Writing functions, conditional statements

Author

Athimed El Taher

Published

November 13, 2024

Writing functions

In the previous lecture we learned about available (built-in) function in R (eg. sum, sd, mean, …).
In this lecture, we will lean how to write functions by ourselves.

Function structure

 myFunction <- function(arg1, arg2, ... ){
    statements
    return(object)
 }

the formals() (or args()) - list of arguments, with/without default value
the code in the body() with returned value at the end
the environment(), the “map” of the location of the function’s variables.

User-Defined functions

When writing a function, we need to:

Think about a meaningful function name.
Similarly, when calling a function, set function’s argument(s) inside normal parentheses.
Specify function body between curly brackets (contains the code executed when we call this function)
Output/Return (or not) a value

Function output

The output value of a function can be:

A single variable or value.
The results of multiple calculations that need to be combined before being returned.
The return value can be vector c(), list list() or table data.frame(), …

The return value can be specified:

explicitly: with return command

# function that calculates the `average` (arithmetic mean) of two numbers. 
avg <- function(x, y) {
  return((x+y)/2)
}
avg(1,7)

[1] 4

implicitly: the last evaluated expression is returned (whose result is not assigned to variable)

avg <- function(x, y) {
  (x+y)/2
}
avg(1,7)

[1] 4

This function does not return any result. It returns a variable which contains the result of the function

avg <- function(x, y) {
  z = (x+y)/2
}
avg(1,7) # does not return a value but a variable

# if you want to see the result of the function saved in the variable
print(avg(1,7))

[1] 4

# if you want to use it later, yan can assign it to a variable
avg_1_and_7 = avg(1,7)

Side note

The majority of functions in R exit in one of these two ways:

If everything goes well: Return a value/variable.
In case of failure: Throw an error (usually displayed in read).

The `source()` command

If we develop a function (or a set of functions) that we want to use in multiple projects, we can store these functions in a separate file with .R suffix.

In every session/project we can then simply load that file and make the functions available.
The loading is achieved by calling source function with a name of that file as a file argument: source('data/functions.R')

Such habit allows easier management of our code:

In case we identify a bug, we can easily fix it at one location
Easily shared code/functions with colleagues.

Documentation or at least detailed description of function and its arguments should always be provided (ideally with example).

For code snippets (very short functions), simple comment explaining what function does, is very helpful.

# Function that calculate the average, the standard deviation, the maximum and the minimum of a variable.
## Use only build-in functions
## na.rm to indicate whether NA values should be stripped before the computation proceeds.

summarize_values <- function(x) {
  # Computes arithmetic mean 
  avr = mean(x,na.rm = T)
  
  # Computes the standard deviation
  standard_dev = sd(x,na.rm = T) 
  
  # Return maxima 
  max_val = max(x,na.rm = T) 
  
  # Return minima 
  min_val = min(x,na.rm = T)
  
  # Named vector, summarizing the 4 info. 
  output = c('average'= avr,'standard_dev'= standard_dev,'min_val'= min_val,'max_val'= max_val)
  
  # Returned named vector
  return (output) 
  }

And we can load this function into our session by calling:

# remove the function from our environment
rm("summarize_values") 
ls()

[1] "avg"            "avg_1_and_7"    "fig.height"     "fig.scale"     
[5] "fig.width"      "myFunction"     "show_solutions"

# load the myFunction.R script that contains useful user-defined functions
source("data/summarize_values.R")

# list all functions and variable from the actual environnement
ls()

[1] "avg"              "avg_1_and_7"      "fig.height"      
[4] "fig.scale"        "fig.width"        "myFunction"      
[7] "show_solutions"   "summarize_values"

summarize_values(1:10)

     avg      std  minimum  maximum 
 5.50000  3.02765  1.00000 10.00000

Conditional Statements

Conditional statements are checkpoints in the program that determines ways according to the situation. Conditional statements execute code based on the output of a condition that is TRUE-or-FALSE. The most simple way of writing conditional statements in R is by using the if and else keywords.

If statement

The syntax of the if statement is as follows:

if (condition) {
  code
}

Here the condition is a logical expression that returns either TRUE or FALSE.

If the condition returns TRUE, than the code in curly brackets is executed.
If it returns FALSE, the code inside the brackets is not executed, and R moves on to the next line of code.
The conditions are evaluated in order.

Here an example:

x <- 1
# print `positive` if the number is greater than 0 
if (x > 0) {
  print("positive")
}

[1] "positive"

Else statement

Obviously if the number is not bigger than 0 than the code is not executed. If we want to consider also the case of negative number we need to extend the if statement to if ... else.

if (condition) {
      code1
      } else {
      code2
    }

If the result of the test is TRUE than the code in curly brackets following if is executed (code1) and if the result of the test is FALSE then code in curly brackets after else (code2) is executed.

x <- -1
# print `positive` if the number is greater than 0 
# print `negative` if the number is smaller than 0 
if (x > 0) {
  print("positive")
} else {
  print("negative")
}

[1] "negative"

How to deal with 0? We need to add additional if statement (sometimes referred as “nested”).

x <- 0
# print `positive` if the number is greater than 0 
# print `zero` if the number is equal to 0 
# print `negative` if the number is smaller than 0 
if (x > 0) {
  print("positive")
} else if (x == 0) {
  print("zero")
} else {
  print("negative")
}

[1] "zero"

Multiple if_else() statements

The above example to determine if the number is positive or negative will not work for vector longer than one. We will get the error message that the condition in if statement is has length > 1

x <- c(-1,5,0)
if (x > 0) {
  print("positive")
} else if (x == 0) {
  print("zero")
} else {
  print("negative")
}

If we want to apply if … else on a vector we can take an advantage of the dplyr package case_when function:

This function allows you to vectorise multiple if_else() statements.
Each case is evaluated sequentially and the first match for each element determines the corresponding value in the output vector.
If no cases match, the .default is used as a final “else” statement.
If none of the cases match and no .default is supplied, NA is returned

x <- c(-1,5,0)
case_when(
  x > 0 ~ "positive",
  x == 0 ~ "zero",
  .default = "negative" # correspond to final else
)

[1] "negative" "positive" "zero"

Demo exercice

Given an integer i, return a string array where:

answer[i] == “fizz buzz” if i is divisible by 3 and 5.
answer[i] == “fizz” if i is divisible by 3.
answer[i] == “buzz” if i is divisible by 5.
answer[i] == i (as a string) if none of the above conditions are true.

# The vector
x = 1:15
x

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

# the multiple if_else() statements
case_when(
  x %% 15 == 0 ~ "fizz buzz", # can be divided by 3 and 5 
  x %% 5 == 0 ~ "fizz",
  x %% 3 == 0 ~ "buzz",
  .default = as.character(x) # 
)

 [1] "1"         "2"         "buzz"      "4"         "fizz"     
 [6] "buzz"      "7"         "8"         "buzz"      "fizz"     
[11] "11"        "buzz"      "13"        "14"        "fizz buzz"

Like an if statement, the arguments are evaluated in order, so you must proceed from the most specific to the most general.

# This won't work:
case_when(
  x %%  5 == 0 ~ "fizz",
  x %%  3 == 0 ~ "buzz",
  x %% 15 == 0 ~ "fizz buzz",
  .default = as.character(x)
)

 [1] "1"    "2"    "buzz" "4"    "fizz" "buzz" "7"    "8"    "buzz" "fizz"
[11] "11"   "buzz" "13"   "14"   "fizz"

If none of the cases match and no .default is supplied, NA is used:

case_when(
  x %% 15 == 0 ~ "fizz buzz",
  x %% 5 == 0 ~ "fizz",
  x %% 3 == 0 ~ "buzz",
)

 [1] NA          NA          "buzz"      NA          "fizz"     
 [6] "buzz"      NA          NA          "buzz"      "fizz"     
[11] NA          "buzz"      NA          NA          "fizz buzz"

Note that NA values on the LHS are treated like FALSE and will be assigned the .default value. You must handle them explicitly if you want to use a different value. The exact way to handle missing values is dependent on the set of LHS conditions you use.

x[2:4] <- NA_real_
x

 [1]  1 NA NA NA  5  6  7  8  9 10 11 12 13 14 15

case_when(
  x %% 15 == 0 ~ "fizz buzz",
  x %% 5 == 0 ~ "fizz",
  x %% 3 == 0 ~ "buzz",
  is.na(x) ~ "nope",
  .default = as.character(x)
)

 [1] "1"         "nope"      "nope"      "nope"      "fizz"     
 [6] "buzz"      "7"         "8"         "buzz"      "fizz"     
[11] "11"        "buzz"      "13"        "14"        "fizz buzz"

Cut [Optional]

cut divides a range of values into bins and specify labels for each bin. Let’s imagine a case that you want to classify a numerical vector into four bins (e.g. assign each penguin body_mass_g` into 0-25%, 25-50%, 50-75%, 75-100% bin).

You can do it with multiple nested ifelse statements.

First, let’s look at the body_mass_g distribution and to which values the bin correspond.

myTheme <-  theme_bw(base_size = 10) + 
  theme(panel.grid.major = element_line(colour = "lightgrey", linewidth = 0.1))  +
  theme(axis.text=element_text(size=10))

# Import the dataset
dat = readr::read_csv("data/penguins.csv")

Rows: 344 Columns: 8
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

quantil_body_masse = 
  dat |>
  drop_na(body_mass_g) |>
  pull(body_mass_g) |>
  quantile()

# quantil_body_masse
#  0%  25%  50%  75% 100% 
# 2700 3550 4050 4750 6300 


p1 = ggplot(dat, aes(x=body_mass_g)) + 
  geom_density(color="darkblue", fill="lightblue",alpha=0.2) +
  myTheme
p1

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).

Now if we look at the bin

dt <- data.frame(x=seq(1,nrow(dat)),y=dat$body_mass_g)
dens <- density(dt$y,na.rm = T)
df <- data.frame(x=dens$x, y=dens$y)
probs <- c(0, 0.25, 0.5, 0.75, 1)
quantiles <- quantile(dt$y, prob=probs,na.rm = T)
df$quant <- factor(findInterval(df$x,quantiles))
ggplot(df, aes(x,y)) + geom_line() + geom_ribbon(aes(ymin=0, ymax=y, fill=quant)) + scale_x_continuous(breaks=quantiles) + scale_fill_brewer(guide="none")

Now if we split the data into these 4 bins

dat = dat |>
  mutate(
      weight_bin = case_when(
      body_mass_g <= quantile(body_mass_g,0.25,na.rm=T) ~ "0-25%",
      body_mass_g <= quantile(body_mass_g,0.5,na.rm=T) ~ "25-50%",
      body_mass_g <= quantile(body_mass_g,0.75,na.rm=T) ~ "50-75%",
      body_mass_g <= quantile(body_mass_g,1,na.rm=T) ~ "75-100%",

    )
  )


dat |>
  count(weight_bin)

# A tibble: 5 × 2
  weight_bin     n
  <chr>      <int>
1 0-25%         89
2 25-50%        87
3 50-75%        81
4 75-100%       85
5 <NA>           2

f = as.factor(dat$weight_bin)

It is not so easy to follow the code above.

However, there is a trick how to achieve the same with more readable and more efficient code (also less typing).

# breaks (0, 25%, 50%... percentile)
# if you want to specify the percentile: seq(0,1,by=0.25)
br = dat |>
  pull(body_mass_g) |>
  quantile(x=_, seq(0,1,by=0.25),na.rm=T)

# br
#  0%  25%  50%  75% 100% 
# 2700 3550 4050 4750 6300 
# Which correspond to: 
# [2.7e+03,3.55e+03] (3.55e+03,4.05e+03] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]


dat |>
  pull(body_mass_g) |>
    cut(x=_,br) #  the smallest value is not included, notice the round bracket...

  [1] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03] 
  [4] <NA>                (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03]
  [7] (3.55e+03,4.05e+03] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03] 
 [10] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03]
 [13] (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03] (4.05e+03,4.75e+03]
 [16] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03]  (4.05e+03,4.75e+03]
 [19] (2.7e+03,3.55e+03]  (4.05e+03,4.75e+03] (2.7e+03,3.55e+03] 
 [22] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
 [25] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03] 
 [28] (2.7e+03,3.55e+03]  (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03]
 [31] (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03] (2.7e+03,3.55e+03] 
 [34] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03]  (4.05e+03,4.75e+03]
 [37] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03]  (2.7e+03,3.55e+03] 
 [40] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03]
 [43] (2.7e+03,3.55e+03]  (4.05e+03,4.75e+03] (2.7e+03,3.55e+03] 
 [46] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03]  (2.7e+03,3.55e+03] 
 [49] (2.7e+03,3.55e+03]  (4.05e+03,4.75e+03] (2.7e+03,3.55e+03] 
 [52] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03]
 [55] (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03] (2.7e+03,3.55e+03] 
 [58] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03]
 [61] (2.7e+03,3.55e+03]  (4.05e+03,4.75e+03] (3.55e+03,4.05e+03]
 [64] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03]
 [67] (2.7e+03,3.55e+03]  (4.05e+03,4.75e+03] (2.7e+03,3.55e+03] 
 [70] (4.05e+03,4.75e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
 [73] (2.7e+03,3.55e+03]  (4.05e+03,4.75e+03] (3.55e+03,4.05e+03]
 [76] (4.05e+03,4.75e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
 [79] (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03] (2.7e+03,3.55e+03] 
 [82] (4.05e+03,4.75e+03] (3.55e+03,4.05e+03] (4.05e+03,4.75e+03]
 [85] (2.7e+03,3.55e+03]  (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03]
 [88] (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
 [91] (2.7e+03,3.55e+03]  (4.05e+03,4.75e+03] (2.7e+03,3.55e+03] 
 [94] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03]  (4.05e+03,4.75e+03]
 [97] (3.55e+03,4.05e+03] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03] 
[100] (4.05e+03,4.75e+03] (3.55e+03,4.05e+03] (4.05e+03,4.75e+03]
[103] (2.7e+03,3.55e+03]  (4.05e+03,4.75e+03] (2.7e+03,3.55e+03] 
[106] (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[109] (2.7e+03,3.55e+03]  (4.75e+03,6.3e+03]  (3.55e+03,4.05e+03]
[112] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03]  (4.05e+03,4.75e+03]
[115] (3.55e+03,4.05e+03] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03] 
[118] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03]  (2.7e+03,3.55e+03] 
[121] (2.7e+03,3.55e+03]  (2.7e+03,3.55e+03]  (2.7e+03,3.55e+03] 
[124] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03]
[127] (2.7e+03,3.55e+03]  (4.05e+03,4.75e+03] (2.7e+03,3.55e+03] 
[130] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03]  (2.7e+03,3.55e+03] 
[133] (2.7e+03,3.55e+03]  (4.05e+03,4.75e+03] (2.7e+03,3.55e+03] 
[136] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03]
[139] (2.7e+03,3.55e+03]  (4.05e+03,4.75e+03] (2.7e+03,3.55e+03] 
[142] (2.7e+03,3.55e+03]  (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03]
[145] (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03] (4.05e+03,4.75e+03]
[148] (2.7e+03,3.55e+03]  (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03]
[151] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03] (4.05e+03,4.75e+03]
[154] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] 
[157] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] 
[160] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] 
[163] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03]
[166] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] 
[169] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03] 
[172] (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03] 
[175] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03] 
[178] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] 
[181] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03] 
[184] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03] 
[187] (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03] 
[190] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] 
[193] (3.55e+03,4.05e+03] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03]
[196] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03] 
[199] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03] 
[202] (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03] 
[205] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03] 
[208] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] 
[211] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03]
[214] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] 
[217] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03]
[220] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] 
[223] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03] 
[226] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] 
[229] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03]
[232] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] 
[235] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03]
[238] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] 
[241] (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03] 
[244] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] 
[247] (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03] 
[250] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] 
[253] (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03] 
[256] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] 
[259] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03]
[262] (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03] 
[265] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]  (4.05e+03,4.75e+03]
[268] (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03] 
[271] (4.75e+03,6.3e+03]  <NA>                (4.75e+03,6.3e+03] 
[274] (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03]  (4.75e+03,6.3e+03] 
[277] (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[280] (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[283] (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03] (4.05e+03,4.75e+03]
[286] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[289] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[292] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03]
[295] (2.7e+03,3.55e+03]  (4.05e+03,4.75e+03] (3.55e+03,4.05e+03]
[298] (2.7e+03,3.55e+03]  (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03]
[301] (2.7e+03,3.55e+03]  (4.05e+03,4.75e+03] (2.7e+03,3.55e+03] 
[304] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03] (4.05e+03,4.75e+03]
[307] (2.7e+03,3.55e+03]  (4.05e+03,4.75e+03] (2.7e+03,3.55e+03] 
[310] (4.05e+03,4.75e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[313] (3.55e+03,4.05e+03] (4.75e+03,6.3e+03]  <NA>               
[316] (4.05e+03,4.75e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[319] (2.7e+03,3.55e+03]  (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03]
[322] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03]  (4.05e+03,4.75e+03]
[325] (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03] (2.7e+03,3.55e+03] 
[328] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[331] (2.7e+03,3.55e+03]  (2.7e+03,3.55e+03]  (2.7e+03,3.55e+03] 
[334] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03] 
[337] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[340] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03]  (3.55e+03,4.05e+03]
[343] (4.05e+03,4.75e+03] (3.55e+03,4.05e+03]
4 Levels: (2.7e+03,3.55e+03] (3.55e+03,4.05e+03] ... (4.75e+03,6.3e+03]

ff = dat |>
  pull(body_mass_g) |>
    cut(x=_,br,include.lowest=TRUE) #  the smallest value is not included, notice the round bracket...

class(ff)

[1] "factor"

levels(ff) <- c("0-25%","25-50%","50-75%","75-100%")
 
identical(f,ff)

[1] TRUE

Switch statement [Optional]

switch is related to if statement discussed before. switch allows efficient selection of one of its arguments. Essentially, it allows to write the if statements in a compact and more readable way.

switch should be used only for character inputs. It can also handle other types (numeric) but the interpretation is more difficult: there is a special handling of floats, we cannot provide default value.

centreData <- function(x, type) {  
   
     cen = case_when(
       type == 'mean' ~ mean(x,na.rm=T),
       type == 'median' ~  median(x,na.rm = T),
       type == 'trimmed' ~  mean(x, trim=0.1),
     )
         return(x - cen)
 }

# Function provide information about the uniform distribution on the interval from min to max.
y <- runif(10) # default: runif(n, min = 0, max = 1)
y

 [1] 0.89517884 0.13234106 0.86585707 0.84565641 0.09742589 0.88882577
 [7] 0.70490113 0.39846469 0.37198646 0.94705853

centreData(y, type="mean")

 [1]  0.28040925 -0.48242852  0.25108749  0.23088683 -0.51734370
 [6]  0.27405619  0.09013154 -0.21630489 -0.24278312  0.33228895

centreData(y, type="median")

 [1]  0.11990007 -0.64293771  0.09057830  0.07037764 -0.67785288
 [6]  0.11354700 -0.07037764 -0.37681408 -0.40329231  0.17177976

centreData(y, type="trimmed")

 [1]  0.2572774 -0.5055604  0.2279556  0.2077550 -0.5404755  0.2509243
 [7]  0.0669997 -0.2394367 -0.2659150  0.3091571

centreData(y, type="other") # return NA in case the type is not recognized.

 [1] NA NA NA NA NA NA NA NA NA NA

Now using the `switch` function. You can also add a specific message to stop the function in case the data type is nor recognized.

centreData <- function(x, type) {
  cen <- switch(type,
                mean = mean(x),
                median = median(x),
                trimmed = mean(x, trim = .1),
                stop("Invalid `type` value"))
  return(x - cen)
}
centreData(y, type="mean")

 [1]  0.28040925 -0.48242852  0.25108749  0.23088683 -0.51734370
 [6]  0.27405619  0.09013154 -0.21630489 -0.24278312  0.33228895

centreData(y, type="median")

 [1]  0.11990007 -0.64293771  0.09057830  0.07037764 -0.67785288
 [6]  0.11354700 -0.07037764 -0.37681408 -0.40329231  0.17177976

centreData(y, type="trimmed")

 [1]  0.2572774 -0.5055604  0.2279556  0.2077550 -0.5404755  0.2509243
 [7]  0.0669997 -0.2394367 -0.2659150  0.3091571

# centreData(y, type="other")
# return: Error in centreData(y, type = "other") : Invalid `type` value

It is a good practice to always include case handling for unmatched input, in the case above it should throw an error (stop()).

stop and warning

In some cases we would like to stop execution of a function and return error to the user similarly we did in the code above. For that we would use the stop() function.

avg3 <- function(x) {
  if (any(is.na(x))) {
    stop(sprintf("There are %d NA's in `x`. Please remove them before calling this function!", sum(is.na(x))))
  }
  sum(x)/length(x)
}
avg3(c(1,2,3,NA))

Error in avg3(c(1, 2, 3, NA)): There are 1 NA's in `x`. Please remove them before calling this function!

If we instead rather want to warn the user that something might be wrong, we can use warning() function.

avg4 <- function(x) {
  if (any(is.na(x))) {
    warning(sprintf("%d NA's removed before calculating the average.", sum(is.na(x))))
    x <- x[!is.na(x)]
  }
  sum(x)/length(x)
}
avg4(c(1,2,3,NA))

Warning in avg4(c(1, 2, 3, NA)): 1 NA's removed before calculating the
average.

[1] 2

Or simply inform user with message() function.

More details about error handling can be found in the Advanced R book by Hadley Wickham.

Exercises

Exercise 1

Like the last past weeks, we will use the Palmer Penguins data set for our exercises and illustrations.

# Import the dataset
dat = readr::read_csv("data/penguins.csv")

Rows: 344 Columns: 8
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Write a function that calculates the average of three values.
Use YOUR function to calculate the average body_mass_g of the penguins of the 81rd, 82rd and 83rd rows of dat.
Write a function that calculates the average of all values of a vector.
Use YOUR function to calculate the average body_mass_g of all penguins in dat. Hint: Remove NA values before the calculation

Solution 1

# Import the dataset
dat = readr::read_csv("data/penguins.csv")

Rows: 344 Columns: 8
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# 1A. 
# Function that calculates the average between three number and return the average
avg <- function(x, y, z) { # x,y and z are numbers
  return ((x+y+z)/3)
}

# 1B.
# Calculate the average of the body masse (g) of the penguin in the row 81,82 and 83 of the dataset
avg(dat[81, "body_mass_g"],dat[82, "body_mass_g"],dat[83, "body_mass_g"]) # (3200 + 4700 + 3800) / 3 = 3900

  body_mass_g
1        3900

# 1C.
# Function that calculates the average of ALL numbers of a vector
avg <- function(x) { # x is a vector 
  return (sum(x)/length(x))
}


# 1D.
# Calculate the average body_mass_g of all penguins in dat

# Like this, the function do not deal with NA:
avg(dat[, "body_mass_g"]) # return NA

[1] NA

## Solution to this:  
## Remove NA before hands
dat |>
  drop_na(body_mass_g) |>
  pull(body_mass_g) |>
  avg()

[1] 4201.754

Exercise 2

Write a function that calculates the mean, the standard deviation, the maximum and the minimum of the variable flipper_length_mm of the dat:

Use build-in functions within the body of YOUR function.
Write a function that deal with NA .
The function return a named vector with the 4 information.

Solution 2

summarize_values <- function(x) { # x is a vector 

  avr = mean(x,na.rm = T)
  standard_dev = sd(x,na.rm = T)
  max_val = max(x,na.rm = T)
  min_val = min(x,na.rm = T)
  
  out = c('avg'=avr,'std'=standard_dev,'minimum'=min_val,'maximum'=max_val)
  
  return (out)

  }
 

dat |>
  pull(flipper_length_mm) |>
    summarize_values()

      avg       std   minimum   maximum 
200.91520  14.06171 172.00000 231.00000

Exercise 3

If we go back to the avg function you wrote in exercise 1D. How can you modify it, in order to allow the user to choose handling (or not) the NA’s. Hint: Something similar to na.rm in some base R functions.
Use YOUR function to calculate the average body_mass_g of all penguins in dat (without removing the NAs before applying the function).

Solution 3

# 3A. Modified version of AVG function, which handel NAs.

#' Calculate average, with conditional satement about handeling or not NA's values
#'
#' @param x Numeric vector
#' @param na.rm Logical Index
#' @return Average value (accounting for NAs or not) \code{x}
#' @examples
#' x <- c(1.5,3.4,4,6.1,5.2)
#' avg(x)

avg <- function(x, na.rm=FALSE) { # by default, na.rm = FALSE
  # 1. If na.rm == TRUE, remove NA from vector X:  
  if (na.rm) { 
    x <- x[!is.na(x)]
  }
  # 2. calculating average
  sum(x)/length(x)
}



# 3B. calculate the average `body_mass_g` of all penguins in `dat` without the NA's.

dat |>
  pull(body_mass_g) |>
    avg(na.rm=T)

[1] 4201.754

Exercise 4

Write a piece of code to download a file from internet (URL) under this conditions:

Load the penguin dataset (https://ivanek.github.io/introductionToR/data/penguins.csv) and save it to your data folder.
We would like to download the file (function download.file) only, if it does not exists locally. This we can check with the build-in file.exists function.
If it exists locally, we would like to send a message (function message()) saying that the file already exist.
Write this function in the simplest way.
Check https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/download.file and https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/message for more info.

Solution 4

# Function to load an non-existing file from URL

# check if the file exists:
file.exists("data/penguins.csv")

[1] TRUE

# check if it does not exists:
!file.exists("data/penguins.csv")

[1] FALSE

downloadFile <- function(url, file) {
  if (!file.exists(file)) { # check if the file does not exist
    download.file(url, file)
  } else {
    message("File exists already, not downloading")
  }
}



## test each step of fonction: 
### condition1: File does not exit
#!file.exists(file) # condition = TRUE --> file does not exist yet
#download.file(url, file) # code executed with build-in function

#!file.exists(file) # condition = FALSE --> NOW the file exist yet
#message("File exists already, not downloading") # message in red 


# If you would like to test the function step by step: 
## 1. write the variable
url = "https://ivanek.github.io/introductionToR/data/penguins.csv"
file = "data/penguins.csv"

# Check if the file exist: 
!file.exists(file)# The file already exist

[1] FALSE

downloadFile(url, file)

File exists already, not downloading

Exercise 5

Knowing these information:

In the water surrounding the Biscoe island you find only krills.
In the water surrounding the Dream island you find only squids.
In the water surrounding the Torgersen island you find only fish.

Add a new column to the penguin dataset that specifies what food diet each penguin has.

Solution 5

# `case_when()` is particularly useful inside `mutate()` when you want to
# create a new variable that relies on an existing
# variables
dat = dat |>
  mutate(
    diet = case_when(
      island == "Biscoe" ~ "krill",
      island == "Dream" ~ "squid",
      .default = "fish"
    )
  )

table(dat$diet) # Base R


 fish krill squid 
   52   168   124

Exercise 6

Add an additional column into penguin dataset, classifying if the penguin body_mass_g is greater than median, or smaller/equal (factor with two levels: “greater, less”). Make sure the NA values in body_mass_g stay NA values in the new column.

Solution 6

# if the weight is greater than median (using ifelse)

dat = dat |>
  mutate(
      weight_category = case_when(
      body_mass_g > median(body_mass_g,na.rm=T) ~ "greater",
      body_mass_g <= median(body_mass_g,na.rm=T) ~ "less",
    )
  )


dat |>
  count(weight_category)

# A tibble: 3 × 2
  weight_category     n
  <chr>           <int>
1 greater           166
2 less              176
3 <NA>                2

Session info

R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.4.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] writexl_1.5.1    readxl_1.4.3     lubridate_1.9.3  forcats_1.0.0   
 [5] stringr_1.5.1    dplyr_1.1.4      purrr_1.0.2      readr_2.1.5     
 [9] tidyr_1.3.1      tibble_3.2.1     tidyverse_2.0.0  gridExtra_2.3   
[13] ggplot2_3.5.1    devtools_2.4.5   usethis_3.0.0    BiocStyle_2.33.1
[17] png_0.1-8        knitr_1.48      

loaded via a namespace (and not attached):
 [1] gtable_0.3.6        xfun_0.49           htmlwidgets_1.6.4  
 [4] remotes_2.5.0       tzdb_0.4.0          vctrs_0.6.5        
 [7] tools_4.4.2         generics_0.1.3      parallel_4.4.2     
[10] fansi_1.0.6         pkgconfig_2.0.3     RColorBrewer_1.1-3 
[13] lifecycle_1.0.4     farver_2.1.2        compiler_4.4.2     
[16] munsell_0.5.1       httpuv_1.6.15       htmltools_0.5.8.1  
[19] yaml_2.3.10         later_1.3.2         pillar_1.9.0       
[22] crayon_1.5.3        urlchecker_1.0.1    ellipsis_0.3.2     
[25] cachem_1.1.0        sessioninfo_1.2.2   mime_0.12          
[28] tidyselect_1.2.1    digest_0.6.37       stringi_1.8.4      
[31] labeling_0.4.3      fastmap_1.2.0       colorspace_2.1-1   
[34] cli_3.6.3           magrittr_2.0.3      pkgbuild_1.4.5     
[37] utf8_1.2.4          withr_3.0.2         scales_1.3.0       
[40] promises_1.3.0      bit64_4.5.2         timechange_0.3.0   
[43] rmarkdown_2.29      bit_4.5.0           cellranger_1.1.0   
[46] hms_1.1.3           memoise_2.0.1       shiny_1.9.1        
[49] evaluate_1.0.1      miniUI_0.1.1.1      profvis_0.4.0      
[52] rlang_1.1.4         Rcpp_1.0.13-1       xtable_1.8-4       
[55] glue_1.8.0          BiocManager_1.30.25 pkgload_1.4.0      
[58] rstudioapi_0.17.1   vroom_1.6.5         jsonlite_1.8.9     
[61] R6_2.5.1            fs_1.6.5

Writing functions

Function structure

User-Defined functions

Function output

The source() command

Conditional Statements

If statement

Else statement

Multiple if_else() statements

Cut [Optional]

Switch statement [Optional]

stop and warning

Exercises

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Session info

The `source()` command