<- function(arg1, arg2, ... ){
myFunction
statementsreturn(object)
}
Writing functions, conditional statements
Writing functions
In the previous lecture we learned about available (built-in) function in R
(eg. sum
, sd
, mean
, …).
In this lecture, we will lean how to write functions by ourselves.
Function structure
- the
formals()
(orargs()
) - list of arguments, with/without default value - the code in the
body()
with returned value at the end - the
environment()
, the “map” of the location of the function’s variables.
User-Defined functions
When writing a function, we need to:
- Think about a meaningful function name.
- Similarly, when calling a function, set function’s argument(s) inside normal parentheses.
- Specify function body between curly brackets (contains the code executed when we call this function)
- Output/Return (or not) a value
Function output
The output value of a function can be:
- A single variable or value.
- The results of multiple calculations that need to be combined before being returned.
- The return value can be vector
c()
, listlist()
or tabledata.frame()
, …
The return value can be specified:
- explicitly: with
return
command
# function that calculates the `average` (arithmetic mean) of two numbers.
<- function(x, y) {
avg return((x+y)/2)
}avg(1,7)
[1] 4
- implicitly: the last evaluated expression is returned (whose result is not assigned to variable)
<- function(x, y) {
avg +y)/2
(x
}avg(1,7)
[1] 4
This function does not return any result. It returns a variable which contains the result of the function
<- function(x, y) {
avg = (x+y)/2
z
}avg(1,7) # does not return a value but a variable
# if you want to see the result of the function saved in the variable
print(avg(1,7))
[1] 4
# if you want to use it later, yan can assign it to a variable
= avg(1,7) avg_1_and_7
The majority of functions in R
exit in one of these two ways:
- If everything goes well: Return a value/variable.
- In case of failure: Throw an error (usually displayed in read).
The source()
command
If we develop a function (or a set of functions) that we want to use in multiple projects, we can store these functions in a separate file with .R
suffix.
In every session/project we can then simply load that file and make the functions available.
The loading is achieved by calling source
function with a name of that file as a file
argument: source('data/functions.R')
Such habit allows easier management of our code:
- In case we identify a bug, we can easily fix it at one location
- Easily shared code/functions with colleagues.
Documentation or at least detailed description of function and its arguments should always be provided (ideally with example).
For code snippets (very short functions), simple comment explaining what function does, is very helpful.
# Function that calculate the average, the standard deviation, the maximum and the minimum of a variable.
## Use only build-in functions
## na.rm to indicate whether NA values should be stripped before the computation proceeds.
<- function(x) {
summarize_values # Computes arithmetic mean
= mean(x,na.rm = T)
avr
# Computes the standard deviation
= sd(x,na.rm = T)
standard_dev
# Return maxima
= max(x,na.rm = T)
max_val
# Return minima
= min(x,na.rm = T)
min_val
# Named vector, summarizing the 4 info.
= c('average'= avr,'standard_dev'= standard_dev,'min_val'= min_val,'max_val'= max_val)
output
# Returned named vector
return (output)
}
And we can load this function into our session by calling:
# remove the function from our environment
rm("summarize_values")
ls()
[1] "avg" "avg_1_and_7" "fig.height" "fig.scale"
[5] "fig.width" "myFunction" "show_solutions"
# load the myFunction.R script that contains useful user-defined functions
source("data/summarize_values.R")
# list all functions and variable from the actual environnement
ls()
[1] "avg" "avg_1_and_7" "fig.height"
[4] "fig.scale" "fig.width" "myFunction"
[7] "show_solutions" "summarize_values"
summarize_values(1:10)
avg std minimum maximum
5.50000 3.02765 1.00000 10.00000
Conditional Statements
Conditional statements are checkpoints in the program that determines ways according to the situation. Conditional statements execute code based on the output of a condition that is TRUE
-or-FALSE
. The most simple way of writing conditional statements in R
is by using the if
and else
keywords.
If statement
The syntax of the if
statement is as follows:
if (condition) {
code }
Here the condition is a logical expression that returns either TRUE
or FALSE
.
- If the condition returns
TRUE
, than the code in curly brackets is executed. - If it returns
FALSE,
the code inside the brackets is not executed, and R moves on to the next line of code. - The conditions are evaluated in order.
Here an example:
<- 1
x # print `positive` if the number is greater than 0
if (x > 0) {
print("positive")
}
[1] "positive"
Else statement
Obviously if the number is not bigger than 0 than the code is not executed. If we want to consider also the case of negative number we need to extend the if
statement to if ... else
.
if (condition) {
code1else {
}
code2 }
If the result of the test is TRUE
than the code in curly brackets following if
is executed (code1
) and if the result of the test is FALSE
then code in curly brackets after else
(code2
) is executed.
<- -1
x # print `positive` if the number is greater than 0
# print `negative` if the number is smaller than 0
if (x > 0) {
print("positive")
else {
} print("negative")
}
[1] "negative"
How to deal with 0
? We need to add additional if
statement (sometimes referred as “nested”).
<- 0
x # print `positive` if the number is greater than 0
# print `zero` if the number is equal to 0
# print `negative` if the number is smaller than 0
if (x > 0) {
print("positive")
else if (x == 0) {
} print("zero")
else {
} print("negative")
}
[1] "zero"
Multiple if_else() statements
The above example to determine if the number is positive or negative will not work for vector longer than one. We will get the error message that the condition in if
statement is has length > 1
<- c(-1,5,0)
x if (x > 0) {
print("positive")
else if (x == 0) {
} print("zero")
else {
} print("negative")
}
If we want to apply if
… else
on a vector we can take an advantage of the dplyr
package case_when
function:
- This function allows you to vectorise multiple if_else() statements.
- Each case is evaluated sequentially and the first match for each element determines the corresponding value in the output vector.
- If no cases match, the
.default
is used as a final “else” statement. - If none of the cases match and no
.default
is supplied, NA is returned
<- c(-1,5,0)
x case_when(
> 0 ~ "positive",
x == 0 ~ "zero",
x .default = "negative" # correspond to final else
)
[1] "negative" "positive" "zero"
Given an integer i
, return a string array where:
- answer[i] == “fizz buzz” if i is divisible by 3 and 5.
- answer[i] == “fizz” if i is divisible by 3.
- answer[i] == “buzz” if i is divisible by 5.
- answer[i] == i (as a string) if none of the above conditions are true.
# The vector
= 1:15
x x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# the multiple if_else() statements
case_when(
%% 15 == 0 ~ "fizz buzz", # can be divided by 3 and 5
x %% 5 == 0 ~ "fizz",
x %% 3 == 0 ~ "buzz",
x .default = as.character(x) #
)
[1] "1" "2" "buzz" "4" "fizz"
[6] "buzz" "7" "8" "buzz" "fizz"
[11] "11" "buzz" "13" "14" "fizz buzz"
Like an if
statement, the arguments are evaluated in order, so you must proceed from the most specific to the most general.
# This won't work:
case_when(
%% 5 == 0 ~ "fizz",
x %% 3 == 0 ~ "buzz",
x %% 15 == 0 ~ "fizz buzz",
x .default = as.character(x)
)
[1] "1" "2" "buzz" "4" "fizz" "buzz" "7" "8" "buzz" "fizz"
[11] "11" "buzz" "13" "14" "fizz"
If none of the cases match and no .default
is supplied, NA is used:
case_when(
%% 15 == 0 ~ "fizz buzz",
x %% 5 == 0 ~ "fizz",
x %% 3 == 0 ~ "buzz",
x )
[1] NA NA "buzz" NA "fizz"
[6] "buzz" NA NA "buzz" "fizz"
[11] NA "buzz" NA NA "fizz buzz"
Note that NA
values on the LHS are treated like FALSE
and will be assigned the .default
value. You must handle them explicitly if you want to use a different value. The exact way to handle missing values is dependent on the set of LHS conditions you use.
2:4] <- NA_real_
x[ x
[1] 1 NA NA NA 5 6 7 8 9 10 11 12 13 14 15
case_when(
%% 15 == 0 ~ "fizz buzz",
x %% 5 == 0 ~ "fizz",
x %% 3 == 0 ~ "buzz",
x is.na(x) ~ "nope",
.default = as.character(x)
)
[1] "1" "nope" "nope" "nope" "fizz"
[6] "buzz" "7" "8" "buzz" "fizz"
[11] "11" "buzz" "13" "14" "fizz buzz"
Cut [Optional]
cut
divides a range of values into bins and specify labels for each bin. Let’s imagine a case that you want to classify a numerical vector into four bins (e.g. assign each penguin body_mass_g
` into 0-25%, 25-50%, 50-75%, 75-100% bin).
You can do it with multiple nested ifelse
statements.
First, let’s look at the body_mass_g
distribution and to which values the bin correspond.
<- theme_bw(base_size = 10) +
myTheme theme(panel.grid.major = element_line(colour = "lightgrey", linewidth = 0.1)) +
theme(axis.text=element_text(size=10))
# Import the dataset
= readr::read_csv("data/penguins.csv") dat
Rows: 344 Columns: 8
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
=
quantil_body_masse |>
dat drop_na(body_mass_g) |>
pull(body_mass_g) |>
quantile()
# quantil_body_masse
# 0% 25% 50% 75% 100%
# 2700 3550 4050 4750 6300
= ggplot(dat, aes(x=body_mass_g)) +
p1 geom_density(color="darkblue", fill="lightblue",alpha=0.2) +
myTheme p1
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).
Now if we look at the bin
<- data.frame(x=seq(1,nrow(dat)),y=dat$body_mass_g)
dt <- density(dt$y,na.rm = T)
dens <- data.frame(x=dens$x, y=dens$y)
df <- c(0, 0.25, 0.5, 0.75, 1)
probs <- quantile(dt$y, prob=probs,na.rm = T)
quantiles $quant <- factor(findInterval(df$x,quantiles))
dfggplot(df, aes(x,y)) + geom_line() + geom_ribbon(aes(ymin=0, ymax=y, fill=quant)) + scale_x_continuous(breaks=quantiles) + scale_fill_brewer(guide="none")
Now if we split the data into these 4 bins
= dat |>
dat mutate(
weight_bin = case_when(
<= quantile(body_mass_g,0.25,na.rm=T) ~ "0-25%",
body_mass_g <= quantile(body_mass_g,0.5,na.rm=T) ~ "25-50%",
body_mass_g <= quantile(body_mass_g,0.75,na.rm=T) ~ "50-75%",
body_mass_g <= quantile(body_mass_g,1,na.rm=T) ~ "75-100%",
body_mass_g
)
)
|>
dat count(weight_bin)
# A tibble: 5 × 2
weight_bin n
<chr> <int>
1 0-25% 89
2 25-50% 87
3 50-75% 81
4 75-100% 85
5 <NA> 2
= as.factor(dat$weight_bin) f
It is not so easy to follow the code above.
However, there is a trick how to achieve the same with more readable and more efficient code (also less typing).
# breaks (0, 25%, 50%... percentile)
# if you want to specify the percentile: seq(0,1,by=0.25)
= dat |>
br pull(body_mass_g) |>
quantile(x=_, seq(0,1,by=0.25),na.rm=T)
# br
# 0% 25% 50% 75% 100%
# 2700 3550 4050 4750 6300
# Which correspond to:
# [2.7e+03,3.55e+03] (3.55e+03,4.05e+03] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]
|>
dat pull(body_mass_g) |>
cut(x=_,br) # the smallest value is not included, notice the round bracket...
[1] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03]
[4] <NA> (2.7e+03,3.55e+03] (3.55e+03,4.05e+03]
[7] (3.55e+03,4.05e+03] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03]
[10] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03]
[13] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03] (4.05e+03,4.75e+03]
[16] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03] (4.05e+03,4.75e+03]
[19] (2.7e+03,3.55e+03] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03]
[22] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[25] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03]
[28] (2.7e+03,3.55e+03] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03]
[31] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03]
[34] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03] (4.05e+03,4.75e+03]
[37] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03] (2.7e+03,3.55e+03]
[40] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03]
[43] (2.7e+03,3.55e+03] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03]
[46] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03] (2.7e+03,3.55e+03]
[49] (2.7e+03,3.55e+03] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03]
[52] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03]
[55] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03]
[58] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03]
[61] (2.7e+03,3.55e+03] (4.05e+03,4.75e+03] (3.55e+03,4.05e+03]
[64] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03]
[67] (2.7e+03,3.55e+03] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03]
[70] (4.05e+03,4.75e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[73] (2.7e+03,3.55e+03] (4.05e+03,4.75e+03] (3.55e+03,4.05e+03]
[76] (4.05e+03,4.75e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[79] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03]
[82] (4.05e+03,4.75e+03] (3.55e+03,4.05e+03] (4.05e+03,4.75e+03]
[85] (2.7e+03,3.55e+03] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03]
[88] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[91] (2.7e+03,3.55e+03] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03]
[94] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03] (4.05e+03,4.75e+03]
[97] (3.55e+03,4.05e+03] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03]
[100] (4.05e+03,4.75e+03] (3.55e+03,4.05e+03] (4.05e+03,4.75e+03]
[103] (2.7e+03,3.55e+03] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03]
[106] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[109] (2.7e+03,3.55e+03] (4.75e+03,6.3e+03] (3.55e+03,4.05e+03]
[112] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03] (4.05e+03,4.75e+03]
[115] (3.55e+03,4.05e+03] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03]
[118] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03] (2.7e+03,3.55e+03]
[121] (2.7e+03,3.55e+03] (2.7e+03,3.55e+03] (2.7e+03,3.55e+03]
[124] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03]
[127] (2.7e+03,3.55e+03] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03]
[130] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03] (2.7e+03,3.55e+03]
[133] (2.7e+03,3.55e+03] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03]
[136] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03]
[139] (2.7e+03,3.55e+03] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03]
[142] (2.7e+03,3.55e+03] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03]
[145] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03] (4.05e+03,4.75e+03]
[148] (2.7e+03,3.55e+03] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03]
[151] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03] (4.05e+03,4.75e+03]
[154] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]
[157] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]
[160] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]
[163] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03]
[166] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]
[169] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03]
[172] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03]
[175] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03]
[178] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]
[181] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03]
[184] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03]
[187] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03]
[190] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]
[193] (3.55e+03,4.05e+03] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03]
[196] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03]
[199] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03]
[202] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03]
[205] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03]
[208] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]
[211] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03]
[214] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]
[217] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03]
[220] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]
[223] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03]
[226] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]
[229] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03]
[232] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]
[235] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03]
[238] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]
[241] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03]
[244] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]
[247] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03]
[250] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]
[253] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03]
[256] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03]
[259] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03]
[262] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03]
[265] (4.05e+03,4.75e+03] (4.75e+03,6.3e+03] (4.05e+03,4.75e+03]
[268] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03]
[271] (4.75e+03,6.3e+03] <NA> (4.75e+03,6.3e+03]
[274] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03] (4.75e+03,6.3e+03]
[277] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[280] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[283] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03] (4.05e+03,4.75e+03]
[286] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[289] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[292] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03]
[295] (2.7e+03,3.55e+03] (4.05e+03,4.75e+03] (3.55e+03,4.05e+03]
[298] (2.7e+03,3.55e+03] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03]
[301] (2.7e+03,3.55e+03] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03]
[304] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03] (4.05e+03,4.75e+03]
[307] (2.7e+03,3.55e+03] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03]
[310] (4.05e+03,4.75e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[313] (3.55e+03,4.05e+03] (4.75e+03,6.3e+03] <NA>
[316] (4.05e+03,4.75e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[319] (2.7e+03,3.55e+03] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03]
[322] (4.05e+03,4.75e+03] (2.7e+03,3.55e+03] (4.05e+03,4.75e+03]
[325] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03]
[328] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[331] (2.7e+03,3.55e+03] (2.7e+03,3.55e+03] (2.7e+03,3.55e+03]
[334] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03]
[337] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03] (3.55e+03,4.05e+03]
[340] (3.55e+03,4.05e+03] (2.7e+03,3.55e+03] (3.55e+03,4.05e+03]
[343] (4.05e+03,4.75e+03] (3.55e+03,4.05e+03]
4 Levels: (2.7e+03,3.55e+03] (3.55e+03,4.05e+03] ... (4.75e+03,6.3e+03]
= dat |>
ff pull(body_mass_g) |>
cut(x=_,br,include.lowest=TRUE) # the smallest value is not included, notice the round bracket...
class(ff)
[1] "factor"
levels(ff) <- c("0-25%","25-50%","50-75%","75-100%")
identical(f,ff)
[1] TRUE
Switch statement [Optional]
switch
is related to if
statement discussed before. switch
allows efficient selection of one of its arguments. Essentially, it allows to write the if
statements in a compact and more readable way.
switch
should be used only for character
inputs. It can also handle other types (numeric
) but the interpretation is more difficult: there is a special handling of floats, we cannot provide default value.
<- function(x, type) {
centreData
= case_when(
cen == 'mean' ~ mean(x,na.rm=T),
type == 'median' ~ median(x,na.rm = T),
type == 'trimmed' ~ mean(x, trim=0.1),
type
)return(x - cen)
}
# Function provide information about the uniform distribution on the interval from min to max.
<- runif(10) # default: runif(n, min = 0, max = 1)
y y
[1] 0.89517884 0.13234106 0.86585707 0.84565641 0.09742589 0.88882577
[7] 0.70490113 0.39846469 0.37198646 0.94705853
centreData(y, type="mean")
[1] 0.28040925 -0.48242852 0.25108749 0.23088683 -0.51734370
[6] 0.27405619 0.09013154 -0.21630489 -0.24278312 0.33228895
centreData(y, type="median")
[1] 0.11990007 -0.64293771 0.09057830 0.07037764 -0.67785288
[6] 0.11354700 -0.07037764 -0.37681408 -0.40329231 0.17177976
centreData(y, type="trimmed")
[1] 0.2572774 -0.5055604 0.2279556 0.2077550 -0.5404755 0.2509243
[7] 0.0669997 -0.2394367 -0.2659150 0.3091571
centreData(y, type="other") # return NA in case the type is not recognized.
[1] NA NA NA NA NA NA NA NA NA NA
Now using the `switch` function. You can also add a specific message to stop the function in case the data type is nor recognized.
<- function(x, type) {
centreData <- switch(type,
cen mean = mean(x),
median = median(x),
trimmed = mean(x, trim = .1),
stop("Invalid `type` value"))
return(x - cen)
}centreData(y, type="mean")
[1] 0.28040925 -0.48242852 0.25108749 0.23088683 -0.51734370
[6] 0.27405619 0.09013154 -0.21630489 -0.24278312 0.33228895
centreData(y, type="median")
[1] 0.11990007 -0.64293771 0.09057830 0.07037764 -0.67785288
[6] 0.11354700 -0.07037764 -0.37681408 -0.40329231 0.17177976
centreData(y, type="trimmed")
[1] 0.2572774 -0.5055604 0.2279556 0.2077550 -0.5404755 0.2509243
[7] 0.0669997 -0.2394367 -0.2659150 0.3091571
# centreData(y, type="other")
# return: Error in centreData(y, type = "other") : Invalid `type` value
It is a good practice to always include case handling for unmatched input, in the case above it should throw an error (stop()
).
stop and warning
In some cases we would like to stop execution of a function and return error to the user similarly we did in the code above. For that we would use the stop()
function.
<- function(x) {
avg3 if (any(is.na(x))) {
stop(sprintf("There are %d NA's in `x`. Please remove them before calling this function!", sum(is.na(x))))
}sum(x)/length(x)
}avg3(c(1,2,3,NA))
Error in avg3(c(1, 2, 3, NA)): There are 1 NA's in `x`. Please remove them before calling this function!
If we instead rather want to warn the user that something might be wrong, we can use warning()
function.
<- function(x) {
avg4 if (any(is.na(x))) {
warning(sprintf("%d NA's removed before calculating the average.", sum(is.na(x))))
<- x[!is.na(x)]
x
}sum(x)/length(x)
}avg4(c(1,2,3,NA))
Warning in avg4(c(1, 2, 3, NA)): 1 NA's removed before calculating the
average.
[1] 2
Or simply inform user with message()
function.
More details about error handling can be found in the Advanced R book by Hadley Wickham.
Exercises
Exercise 1
Like the last past weeks, we will use the Palmer Penguins data set for our exercises and illustrations.
# Import the dataset
= readr::read_csv("data/penguins.csv") dat
Rows: 344 Columns: 8
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
- Write a function that calculates the average of three values.
- Use YOUR function to calculate the average
body_mass_g
of the penguins of the 81rd, 82rd and 83rd rows ofdat
. - Write a function that calculates the average of all values of a vector.
- Use YOUR function to calculate the average
body_mass_g
of all penguins indat
. Hint: Remove NA values before the calculation
# Import the dataset
= readr::read_csv("data/penguins.csv") dat
Rows: 344 Columns: 8
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# 1A.
# Function that calculates the average between three number and return the average
<- function(x, y, z) { # x,y and z are numbers
avg return ((x+y+z)/3)
}
# 1B.
# Calculate the average of the body masse (g) of the penguin in the row 81,82 and 83 of the dataset
avg(dat[81, "body_mass_g"],dat[82, "body_mass_g"],dat[83, "body_mass_g"]) # (3200 + 4700 + 3800) / 3 = 3900
body_mass_g
1 3900
# 1C.
# Function that calculates the average of ALL numbers of a vector
<- function(x) { # x is a vector
avg return (sum(x)/length(x))
}
# 1D.
# Calculate the average body_mass_g of all penguins in dat
# Like this, the function do not deal with NA:
avg(dat[, "body_mass_g"]) # return NA
[1] NA
## Solution to this:
## Remove NA before hands
|>
dat drop_na(body_mass_g) |>
pull(body_mass_g) |>
avg()
[1] 4201.754
Exercise 2
Write a function that calculates the mean, the standard deviation, the maximum and the minimum of the variable flipper_length_mm
of the dat
:
- Use build-in functions within the body of YOUR function.
- Write a function that deal with
NA
. - The function return a named vector with the 4 information.
<- function(x) { # x is a vector
summarize_values
= mean(x,na.rm = T)
avr = sd(x,na.rm = T)
standard_dev = max(x,na.rm = T)
max_val = min(x,na.rm = T)
min_val
= c('avg'=avr,'std'=standard_dev,'minimum'=min_val,'maximum'=max_val)
out
return (out)
}
|>
dat pull(flipper_length_mm) |>
summarize_values()
avg std minimum maximum
200.91520 14.06171 172.00000 231.00000
Exercise 3
If we go back to the
avg
function you wrote in exercise 1D. How can you modify it, in order to allow the user to choose handling (or not) theNA
’s. Hint: Something similar tona.rm
in some base R functions.Use YOUR function to calculate the average
body_mass_g
of all penguins indat
(without removing the NAs before applying the function).
# 3A. Modified version of AVG function, which handel NAs.
#' Calculate average, with conditional satement about handeling or not NA's values
#'
#' @param x Numeric vector
#' @param na.rm Logical Index
#' @return Average value (accounting for NAs or not) \code{x}
#' @examples
#' x <- c(1.5,3.4,4,6.1,5.2)
#' avg(x)
<- function(x, na.rm=FALSE) { # by default, na.rm = FALSE
avg # 1. If na.rm == TRUE, remove NA from vector X:
if (na.rm) {
<- x[!is.na(x)]
x
}# 2. calculating average
sum(x)/length(x)
}
# 3B. calculate the average `body_mass_g` of all penguins in `dat` without the NA's.
|>
dat pull(body_mass_g) |>
avg(na.rm=T)
[1] 4201.754
Exercise 4
Write a piece of code to download a file from internet (URL) under this conditions:
- Load the penguin dataset (https://ivanek.github.io/introductionToR/data/penguins.csv) and save it to your
data
folder. - We would like to download the file (function
download.file
) only, if it does not exists locally. This we can check with the build-infile.exists
function. - If it exists locally, we would like to send a message (function
message()
) saying that the file already exist. - Write this function in the simplest way.
- Check https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/download.file and https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/message for more info.
# Function to load an non-existing file from URL
# check if the file exists:
file.exists("data/penguins.csv")
[1] TRUE
# check if it does not exists:
!file.exists("data/penguins.csv")
[1] FALSE
<- function(url, file) {
downloadFile if (!file.exists(file)) { # check if the file does not exist
download.file(url, file)
else {
} message("File exists already, not downloading")
}
}
## test each step of fonction:
### condition1: File does not exit
#!file.exists(file) # condition = TRUE --> file does not exist yet
#download.file(url, file) # code executed with build-in function
#!file.exists(file) # condition = FALSE --> NOW the file exist yet
#message("File exists already, not downloading") # message in red
# If you would like to test the function step by step:
## 1. write the variable
= "https://ivanek.github.io/introductionToR/data/penguins.csv"
url = "data/penguins.csv"
file
# Check if the file exist:
!file.exists(file)# The file already exist
[1] FALSE
downloadFile(url, file)
File exists already, not downloading
Exercise 5
Knowing these information:
- In the water surrounding the Biscoe island you find only krills.
- In the water surrounding the Dream island you find only squids.
- In the water surrounding the Torgersen island you find only fish.
Add a new column to the penguin dataset that specifies what food diet each penguin has.
# `case_when()` is particularly useful inside `mutate()` when you want to
# create a new variable that relies on an existing
# variables
= dat |>
dat mutate(
diet = case_when(
== "Biscoe" ~ "krill",
island == "Dream" ~ "squid",
island .default = "fish"
)
)
table(dat$diet) # Base R
fish krill squid
52 168 124
Exercise 6
Add an additional column into penguin dataset, classifying if the penguin body_mass_g
is greater than median
, or smaller/equal (factor with two levels: “greater, less”). Make sure the NA
values in body_mass_g
stay NA
values in the new column.
# if the weight is greater than median (using ifelse)
= dat |>
dat mutate(
weight_category = case_when(
> median(body_mass_g,na.rm=T) ~ "greater",
body_mass_g <= median(body_mass_g,na.rm=T) ~ "less",
body_mass_g
)
)
|>
dat count(weight_category)
# A tibble: 3 × 2
weight_category n
<chr> <int>
1 greater 166
2 less 176
3 <NA> 2
Session info
R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.4.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Zurich
tzcode source: internal
attached base packages:
[1] grid stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] writexl_1.5.1 readxl_1.4.3 lubridate_1.9.3 forcats_1.0.0
[5] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
[9] tidyr_1.3.1 tibble_3.2.1 tidyverse_2.0.0 gridExtra_2.3
[13] ggplot2_3.5.1 devtools_2.4.5 usethis_3.0.0 BiocStyle_2.33.1
[17] png_0.1-8 knitr_1.48
loaded via a namespace (and not attached):
[1] gtable_0.3.6 xfun_0.49 htmlwidgets_1.6.4
[4] remotes_2.5.0 tzdb_0.4.0 vctrs_0.6.5
[7] tools_4.4.2 generics_0.1.3 parallel_4.4.2
[10] fansi_1.0.6 pkgconfig_2.0.3 RColorBrewer_1.1-3
[13] lifecycle_1.0.4 farver_2.1.2 compiler_4.4.2
[16] munsell_0.5.1 httpuv_1.6.15 htmltools_0.5.8.1
[19] yaml_2.3.10 later_1.3.2 pillar_1.9.0
[22] crayon_1.5.3 urlchecker_1.0.1 ellipsis_0.3.2
[25] cachem_1.1.0 sessioninfo_1.2.2 mime_0.12
[28] tidyselect_1.2.1 digest_0.6.37 stringi_1.8.4
[31] labeling_0.4.3 fastmap_1.2.0 colorspace_2.1-1
[34] cli_3.6.3 magrittr_2.0.3 pkgbuild_1.4.5
[37] utf8_1.2.4 withr_3.0.2 scales_1.3.0
[40] promises_1.3.0 bit64_4.5.2 timechange_0.3.0
[43] rmarkdown_2.29 bit_4.5.0 cellranger_1.1.0
[46] hms_1.1.3 memoise_2.0.1 shiny_1.9.1
[49] evaluate_1.0.1 miniUI_0.1.1.1 profvis_0.4.0
[52] rlang_1.1.4 Rcpp_1.0.13-1 xtable_1.8-4
[55] glue_1.8.0 BiocManager_1.30.25 pkgload_1.4.0
[58] rstudioapi_0.17.1 vroom_1.6.5 jsonlite_1.8.9
[61] R6_2.5.1 fs_1.6.5