In this session, we will see how to use R’s vectorised operations to perform calculations more efficiently, and introduce functions in R and how to write them.
We often need to process sets of values rather than single values. Such sets can be stored inside a vector. In many programming languages, operations on vectors are typically performed element by element in a loop:
for (i=0; i < length; i++) {
z[i] = x[i] + y[i];
}
What this loop does:
Start i at 0.
Keep looping while i is less than length.
After each iteration, increase i by 1.
Within the loop, it perform the addition for the i-th elements
of x and y, and store the result in z[i].
In interpreted languages such as R, this would be computationally
inefficient, as the code within the loop is interpreted again for each
iteration.
However, interpreted languages often provide common vectorised
operations. A vectorised operation is simply an operation that is done
on an entire vector rather than on single values, removing the need
for interpretation of code at each iteration. For example, the loop
above would be replaced by:
z <- x + y
Vectorised operations are typically much faster to compute than looping single-value operations, and also easier to read. Consequently, vectorised operations are essential for efficient R programming.
If you feel the impulse of writing a loop, think twice before doing it! Indeed, R is particularly geared towards vectorised operations.
On the one hand, most things in R are already vectors (or composed of vectors):
On the other hand, R also includes default behaviours to help work with vectors:
NB: matrices (arrays) are often not recycled with the same rules, and often must have the same dimensions, without which warnings or errors are thrown. See also: R documentation about recycling
x:Python code:
x = [0,1,2,3,4,5]
x = [item+1 for item in x]
x
## [1, 2, 3, 4, 5, 6]
However, in R code we can write this:
x <- 0:5
x + 1
## [1] 1 2 3 4 5 6
x + 1 is the addition of two vectors of lengths 6
and 1. For this to work, the vectors need to have the same length
and R recycles (repeats) the shorter vector (1) to match the
longer vector (x).The shorter vector is always recycled to match the longer one.
y <- 5:0
z <- x + y
z
## [1] 5 5 5 5 5 5
Can you guess and explain the results in the following cases?
r <- c(1, 1, 1, 1)
s <- c(1, 3)
r + s
## [1] 2 4 2 4
=> R recycles the shorter vector along the long one and perform the
addition element-wise. The expression c(1,1,1,1) + c(1,3) is
equivalent to c(1,1,1,1) + c(1,3,1,3)
v <- c(1, 2, 3)
r + v
## Warning in r + v: longer object length is not a multiple of shorter object length
## [1] 2 3 4 2
R still recycles the shorter vector, but in this case the longer object length is not a multiple of shorter object length, and R throws a warning.
r + 5L
## [1] 6 6 6 6
=> This is the addition of a numerical vector (r) to an integer
number (5L). In this case, R converts the integer vector also to a
numerical vector and recycles it as above. Compare class(r) and
class(5L) to class(r + 5L).
matrix is internally represented as a vector.
M (length(M)) is the product of its
dimensions (prod(dim(M))).m <- matrix(11:16, nrow = 2, ncol = 3)
m # matrix printed as a matrix
## [,1] [,2] [,3]
## [1,] 11 13 15
## [2,] 12 14 16
m[2, 2] # extracting a single value from a matrix with 2D coordinates
## [1] 14
m[4] # extracting the same value from the matrix using the internal vector's index
## [1] 14
m[1:length(m)] # extracting the underlying vector --> m[1:6]
## [1] 11 12 13 14 15 16
Note: notice that (by default) matrices are filled column by column!
m1 <- matrix(1:4, nrow = 2, ncol = 2)
m1
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
# add 1
m1 + 1 # equivalent to m1 + matrix(c(1,1,1,1),nrow=2,ncol=2)
## [,1] [,2]
## [1,] 2 4
## [2,] 3 5
Examples:
m1 + c(1:2)
## [,1] [,2]
## [1,] 2 4
## [2,] 4 6
m1 + c(1:4)
## [,1] [,2]
## [1,] 2 6
## [2,] 4 8
Can you guess what would happen by evaluating this statement?
m1 + c(1:3)
## Warning in m1 + c(1:3): longer object length is not a multiple of shorter object length
## [,1] [,2]
## [1,] 2 6
## [2,] 4 5
=> R again recycles the shorter vector and performs the addition element-wise, with a warning.
Interestingly, if the vector were longer, R throws an error about incompatible dimensions (see also R documentation about recycling):
m1 + 1:8
Error: dims [product 4] do not match the length of object [8]
m2 <- matrix(5:8, nrow = 2, ncol = 2)
m1 + m2 # element-wise addition
## [,1] [,2]
## [1,] 6 10
## [2,] 8 12
Can you guess what would happen in this case?
m3 <- matrix(1:6, nrow = 3, ncol = 2)
m1 + m3
Error in m1 + m3 : non-conformable arrays
=> In operations with more than one matrix or array, all the matrices
(arrays) must have the same dim attribute (see also R documentation
about
recycling).
R does not automatically recycle non-conforming matrices and throws an
error about incompatible dimensions.
"*") is element-wise multiplication
(similar to addition).m1 * m2 is not to be confused with
matrix multiplicationm1 * m2
## [,1] [,2]
## [1,] 5 21
## [2,] 12 32
%*% operator instead of the * operator.\[ C_{ij} = \sum_{k=1}^n A_{ik}*B_{kj} \]
Figure 1: From https://en.wikipedia.org/wiki/Matrix_multiplication
Multiply the rows of the first matrix by the columns of the second matrix and add up the results.
The number of columns in first matrix must match the number of rows in second matrix.
Example:
m1 %*% m2
## [,1] [,2]
## [1,] 23 31
## [2,] 34 46
Step-by-step calculation
Top-left : (Row 1 of m1 × Column 1 of m2):
(1∗5)+(3∗6)=5+18=23
Top-right : (Row 1 of m1 × Column 2 of m2):
(1∗7)+(3∗8)=7+24=31
Bottom-left: (Row 2 of m1 × Column 1 of m2):
(2∗5)+(4∗6)=10+24=34
Bottom-right: (Row 2 of m1 × Column 2 of m2):
(2∗7)+(4∗8)=14+32=46
Matrix multiplication is used in many statistical approaches (for example linear models).
m3 <- matrix(1:6, nrow = 3, ncol = 2)
m3
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
c(1, 2, 3) %*% m3
## [,1] [,2]
## [1,] 14 32
m3 %*% c(2, 2)
## [,1]
## [1,] 10
## [2,] 14
## [3,] 18
Hint: \((°C * 9/5) + 32 = °F\).
x <- c(-20, 0, 100)
(x * 9/5) + 32
## [1] -4 32 212
m <- matrix(1:20, nrow=4, ncol=5), what are the
results of:m <- matrix(1:20, nrow = 4, ncol = 5)
# a)
m^0.5 # or sqrt(m)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.000000 2.236068 3.000000 3.605551 4.123106
## [2,] 1.414214 2.449490 3.162278 3.741657 4.242641
## [3,] 1.732051 2.645751 3.316625 3.872983 4.358899
## [4,] 2.000000 2.828427 3.464102 4.000000 4.472136
# b)
m * c(0, 1, 2, 3)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0 0 0 0
## [2,] 2 6 10 14 18
## [3,] 6 14 22 30 38
## [4,] 12 24 36 48 60
# c)
m > c(100, 0)
## [,1] [,2] [,3] [,4] [,5]
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] TRUE TRUE TRUE TRUE TRUE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] TRUE TRUE TRUE TRUE TRUE
# d)
m * m # or m^2
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 25 81 169 289
## [2,] 4 36 100 196 324
## [3,] 9 49 121 225 361
## [4,] 16 64 144 256 400
# e) m %*% m # returns error: non-conformable arguments, we need to transpose
# one of the matrices
t(m) %*% m
## [,1] [,2] [,3] [,4] [,5]
## [1,] 30 70 110 150 190
## [2,] 70 174 278 382 486
## [3,] 110 278 446 614 782
## [4,] 150 382 614 846 1078
## [5,] 190 486 782 1078 1374
In R, functions are pieces of R code that are kept inside R
objects of class "function". They can take input via arguments
(e.g. some integers, vectors, character strings…), process that
input, and then return some output (e.g. a matrix, a figure, some
text, a file…).
Many functions in R are written themselves in R. R can show you the
code inside a function.
To view the code implementation of a function, type the name of the
function without the parentheses “(” and “)”. Here is an example:
# R's implementation for standard deviation (a one-line function)
sd
## function (x, na.rm = FALSE)
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
## na.rm = na.rm))
## <bytecode: 0x12901f3f8>
## <environment: namespace:stats>
A great many functions are provided in base R, and many more can be added to your session by loading packages.
Finally, it is easy in R to create and work with user-defined functions, which is what we will explore here. Creating and using your own functions helps keep your code tidy and readable, avoids copy-pasting code multiple times, and also helps with debugging, especially within an IDE such as RStudio.
function(...). myFunction <- function(arg1, arg2, ... ){
statements
return(object)
}
myFunction in the example above).
Naming rules for functions are identical to rules for other R
objects (names should start with a letter, should be descriptive,
etc.).+)
and can be called using `+` (function name enclosed in
backticks).Internally, a function has three parts, which can be viewed using the following functions:
formals() (or args()): the list of arguments which
define how you call the functionbody(): the statements/code executed by the functionenvironment(): the “map” of the location of the function’s
variables.Let’s consider a concrete example:
sd
## function (x, na.rm = FALSE)
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
## na.rm = na.rm))
## <bytecode: 0x12901f3f8>
## <environment: namespace:stats>
# R's implementation of standard deviation
formals(sd)
## $x
##
##
## $na.rm
## [1] FALSE
args(sd) # more human-readable version of formals()
## function (x, na.rm = FALSE)
## NULL
body(sd)
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
## na.rm = na.rm))
environment(sd)
## <environment: namespace:stats>
sum, “+”, “[”, sin,
cos, etc.). For efficiency, they directly call internal code by
using the .Primitive function, and do not have these three
components.sum
## function (..., na.rm = FALSE) .Primitive("sum")
# R's implementation of sum
formals(sum)
## NULL
body(sum)
## NULL
environment(sum)
## NULL
+”, “<-”, “=”, etc.) are in fact special functions
that take one or two arguments, and that can be written without
parantheses.function name + round parentheses + arguments inside parentheses
For example: Randomly draw a vector of size 1000 from a normal distribution
rnorm # function (n, mean = 0, sd = 1)
## function (n, mean = 0, sd = 1)
## .Call(C_rnorm, n, mean, sd)
## <bytecode: 0x12ba91158>
## <environment: namespace:stats>
x <- rnorm(1000) #
The arguments can be defined in two ways:
sd() the x)sd(),
the na.rm argument has a default value of FALSE)When calling a function, the values in the call are matched to arguments either by name or by position.
For example, the following call does probably not what you intended (and depending on your R version may even throw an error):
# x <- rnorm(1000) # reminder
sd(TRUE, x) # incorrect, expects first `x` and second `na.rm`
In the above call, arguments were matched by position, equivalent to
sd(x = TRUE, na.rm = x).
You can pass arguments in an arbitrary order by using their names, in which case R will first match all named arguments, and then the remaining unnamed ones by position:
# ... which can make the code less readable
sd(na.rm = TRUE, x = x)
## [1] 0.9997483
sd(na.rm = TRUE, x)
## [1] 0.9997483
sd(TRUE, x = x)
## [1] 0.9997483
a and b.add <- function(a, b) {
a + b
}
add(1:3, 1:3)
## [1] 2 4 6
...A function can only handle as many arguments as specified in
formals, unless the function uses the special ... argument (e.g.
the plot function).
The special argument ... (dot-dot-dot, or ellipsis) can accept any
number of named/unnamed values. Here is an illustrative
[example:\\](example:){.uri}
count_args <- function(...) {
arglist <- list(...)
message("you called me using ", length(arglist), " arguments")
}
count_args()
## you called me using 0 arguments
count_args(1:10)
## you called me using 1 arguments
count_args(1, 2, 3, 4, 5)
## you called me using 5 arguments
The ellipsis argument will be assigned all arguments that could not be name- nor position-matched.
Exercise 9
How many arguments are “eaten up” by the ellipsis (...) in the
following examples:
count_args(1:2)
count_args(1, 2)
count_args(1:2, 3:4)
count_args(1:4)
count_args(1:2): 1 (a single vector with two elements)count_args(1, 2): 2 (two length-one vectors)count_args(1:2, 3:4): 2 (two length-two vectors)count_args(1:4): 1 (a single vector with four elements)An environment is a place in R’s memory where R objects can be stored, including values, datasets, functions, other environments, and more.
By default, when working from the command line, you are working within the global environment.
The scope of a variable defines where a variable “lives”. It is the region of code in which a variable is visible and can be used in expressions.
The arguments of a function have a scope that is limited to the body
of that function, e.g. myarg does not exist in the global
environment, but it does exist inside of the environment of
myfunction(myarg = 1):
# no myarg in the current environment
exists("myarg")
## [1] FALSE
# but myarg exists inside of myfunction
myfunction <- function(myarg = 1) {
return(exists("myarg"))
}
myfunction()
## [1] TRUE
environment of the function itself (remember that each function
gets its own environment, one of the three parts of a function).Here is an illustration:
f <- function() {
x <- 1
y <- 2
c(x, y)
}
f()
## [1] 1 2
rm(f) # clean-up
# exists(x)
If a variable is not defined inside a function, R will look one level
up, in its parent environment:
x <- 2
g <- function() {
y <- 1
c(x, y) # x is defined in the parent environement
}
g()
## [1] 2 1
rm(x, g) # clean-up
x <- 2
g <- function() {
y <- 1
c(x, y)
}
x <- 3
# g() uses the value of `x` at the time of the call to g()
g()
## [1] 3 1
rm(x, g)
()) and skips non-function objects when searching:sum # the function
## function (..., na.rm = FALSE) .Primitive("sum")
sum <- 15
sum # the variable, not the function
## [1] 15
sum(c(1, 2, 3)) # still works
## [1] 6
sum
## [1] 15
help(sum) # still works
rm(sum) # removing the variable...
sum # ...unmasks the function
## function (..., na.rm = FALSE) .Primitive("sum")
With every function call, the environment of the function is freshly created, which means that functions have no automatic memory.
sum(1,2,3,4,5) ?sum(1, 2, 3, 4, 5)
## [1] 15
mean(1,2,3,4,5) ?mean(1, 2, 3, 4, 5)
## [1] 1
# check the help page of both functions
`?`(sum)
`?`(mean)
=> We can see that sum uses the dot-dot-dot ... argument, which means
that all arguments passed without a specific name are just summed-up
together. Even if we provide logical value! The only way how to tell
that the logical is a na.rm argument is to directly name it.
sum(1, 2, 3, 4, 5, TRUE)
## [1] 16
sum(1, 2, 3, 4, 5, na.rm = TRUE)
## [1] 15
=> The function mean on the other hand calculates the mean only from
first argument. The others are passed to internal functions which do the
calculation. It also accepts a na.rm argument. It can also accept
trim argument for calculation of a trimmed mean.
Provide the numbers in a single vector.
sum(c(1, 2, 3, 4, 5))
## [1] 15
mean(c(1, 2, 3, 4, 5))
## [1] 3
C_to_F <- function(x) {
(x * 9/5) + 32
}
C_to_F(c(-20, 0, 100))
## [1] -4 32 212
gb_val <- 10
edit_global_val <- function(x) {
gb_val <- gb_val + x
return(gb_val)
}
# apply the edit...
edit_global_val(5)
# ...but did the global value actually change?
gb_val
## [1] 15
## [1] 10
=> The function returns the edited global value… But the actual
global value has not changed! This is because the moment we assigned a
value to an object called gb_val inside the function using
gb_val <- gb_val + x, we actually created a copy of gb_val that is
limited to the scope of the function. As such, the assignment only
changes the local copy, not the global one.
Is there a way around this? Beyond the scope of this session, but…
gb_val <- 10
edit_global_val <- function(x) {
# force the assignment to occur within the global environment
assign("gb_val", gb_val + x, envir = parent.env(environment()))
return(gb_val)
}
# apply the edit...
edit_global_val(5)
## [1] 15
# ...but did the global value actually change?
gb_val
## [1] 15
Loose rule: avoid editing values in other environments from within functions.