In this session, we will see how to use R’s vectorised operations to perform calculations more efficiently, and introduce functions in R and how to write them.

1 Vectorized operations

1.1 Introduction

We often need to process sets of values rather than single values. Such sets can be stored inside a vector. In many programming languages, operations on vectors are typically performed element by element in a loop:

for (i=0; i < length; i++) {
  z[i] = x[i] + y[i];
}

What this loop does:

  • Start i at 0.

  • Keep looping while i is less than length.

  • After each iteration, increase i by 1.

  • Within the loop, it perform the addition for the i-th elements of x and y, and store the result in z[i].

In interpreted languages such as R, this would be computationally inefficient, as the code within the loop is interpreted again for each iteration.
However, interpreted languages often provide common vectorised operations. A vectorised operation is simply an operation that is done on an entire vector rather than on single values, removing the need for interpretation of code at each iteration. For example, the loop above would be replaced by:

z <- x + y
  • Vectorised operations are typically much faster to compute than looping single-value operations, and also easier to read. Consequently, vectorised operations are essential for efficient R programming.

  • If you feel the impulse of writing a loop, think twice before doing it! Indeed, R is particularly geared towards vectorised operations.

    • On the one hand, most things in R are already vectors (or composed of vectors):

      • a single number is a (length-one) vector;
      • matrices are internally represented as vectors.
    • On the other hand, R also includes default behaviours to help work with vectors:

      • if vectors in an operation are not all of the same length, the shorter vector is recycled (repeated) as needed.
      • many basic operations (adding, multiplying, calculating means, …) are vectorised.
  • NB: matrices (arrays) are often not recycled with the same rules, and often must have the same dimensions, without which warnings or errors are thrown. See also: R documentation about recycling

1.2 Simple mathematical operations

1.2.1 Vector and single number

  • add, subtract, divide, multiply, modulo …
  • in other programming languages (e.g. C, Perl, Python…), we would use a ‘for’ loop in order to modify each element of the vector x:

Python code:

x = [0,1,2,3,4,5]
x = [item+1 for item in x]
x
## [1, 2, 3, 4, 5, 6]

However, in R code we can write this:

x <- 0:5
x + 1
## [1] 1 2 3 4 5 6
  • How does this work?
  1. The expression x + 1 is the addition of two vectors of lengths 6 and 1. For this to work, the vectors need to have the same length and R recycles (repeats) the shorter vector (1) to match the longer vector (x).
  2. Now that both vectors are of the same length, additions are performed element-wise.
  3. Print result to screen.

The shorter vector is always recycled to match the longer one.

1.2.2 Two vectors

  • Similarly, we can add elements of two vectors:
y <- 5:0
z <- x + y
z
## [1] 5 5 5 5 5 5
  • \(z_i\), the element at index \(i\) of the result \(z\), is the sum of the elements \(x_i\) and \(y_i\): \(z_i = x_i + y_i\)
  • Remember: R performs this operation element-wise

Can you guess and explain the results in the following cases?

  • Exercise 1
r <- c(1, 1, 1, 1)
s <- c(1, 3)
r + s
## [1] 2 4 2 4

=> R recycles the shorter vector along the long one and perform the addition element-wise. The expression c(1,1,1,1) + c(1,3) is equivalent to c(1,1,1,1) + c(1,3,1,3)

  • Exercise 2
v <- c(1, 2, 3)
r + v
## Warning in r + v: longer object length is not a multiple of shorter object length
## [1] 2 3 4 2

R still recycles the shorter vector, but in this case the longer object length is not a multiple of shorter object length, and R throws a warning.

  • Exercise 3
r + 5L
## [1] 6 6 6 6

=> This is the addition of a numerical vector (r) to an integer number (5L). In this case, R converts the integer vector also to a numerical vector and recycles it as above. Compare class(r) and class(5L) to class(r + 5L).

1.2.3 Matrix and single number

  • In R, a matrix is internally represented as a vector.
    • Values can be looked up using 2-dimensional coordinates (rows, columns)
    • Values can be looked up using the underlying 1-dimensional index.
    • The length of a matrix M (length(M)) is the product of its dimensions (prod(dim(M))).
  • For example, a 2-by-3 matrix has length 6 (= 2 rows x 3 cols):
m <- matrix(11:16, nrow = 2, ncol = 3)
m  # matrix printed as a matrix
##      [,1] [,2] [,3]
## [1,]   11   13   15
## [2,]   12   14   16
m[2, 2]  # extracting a single value from a matrix with 2D coordinates
## [1] 14
m[4]  # extracting the same value from the matrix using the internal vector's index
## [1] 14
m[1:length(m)]  # extracting the underlying vector --> m[1:6]
## [1] 11 12 13 14 15 16

Note: notice that (by default) matrices are filled column by column!

  • When performing vector operations on matrices, the shorter vectors are recycled to that length. Hence, in the example below, R recycles a single number to match the dimensions of a matrix, and again performs calculations element-wise.
m1 <- matrix(1:4, nrow = 2, ncol = 2)
m1
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
# add 1
m1 + 1  # equivalent to m1 + matrix(c(1,1,1,1),nrow=2,ncol=2)
##      [,1] [,2]
## [1,]    2    4
## [2,]    3    5

1.2.4 Matrix and vector

Examples:

m1 + c(1:2)
##      [,1] [,2]
## [1,]    2    4
## [2,]    4    6
m1 + c(1:4)
##      [,1] [,2]
## [1,]    2    6
## [2,]    4    8

Can you guess what would happen by evaluating this statement?

  • Exercise 4
m1 + c(1:3)
## Warning in m1 + c(1:3): longer object length is not a multiple of shorter object length
##      [,1] [,2]
## [1,]    2    6
## [2,]    4    5

=> R again recycles the shorter vector and performs the addition element-wise, with a warning.

Interestingly, if the vector were longer, R throws an error about incompatible dimensions (see also R documentation about recycling):

m1 + 1:8
Error: dims [product 4] do not match the length of object [8]

1.2.5 Two matrices

  • As before, all simple mathematical operations are applied element-wise.
m2 <- matrix(5:8, nrow = 2, ncol = 2)
m1 + m2  # element-wise addition
##      [,1] [,2]
## [1,]    6   10
## [2,]    8   12

Can you guess what would happen in this case?

  • Exercise 5
m3 <- matrix(1:6, nrow = 3, ncol = 2)
m1 + m3
Error in m1 + m3 : non-conformable arrays

=> In operations with more than one matrix or array, all the matrices (arrays) must have the same dim attribute (see also R documentation about recycling). R does not automatically recycle non-conforming matrices and throws an error about incompatible dimensions.

1.3 Linear algebra

1.3.1 Simple multiplication

  • Simple multiplication ("*") is element-wise multiplication (similar to addition).
  • Important: The expression m1 * m2 is not to be confused with matrix multiplication
    (see below how to do that in R).
  • Matrices need to have identical dimensions.
m1 * m2
##      [,1] [,2]
## [1,]    5   21
## [2,]   12   32

1.3.2 Matrix product of two matrices

  • If we want to calculate the matrix-product of two matrices, we need to use the %*% operator instead of the * operator.
  • In this case, the calculation is based on the formula below, illustrated
    in the figure.

\[ C_{ij} = \sum_{k=1}^n A_{ik}*B_{kj} \]

  • Multiply the rows of the first matrix by the columns of the second matrix and add up the results.

  • The number of columns in first matrix must match the number of rows in second matrix.

Example:

m1 %*% m2
##      [,1] [,2]
## [1,]   23   31
## [2,]   34   46

  • Step-by-step calculation

    1. Top-left : (Row 1 of m1 × Column 1 of m2): (1∗5)+(3∗6)=5+18=23

    2. Top-right : (Row 1 of m1 × Column 2 of m2): (1∗7)+(3∗8)=7+24=31

    3. Bottom-left: (Row 2 of m1 × Column 1 of m2): (2∗5)+(4∗6)=10+24=34

    4. Bottom-right: (Row 2 of m1 × Column 2 of m2): (2∗7)+(4∗8)=14+32=46

  • Matrix multiplication is used in many statistical approaches (for example linear models).

1.3.3 Matrix product of one matrix and a vector

  • We can also calculate the product of matrix and a vector. The vector’s length must match one of the matrices dimensions, depending on which side of the operation the vector is.
    • Left: row vector of length equal to height of matrix
    • Right: column vector of length equal to width of matrix
m3 <- matrix(1:6, nrow = 3, ncol = 2)
m3
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
c(1, 2, 3) %*% m3
##      [,1] [,2]
## [1,]   14   32
m3 %*% c(2, 2)
##      [,1]
## [1,]   10
## [2,]   14
## [3,]   18

1.4 More exercises

  • Exercise 6
    Convert the following set of temperatures measured in degrees of Celsius to Fahrenheit, -20°C, 0°C, 100°C, using vectorized arithmetic.

Hint: \((°C * 9/5) + 32 = °F\).

x <- c(-20, 0, 100)
(x * 9/5) + 32
## [1]  -4  32 212
  • Exercise 7
    Given a matrix m m <- matrix(1:20, nrow=4, ncol=5), what are the results of:
  1. m ^ 0.5
  2. m * c(0,1,2,3)
  3. m > c(100, 0)
  4. m * m
  5. m %*% m
m <- matrix(1:20, nrow = 4, ncol = 5)
# a)
m^0.5  # or sqrt(m)
##          [,1]     [,2]     [,3]     [,4]     [,5]
## [1,] 1.000000 2.236068 3.000000 3.605551 4.123106
## [2,] 1.414214 2.449490 3.162278 3.741657 4.242641
## [3,] 1.732051 2.645751 3.316625 3.872983 4.358899
## [4,] 2.000000 2.828427 3.464102 4.000000 4.472136
# b)
m * c(0, 1, 2, 3)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    0    0
## [2,]    2    6   10   14   18
## [3,]    6   14   22   30   38
## [4,]   12   24   36   48   60
# c)
m > c(100, 0)
##       [,1]  [,2]  [,3]  [,4]  [,5]
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,]  TRUE  TRUE  TRUE  TRUE  TRUE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,]  TRUE  TRUE  TRUE  TRUE  TRUE
# d)
m * m  # or m^2
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1   25   81  169  289
## [2,]    4   36  100  196  324
## [3,]    9   49  121  225  361
## [4,]   16   64  144  256  400
# e) m %*% m # returns error: non-conformable arguments, we need to transpose
# one of the matrices
t(m) %*% m
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   30   70  110  150  190
## [2,]   70  174  278  382  486
## [3,]  110  278  446  614  782
## [4,]  150  382  614  846 1078
## [5,]  190  486  782 1078 1374

2 Functions

2.1 Introduction

  • In R, functions are pieces of R code that are kept inside R objects of class "function". They can take input via arguments (e.g. some integers, vectors, character strings…), process that input, and then return some output (e.g. a matrix, a figure, some text, a file…).

  • Many functions in R are written themselves in R. R can show you the code inside a function.
    To view the code implementation of a function, type the name of the function without the parentheses “(” and “)”. Here is an example:

# R's implementation for standard deviation (a one-line function)
sd
## function (x, na.rm = FALSE) 
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
##     na.rm = na.rm))
## <bytecode: 0x12901f3f8>
## <environment: namespace:stats>
  • A great many functions are provided in base R, and many more can be added to your session by loading packages.

  • Finally, it is easy in R to create and work with user-defined functions, which is what we will explore here. Creating and using your own functions helps keep your code tidy and readable, avoids copy-pasting code multiple times, and also helps with debugging, especially within an IDE such as RStudio.

2.2 Function structure

2.2.1 Code structure

  • You create your own functions using function(...).
    The basic code structure of a function is given below.
    myFunction <- function(arg1, arg2, ... ){
      statements
      return(object)
    }

2.2.2 Naming convention

  • Functions are typically named (myFunction in the example above). Naming rules for functions are identical to rules for other R objects (names should start with a letter, should be descriptive, etc.).
  • Some functions have special names (e.g. the addition function +) and can be called using `+` (function name enclosed in backticks).
  • In special cases, we can use anonymous functions (without names, which are also called “lambda functions” in other programming languages), which we will see in the next session.

2.2.3 Internal parts of a function object

  • Internally, a function has three parts, which can be viewed using the following functions:

    • the formals() (or args()): the list of arguments which define how you call the function
    • the body(): the statements/code executed by the function
    • the environment(): the “map” of the location of the function’s variables.

Let’s consider a concrete example:

sd
## function (x, na.rm = FALSE) 
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
##     na.rm = na.rm))
## <bytecode: 0x12901f3f8>
## <environment: namespace:stats>
# R's implementation of standard deviation
formals(sd)
## $x
## 
## 
## $na.rm
## [1] FALSE
args(sd)  # more human-readable version of formals()
## function (x, na.rm = FALSE) 
## NULL
body(sd)
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
##     na.rm = na.rm))
environment(sd)
## <environment: namespace:stats>

2.2.4 Primitive functions

  • Exceptions to the structure presented above are “primitive” (internally implemented) functions (e.g. sum, “+”, “[”, sin, cos, etc.). For efficiency, they directly call internal code by using the .Primitive function, and do not have these three components.
sum
## function (..., na.rm = FALSE)  .Primitive("sum")
# R's implementation of sum
formals(sum)
## NULL
body(sum)
## NULL
environment(sum)
## NULL
  • Operators (“+”, “<-”, “=”, etc.) are in fact special functions that take one or two arguments, and that can be written without parantheses.

2.3 Calling functions

  • To call a function is simply to run that function’s code on some input arguments. We have already done this many times in this course:
function name + round parentheses + arguments inside parentheses

For example: Randomly draw a vector of size 1000 from a normal distribution

rnorm  # function (n, mean = 0, sd = 1) 
## function (n, mean = 0, sd = 1) 
## .Call(C_rnorm, n, mean, sd)
## <bytecode: 0x12ba91158>
## <environment: namespace:stats>
x <- rnorm(1000)  # 

2.3.1 Arguments

  • The arguments can be defined in two ways:

    • mandatory arguments, without a default value (in the case of sd() the x)
    • optional arguments, with a default value (in the case of sd(), the na.rm argument has a default value of FALSE)
  • When calling a function, the values in the call are matched to arguments either by name or by position.

For example, the following call does probably not what you intended (and depending on your R version may even throw an error):

# x <- rnorm(1000) # reminder
sd(TRUE, x)  # incorrect, expects first `x` and second `na.rm`

In the above call, arguments were matched by position, equivalent to sd(x = TRUE, na.rm = x).

You can pass arguments in an arbitrary order by using their names, in which case R will first match all named arguments, and then the remaining unnamed ones by position:

# ... which can make the code less readable
sd(na.rm = TRUE, x = x)
## [1] 0.9997483
sd(na.rm = TRUE, x)
## [1] 0.9997483
sd(TRUE, x = x)
## [1] 0.9997483
  • Exercise 8
    Write a function that calculates the (element-wise) addition of two vectors a and b.
add <- function(a, b) {
    a + b
}
add(1:3, 1:3)
## [1] 2 4 6

2.3.2 Dot-dot-dot operator ...

  • A function can only handle as many arguments as specified in formals, unless the function uses the special ... argument (e.g. the plot function).

  • The special argument ... (dot-dot-dot, or ellipsis) can accept any number of named/unnamed values. Here is an illustrative [example:\\](example:){.uri}

count_args <- function(...) {
    arglist <- list(...)
    message("you called me using ", length(arglist), " arguments")
}
count_args()
## you called me using 0 arguments
count_args(1:10)
## you called me using 1 arguments
count_args(1, 2, 3, 4, 5)
## you called me using 5 arguments
  • The ellipsis argument will be assigned all arguments that could not be name- nor position-matched.

  • Exercise 9
    How many arguments are “eaten up” by the ellipsis (...) in the following examples:

count_args(1:2)
count_args(1, 2)
count_args(1:2, 3:4)
count_args(1:4)
  • count_args(1:2): 1 (a single vector with two elements)
  • count_args(1, 2): 2 (two length-one vectors)
  • count_args(1:2, 3:4): 2 (two length-two vectors)
  • count_args(1:4): 1 (a single vector with four elements)

2.4 Environments and variable scoping

  • An environment is a place in R’s memory where R objects can be stored, including values, datasets, functions, other environments, and more.

  • By default, when working from the command line, you are working within the global environment.

  • The scope of a variable defines where a variable “lives”. It is the region of code in which a variable is visible and can be used in expressions.

  • The arguments of a function have a scope that is limited to the body of that function, e.g. myarg does not exist in the global environment, but it does exist inside of the environment of myfunction(myarg = 1):

# no myarg in the current environment
exists("myarg")
## [1] FALSE
# but myarg exists inside of myfunction
myfunction <- function(myarg = 1) {
    return(exists("myarg"))
}
myfunction()
## [1] TRUE
  • You can however use variables inside a function that “live” outside, i.e. in the environment from which the function was called (e.g. a parent function, or the global environment). This works due to R’s variable lookup rules, described in more detail below.

2.4.1 Where to look-up for variables and functions (lexical lookup)

  • By default, R first tries to find variables or functions in the environment of the function itself (remember that each function gets its own environment, one of the three parts of a function).
  • If it cannot find an object there, it tries to find the object in the parent environment, from which the function was called.
  • This is continued until it reaches the global environment.
  • If no object is found with that name by then, R throws an error.

Here is an illustration:

f <- function() {
    x <- 1
    y <- 2
    c(x, y)
}
f()
## [1] 1 2
rm(f)  # clean-up
# exists(x)

If a variable is not defined inside a function, R will look one level up, in its parent environment:

x <- 2
g <- function() {
    y <- 1
    c(x, y)  # x is defined in the parent environement
}
g()
## [1] 2 1
rm(x, g)  # clean-up
  • This behavior might be problematic if there are already many variables in an R session, and a function might find one that by chance has a matching name. In most cases, this is not a desired behavior, as it may lead to different results depending on the current state of the global environment (i.e. the value of the variable). It is good practice to always test your code and functions inside a fresh R session. This is particularly important if you develop code in the global environment first, before moving it to an encapsulating function, as the objects you created will remain in the global environment.

2.4.2 When to look-up for variables / functions (dynamic look-up)

  • If the function refers to variable defined outside of its body, it uses the value at the time of the function call, not at the time when we define the function (the lookup happens during the function call).
x <- 2
g <- function() {
    y <- 1
    c(x, y)
}
x <- 3
# g() uses the value of `x` at the time of the call to g()
g()
## [1] 3 1
rm(x, g)

2.4.3 Same name – function vs. variable

  • Be careful when choosing the names for the objects and functions. If an object with the same name already exists in an environment, it will be overwritten with a new value. The only exception is for functions defined in base R or in packages, which are just “masked” by the newly created object, but can be unmasked again when it is removed.
  • In addition, R recognizes when you intend to call a function (round brackets ()) and skips non-function objects when searching:
sum  # the function
## function (..., na.rm = FALSE)  .Primitive("sum")
sum <- 15
sum  # the variable, not the function
## [1] 15
sum(c(1, 2, 3))  # still works
## [1] 6
sum
## [1] 15
help(sum)  # still works
rm(sum)  # removing the variable...
sum  # ...unmasks the function
## function (..., na.rm = FALSE)  .Primitive("sum")

2.4.4 A fresh start with every function call

With every function call, the environment of the function is freshly created, which means that functions have no automatic memory.

2.5 More exercises

  • Exercise 10a
    What is the output of sum(1,2,3,4,5) ?
sum(1, 2, 3, 4, 5)
## [1] 15
  • Exercise 10b
    What is the output of mean(1,2,3,4,5) ?
mean(1, 2, 3, 4, 5)
## [1] 1
  • Exercise 10c
    What is the reason for the observations from the two previous exercises?
# check the help page of both functions
`?`(sum)
`?`(mean)

=> We can see that sum uses the dot-dot-dot ... argument, which means that all arguments passed without a specific name are just summed-up together. Even if we provide logical value! The only way how to tell that the logical is a na.rm argument is to directly name it.

sum(1, 2, 3, 4, 5, TRUE)
## [1] 16
sum(1, 2, 3, 4, 5, na.rm = TRUE)
## [1] 15

=> The function mean on the other hand calculates the mean only from first argument. The others are passed to internal functions which do the calculation. It also accepts a na.rm argument. It can also accept trim argument for calculation of a trimmed mean.

  • Exercise 10d
    How can we achieve the desired result?

Provide the numbers in a single vector.

sum(c(1, 2, 3, 4, 5))
## [1] 15
mean(c(1, 2, 3, 4, 5))
## [1] 3
  • Exercise 11
    Write a function that converts temperatures in degrees Fahrenheit to degrees Celsius. C.F. previous exercise on the subject.
C_to_F <- function(x) {
    (x * 9/5) + 32
}
C_to_F(c(-20, 0, 100))
## [1]  -4  32 212
  • Exercise 12
    Can you predict what the code below does? Can you explain the resulting behaviour?
gb_val <- 10
edit_global_val <- function(x) {
    gb_val <- gb_val + x
    return(gb_val)
}
# apply the edit...
edit_global_val(5)
# ...but did the global value actually change?
gb_val
## [1] 15
## [1] 10

=> The function returns the edited global value… But the actual global value has not changed! This is because the moment we assigned a value to an object called gb_val inside the function using gb_val <- gb_val + x, we actually created a copy of gb_val that is limited to the scope of the function. As such, the assignment only changes the local copy, not the global one.

Is there a way around this? Beyond the scope of this session, but…

gb_val <- 10
edit_global_val <- function(x) {
    # force the assignment to occur within the global environment
    assign("gb_val", gb_val + x, envir = parent.env(environment()))
    return(gb_val)
}
# apply the edit...
edit_global_val(5)
## [1] 15
# ...but did the global value actually change?
gb_val
## [1] 15

Loose rule: avoid editing values in other environments from within functions.