In this session, we will see how to use R’s vectorised operations to perform calculations more efficiently, and introduce functions in R and how to write them.
for (i=0; i < length; i++) {
z[i] = x[i] + y[i];
}
z <- x + y
Vectorised operations are typically much faster to compute than looping single-value operations, and also easier to read. Consequently, vectorised operations are essential for efficient R programming.
If you feel the impulse of writing a loop, think twice before doing it! Indeed, R is particularly geared towards vectorised operations.
On the one hand, most things in R are already vectors (or composed of vectors):
On the other hand, R also includes default behaviours to help work with vectors:
NB: matrices (arrays) are often not recycled with the same rules, and often must have the same dimensions, without which warnings or errors are thrown. See also: R documentation about recycling
x
:Python code:
x = [0,1,2,3,4,5]
x = [item+1 for item in x]
x
## [1, 2, 3, 4, 5, 6]
However, in R code we can write this:
x <- 0:5
x + 1
## [1] 1 2 3 4 5 6
x + 1
is the addition of two vectors of lengths 6 and 1.
For this to work, the vectors need to have the same length and R recycles
(repeats) the shorter vector (1
) to match the longer vector (x
).The shorter vector is always recycled to match the longer one.
y <- 5:0
z <- x + y
z
## [1] 5 5 5 5 5 5
Can you guess and explain the results in the following cases?
r <- c(1,1,1,1)
s <- c(1,3)
r + s
## [1] 2 4 2 4
=> R recycles the shorter vector along the long one and perform the addition
element-wise. The expression c(1,1,1,1) + c(1,3)
is equivalent to
c(1,1,1,1) + c(1,3,1,3)
v <- c(1,2,3)
r + v
## Warning in r + v: longer object length is not a multiple of shorter object length
## [1] 2 3 4 2
R still recycles the shorter vector, but in this case the two vectors do not have compatible lengths, and R throws a warning.
r + 5L
## [1] 6 6 6 6
=> This is the addition of a numerical vector (r
) to an integer number (5L
).
In this case, R converts the integer vector also to a numerical vector and
recycles it as above. Compare class(r)
and class(5L)
to class(r + 5L)
.
matrix
is internally represented as a vector.
M
(length(M)
) is the product of its dimensions
(prod(dim(M))
).m <- matrix(11:16, nrow = 2, ncol = 3)
m # matrix printed as a matrix
## [,1] [,2] [,3]
## [1,] 11 13 15
## [2,] 12 14 16
m[2,2] # extracting a single value from a matrix with 2D coordinates
## [1] 14
m[4] # extracting the same value from the matrix using the internal vector's index
## [1] 14
m[1:length(m)] # extracting the underlying vector
## [1] 11 12 13 14 15 16
Note: notice that (by default) matrices are filled column by column!
m1 <- matrix(1:4, nrow = 2, ncol = 2)
m1
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
# add 1
m1 + 1
## [,1] [,2]
## [1,] 2 4
## [2,] 3 5
Examples:
m1 + c(1:2)
## [,1] [,2]
## [1,] 2 4
## [2,] 4 6
m1 + c(1:4)
## [,1] [,2]
## [1,] 2 6
## [2,] 4 8
Can you guess what would happen by evaluating this statement?
m1 + c(1:3)
## Warning in m1 + c(1:3): longer object length is not a multiple of shorter object length
## [,1] [,2]
## [1,] 2 6
## [2,] 4 5
=> R again recycles the shorter vector and performs the addition element-wise, with a warning.
Interestingly, if the vector were longer, R throws an error about incompatible dimensions (see also R documentation about recycling):
m1 + 1:8
Error: dims [product 4] do not match the length of object [8]
m1 * m2
is not to be confused with matrix
multiplicationm2 <- matrix(5:8, nrow = 2, ncol = 2)
m1 + m2 # element-wise addition
## [,1] [,2]
## [1,] 6 10
## [2,] 8 12
Can you guess what would happen in this case?
m3 <- matrix(1:6, nrow = 3, ncol = 2)
m1 + m3
Error in m1 + m3 : non-conformable arrays
=> In operations with more than one matrix or array, all the matrices (arrays)
must have the same dim
attribute (see also R documentation about recycling).
R does not automatically recycle non-conforming matrices and throws an error
about incompatible dimensions.
"*"
) is element-wise multiplication
(similar to addition).m1 * m2
## [,1] [,2]
## [1,] 5 21
## [2,] 12 32
%*%
operator instead of the *
operator.\[ C_{ij} = \sum_{k=1}^n A_{ik}*B_{kj} \]
NB: It is also obvious that the number of columns in first matrix must match the number of rows in second matrix.
Example:
m1 %*% m2
## [,1] [,2]
## [1,] 23 31
## [2,] 34 46
m3 <- matrix(1:6, nrow = 3, ncol = 2)
m3
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
c(1,2,3) %*% m3
## [,1] [,2]
## [1,] 14 32
m3 %*% c(2,2)
## [,1]
## [1,] 10
## [2,] 14
## [3,] 18
# with logical operator
x <- 1:1000
sum(x[c(TRUE, FALSE)])
## [1] 250000
# or with seq
sum(seq(1, 1000, 2))
## [1] 250000
Hint: \((°C * 9/5) + 32 = °F\).
x <- c(-20, 0, 100)
(x * 9/5) + 32
## [1] -4 32 212
m <- matrix(1:20, nrow=4, ncol=5)
, what are the results of:m <- matrix(1:20, nrow = 4, ncol = 5)
# a)
m ^ 0.5 # or sqrt(m)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.000000 2.236068 3.000000 3.605551 4.123106
## [2,] 1.414214 2.449490 3.162278 3.741657 4.242641
## [3,] 1.732051 2.645751 3.316625 3.872983 4.358899
## [4,] 2.000000 2.828427 3.464102 4.000000 4.472136
# b)
m * c(0,1,2,3)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0 0 0 0
## [2,] 2 6 10 14 18
## [3,] 6 14 22 30 38
## [4,] 12 24 36 48 60
# c)
m > c(100, 0)
## [,1] [,2] [,3] [,4] [,5]
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] TRUE TRUE TRUE TRUE TRUE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] TRUE TRUE TRUE TRUE TRUE
# d)
m * m # or m^2
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 25 81 169 289
## [2,] 4 36 100 196 324
## [3,] 9 49 121 225 361
## [4,] 16 64 144 256 400
# e)
# m %*% m # returns error: non-conformable arguments, we need to transpose one of the matrices
t(m) %*% m
## [,1] [,2] [,3] [,4] [,5]
## [1,] 30 70 110 150 190
## [2,] 70 174 278 382 486
## [3,] 110 278 446 614 782
## [4,] 150 382 614 846 1078
## [5,] 190 486 782 1078 1374
# a)
1/1 + 1/(2^2) + 1/(3^2) # non-vectorised
## [1] 1.361111
# b)
x <- 1:100
sum(1/(x^2)) # vectorised
## [1] 1.634984
In R, functions are pieces of R code that are kept inside R objects of class
"function"
. They can take input via other R objects passed as arguments
(e.g. some integers, vectors, character strings…), process that input,
and then return some output in the form of another R object or something else
(e.g. a matrix, a figure, some text, a file…).
Many functions in R are written themselves in R, and also call other
R functions. R can show you the code inside a function.
To view the code implementation of a function, type the name of the function
without the parentheses “(” and “)”. Here is an example:
# R's implementation for set intersection
intersect
## function (x, y)
## {
## if (is.null(x) || is.null(y))
## return(NULL)
## u <- as.vector(x)
## v <- as.vector(y)
## c(u[!duplicated(unclass(u)) & (match(u, v, 0L) > 0L)], v[numeric()])
## }
## <bytecode: 0x1191defc8>
## <environment: namespace:base>
# R's implementation for standard deviation (a one-line function)
sd
## function (x, na.rm = FALSE)
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
## na.rm = na.rm))
## <bytecode: 0x12d808ba8>
## <environment: namespace:stats>
A great many functions are provided in base R, and many more can be added to your session by loading packages.
Finally, it is easy in R to create and work with user-defined functions, which is what we will explore here. Creating and using your own functions helps keep your code tidy and readable, avoids copy-pasting code multiple times, and also helps with debugging, especially within an IDE such as RStudio.
function(...)
. myFunction <- function(arg1, arg2, ... ){
statements
return(object)
}
myFunction
in the example above). Naming rules
for functions are identical to rules for other R objects (names should start
with a letter, should be descriptive, etc.).+
)
and can be called using `+
` (function name enclosed in backticks).Internally, a function has three parts, which can be viewed using the following functions:
formals()
(or args()
): the list of arguments which define how you
call the functionbody()
: the statements/code executed by the functionenvironment()
: the “map” of the location of the function’s variables.Let’s consider a concrete example:
# R's implementation of standard deviation
formals(sd)
## $x
##
##
## $na.rm
## [1] FALSE
args(sd) # more human-readable version of formals()
## function (x, na.rm = FALSE)
## NULL
body(sd)
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
## na.rm = na.rm))
environment(sd)
## <environment: namespace:stats>
sd
## function (x, na.rm = FALSE)
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
## na.rm = na.rm))
## <bytecode: 0x12d808ba8>
## <environment: namespace:stats>
sum
, “+
”, “[
”, sin
, cos
, etc.).
For efficiency, they directly call internal code by using the .Primitive
function, and do not have these three components.# R's implementation of sum
formals(sum)
## NULL
body(sum)
## NULL
environment(sum)
## NULL
sum
## function (..., na.rm = FALSE) .Primitive("sum")
+
”, “<-
”, “=
”, etc.) are in fact special functions that take
one or two arguments, and that can be written without parantheses.mean
, summary
, print
) and
work by calling specific methods depending on the class of the arguments
provided. Covering these is beyond the scope of this session, however, here
is how to recognise one (beyond it being mentioned in the help page!):# R's generic implementation of mean
body(mean)
## UseMethod("mean")
# => body is made up entirely of a call to a method dispatcher.
function name + round parentheses + arguments inside parentheses
For example:
formals(sd) # view the arguments of the sd function
## $x
##
##
## $na.rm
## [1] FALSE
x <- rnorm(1000) # randomly draw a vector of size 1000 from a normal distribution
sd(x) # calculate the sd of the previous vector
## [1] 1.010394
invisible()
function on their output.# quick example
makeBigMatrix <- function() {
invisible(matrix(rnorm(1000*1000), nrow=1000, ncol=1000))
}
makeBigMatrix() # matrix returned is invisible: no printing to screen
m <- makeBigMatrix()
m[1:5,1:5] # can still retrieve and work with the returned matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.9635418 -0.1850505 0.2341487 -0.20526008 -0.55730703
## [2,] -0.2843295 0.6698055 1.9439888 0.04467237 1.47766488
## [3,] 0.9716795 0.4679881 0.3149382 -1.25812107 0.01324809
## [4,] 0.3488799 1.8589399 0.5091091 0.79241692 0.19902000
## [5,] 0.8091254 -1.1175470 0.4368816 -0.00209724 -1.88837251
return()
statement, the function
will return what was returned by its last executed statement.myadd <- function(a,b) {
a + b
}
myadd(3, 2)
## [1] 5
# => still returns a value
The arguments can be defined in two ways:
sd()
the x
)sd()
, the na.rm
argument has a default value of FALSE
)When calling a function, the values in the call are matched to arguments either by name or by position.
For example, the following call does probably not what you intended (and depending on your R version may even throw an error):
# x <- rnorm(1000) # reminder
sd(TRUE, x) # incorrect, expects first `x` and second `na.rm`
In the above call, arguments were matched by position, equivalent to
sd(x = TRUE, na.rm = x)
.
You can pass arguments in an arbitrary order by using their names, in which case R will first match all named arguments, and then the remaining unnamed ones by position:
# ... which can make the code less readable
sd(na.rm = TRUE, x = x)
## [1] 1.010394
sd(na.rm = TRUE, x)
## [1] 1.010394
sd(TRUE, x = x)
## [1] 1.010394
y <- c(x, NA)
sd(y)
## [1] NA
sd(y, na.rm = TRUE)
## [1] 1.010394
sd(y, TRUE)
## [1] 1.010394
sd(TRUE, x = y)
## [1] 1.010394
sd(y)
: by position (using the default value of na.rm = FALSE
)sd(y, na.rm = TRUE)
: by name (na.rm
) and by position (y
)sd(y, TRUE)
: by positionsd(TRUE, x = y)
: by name (y
) and by position (TRUE
)a
and b
.add <- function(a, b) {
a + b
}
add(1:3, 1:3)
## [1] 2 4 6
a
and b
, but that can also be called with just a
. In that case, it will calculate the addition of a
with itself.# using first argument as default value
add2 <- function(a, b = a) {
return(a + b)
}
add2(1:3, 1:3)
## [1] 2 4 6
add2(1:3)
## [1] 2 4 6
# using NULL as default value
add3 <- function(a, b = NULL) {
if (is.null(b)) {
b = a
}
return(a + b)
}
add3(2:4, 2:4)
## [1] 4 6 8
add3(2:4)
## [1] 4 6 8
x
and n
and multiplies the elements
of x
with n
. Check that x
is a numerical vector and n
is a single
numerical value, and throw an error if this is not the case
(hint: see ?stop
or ?stopifnot
for information of how to throw an error).# using if
multn <- function(x, n) {
if (!is.numeric(x) || !is.numeric(n) || length(n) != 1L) {
stop("Error in arguments")
}
return(x * n)
}
multn(1:3, 2)
## [1] 2 4 6
# multn(1:3, c(2,2))
# using stopifnot
multn2 <- function(x, n) {
stopifnot(exprs = {
is.numeric(x)
is.numeric(n)
length(n) == 1L
})
return(x * n)
}
multn2(2:4, 2)
## [1] 4 6 8
# multn2(1:3, c(2,2))
...
A function can only handle as many arguments as specified in formals
, unless
the function uses the special ...
argument (e.g. the plot
function).
The special argument ...
(dot-dot-dot, or ellipsis) can accept any number of
named/unnamed values. Here is an illustrative example:
count_args <- function(...) {
arglist <- list(...)
message("you called me using ", length(arglist), " arguments")
}
count_args()
## you called me using 0 arguments
count_args(1:10)
## you called me using 1 arguments
count_args(1, 2, 3, 4, 5)
## you called me using 5 arguments
The ellipsis argument will be assigned all arguments that could not be name- nor position-matched.
Exercise 14
How many arguments are “eaten up” by the ellipsis (...
) in the following examples:
count_args(1:2)
count_args(1, 2)
count_args(1:2, 3:4)
count_args(1:4)
count_args(1:2)
: 1 (a single vector with two elements)count_args(1, 2)
: 2 (two length-one vectors)count_args(1:2, 3:4)
: 2 (two length-two vectors)count_args(1:4)
: 1 (a single vector with four elements)do.call
We have seen above that the arguments of a function (e.g. returned by
formals()
) are, in fact, a list object, and that functions with an ellipsis
argument can handle any number of arguments. What can we do when we don’t know
in advance how many arguments we will want to pass to such a function?
It is possible to create an argument list as an R object, and then
execute the function call on it using do.call
; the first argument to do.call
is the function to be called, and the second is the list with the arguments.
For example:
inputList <- list(y, na.rm = TRUE)
do.call(sd, inputList)
## [1] 1.010394
r <- list(row1 = 1:3, row2 = 4:6)
# assuming we don't know how many elements there are in r
do.call(rbind, r)
## [,1] [,2] [,3]
## row1 1 2 3
## row2 4 5 6
An environment is a place in R’s memory where R objects can be stored, including values, datasets, functions, other environments, and more.
By default, when working from the command line, you are working within the global environment.
The scope of a variable defines where a variable “lives”. It is the region of code in which a variable is visible and can be used in expressions.
The arguments of a function have a scope that is limited to the body of that
function, e.g. myarg
does not exist in the global environment, but it does
exist inside of the environment of myfunction(myarg = 1)
:
# no myarg in the current environment
exists("myarg")
## [1] FALSE
# but myarg exists inside of myfunction
myfunction <- function(myarg = 1) {
return(exists("myarg"))
}
myfunction()
## [1] TRUE
environment
of the function itself (remember that each function gets its own environment,
one of the three parts of a function).Here is an illustration:
f <- function() {
x <- 1
y <- 2
c(x, y)
}
f()
## [1] 1 2
rm(f) # clean-up
If a variable is not defined inside a function, R will look one level up, in
its parent environment
:
x <- 2
g <- function() {
y <- 1
c(x, y)
}
g()
## [1] 2 1
rm(x, g) # clean-up
x <- 2
g <- function() {
y <- 1
c(x, y)
}
x <- 3
# g() uses the value of `x` at the time of the call to g()
g()
## [1] 3 1
rm(x, g)
()
) and skips non-function objects when searching:sum # the function
## function (..., na.rm = FALSE) .Primitive("sum")
sum <- 15
sum # the variable, not the function
## [1] 15
sum(c(1,2,3)) # still works
## [1] 6
sum
## [1] 15
help(sum) # still works
rm(sum) # removing the variable...
sum # ...unmasks the function
## function (..., na.rm = FALSE) .Primitive("sum")
With every function call, the environment of the function is freshly created, which means that functions have no automatic memory.
The function below checks if a
exists. If not, it creates a
with a value of
zero, otherwise it adds 1 to the already existing a
.
Can you guess what would the output of two subsequent calls of fresh
function?
fresh <- function() {
if (!exists("a")) {
a <- 0
} else {
a <- a + 1
}
a
}
exists("a") # check it does not exist globally
## [1] FALSE
fresh()
## [1] 0
fresh()
## [1] 0
As the variable a
does not exist at the time we call the function (neither in
the function itself nor in the global space), the fresh
function returns 0
in both calls. a
temporarily lives in the environment of the function fresh
,
but every time fresh
is called, it will get a newly created environment.
For more details see Hadley Wickham’s book Advanced R
sum(1,2,3,4,5)
?sum(1,2,3,4,5)
## [1] 15
mean(1,2,3,4,5)
?mean(1,2,3,4,5)
## [1] 1
# check the help page of both functions
?sum
?mean
=> We can see that sum uses the dot-dot-dot ...
argument, which means that all
arguments passed without a specific name are just summed-up together. Even if we
provide logical value! The only way how to tell that the logical is a na.rm
argument is to directly name it.
sum(1,2,3,4,5, TRUE)
## [1] 16
sum(1,2,3,4,5, na.rm = TRUE)
## [1] 15
=> The function mean
on the other hand calculates the mean only from first
argument. The others are passed to internal functions which do the calculation.
It also accepts a na.rm
argument. It can also accept trim
argument for
calculation of a trimmed mean.
Provide the numbers in a single vector.
sum(c(1,2,3,4,5))
## [1] 15
mean(c(1,2,3,4,5))
## [1] 3
C_to_F <- function(x) {
(x * 9/5) + 32
}
C_to_F(c(-20, 0, 100))
## [1] -4 32 212
gb_val <- 10
edit_global_val <- function(x) {
gb_val <- gb_val + x
return(gb_val)
}
# apply the edit...
edit_global_val(5)
# ...but did the global value actually change?
gb_val
## [1] 15
## [1] 10
=> The function returns the edited global value… But the actual global value
has not changed! This is because the moment we assigned a value to an object
called gb_val
inside the function using gb_val <- gb_val + x
, we actually
created a copy of gb_val
that is limited to the scope of the function. As
such, the assignment only changes the local copy, not the global one.
Is there a way around this? Beyond the scope of this session, but…
gb_val <- 10
edit_global_val <- function(x) {
# force the assignment to occur within the global environment
assign("gb_val", gb_val + x, envir = parent.env(environment()))
return(gb_val)
}
# apply the edit...
edit_global_val(5)
## [1] 15
# ...but did the global value actually change?
gb_val
## [1] 15
Loose rule: avoid editing values in other environments from within functions.