For-loops, `apply` family of functions

Author

Michal Kloc

Published

November 20, 2024

Goals of today’s lecture

Learn the concept of loops
Implementation of the for-loop
Implementation of the apply() family of functions
Apply the knowledge from previous lectures on a new dataset

Functions (short recap)

Calling a Function

Last time we wrote together function avg with two arguments (x without default value, na.rm with default value):

avg <- function(x, na.rm = FALSE) {
    if (na.rm) {
        x <- x[!is.na(x)]
    }
    out <- sum(x)/length(x)
    return(out)
}

We can call the function avg with following arguments:

x we need to provide this argument, it does not have any default value
na.rm it has default value (FALSE) and therefore it does not have to be always specified. We have to do it only, if we want to override its default value.

Remember, we can list all arguments by calling formals() or args() with function name as an argument.

args(avg)

function (x, na.rm = FALSE) 
NULL

formals(avg)

$x


$na.rm
[1] FALSE

Here we provided y as argument x.

y <- c(1, 7, 10, 20, 50, 60, NA)
avg(y)

[1] NA

avg(y, na.rm = TRUE)

[1] 24.66667

In the first call the function avg uses default for na.rm=FALSE and returns NA, in the second call we overwrote it to na.rm=TRUE.
For better readability of the code, keep the order or arguments in function call as they are defined in the function.
Often the direct naming of additional arguments improves the readability too. Despite the fact that we do not have to name them.

Writing a Function

Start writing your code, body() of your future function, think about the returned value
Define all arguments (without/with default value), the function should ideally be self contained (only refer to objects defined as arguments)
Choose a meaningful name and add description what the function does.

`for()` loop

So far, if there was a need to perform the same operation on all elements of a vector we tried to do it using vectorized operation. This is also the most efficient way in R.

However sometimes the vectorization is not possible. Then we would simply iterate (walk) over individual elements and apply the function to each of them separately. For such cases, we can use for() loop. Note we also write it with brackets, it is indeed a function with “walking rule” as an argument.

We recommend to use for() loops only for cases with small number of iterations (i.e. plotting). for() loop can easily become inefficient. The is also the motivation why we learned first about vectorized operations.

We will also discussed some alternatives later today.

Iterate using numerical index

There are multiple ways how to implement for() loops. One is to iterate over the indexes of an object.

object <- c("A", "B", "C")
# loop
for (i in seq_along(object)) {
    print(i)
    print(object[i])
}

[1] 1
[1] "A"
[1] 2
[1] "B"
[1] 3
[1] "C"

The syntax is indeed very similar to the definition of a function. In the normal brackets we define over which values the so-called ‘dummy’ variable i iterates, whereas in the curly brackets we write the code that is executed for each i. Here, the function seq_along() will produce a sequence of integer numbers matching the length of the object (starting from 1), thus assigning an index to each element. The code in the body of the for() loop will run as many times as there is the length of the object.

We can get similar vector of integers as seq_along() by calling 1:length().

# object of length 3
object <- c("A", "B", "C")
## seq_along
seq_along(object)

[1] 1 2 3

## length
1:length(object)

[1] 1 2 3

This might work well in many cases, however does not correctly handle the length of 0:

# object of length 0
object0 <- character()
object0

character(0)

## seq_along
seq_along(object0)

integer(0)

## length
1:length(object0)

[1] 1 0

Therefore always use seq_along() function!

Iterate using element names

If the individual elements have names, we can iterate over names, use them to subset object and work with its individual elements.

We can also to use the same names to store the results of the loop in the new object (will be discussed later).

object <- c(x = "A", y = "B", z = "C")
object

  x   y   z 
"A" "B" "C"

names(object)

[1] "x" "y" "z"

for (j in names(object)) {
    print(j)
    print(object[j])
}

[1] "x"
  x 
"A" 
[1] "y"
  y 
"B" 
[1] "z"
  z 
"C"

Iterate directly over elements

In this case we do not use numerical index, nor element names but directly iterate over each element in k.

object <- c(x = "A", y = "B", z = "C")
for (k in object) {
    print(k)
}

[1] "A"
[1] "B"
[1] "C"

Exercise:

Using a for-loop, print the first 20 elements of the Fibonacci sequence

Solution

fib.i_1 <- 0  #0th order term
fib.i <- 1  #1st term
for (i in seq(2, 20)) {
    print(paste0("the ", i, "th value of the Fibonacci sequence is:"))
    fib.num <- fib.i + fib.i_1  #Fibonacci step
    print(fib.num)
    # update which values I will be adding in the next step
    fib.i_1 <- fib.i
    fib.i <- fib.num
}

[1] "the 2th value of the Fibonacci sequence is:"
[1] 1
[1] "the 3th value of the Fibonacci sequence is:"
[1] 2
[1] "the 4th value of the Fibonacci sequence is:"
[1] 3
[1] "the 5th value of the Fibonacci sequence is:"
[1] 5
[1] "the 6th value of the Fibonacci sequence is:"
[1] 8
[1] "the 7th value of the Fibonacci sequence is:"
[1] 13
[1] "the 8th value of the Fibonacci sequence is:"
[1] 21
[1] "the 9th value of the Fibonacci sequence is:"
[1] 34
[1] "the 10th value of the Fibonacci sequence is:"
[1] 55
[1] "the 11th value of the Fibonacci sequence is:"
[1] 89
[1] "the 12th value of the Fibonacci sequence is:"
[1] 144
[1] "the 13th value of the Fibonacci sequence is:"
[1] 233
[1] "the 14th value of the Fibonacci sequence is:"
[1] 377
[1] "the 15th value of the Fibonacci sequence is:"
[1] 610
[1] "the 16th value of the Fibonacci sequence is:"
[1] 987
[1] "the 17th value of the Fibonacci sequence is:"
[1] 1597
[1] "the 18th value of the Fibonacci sequence is:"
[1] 2584
[1] "the 19th value of the Fibonacci sequence is:"
[1] 4181
[1] "the 20th value of the Fibonacci sequence is:"
[1] 6765

Example: Using `for()` loop to read multiple files

From now on, we will work with a simplified dataset originally available from Kaggle web site (see here for more details). The data contain actigraph recordings of motor activity of patients diagnosed with depression (condition group) as well as healthy individuals (control group). The Kaggle database serves as a source of high-quality data for machine learning. In the case of this dataset, it is believed that these data can be predictive of the depression status of a patient. Such a task is, of course, way beyond our scope. We will, nevertheless, explore the dataset together and compute some basic statistics.

First, create a new empty R script in your working directory. We will start by importing the tidyverse package and we will also create a new directory where we will store the figures.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

dir.create("figures_lecture10", showWarnings = FALSE)  #create a folder to store figues

Let’s now download the data and unzip them in 10_data folder.

download.file(url = "https://ivanek.github.io/introductionToR/data/Kaggle_mentalHealthDataset.zip",
    destfile = "Kaggle_mentalHealthDataset.zip")
unzip("Kaggle_MentalHealthDataset.zip", exdir = "10_data")  # unzip your file

The actigraph data are now in the folder 10_data/Kaggle_MentalHealthDataset and are stored for each patient separately csv (comma-separated values) files. Reading multiple separate files into R is an example of an operation that cannot be vectorized and we will have to load these files one by one in a loop. First, let’s have a look at some of the files so we can see their structure. The csv files can be opened in standard spreadsheet editors but we can conveniently use R. We store the names of the individual files.

# get the vector of the file names
filenames.patients <- list.files("10_data/Kaggle_MentalHealthDataset")
print(filenames.patients)

 [1] "condition_1.csv"  "condition_10.csv" "condition_11.csv" "condition_12.csv"
 [5] "condition_13.csv" "condition_14.csv" "condition_15.csv" "condition_16.csv"
 [9] "condition_17.csv" "condition_18.csv" "condition_19.csv" "condition_2.csv" 
[13] "condition_20.csv" "condition_21.csv" "condition_22.csv" "condition_23.csv"
[17] "condition_3.csv"  "condition_4.csv"  "condition_5.csv"  "condition_6.csv" 
[21] "condition_7.csv"  "condition_8.csv"  "condition_9.csv"  "control_1.csv"   
[25] "control_10.csv"   "control_11.csv"   "control_12.csv"   "control_13.csv"  
[29] "control_14.csv"   "control_15.csv"   "control_16.csv"   "control_17.csv"  
[33] "control_18.csv"   "control_19.csv"   "control_2.csv"    "control_20.csv"  
[37] "control_3.csv"    "control_4.csv"    "control_5.csv"    "control_6.csv"   
[41] "control_7.csv"    "control_8.csv"    "control_9.csv"

They can be imported as tibbles using read_csv() function

# explore some of the files
patient.23 <- read_csv(file.path("10_data/Kaggle_MentalHealthDataset", filenames.patients[23]),
    show_col_types = FALSE)  #show_col_types = FALSE just to remove some unnecessary reporting
print(patient.23)

# A tibble: 20,318 × 2
   timestamp activity
       <dbl>    <dbl>
 1         1        5
 2         2        5
 3         3        5
 4         4        5
 5         5        5
 6         6       17
 7         7       15
 8         8        5
 9         9        5
10        10        5
# ℹ 20,308 more rows

patient.15 <- read_csv(file.path("10_data/Kaggle_MentalHealthDataset", filenames.patients[15]),
    show_col_types = FALSE)
print(patient.15)

# A tibble: 21,772 × 2
   timestamp activity
       <dbl>    <dbl>
 1         1      111
 2         2       66
 3         3      157
 4         4       73
 5         5      142
 6         6      131
 7         7      137
 8         8        5
 9         9      107
10        10      223
# ℹ 21,762 more rows

The structure is fairly simple. For each patient the tibble contains two columns. The first one (timestamp) shows the timepoints (in minutes from the beginning of the measurement) and the column activity shows the respective measured value. What differs between the patients is the number of rows which reflects the fact that the patients carried the actigraph for different periods of time. This is important for correct reading of these data into R.

We will use the following strategy:

Since all the records share the same structure of columns, we will read them one by one and then “stack” then on top of each other to have them stored as one very long tibble in the end. Thus, functions read_csv() and bind_rows() will come in handy.
Upon this simple stacking of the individual tibbles on top of each other, we need to keep track of which patient they belong to. Thus, we will use mutate() to add a patient.id column to each tibble before adding them together

# pre-allocate an empty list
tbl.depression <- vector(mode = "list", length = length(filenames.patients))
# alternative way of pre-allocation tbl.depression <- list(NULL)
# length(tbl.depression) <- length(filenames.patients)
for (i in seq_along(filenames.patients)) {
    patient.i <- read_csv(file.path("10_data/Kaggle_MentalHealthDataset", filenames.patients[i]),
        show_col_types = FALSE)
    patient.i <- patient.i |>
        mutate(patient.id = str_remove(filenames.patients[i], ".csv"), .before = timestamp)  #add column with the patient's id as the first column
    tbl.depression[[i]] <- patient.i  #remember, to enter a single list element, you need double brackets
}
tbl.depression <- bind_rows(tbl.depression)  #glue the tibbles on top of each other

Now, all the data are stored in tbl.depression We can check if all the patients have been included

# distinct() generates a tibble with one column, pull() turns this column into
# a simple char vector
patients <- tbl.depression |>
    distinct(patient.id) |>
    pull()
print(patients)  #we see all the patients

 [1] "condition_1"  "condition_10" "condition_11" "condition_12" "condition_13"
 [6] "condition_14" "condition_15" "condition_16" "condition_17" "condition_18"
[11] "condition_19" "condition_2"  "condition_20" "condition_21" "condition_22"
[16] "condition_23" "condition_3"  "condition_4"  "condition_5"  "condition_6" 
[21] "condition_7"  "condition_8"  "condition_9"  "control_1"    "control_10"  
[26] "control_11"   "control_12"   "control_13"   "control_14"   "control_15"  
[31] "control_16"   "control_17"   "control_18"   "control_19"   "control_2"   
[36] "control_20"   "control_3"    "control_4"    "control_5"    "control_6"   
[41] "control_7"    "control_8"    "control_9"

We have the data in the long format (indeed the tbl.depression has 1225245 rows!). As discussed in previous lectures, this is actually a useful representation for plotting with ggplot. Let’s visually explore the data for a few patients

#select patients to plot
plot.patients <- c("condition_1","condition_12","condition_6", "control_1", "control_2", "control_3")
#we have the long format (good for plotting)


#pdf("figures_lecture10/actigraph_example.pdf", width = 4, height = 6)
tbl.depression |>
  filter(patient.id %in% plot.patients) |> #filter only the rows with patient.ids in our selection
  ggplot(aes(x = timestamp, y = activity)) +
  geom_line() + 
  facet_wrap(~ patient.id, ncol = 1)

#dev.off()

We can observe several things. First, for some patients (control_1 and control_3) the device recorded very long sequence but apparently was not carried all the time. Second, there is a clear periodic behavior for each individual which reflects circadian rhythms. Indeed, between time stamps 0 and 20000 minutes we see roughly 14 peaks corresponding to 14 days. And third, at first glance there is no noticeable difference between the patients and the controls (that is why it is believed that perhaps ML approach could in principle help extract some “hidden” features with predictive capacity).

Let us now transform the data into the wide format, meaning that the recorded data over time will be stored in rows and individual patients will be the columns of the tibble.

# to wide format (timepoints in rows, patients in columns)
tbl.depression.wd <- tbl.depression |>
    pivot_wider(names_from = patient.id, values_from = activity)
print(dim(tbl.depression.wd))

[1] 65407    44

This format is more “explanatory” since in the rows we can directly see the measured values from all the samples in the same timestamp. But remember, we already saw that different samples have measurements for different time intervals. So in tbl.depression.wd there will inevitably be missing values (NAs), and we should keep this in mind.

any(is.na(tbl.depression))  #FALSE

[1] FALSE

any(is.na(tbl.depression.wd))  #TRUE

[1] TRUE

Let’s ask a simple question: Can a maximal measured actigraph value be linked to the depression status? That means, in the tibble tbl.depression.wd we want to apply the max() function on each column corresponding to patients/controls, store these values and compare them between the groups.

# solution using for loop
max.forloop <- numeric(length = length(patients))  #prepare the vector for collecting the results
names(max.forloop) <- patients
for (i in seq_along(patients)) {
    max.forloop[i] <- tbl.depression.wd |>
        select(patients[i]) |>
        na.omit() |>
        max()
}
print(max.forloop)

 condition_1 condition_10 condition_11 condition_12 condition_13 condition_14 
        3526         4935         3506         3622         4859         2929 
condition_15 condition_16 condition_17 condition_18 condition_19  condition_2 
        6825         6964         3195         1355         8000         4228 
condition_20 condition_21 condition_22 condition_23  condition_3  condition_4 
        2150         3869         3222         5931         3847         6776 
 condition_5  condition_6  condition_7  condition_8  condition_9    control_1 
        4609         4129         4778         5437         2821         6117 
  control_10   control_11   control_12   control_13   control_14   control_15 
        4245         6752         2682         3180         5931         4935 
  control_16   control_17   control_18   control_19    control_2   control_20 
        8000         4658         4517         2918         4927         8000 
   control_3    control_4    control_5    control_6    control_7    control_8 
        3974         4354         3314         7431         8000         6321 
   control_9 
        2674

Replacing `for()` loop with `apply()` function

The for-loop in this case can be replaced by more compact and more efficient apply() function with the following structure

apply(X, MARGIN,FUN)

X is an array or matrix, (or another rectangular matrix-like format, tibble or data frame is intrinsically transformed via as.matrix() function )
FUN is the function to be applied
MARGIN either 1 (apply the function on each row of X) or 2 (on columns of X)

The idea behind this function exactly matches our needs to apply the max() function over the patients/control columns in tbl.depression.wd. To implement it, we need, in the first step we will remove the first column timestamp via function select(-timestamp). Indeed, this column does not represent any patient but rather labels the rows. Next, we will apply function FUN = max over the columns, which means MARGIN = 2. Since we are aware that the data contain NA values, we have to add the flag na.rm = TRUE to the function.

# using apply function
max.apply <- tbl.depression.wd |>
    select(-timestamp) |>
    apply(MARGIN = 2, FUN = max, na.rm = TRUE)
# you can do the same thing in multiple ways
identical(max.forloop, max.apply)  #TRUE

[1] TRUE

Note that the na.rm = TRUE is not an argument of the apply() function itself but rather an argument of the max() function. This shows how additional inputs needed for the implemented FUN are entered through apply().

Exercise:

Can we compute the max value from the original long-format tbl.depression using group_by and summarize?

Solution

max.summarise <- tbl.depression |>
    group_by(patient.id) |>
    summarise(max.value = max(activity))
# results in tibble format, but essentially the same
print(max.summarise)

# A tibble: 43 × 2
   patient.id   max.value
   <chr>            <dbl>
 1 condition_1       3526
 2 condition_10      4935
 3 condition_11      3506
 4 condition_12      3622
 5 condition_13      4859
 6 condition_14      2929
 7 condition_15      6825
 8 condition_16      6964
 9 condition_17      3195
10 condition_18      1355
# ℹ 33 more rows

Let’s plot the results (refresh the respective Lecture).

# plot the maximal values for patients and controls
tbl.maxRes <- tibble(patient.id = names(max.apply), max.value = unname(max.apply))
# alternative: simply builds a tibble from a named vector, elegant tbl.maxRes
# <- enframe(max.apply, name = 'patient.id', value = 'max.value')
# we will add a column with diagnosis for each patient (depression or control)
# 23 first rows are depression patients, the rest 20 are controls
diagnosis <- c(rep("depression", 23), rep("control", 20))
tbl.maxRes <- tbl.maxRes |>
    mutate(diagnosis = diagnosis)
# adaptation of the boxplot from the lecture about plotting
# pdf('figures_lecture10/actigraph_max_detected.pdf', width = 5, height = 5)
tbl.maxRes |>
    ggplot(aes(x = diagnosis, y = max.value)) + geom_boxplot(outlier.size = -1) +
    theme_minimal() + theme(axis.text = element_text(size = 15), axis.title = element_text(size = 18),
    plot.title = element_text(size = 16, face = "bold")) + ggtitle("max. recorded actigraph value") +
    geom_jitter(alpha = 0.5, size = 2, height = 0, width = 0.25)

# dev.off()

As expected, the pure maximal detected value does not seem to differ between the groups (you will learn more about how to do statistical tests in R in the upcoming lectures). Let’s say we want to compute some other descriptive statistics in both groups, such as the mean, median and the 3rd quartile values . In order to do that, we will write a simple function

descriptive.stat <- function(x, na.rm = TRUE) {
    if (na.rm == TRUE) {
        x <- na.omit(x)
    }
    stats <- c(mean(x), median(x), quantile(x, 0.75))
    names(stats) <- c("mean", "median", "3rd quartile")
    return(stats)
}

This function can then be called on each patient with apply(). Before we do that, remember that we saw earlier that in some cases the data were being collected while the machine was already not properly carried (check the figure with examples of actigraph recordings that we generated). Thus, we will truncate the dataset to the first 20000 minutes.

# restrict the data to 0:20000 timesteps (select first 20000 roes)
tbl.depression.wd.trunc <- tbl.depression.wd |>
    head(20000)

Let’s now apply our function descriptive.stat() on the truncated data

stat.apply <- tbl.depression.wd.trunc |>
    select(-timestamp) |>
    apply(MARGIN = 2, FUN = descriptive.stat)
print(stat.apply)

             condition_1 condition_10 condition_11 condition_12 condition_13
mean            152.0564     307.5275     139.6669     166.5213     258.9798
median           18.0000     138.0000       0.0000      29.0000     101.0000
3rd quartile    197.0000     468.0000     131.0000     219.0000     407.0000
             condition_14 condition_15 condition_16 condition_17 condition_18
mean             78.41665     115.3727     245.6214      86.4196     74.40955
median            0.00000       3.0000      47.0000      15.0000      8.00000
3rd quartile     49.00000     140.0000     373.0000     106.0000     93.00000
             condition_19 condition_2 condition_20 condition_21 condition_22
mean             163.4425    207.7044      68.2422      78.4702     165.2143
median            26.0000     32.0000       0.0000       0.0000      21.0000
3rd quartile     195.0000    268.0000      45.0000      26.0000     275.0000
             condition_23 condition_3 condition_4 condition_5 condition_6
mean             252.6683    264.8822    293.8063    173.9925    203.8545
median            50.0000     52.0000     87.0000     36.0000     21.0000
3rd quartile     394.0000    394.0000    421.0000    210.0000    294.0000
             condition_7 condition_8 condition_9 control_1 control_10
mean            285.4507    184.9345    179.8345   232.176   288.5357
median           32.0000      3.0000     35.0000    60.000    52.0000
3rd quartile    381.0000    172.0000    231.0000   334.000   454.0000
             control_11 control_12 control_13 control_14 control_15 control_16
mean           192.8289   151.0216   181.8055   375.4785   296.6492   244.6996
median          32.0000    18.0000    44.0000    93.0000   103.0000   107.0000
3rd quartile   242.0000   227.0000   247.0000   564.0000   438.0000   360.0000
             control_17 control_18 control_19 control_2 control_20 control_3
mean           251.5357   296.6109   225.4137  380.3896   359.0317   249.135
median          98.0000   134.0000    62.0000  131.0000   166.0000   103.000
3rd quartile   373.0000   425.0000   327.0000  600.0000   587.0000   372.000
             control_4 control_5 control_6 control_7 control_8 control_9
mean          206.4836  308.9847  349.2488  406.4155  406.8263   139.374
median         30.0000  143.0000  131.0000  148.0000  250.0000    10.000
3rd quartile  231.0000  500.0000  483.0000  568.0000  644.0000   152.000

The function can be also directly defined within the apply() command

stat.apply.allinone <- tbl.depression.wd.trunc |>
    select(-timestamp) |>
    apply(MARGIN = 2, function(x) {
        x <- na.omit(x)
        stats <- c(mean(x), median(x), quantile(x, 0.75))
        names(stats) <- c("mean", "median", "3rd quartile")
        return(stats)
    })
identical(stat.apply.allinone, stat.apply)

[1] TRUE

One-dimensinal alternatives: `lapply()`, `sapply()`, `vapply()`

If we want to call a function over a one-dimensional object (list, vector), there are alternatives to the apply() function with the following syntax

lapply(X, FUN)
sapply(X, FUN)
vapply(X, FUN, FUN.VALUE)

X is vector or list
FUN is the function to be applied
FUN.VALUE (only vapply()) definition of expected output per element

These three functions work in a similar way, traversing over a set of data like a list or a vector, and calling the specified function on each element. They differ in the output format though.

lapply() returns list
sapply() returns a simplified version of the output, i.e. vector or matrix, if the output for every element has same dimensions (length). If the dimensions (length) are not identical it returns back the list. This might create an issue if the code after expects vector or matrix as an input.
vapply() returns also a simplified version of the output, however it requires the specification of the output per element. This extra work pays off, we can then be sure what output type is returned. Therefore the use of vapply() is much safer approach then using sapply(). And often even faster.

The recommendation is to use vapply() and define the expected output, otherwise use lapply().

Let’s check their functionality on a simple (an admittedly artificial) example. We will extract the vector of the computed means for each patient/control and square each element using the l/s/v-apply() functions and will compare the output.

patient.means <- stat.apply["mean", ]
lapply(patient.means, function(x) x^2)  #returns a list

$condition_1
[1] 23121.15

$condition_10
[1] 94573.16

$condition_11
[1] 19506.84

$condition_12
[1] 27729.34

$condition_13
[1] 67070.54

$condition_14
[1] 6149.171

$condition_15
[1] 13310.86

$condition_16
[1] 60329.85

$condition_17
[1] 7468.347

$condition_18
[1] 5536.781

$condition_19
[1] 26713.45

$condition_2
[1] 43141.12

$condition_20
[1] 4656.998

$condition_21
[1] 6157.572

$condition_22
[1] 27295.76

$condition_23
[1] 63841.24

$condition_3
[1] 70162.55

$condition_4
[1] 86322.14

$condition_5
[1] 30273.41

$condition_6
[1] 41556.66

$condition_7
[1] 81482.13

$condition_8
[1] 34200.75

$condition_9
[1] 32340.43

$control_1
[1] 53905.69

$control_10
[1] 83252.85

$control_11
[1] 37183

$control_12
[1] 22807.54

$control_13
[1] 33053.24

$control_14
[1] 140984.1

$control_15
[1] 88000.75

$control_16
[1] 59877.92

$control_17
[1] 63270.18

$control_18
[1] 87978.03

$control_19
[1] 50811.36

$control_2
[1] 144696.2

$control_20
[1] 128903.8

$control_3
[1] 62068.25

$control_4
[1] 42635.48

$control_5
[1] 95471.54

$control_6
[1] 121974.7

$control_7
[1] 165173.5

$control_8
[1] 165507.6

$control_9
[1] 19425.11

sapply(patient.means, function(x) x^2)  #simplified output, vector in this case

 condition_1 condition_10 condition_11 condition_12 condition_13 condition_14 
   23121.149    94573.163    19506.843    27729.343    67070.537     6149.171 
condition_15 condition_16 condition_17 condition_18 condition_19  condition_2 
   13310.860    60329.848     7468.347     5536.781    26713.451    43141.118 
condition_20 condition_21 condition_22 condition_23  condition_3  condition_4 
    4656.998     6157.572    27295.765    63841.245    70162.553    86322.142 
 condition_5  condition_6  condition_7  condition_8  condition_9    control_1 
   30273.407    41556.657    81482.131    34200.752    32340.429    53905.695 
  control_10   control_11   control_12   control_13   control_14   control_15 
   83252.850    37183.004    22807.539    33053.240   140984.142    88000.748 
  control_16   control_17   control_18   control_19    control_2   control_20 
   59877.919    63270.183    87978.026    50811.359   144696.248   128903.798 
   control_3    control_4    control_5    control_6    control_7    control_8 
   62068.248    42635.477    95471.545   121974.724   165173.518   165507.598 
   control_9 
   19425.112

vapply(patient.means, function(x) x^2, FUN.VALUE = numeric(1))

 condition_1 condition_10 condition_11 condition_12 condition_13 condition_14 
   23121.149    94573.163    19506.843    27729.343    67070.537     6149.171 
condition_15 condition_16 condition_17 condition_18 condition_19  condition_2 
   13310.860    60329.848     7468.347     5536.781    26713.451    43141.118 
condition_20 condition_21 condition_22 condition_23  condition_3  condition_4 
    4656.998     6157.572    27295.765    63841.245    70162.553    86322.142 
 condition_5  condition_6  condition_7  condition_8  condition_9    control_1 
   30273.407    41556.657    81482.131    34200.752    32340.429    53905.695 
  control_10   control_11   control_12   control_13   control_14   control_15 
   83252.850    37183.004    22807.539    33053.240   140984.142    88000.748 
  control_16   control_17   control_18   control_19    control_2   control_20 
   59877.919    63270.183    87978.026    50811.359   144696.248   128903.798 
   control_3    control_4    control_5    control_6    control_7    control_8 
   62068.248    42635.477    95471.545   121974.724   165173.518   165507.598 
   control_9 
   19425.112

Exercise:

In this simple task, the easiest way would be to just type patient.means^2. A little bit more practical task could be to rewrite the original for-loop from today’s lecture that reads in the data files using lapply() function. Can you manage?

Solution

dataFiles <- lapply(file.path("10_data/Kaggle_MentalHealthDataset", filenames.patients),
    function(x) {
        data <- read_csv(x, show_col_types = FALSE)
        # get the name of the sample, first remove the path
        name <- str_remove(x, "10_data/Kaggle_MentalHealthDataset/")
        # now remove the suffix
        name <- str_remove(name, ".csv")
        data <- data |>
            mutate(patient.id = name, .before = timestamp)
    })
# stack the data on top of each other
dataFiles <- bind_rows(dataFiles)
identical(tbl.depression, dataFiles)

[1] TRUE

Session info

sessionInfo()

R version 4.3.2 (2023-10-31)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
 [5] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
 [9] ggplot2_3.5.1   tidyverse_2.0.0 png_0.1-8       knitr_1.49     

loaded via a namespace (and not attached):
 [1] utf8_1.2.4        generics_0.1.3    stringi_1.8.4     hms_1.1.3        
 [5] digest_0.6.37     magrittr_2.0.3    evaluate_1.0.1    timechange_0.3.0 
 [9] fastmap_1.2.0     jsonlite_1.8.9    formatR_1.14      fansi_1.0.6      
[13] scales_1.3.0      codetools_0.2-20  cli_3.6.3         rlang_1.1.4      
[17] crayon_1.5.3      bit64_4.5.2       munsell_0.5.1     withr_3.0.2      
[21] yaml_2.3.10       tools_4.3.2       parallel_4.3.2    tzdb_0.4.0       
[25] colorspace_2.1-1  vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4  
[29] htmlwidgets_1.6.4 bit_4.5.0         vroom_1.6.5       pkgconfig_2.0.3  
[33] pillar_1.9.0      gtable_0.3.6      glue_1.8.0        xfun_0.49        
[37] tidyselect_1.2.1  rstudioapi_0.17.1 farver_2.1.2      htmltools_0.5.8.1
[41] rmarkdown_2.29    labeling_0.4.3    compiler_4.3.2

Goals of today’s lecture

Functions (short recap)

Calling a Function

Writing a Function

for() loop

Iterate using numerical index

Iterate using element names

Iterate directly over elements

Exercise:

Example: Using for() loop to read multiple files

Replacing for() loop with apply() function

Exercise:

One-dimensinal alternatives: lapply(), sapply(), vapply()

Exercise:

Session info

`for()` loop

Example: Using `for()` loop to read multiple files

Replacing `for()` loop with `apply()` function

One-dimensinal alternatives: `lapply()`, `sapply()`, `vapply()`