<- function(x, na.rm = FALSE) {
avg if (na.rm) {
<- x[!is.na(x)]
x
}<- sum(x)/length(x)
out return(out)
}
For-loops, apply
family of functions
Goals of today’s lecture
- Learn the concept of loops
- Implementation of the for-loop
- Implementation of the
apply()
family of functions - Apply the knowledge from previous lectures on a new dataset
Functions (short recap)
Calling a Function
Last time we wrote together function avg
with two arguments (x
without default value, na.rm
with default value):
We can call the function avg
with following arguments:
x
we need to provide this argument, it does not have any default valuena.rm
it has default value (FALSE
) and therefore it does not have to be always specified. We have to do it only, if we want to override its default value.
Remember, we can list all arguments by calling formals()
or args()
with function name as an argument.
args(avg)
function (x, na.rm = FALSE)
NULL
formals(avg)
$x
$na.rm
[1] FALSE
- Here we provided
y
as argumentx
.
<- c(1, 7, 10, 20, 50, 60, NA)
y avg(y)
[1] NA
avg(y, na.rm = TRUE)
[1] 24.66667
- In the first call the function
avg
uses default forna.rm=FALSE
and returnsNA
, in the second call we overwrote it tona.rm=TRUE
. - For better readability of the code, keep the order or arguments in function call as they are defined in the function.
Often the direct naming of additional arguments improves the readability too. Despite the fact that we do not have to name them.
Writing a Function
- Start writing your code,
body()
of your future function, think about the returned value - Define all arguments (without/with default value), the function should ideally be self contained (only refer to objects defined as arguments)
- Choose a meaningful name and add description what the function does.
for()
loop
So far, if there was a need to perform the same operation on all elements of a vector
we tried to do it using vectorized operation. This is also the most efficient way in R
.
However sometimes the vectorization is not possible. Then we would simply iterate (walk) over individual elements and apply the function to each of them separately. For such cases, we can use for()
loop. Note we also write it with brackets, it is indeed a function with “walking rule” as an argument.
We recommend to use for()
loops only for cases with small number of iterations (i.e. plotting). for()
loop can easily become inefficient. The is also the motivation why we learned first about vectorized operations.
We will also discussed some alternatives later today.
Iterate using numerical index
There are multiple ways how to implement for()
loops. One is to iterate over the indexes of an object.
<- c("A", "B", "C")
object # loop
for (i in seq_along(object)) {
print(i)
print(object[i])
}
[1] 1
[1] "A"
[1] 2
[1] "B"
[1] 3
[1] "C"
The syntax is indeed very similar to the definition of a function. In the normal brackets we define over which values the so-called ‘dummy’ variable i
iterates, whereas in the curly brackets we write the code that is executed for each i
. Here, the function seq_along()
will produce a sequence of integer numbers matching the length of the object (starting from 1
), thus assigning an index to each element. The code in the body of the for()
loop will run as many times as there is the length of the object
.
We can get similar vector of integers as seq_along()
by calling 1:length()
.
# object of length 3
<- c("A", "B", "C")
object ## seq_along
seq_along(object)
[1] 1 2 3
## length
1:length(object)
[1] 1 2 3
This might work well in many cases, however does not correctly handle the length of 0
:
# object of length 0
<- character()
object0 object0
character(0)
## seq_along
seq_along(object0)
integer(0)
## length
1:length(object0)
[1] 1 0
Therefore always use seq_along()
function!
Iterate using element names
If the individual elements have names, we can iterate over names, use them to subset object
and work with its individual elements.
We can also to use the same names to store the results of the loop in the new object (will be discussed later).
<- c(x = "A", y = "B", z = "C")
object object
x y z
"A" "B" "C"
names(object)
[1] "x" "y" "z"
for (j in names(object)) {
print(j)
print(object[j])
}
[1] "x"
x
"A"
[1] "y"
y
"B"
[1] "z"
z
"C"
Iterate directly over elements
In this case we do not use numerical index, nor element names but directly iterate over each element in k
.
<- c(x = "A", y = "B", z = "C")
object for (k in object) {
print(k)
}
[1] "A"
[1] "B"
[1] "C"
Exercise:
Using a for-loop, print the first 20 elements of the Fibonacci sequence
<- 0 #0th order term
fib.i_1 <- 1 #1st term
fib.i for (i in seq(2, 20)) {
print(paste0("the ", i, "th value of the Fibonacci sequence is:"))
<- fib.i + fib.i_1 #Fibonacci step
fib.num print(fib.num)
# update which values I will be adding in the next step
<- fib.i
fib.i_1 <- fib.num
fib.i }
[1] "the 2th value of the Fibonacci sequence is:"
[1] 1
[1] "the 3th value of the Fibonacci sequence is:"
[1] 2
[1] "the 4th value of the Fibonacci sequence is:"
[1] 3
[1] "the 5th value of the Fibonacci sequence is:"
[1] 5
[1] "the 6th value of the Fibonacci sequence is:"
[1] 8
[1] "the 7th value of the Fibonacci sequence is:"
[1] 13
[1] "the 8th value of the Fibonacci sequence is:"
[1] 21
[1] "the 9th value of the Fibonacci sequence is:"
[1] 34
[1] "the 10th value of the Fibonacci sequence is:"
[1] 55
[1] "the 11th value of the Fibonacci sequence is:"
[1] 89
[1] "the 12th value of the Fibonacci sequence is:"
[1] 144
[1] "the 13th value of the Fibonacci sequence is:"
[1] 233
[1] "the 14th value of the Fibonacci sequence is:"
[1] 377
[1] "the 15th value of the Fibonacci sequence is:"
[1] 610
[1] "the 16th value of the Fibonacci sequence is:"
[1] 987
[1] "the 17th value of the Fibonacci sequence is:"
[1] 1597
[1] "the 18th value of the Fibonacci sequence is:"
[1] 2584
[1] "the 19th value of the Fibonacci sequence is:"
[1] 4181
[1] "the 20th value of the Fibonacci sequence is:"
[1] 6765
Example: Using for()
loop to read multiple files
From now on, we will work with a simplified dataset originally available from Kaggle web site (see here for more details). The data contain actigraph recordings of motor activity of patients diagnosed with depression (condition group) as well as healthy individuals (control group). The Kaggle database serves as a source of high-quality data for machine learning. In the case of this dataset, it is believed that these data can be predictive of the depression status of a patient. Such a task is, of course, way beyond our scope. We will, nevertheless, explore the dataset together and compute some basic statistics.
First, create a new empty R script in your working directory. We will start by importing the tidyverse package and we will also create a new directory where we will store the figures.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dir.create("figures_lecture10", showWarnings = FALSE) #create a folder to store figues
Let’s now download the data and unzip them in 10_data
folder.
download.file(url = "https://ivanek.github.io/introductionToR/data/Kaggle_mentalHealthDataset.zip",
destfile = "Kaggle_mentalHealthDataset.zip")
unzip("Kaggle_MentalHealthDataset.zip", exdir = "10_data") # unzip your file
The actigraph data are now in the folder 10_data/Kaggle_MentalHealthDataset
and are stored for each patient separately csv
(comma-separated values) files. Reading multiple separate files into R is an example of an operation that cannot be vectorized and we will have to load these files one by one in a loop. First, let’s have a look at some of the files so we can see their structure. The csv
files can be opened in standard spreadsheet editors but we can conveniently use R. We store the names of the individual files.
# get the vector of the file names
<- list.files("10_data/Kaggle_MentalHealthDataset")
filenames.patients print(filenames.patients)
[1] "condition_1.csv" "condition_10.csv" "condition_11.csv" "condition_12.csv"
[5] "condition_13.csv" "condition_14.csv" "condition_15.csv" "condition_16.csv"
[9] "condition_17.csv" "condition_18.csv" "condition_19.csv" "condition_2.csv"
[13] "condition_20.csv" "condition_21.csv" "condition_22.csv" "condition_23.csv"
[17] "condition_3.csv" "condition_4.csv" "condition_5.csv" "condition_6.csv"
[21] "condition_7.csv" "condition_8.csv" "condition_9.csv" "control_1.csv"
[25] "control_10.csv" "control_11.csv" "control_12.csv" "control_13.csv"
[29] "control_14.csv" "control_15.csv" "control_16.csv" "control_17.csv"
[33] "control_18.csv" "control_19.csv" "control_2.csv" "control_20.csv"
[37] "control_3.csv" "control_4.csv" "control_5.csv" "control_6.csv"
[41] "control_7.csv" "control_8.csv" "control_9.csv"
They can be imported as tibbles using read_csv()
function
# explore some of the files
.23 <- read_csv(file.path("10_data/Kaggle_MentalHealthDataset", filenames.patients[23]),
patientshow_col_types = FALSE) #show_col_types = FALSE just to remove some unnecessary reporting
print(patient.23)
# A tibble: 20,318 × 2
timestamp activity
<dbl> <dbl>
1 1 5
2 2 5
3 3 5
4 4 5
5 5 5
6 6 17
7 7 15
8 8 5
9 9 5
10 10 5
# ℹ 20,308 more rows
.15 <- read_csv(file.path("10_data/Kaggle_MentalHealthDataset", filenames.patients[15]),
patientshow_col_types = FALSE)
print(patient.15)
# A tibble: 21,772 × 2
timestamp activity
<dbl> <dbl>
1 1 111
2 2 66
3 3 157
4 4 73
5 5 142
6 6 131
7 7 137
8 8 5
9 9 107
10 10 223
# ℹ 21,762 more rows
The structure is fairly simple. For each patient the tibble contains two columns. The first one (timestamp
) shows the timepoints (in minutes from the beginning of the measurement) and the column activity
shows the respective measured value. What differs between the patients is the number of rows which reflects the fact that the patients carried the actigraph for different periods of time. This is important for correct reading of these data into R.
We will use the following strategy:
Since all the records share the same structure of columns, we will read them one by one and then “stack” then on top of each other to have them stored as one very long tibble in the end. Thus, functions
read_csv()
andbind_rows()
will come in handy.Upon this simple stacking of the individual tibbles on top of each other, we need to keep track of which patient they belong to. Thus, we will use
mutate()
to add apatient.id
column to each tibble before adding them together
# pre-allocate an empty list
<- vector(mode = "list", length = length(filenames.patients))
tbl.depression # alternative way of pre-allocation tbl.depression <- list(NULL)
# length(tbl.depression) <- length(filenames.patients)
for (i in seq_along(filenames.patients)) {
<- read_csv(file.path("10_data/Kaggle_MentalHealthDataset", filenames.patients[i]),
patient.i show_col_types = FALSE)
<- patient.i |>
patient.i mutate(patient.id = str_remove(filenames.patients[i], ".csv"), .before = timestamp) #add column with the patient's id as the first column
<- patient.i #remember, to enter a single list element, you need double brackets
tbl.depression[[i]]
}<- bind_rows(tbl.depression) #glue the tibbles on top of each other tbl.depression
Now, all the data are stored in tbl.depression
We can check if all the patients have been included
# distinct() generates a tibble with one column, pull() turns this column into
# a simple char vector
<- tbl.depression |>
patients distinct(patient.id) |>
pull()
print(patients) #we see all the patients
[1] "condition_1" "condition_10" "condition_11" "condition_12" "condition_13"
[6] "condition_14" "condition_15" "condition_16" "condition_17" "condition_18"
[11] "condition_19" "condition_2" "condition_20" "condition_21" "condition_22"
[16] "condition_23" "condition_3" "condition_4" "condition_5" "condition_6"
[21] "condition_7" "condition_8" "condition_9" "control_1" "control_10"
[26] "control_11" "control_12" "control_13" "control_14" "control_15"
[31] "control_16" "control_17" "control_18" "control_19" "control_2"
[36] "control_20" "control_3" "control_4" "control_5" "control_6"
[41] "control_7" "control_8" "control_9"
We have the data in the long format (indeed the tbl.depression
has 1225245 rows!). As discussed in previous lectures, this is actually a useful representation for plotting with ggplot
. Let’s visually explore the data for a few patients
#select patients to plot
<- c("condition_1","condition_12","condition_6", "control_1", "control_2", "control_3")
plot.patients #we have the long format (good for plotting)
#pdf("figures_lecture10/actigraph_example.pdf", width = 4, height = 6)
|>
tbl.depression filter(patient.id %in% plot.patients) |> #filter only the rows with patient.ids in our selection
ggplot(aes(x = timestamp, y = activity)) +
geom_line() +
facet_wrap(~ patient.id, ncol = 1)
#dev.off()
We can observe several things. First, for some patients (control_1 and control_3) the device recorded very long sequence but apparently was not carried all the time. Second, there is a clear periodic behavior for each individual which reflects circadian rhythms. Indeed, between time stamps 0 and 20000 minutes we see roughly 14 peaks corresponding to 14 days. And third, at first glance there is no noticeable difference between the patients and the controls (that is why it is believed that perhaps ML approach could in principle help extract some “hidden” features with predictive capacity).
Let us now transform the data into the wide format, meaning that the recorded data over time will be stored in rows and individual patients will be the columns of the tibble.
# to wide format (timepoints in rows, patients in columns)
<- tbl.depression |>
tbl.depression.wd pivot_wider(names_from = patient.id, values_from = activity)
print(dim(tbl.depression.wd))
[1] 65407 44
This format is more “explanatory” since in the rows we can directly see the measured values from all the samples in the same timestamp. But remember, we already saw that different samples have measurements for different time intervals. So in tbl.depression.wd
there will inevitably be missing values (NAs
), and we should keep this in mind.
any(is.na(tbl.depression)) #FALSE
[1] FALSE
any(is.na(tbl.depression.wd)) #TRUE
[1] TRUE
Let’s ask a simple question: Can a maximal measured actigraph value be linked to the depression status? That means, in the tibble tbl.depression.wd
we want to apply the max()
function on each column corresponding to patients/controls, store these values and compare them between the groups.
# solution using for loop
<- numeric(length = length(patients)) #prepare the vector for collecting the results
max.forloop names(max.forloop) <- patients
for (i in seq_along(patients)) {
<- tbl.depression.wd |>
max.forloop[i] select(patients[i]) |>
na.omit() |>
max()
}print(max.forloop)
condition_1 condition_10 condition_11 condition_12 condition_13 condition_14
3526 4935 3506 3622 4859 2929
condition_15 condition_16 condition_17 condition_18 condition_19 condition_2
6825 6964 3195 1355 8000 4228
condition_20 condition_21 condition_22 condition_23 condition_3 condition_4
2150 3869 3222 5931 3847 6776
condition_5 condition_6 condition_7 condition_8 condition_9 control_1
4609 4129 4778 5437 2821 6117
control_10 control_11 control_12 control_13 control_14 control_15
4245 6752 2682 3180 5931 4935
control_16 control_17 control_18 control_19 control_2 control_20
8000 4658 4517 2918 4927 8000
control_3 control_4 control_5 control_6 control_7 control_8
3974 4354 3314 7431 8000 6321
control_9
2674
Replacing for()
loop with apply()
function
The for-loop in this case can be replaced by more compact and more efficient apply()
function with the following structure
apply(X, MARGIN,FUN)
- X is an
array
ormatrix
, (or another rectangular matrix-like format, tibble or data frame is intrinsically transformed viaas.matrix()
function ) - FUN is the function to be applied
- MARGIN either 1 (apply the function on each row of X) or 2 (on columns of X)
The idea behind this function exactly matches our needs to apply the max()
function over the patients/control columns in tbl.depression.wd
. To implement it, we need, in the first step we will remove the first column timestamp
via function select(-timestamp)
. Indeed, this column does not represent any patient but rather labels the rows. Next, we will apply function FUN = max
over the columns, which means MARGIN = 2
. Since we are aware that the data contain NA
values, we have to add the flag na.rm = TRUE
to the function.
# using apply function
<- tbl.depression.wd |>
max.apply select(-timestamp) |>
apply(MARGIN = 2, FUN = max, na.rm = TRUE)
# you can do the same thing in multiple ways
identical(max.forloop, max.apply) #TRUE
[1] TRUE
Note that the na.rm = TRUE
is not an argument of the apply()
function itself but rather an argument of the max()
function. This shows how additional inputs needed for the implemented FUN
are entered through apply()
.
Exercise:
Can we compute the max value from the original long-format tbl.depression
using group_by
and summarize
?
<- tbl.depression |>
max.summarise group_by(patient.id) |>
summarise(max.value = max(activity))
# results in tibble format, but essentially the same
print(max.summarise)
# A tibble: 43 × 2
patient.id max.value
<chr> <dbl>
1 condition_1 3526
2 condition_10 4935
3 condition_11 3506
4 condition_12 3622
5 condition_13 4859
6 condition_14 2929
7 condition_15 6825
8 condition_16 6964
9 condition_17 3195
10 condition_18 1355
# ℹ 33 more rows
Let’s plot the results (refresh the respective Lecture).
# plot the maximal values for patients and controls
<- tibble(patient.id = names(max.apply), max.value = unname(max.apply))
tbl.maxRes # alternative: simply builds a tibble from a named vector, elegant tbl.maxRes
# <- enframe(max.apply, name = 'patient.id', value = 'max.value')
# we will add a column with diagnosis for each patient (depression or control)
# 23 first rows are depression patients, the rest 20 are controls
<- c(rep("depression", 23), rep("control", 20))
diagnosis <- tbl.maxRes |>
tbl.maxRes mutate(diagnosis = diagnosis)
# adaptation of the boxplot from the lecture about plotting
# pdf('figures_lecture10/actigraph_max_detected.pdf', width = 5, height = 5)
|>
tbl.maxRes ggplot(aes(x = diagnosis, y = max.value)) + geom_boxplot(outlier.size = -1) +
theme_minimal() + theme(axis.text = element_text(size = 15), axis.title = element_text(size = 18),
plot.title = element_text(size = 16, face = "bold")) + ggtitle("max. recorded actigraph value") +
geom_jitter(alpha = 0.5, size = 2, height = 0, width = 0.25)
# dev.off()
As expected, the pure maximal detected value does not seem to differ between the groups (you will learn more about how to do statistical tests in R in the upcoming lectures). Let’s say we want to compute some other descriptive statistics in both groups, such as the mean, median and the 3rd quartile values . In order to do that, we will write a simple function
<- function(x, na.rm = TRUE) {
descriptive.stat if (na.rm == TRUE) {
<- na.omit(x)
x
}<- c(mean(x), median(x), quantile(x, 0.75))
stats names(stats) <- c("mean", "median", "3rd quartile")
return(stats)
}
This function can then be called on each patient with apply()
. Before we do that, remember that we saw earlier that in some cases the data were being collected while the machine was already not properly carried (check the figure with examples of actigraph recordings that we generated). Thus, we will truncate the dataset to the first 20000 minutes.
# restrict the data to 0:20000 timesteps (select first 20000 roes)
<- tbl.depression.wd |>
tbl.depression.wd.trunc head(20000)
Let’s now apply our function descriptive.stat()
on the truncated data
<- tbl.depression.wd.trunc |>
stat.apply select(-timestamp) |>
apply(MARGIN = 2, FUN = descriptive.stat)
print(stat.apply)
condition_1 condition_10 condition_11 condition_12 condition_13
mean 152.0564 307.5275 139.6669 166.5213 258.9798
median 18.0000 138.0000 0.0000 29.0000 101.0000
3rd quartile 197.0000 468.0000 131.0000 219.0000 407.0000
condition_14 condition_15 condition_16 condition_17 condition_18
mean 78.41665 115.3727 245.6214 86.4196 74.40955
median 0.00000 3.0000 47.0000 15.0000 8.00000
3rd quartile 49.00000 140.0000 373.0000 106.0000 93.00000
condition_19 condition_2 condition_20 condition_21 condition_22
mean 163.4425 207.7044 68.2422 78.4702 165.2143
median 26.0000 32.0000 0.0000 0.0000 21.0000
3rd quartile 195.0000 268.0000 45.0000 26.0000 275.0000
condition_23 condition_3 condition_4 condition_5 condition_6
mean 252.6683 264.8822 293.8063 173.9925 203.8545
median 50.0000 52.0000 87.0000 36.0000 21.0000
3rd quartile 394.0000 394.0000 421.0000 210.0000 294.0000
condition_7 condition_8 condition_9 control_1 control_10
mean 285.4507 184.9345 179.8345 232.176 288.5357
median 32.0000 3.0000 35.0000 60.000 52.0000
3rd quartile 381.0000 172.0000 231.0000 334.000 454.0000
control_11 control_12 control_13 control_14 control_15 control_16
mean 192.8289 151.0216 181.8055 375.4785 296.6492 244.6996
median 32.0000 18.0000 44.0000 93.0000 103.0000 107.0000
3rd quartile 242.0000 227.0000 247.0000 564.0000 438.0000 360.0000
control_17 control_18 control_19 control_2 control_20 control_3
mean 251.5357 296.6109 225.4137 380.3896 359.0317 249.135
median 98.0000 134.0000 62.0000 131.0000 166.0000 103.000
3rd quartile 373.0000 425.0000 327.0000 600.0000 587.0000 372.000
control_4 control_5 control_6 control_7 control_8 control_9
mean 206.4836 308.9847 349.2488 406.4155 406.8263 139.374
median 30.0000 143.0000 131.0000 148.0000 250.0000 10.000
3rd quartile 231.0000 500.0000 483.0000 568.0000 644.0000 152.000
The function can be also directly defined within the apply()
command
<- tbl.depression.wd.trunc |>
stat.apply.allinone select(-timestamp) |>
apply(MARGIN = 2, function(x) {
<- na.omit(x)
x <- c(mean(x), median(x), quantile(x, 0.75))
stats names(stats) <- c("mean", "median", "3rd quartile")
return(stats)
})identical(stat.apply.allinone, stat.apply)
[1] TRUE
One-dimensinal alternatives: lapply()
, sapply()
, vapply()
If we want to call a function over a one-dimensional object (list, vector), there are alternatives to the apply()
function with the following syntax
lapply(X, FUN)
sapply(X, FUN)
vapply(X, FUN, FUN.VALUE)
- X is
vector
orlist
- FUN is the function to be applied
- FUN.VALUE (only
vapply()
) definition of expected output per element
These three functions work in a similar way, traversing over a set of data like a list
or a vector
, and calling the specified function on each element. They differ in the output format though.
lapply()
returnslist
sapply()
returns a simplified version of the output, i.e.vector
ormatrix
, if the output for every element has same dimensions (length). If the dimensions (length) are not identical it returns back thelist
. This might create an issue if the code after expectsvector
ormatrix
as an input.vapply()
returns also a simplified version of the output, however it requires the specification of the output per element. This extra work pays off, we can then be sure what output type is returned. Therefore the use ofvapply()
is much safer approach then usingsapply()
. And often even faster.
The recommendation is to use vapply()
and define the expected output, otherwise use lapply()
.
Let’s check their functionality on a simple (an admittedly artificial) example. We will extract the vector of the computed means for each patient/control and square each element using the l/s/v-apply()
functions and will compare the output.
<- stat.apply["mean", ]
patient.means lapply(patient.means, function(x) x^2) #returns a list
$condition_1
[1] 23121.15
$condition_10
[1] 94573.16
$condition_11
[1] 19506.84
$condition_12
[1] 27729.34
$condition_13
[1] 67070.54
$condition_14
[1] 6149.171
$condition_15
[1] 13310.86
$condition_16
[1] 60329.85
$condition_17
[1] 7468.347
$condition_18
[1] 5536.781
$condition_19
[1] 26713.45
$condition_2
[1] 43141.12
$condition_20
[1] 4656.998
$condition_21
[1] 6157.572
$condition_22
[1] 27295.76
$condition_23
[1] 63841.24
$condition_3
[1] 70162.55
$condition_4
[1] 86322.14
$condition_5
[1] 30273.41
$condition_6
[1] 41556.66
$condition_7
[1] 81482.13
$condition_8
[1] 34200.75
$condition_9
[1] 32340.43
$control_1
[1] 53905.69
$control_10
[1] 83252.85
$control_11
[1] 37183
$control_12
[1] 22807.54
$control_13
[1] 33053.24
$control_14
[1] 140984.1
$control_15
[1] 88000.75
$control_16
[1] 59877.92
$control_17
[1] 63270.18
$control_18
[1] 87978.03
$control_19
[1] 50811.36
$control_2
[1] 144696.2
$control_20
[1] 128903.8
$control_3
[1] 62068.25
$control_4
[1] 42635.48
$control_5
[1] 95471.54
$control_6
[1] 121974.7
$control_7
[1] 165173.5
$control_8
[1] 165507.6
$control_9
[1] 19425.11
sapply(patient.means, function(x) x^2) #simplified output, vector in this case
condition_1 condition_10 condition_11 condition_12 condition_13 condition_14
23121.149 94573.163 19506.843 27729.343 67070.537 6149.171
condition_15 condition_16 condition_17 condition_18 condition_19 condition_2
13310.860 60329.848 7468.347 5536.781 26713.451 43141.118
condition_20 condition_21 condition_22 condition_23 condition_3 condition_4
4656.998 6157.572 27295.765 63841.245 70162.553 86322.142
condition_5 condition_6 condition_7 condition_8 condition_9 control_1
30273.407 41556.657 81482.131 34200.752 32340.429 53905.695
control_10 control_11 control_12 control_13 control_14 control_15
83252.850 37183.004 22807.539 33053.240 140984.142 88000.748
control_16 control_17 control_18 control_19 control_2 control_20
59877.919 63270.183 87978.026 50811.359 144696.248 128903.798
control_3 control_4 control_5 control_6 control_7 control_8
62068.248 42635.477 95471.545 121974.724 165173.518 165507.598
control_9
19425.112
vapply(patient.means, function(x) x^2, FUN.VALUE = numeric(1))
condition_1 condition_10 condition_11 condition_12 condition_13 condition_14
23121.149 94573.163 19506.843 27729.343 67070.537 6149.171
condition_15 condition_16 condition_17 condition_18 condition_19 condition_2
13310.860 60329.848 7468.347 5536.781 26713.451 43141.118
condition_20 condition_21 condition_22 condition_23 condition_3 condition_4
4656.998 6157.572 27295.765 63841.245 70162.553 86322.142
condition_5 condition_6 condition_7 condition_8 condition_9 control_1
30273.407 41556.657 81482.131 34200.752 32340.429 53905.695
control_10 control_11 control_12 control_13 control_14 control_15
83252.850 37183.004 22807.539 33053.240 140984.142 88000.748
control_16 control_17 control_18 control_19 control_2 control_20
59877.919 63270.183 87978.026 50811.359 144696.248 128903.798
control_3 control_4 control_5 control_6 control_7 control_8
62068.248 42635.477 95471.545 121974.724 165173.518 165507.598
control_9
19425.112
Exercise:
In this simple task, the easiest way would be to just type patient.means^2
. A little bit more practical task could be to rewrite the original for-loop from today’s lecture that reads in the data files using lapply()
function. Can you manage?
<- lapply(file.path("10_data/Kaggle_MentalHealthDataset", filenames.patients),
dataFiles function(x) {
<- read_csv(x, show_col_types = FALSE)
data # get the name of the sample, first remove the path
<- str_remove(x, "10_data/Kaggle_MentalHealthDataset/")
name # now remove the suffix
<- str_remove(name, ".csv")
name <- data |>
data mutate(patient.id = name, .before = timestamp)
})# stack the data on top of each other
<- bind_rows(dataFiles)
dataFiles identical(tbl.depression, dataFiles)
[1] TRUE
Session info
sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Zurich
tzcode source: internal
attached base packages:
[1] grid stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[5] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[9] ggplot2_3.5.1 tidyverse_2.0.0 png_0.1-8 knitr_1.49
loaded via a namespace (and not attached):
[1] utf8_1.2.4 generics_0.1.3 stringi_1.8.4 hms_1.1.3
[5] digest_0.6.37 magrittr_2.0.3 evaluate_1.0.1 timechange_0.3.0
[9] fastmap_1.2.0 jsonlite_1.8.9 formatR_1.14 fansi_1.0.6
[13] scales_1.3.0 codetools_0.2-20 cli_3.6.3 rlang_1.1.4
[17] crayon_1.5.3 bit64_4.5.2 munsell_0.5.1 withr_3.0.2
[21] yaml_2.3.10 tools_4.3.2 parallel_4.3.2 tzdb_0.4.0
[25] colorspace_2.1-1 vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4
[29] htmlwidgets_1.6.4 bit_4.5.0 vroom_1.6.5 pkgconfig_2.0.3
[33] pillar_1.9.0 gtable_0.3.6 glue_1.8.0 xfun_0.49
[37] tidyselect_1.2.1 rstudioapi_0.17.1 farver_2.1.2 htmltools_0.5.8.1
[41] rmarkdown_2.29 labeling_0.4.3 compiler_4.3.2