RStudio is an Integrated Development Environment (IDE) for the R programming language.
It is one of the most productive option for using R
It enables interactive data analysis, reproducible research and result publishing
It can be locally installed (desktop version) or accessed via a web browser (RStudio server).
RStudio interface:
Working style
For each lecture, maintain a <lecture_name>.R or <lecture_name>.Rmd script on your working directory, storing all your commands
proper indentation of the code (CMD+I in Rstudio) is always a plus for readability
A code without comments is useless.
Always keep in mind reproducibility. Most of us use R/Bioconductor for their research: many months after the analysis was performed, some reviewer/colleague might ask for the code used to generate some specific result.
Use a clean working directory structure.
Recommended steps/commands at the beginning of every R script
Start by setting up the working directory (setwd()) with absolute path
List what is in workspace with ls() to check that it is clean
Load libraries with library()
When quitting R, do not save history and data. This might lead to a big mess later (and anyway all commands necessary to reproduce the analysis are in your script, right?)
When using R Studio, navigate to Tools -> Global Options -> General and unset “restore .RData into workspace on startup” and set “Save workspace .RData on exit” to Never.
Tip: use the auto completion (via TAB) as much as possible
Getting help
Really read the error messages! They might seem cryptic at first, but yet often contain useful information.
First carefully look at R help pages for functions (e.g., ?lm or pressing F1 when the cursor is positioned inside the function name) and to the vignettes of the packages
Use LLMs (responsibly) to both analyze possible causes of error messages and get suggestions to avoid them
Talk to colleagues to get a different perspective on the problem
If no success, ask on forums or mailing lists
Don’t forget to attach to your question the result of the function sessionInfo() (or the results of the function session_info() from the sessioninfo package, slightly easier to read). This documents what system you are using, the version of R and the different packages
# save the outpoutcapture.output(devtools::session_info(pkgs='attached'),file="data/01/session_info_RevisionRBasics.txt")
Do not forget to attach some code example allowing others to reproduce the issue, ideally a Minimal Reproducible Example
Be sure to work on an up-to-date system (R and Bioconductor)! Bugs are regularly fixed in Bioconductor packages
Important!
Asking questions is highly encouraged
If something is not working or understood, reach out for help as early as possible
Steep learning curve: practice, practice and practice again!
The best option is to get access to some genomics dataset to analyze (from your lab, from a paper your like, etc) and to train and apply what you learn during the lectures on it
Short revision of basic R
In the introduction to R basics, we mostly used functions from the tidyverse, which provides a modern, widely used collection of R packages designed for data science. However, for some aspects, we also illustrated how to perform the same operations using “base R” commands, that is, functions that are loaded with R by default rather than provided in a separate add-on package.
Import file
Read data from a text file into R represented as a tibble or data.frame.
Load the packages required for analysis or install them if not present.
# Install package if not available# install.packages("palmerpenguins")# Load the librarylibrary(palmerpenguins)# If you have the data saved in a file, you would read it from disk# into tibble# dat = readr::read_csv("data/01/penguins.csv")# or into data frame# df = read.table("data/01/penguins.csv", sep = ",", header = TRUE)# Rename to match the rest of the analysisdat =tibble(penguins)
str(dat)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
dim(dat)
[1] 344 8
dat
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm
<fct> <fct> <dbl> <dbl> <int>
1 Adelie Torgersen 39.1 18.7 181
2 Adelie Torgersen 39.5 17.4 186
3 Adelie Torgersen 40.3 18 195
4 Adelie Torgersen NA NA NA
5 Adelie Torgersen 36.7 19.3 193
6 Adelie Torgersen 39.3 20.6 190
7 Adelie Torgersen 38.9 17.8 181
8 Adelie Torgersen 39.2 19.6 195
9 Adelie Torgersen 34.1 18.1 193
10 Adelie Torgersen 42 20.2 190
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>
# data framedf =as.data.frame(dat) head(df)
species island bill_length_mm bill_depth_mm flipper_length_mm
1 Adelie Torgersen 39.1 18.7 181
2 Adelie Torgersen 39.5 17.4 186
3 Adelie Torgersen 40.3 18.0 195
4 Adelie Torgersen NA NA NA
5 Adelie Torgersen 36.7 19.3 193
6 Adelie Torgersen 39.3 20.6 190
body_mass_g sex year
1 3750 male 2007
2 3800 female 2007
3 3250 female 2007
4 NA <NA> 2007
5 3450 female 2007
6 3650 male 2007
str(df)
'data.frame': 344 obs. of 8 variables:
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
dim(df)
[1] 344 8
The penguins dataset can be stored in a tibble object (dat) or in data.frame (df). There are three main differences in the usage of a data frame vs a tibble: printing, subsetting and availability of row names.
the dataset contains 344 rows (observations) and 8 columns (variables): 344 penguin observations, with 8 information per observations (variable)
Each column of a data.frame is usually of an atomic (=simple) data type. Atomic data types have values that cannot be divided or broken down further.
Common atomic data types include:
logical (TRUE, FALSE and NA i.e., not available)
integer (3L). Use the L when creating and tells R to store this as an integer.
numeric (3.14, Inf, NaN)
character ("three")
factor for categorical variables (e.g., blood types "A","B","AB","0"). Variables that can take on one of a limited, and usually fixed, number of possible values.
Common data types derived from (are the combination of) the atomic ones include:
data.frame/tibble: all columns are of the same length, but each can be a different data type
matrix: all columns/rows are the same length and all elements are of the same atomic data type
list: no restriction on length/data type per element. Can have character names
You can query for the data type (also called class) of a variable using the function class()
class(df$body_mass_g) # atomic data type. Cannot be broken down further.
[1] "integer"
Change data type:
str(df)
'data.frame': 344 obs. of 8 variables:
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
head(dat$sex)
[1] male female female <NA> female male
Levels: female male
# Sometimes the factor might not be the best column type, we change sex to # character for illustration purposesdat$sex =as.character(dat$sex)df$sex =as.character(df$sex)head(dat$sex)
[1] "male" "female" "female" NA "female" "male"
# If you check the levels of the variable: levels(df$island) # [1] "Biscoe" "Dream" "Torgersen"
[1] "Biscoe" "Dream" "Torgersen"
# However, imagine Torgersen is the main island# you can choose which level to put as reference leveldf$island <-relevel(df$island, ref ="Torgersen")levels(df$island) # [1] "Torgersen" "Biscoe" "Dream"
[1] "Torgersen" "Biscoe" "Dream"
Remember that behind the scenes, a factor is encoded with numerical values.
Whether a string variable should be considered as categorical or not depends on the question asked.
Subsetting
A common operation is to subset data sets, that is, to only retain some of the records. The specification of what to retain can be done in many different ways. There are many ways of achieving it:
1. by numeric index:
slice: To retain a pre-specified set of rows, we can use the slice function from dplyr.
Select: the select function, also from dplyr, allows us to retain a given set of columns
# first rowslice(dat, 1)
# A tibble: 1 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torge… 39.1 18.7 181 3750
# ℹ 2 more variables: sex <chr>, year <int>
#slice(df, 1)# first, third and fifth rowslice(dat, c(1, 3, 5))
# A tibble: 3 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torge… 39.1 18.7 181 3750
2 Adelie Torge… 40.3 18 195 3250
3 Adelie Torge… 36.7 19.3 193 3450
# ℹ 2 more variables: sex <chr>, year <int>
#slice(df, c(1, 3, 5))# Check if dat and df give the same result:# first to third column using tibbleon_dat =select(dat,1:3)# first to third column using data.frameon_df =select(df,1:3)table(on_dat==on_df)
TRUE
1030
Side note
A more intuitive way, which reduces the number of intermediate variables that are explicitly created and scales to more complex operations, is to use the pipe operator (|>). Pipe is an infix operator that enables function composition g(f(x)) = x |> f() |> g().
# Select column 1:3, and row 1:5 of the dat tibble or df data.framedat |>slice(1:5) |>select(1:3)
# A tibble: 5 × 3
species island bill_length_mm
<fct> <fct> <dbl>
1 Adelie Torgersen 39.1
2 Adelie Torgersen 39.5
3 Adelie Torgersen 40.3
4 Adelie Torgersen NA
5 Adelie Torgersen 36.7
square brackets []: Subsetting can also be done using base R commands. This type of subsetting is used a lot in S4 classes, such as SinglCellExperiment that underpin most of genomics analysis in Bioconductor.
# first five rows, first three columnsdat[1:5, 1:3]
# A tibble: 5 × 3
species island bill_length_mm
<fct> <fct> <dbl>
1 Adelie Torgersen 39.1
2 Adelie Torgersen 39.5
3 Adelie Torgersen 40.3
4 Adelie Torgersen NA
5 Adelie Torgersen 36.7
2. by row- and/or column names with:
Can be achieved with select or square brackets. An important shortcut to extract a single column is provided by the $ sign
# Use row number and colnamesdat[1:5, c("species", "island", "bill_length_mm")]
# A tibble: 5 × 3
species island bill_length_mm
<fct> <fct> <dbl>
1 Adelie Torgersen 39.1
2 Adelie Torgersen 39.5
3 Adelie Torgersen 40.3
4 Adelie Torgersen NA
5 Adelie Torgersen 36.7
Side note
There is an important difference between a tibble and a data.frame when it comes to subsetting using the bracket operator. By default, extracting a single column from a data.frame returns a vector (not a data.frame), while performing the same operation on a tibble returns another tibble. The key to getting a data.frame when extracting a column from a data.frame is to add the argument drop = FALSE. The $ operator always returns a vector. The tidyverse counterpart of $ operator that can be used in pipes is pull function from dplyr package.
# extract the first five elements in the 'species' columnstr(df[,'species']) # vector
# Now we are back to subsetting by row numbers, this is useful e.g. for# reordering, here we return the rows of interest in reverse orderdat[rev(which(dat$island=='Torgersen'& dat$year ==2007)),]
%in%: %in% This operator is used to identify if a series of elements of a vector belong to another vector. This is also very useful for subsetting data frames
Adding a new column to a data set, or replacing an existing column
mutate: function from dplyr
# Add another column to `dat` that contains the sum of the `bill_length_mm` and `bill_depth_mm` columns.dat |>mutate(lpd = bill_length_mm + bill_depth_mm)
# A tibble: 344 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm
<fct> <fct> <dbl> <dbl> <int>
1 Adelie Torgersen 39.1 18.7 181
2 Adelie Torgersen 39.5 17.4 186
3 Adelie Torgersen 40.3 18 195
4 Adelie Torgersen NA NA NA
5 Adelie Torgersen 36.7 19.3 193
6 Adelie Torgersen 39.3 20.6 190
7 Adelie Torgersen 38.9 17.8 181
8 Adelie Torgersen 39.2 19.6 195
9 Adelie Torgersen 34.1 18.1 193
10 Adelie Torgersen 42 20.2 190
# ℹ 334 more rows
# ℹ 4 more variables: body_mass_g <int>, sex <chr>, year <int>, lpd <dbl>
#Add column in `dat` that contains the bill length in cm. Display only the species column and the newly created columndat |>mutate(bill_length_cm = bill_length_mm /10) |>select(species, bill_length_cm)
Mutate is a very powerful function able to solve many of problems that you might encounter when trying to transform your data. It accepts many functions and tidyselect semantics on the righthand side.
# Example of more complicated mutates, only for illustration purposes# This example doesn't use a native R pipe |>, but a bit more flexible # "tidyverse" pipe %>%dat %>%mutate(s =apply(., 1, \(x) paste(x, collapse ="_"))) %>%mutate(ex =rowSums(across(starts_with("bill")))) %>%select(s, ex)
# A tibble: 344 × 2
s ex
<chr> <dbl>
1 Adelie_Torgersen_39.1_18.7_181_3750_male_2007 57.8
2 Adelie_Torgersen_39.5_17.4_186_3800_female_2007 56.9
3 Adelie_Torgersen_40.3_18.0_195_3250_female_2007 58.3
4 Adelie_Torgersen_NA_NA_NA_NA_NA_2007 NA
5 Adelie_Torgersen_36.7_19.3_193_3450_female_2007 56
6 Adelie_Torgersen_39.3_20.6_190_3650_male_2007 59.9
7 Adelie_Torgersen_38.9_17.8_181_3625_female_2007 56.7
8 Adelie_Torgersen_39.2_19.6_195_4675_male_2007 58.8
9 Adelie_Torgersen_34.1_18.1_193_3475_NA_2007 52.2
10 Adelie_Torgersen_42.0_20.2_190_4250_NA_2007 62.2
# ℹ 334 more rows
$: $ can also be used to add a new column to a data.frame
dat$bill_length_cm = dat$bill_length_mm /10
Finding all unique combinations
unique function from base R can be used to accomplish the same with a bit of extra typing compared to tidyverse.
unique(dat[, c("species", "island")])
# A tibble: 5 × 2
species island
<fct> <fct>
1 Adelie Torgersen
2 Adelie Biscoe
3 Adelie Dream
4 Gentoo Biscoe
5 Chinstrap Dream
distinct(): function from dplyr which returns all distinct rows in a data.frame or tibble. A useful argument to distinct() is .keep_all, that evaluates the uniqueness of specified column combinations, and keeps the first observation of the rest of the columns.
dat |>arrange(body_mass_g) |>distinct(species, island, .keep_all =TRUE)
Function summary from base R offers a quick overview of the basic summary statistics for each column.
summary(dat)
species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 Length:344 Min. :2007
1st Qu.:190.0 1st Qu.:3550 Class :character 1st Qu.:2007
Median :197.0 Median :4050 Mode :character Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
bill_length_cm
Min. :3.210
1st Qu.:3.922
Median :4.445
Mean :4.392
3rd Qu.:4.850
Max. :5.960
NA's :2
However, we might need to compute additional ones such as SD, SEM etc. When performing the calculations ourselves, we need to handle NAs gracefully. We already got some indication of presence of NAs from the summary function.
mean(dat$bill_length_cm) # return NA
[1] NA
# Deal with missing datamean(dat$bill_length_cm, na.rm =TRUE)
[1] 4.392193
sd(dat$bill_length_cm, na.rm =TRUE)
[1] 0.5459584
For our example it seems more useful to split the flipper length according to species and/or sex and compute summary statistics on these data subsets.
The data set can be split in six groups: 1. Adelie - male 2. Adelie - female 3. Chinstrap - male 4. Chinstrap - female 5. Gentoo - male 6. Gentoo - female
1. Split into lists and compute mean using apply functions (base R)
# Split the data in four groups: dataSplit <-split(dat$flipper_length_mm, list(dat$species, df$sex))dataSplit # the result is a list with for each group, the flipper_length_mm of the corresponding species and sex
# Apply on each group, the mean calculation: groupMeans <-sapply(dataSplit, mean) # sapply is specific to apply a function on a list groupMeans # Output is the mean of each group, saved in a list
aggregate(flipper_length_mm ~ species + sex, data=dat, FUN=meanAndStd)
species sex flipper_length_mm.mean flipper_length_mm.std
1 Adelie female 187.794521 5.595035
2 Chinstrap female 191.735294 5.754096
3 Gentoo female 212.706897 3.897856
4 Adelie male 192.410959 6.599317
5 Chinstrap male 199.911765 5.976558
6 Gentoo male 221.540984 5.673252
3. Using split-apply-combine strategy (tidyverse)
# We can use filter(sex %in% c("male", "female")) in the pipe to filter out NA# sexdat |>summarise(fmean =mean(flipper_length_mm, na.rm =TRUE),fsd =sd(flipper_length_mm, na.rm =TRUE),.by =c(species, sex) )
# A tibble: 8 × 4
species sex fmean fsd
<fct> <chr> <dbl> <dbl>
1 Adelie male 192. 6.60
2 Adelie female 188. 5.60
3 Adelie <NA> 186. 6.11
4 Gentoo female 213. 3.90
5 Gentoo male 222. 5.67
6 Gentoo <NA> 216. 1.26
7 Chinstrap female 192. 5.75
8 Chinstrap male 200. 5.98
Visual summaries
There are good reasons to visualize the data instead of deriving conclusions from summary statistics alone:
It provides more complete quality check on a dataset. For example, extreme values can render summary statistics meaningless
Meaningful trends are often easier to grasp visually.
With base graphics
A boxplot will give a visual summary of the data spread per grouping variable.
It is best applied to continuous data
The formula interface allows to split the data into subsets before plotting it
par(mar=c(10,5,1,1))boxplot(flipper_length_mm ~ species + sex, data=dat,las=2,xlab='')
Boxplots convey summary statistics (median, data spread and outliers) for large datasets
For small datasets (<= 5 points per group) it is more informative to plot the data points
Secondary plotting functions such as lines() and points() can be called after the main figure has been drawn to add elements
boxplot(flipper_length_mm ~ species + sex, data=dat, outline=F,las=2,xlab='')for (n in1:length(dataSplit)) { # for each group/box x <-jitter(rep(n, length(dataSplit[[n]])),amount=0.1) # amount to shift all the points if same valuepoints(x, dataSplit[[n]], pch=1.2, cex=0.5,col='blue',lwd=0.8)}
A scatterplot will give a visual summary of relationship of two numeric (usually continuous) variables.
We can add additional mappings such as points color, size or symbol
there seems to be a dependence of flipper length on body mass
the Gentoo species is heavier than the other two and their males are heavier than females
for the same body mass, the females might have longer flippers
With ggplot2
An elegant alternative to the base graphic system is the ggplot2 package system, which is an implementation of “The Grammar of Graphics” (Wilkinson 2005). Here each figure is composed of the same few building blocks, such as data, aesthetics and scales.
suppressPackageStartupMessages(library(ggplot2))dat |># what data to plotdrop_na() |># we remove NA for consistency with previous plotsggplot() +# ggplot callaes(x = species:sex, y = flipper_length_mm) +# aesthetics = which variablessgeom_boxplot(outlier.size =-1, fill ="lightblue") +# add boxplotgeom_jitter(shape =21, size =1, width =0.15, color ="grey20") +# add pointstheme_minimal(base_size =14) +# change font size labs(x ="species - sex", y ="flipper length (mm)", title ="Penguins")
Fitting a linear model
Linear models are used to test hypotheses on the relation between a numerical variable and a set of independent predictors. These can be:
numerical as in linear regression
categorical as for t-test or ANOVA
A combination of both
We can use the function lm() to test hypotheses about the dependence of flipper length(mm) on species and sex in our penguin dataset.
linmod <-lm(flipper_length_mm ~ species * sex, data=dat)summary(linmod)
Call:
lm(formula = flipper_length_mm ~ species * sex, data = dat)
Residuals:
Min 1Q Median 3Q Max
-15.7945 -3.4110 0.0882 3.4590 17.5890
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 187.7945 0.6619 283.721 < 2e-16 ***
speciesChinstrap 3.9408 1.1742 3.356 0.000884 ***
speciesGentoo 24.9124 0.9947 25.044 < 2e-16 ***
sexmale 4.6164 0.9361 4.932 1.3e-06 ***
speciesChinstrap:sexmale 3.5600 1.6606 2.144 0.032782 *
speciesGentoo:sexmale 4.2176 1.3971 3.019 0.002737 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.655 on 327 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8396, Adjusted R-squared: 0.8372
F-statistic: 342.4 on 5 and 327 DF, p-value: < 2.2e-16
To correctly interpret the coefficients of a linear model remember that by default they reflect differences to a baseline
In our model, the baseline (here Intercept) is defined by the set of reference levels Adelie (for species) and female (for sex), respectively.
Lets check how the model coefficients relate to the group means calculated above:
# Intercept. The test statistics evaluate the difference of Adelie.female# (baseline) penguin flipper length (average) from 0. As expected, this is# highly significant but not at all relevant.groupMeans['Adelie.female']
Adelie.female
187.7945
## speciesChinstrap: Difference between Chinstrap amd Adelie in female groupgroupMeans['Chinstrap.female'] - groupMeans['Adelie.female'] # deference to baseline
Chinstrap.female
3.940774
## sexmale: Difference between female and male for species AdeliegroupMeans['Adelie.male'] - groupMeans['Adelie.female'] # deference to baseline
Adelie.male
4.616438
## speciesChinstrap:sexmale: A difference of differences (called 'interaction')(groupMeans['Chinstrap.male'] - groupMeans['Chinstrap.female']) - (groupMeans['Adelie.male'] - groupMeans['Adelie.female'] )
Chinstrap.male
3.560032
Exercises
In the exercise we will use data from a glucose response experiment (secretin data). Five patients with arterial hypertension were dosed with the hormone secretin. Among other roles, secretin is stimulating insulin release from the pancreas after oral glucose intake. Blood glucose levels were recorded in duplicates before administration and in intervals of 20, 30, 60 and 90 minutes after administration of the hormone. The dataset is a subset of the secretin data from the ISwR R package.
1. Read-in
Read-in the example dataset using an appropriate import function. Consult the help pages if you are uncertain about the function usage.
Solution 1
# Import the datasetdata = readr::read_csv("data/01/secretin.csv")
Rows: 50 Columns: 4
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr (3): patient, time, replicate
dbl (1): glucose
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
2. Data structure
How is the data structured? Which variables are present and what are their data types?
Solution 2
data
# A tibble: 50 × 4
glucose patient time replicate
<dbl> <chr> <chr> <chr>
1 92 A pre a
2 93 A pre b
3 84 A 20 a
4 88 A 20 b
5 88 A 30 a
6 90 A 30 b
7 86 A 60 a
8 89 A 60 b
9 87 A 90 a
10 90 A 90 b
# ℹ 40 more rows
str(data)
spc_tbl_ [50 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ glucose : num [1:50] 92 93 84 88 88 90 86 89 87 90 ...
$ patient : chr [1:50] "A" "A" "A" "A" ...
$ time : chr [1:50] "pre" "pre" "20" "20" ...
$ replicate: chr [1:50] "a" "b" "a" "b" ...
- attr(*, "spec")=
.. cols(
.. glucose = col_double(),
.. patient = col_character(),
.. time = col_character(),
.. replicate = col_character()
.. )
- attr(*, "problems")=<externalptr>
3. Data “clean-up”
If needed, change the data type of the column of the data frame so that they are represented with a relevant data type. When needed, relevel factors with the function relevel() to have a meaningful reference level.
Solution 3
## Some columns are better treated as factordata$patient <-factor(data$patient)data$time <-factor(data$time)data$replicate <-factor(data$replicate)## Relevel time to have "pre" as referencedata$time <-relevel(data$time, ref="pre")
4. Exploring
How many observations per time and/or patient are present? Use the functions split(), sapply() and length() to answer the question
Be careful about differentiating biological and technical replication. Here the patients represent the biological replicates, and the 2 measurements per patient and time point are purely technical replicates. It is rarely the case that technical variation is larger than biological variation, and if so, it often can be ignored. We will average the technical replicates for the next steps of the analysis.
Use the aggregate() function to average the glucose levels for each time/patient combination across replicates
Solution 5
data <-aggregate(glucose ~ time+patient, data, FUN=mean) head(data)
time patient glucose
1 pre A 92.5
2 20 A 86.0
3 30 A 89.0
4 60 A 87.5
5 90 A 88.5
6 pre B 85.0
6. Plotting
Construct a boxplot of replicate-averaged glucose levels versus time points:
Overlay it with the single data points
Highlight the different patients with different colors
Solution 6
ggplot(data = data, aes(x = time, y = glucose)) +geom_boxplot() +geom_point(aes(color=patient), size =2.5) +ggtitle('Secretin Data') +xlab('Time') +ylab('Glucose Level') +theme_minimal(base_size =14)
# Of note, with only 5 points per condition a boxplot is not an optimal plot# choice. Plotting only the points is simpler, and since the observations are# available for every patient, we can connect them with lines to see the# evolution with time.ggplot(data = data, aes(x = time, y = glucose)) +geom_point(aes(color=patient), size =2.5) +geom_line(aes(group=patient)) +ggtitle('Secretin Data') +xlab('Time') +ylab('Glucose Level') +theme_minimal(base_size =14)
7. Subsetting
Create a subset of the (replicate-averaged) data containing only time-point pre and 20 using the function subset().
Solution 7
dataSubset <-subset(data, time =="pre"| time =="20")## alternative solution using %in%dataSubset <-subset(data, time %in%c("pre","20"))
8. Linear model
On this reduced dataset, run a linear model to evaluate the change in glucose levels at 20 minutes relative to the baseline levels. What do you conclude?
Solution 8a
mod.time <-lm(glucose ~ time, dataSubset)summary(mod.time)
Call:
lm(formula = glucose ~ time, data = dataSubset)
Residuals:
Min 1Q Median 3Q Max
-8.800 -3.425 0.700 2.200 9.200
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 90.300 2.490 36.258 3.67e-10 ***
time20 -6.500 3.522 -1.846 0.102
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.569 on 8 degrees of freedom
Multiple R-squared: 0.2986, Adjusted R-squared: 0.2109
F-statistic: 3.406 on 1 and 8 DF, p-value: 0.1022
# It seems that the strong downregulation of glucose (-6.5 units in 20 minutes)# is not significant (p=0.1). But the R-squared value indicates a bad fit. The# boxplots above already indicated a large patient specific offset. So its# probably a good choice to account for it in the model.
Implement a model taking the sample pairing by patients into account. What do you conclude now?
Solution 8b
mod.time.patient <-lm(glucose ~ time + patient, dataSubset)summary(mod.time.patient)
# Additionally accounting for patients makes the strong downregulation of glucose# highly significant (p=0.0066) and removes a lot of unexplained variance (the# R-squared is now 95%). Although the patient-specific effects are usually not of# direct interest, its important to address them in the model!
Compare the first model with the patient offset model using anova
Solution 8c
anova(mod.time, mod.time.patient)
Analysis of Variance Table
Model 1: glucose ~ time
Model 2: glucose ~ time + patient
Res.Df RSS Df Sum of Sq F Pr(>F)
1 8 248.10
2 4 15.75 4 232.35 14.752 0.01158 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What do you conclude now?
Solution 8d
# The anova result tells us that the "time.patient" model explains the data# significantly better than expected. Here, a better explanation means a smaller# residual variation. Remember, the "time.patient" model is more complex (has 4# more parameters compared with the simple "time" model), so it has to explain the# data better. But this expectation is exceeded making it significantly better.
9. Writing to a file
Write the summary for the model coefficients of the previous model to a text file as well as your R session information (using SessionInfo())
For each individual, you have a series of glucose measurements taken at discrete time points (See Question 6, second plot). By assuming that the glucose concentration changes linearly between consecutive time points, you can model the data as a piecewise linear function. Your goal is to approximate the definite integral of this function for each patient separately — i.e., the area under the curve (AUC)—using the trapezoidal rule, which computes the sum of the areas of trapezoids formed between adjacent data points, using only base R. For extra challenge, try to do it without the help of LLMs.
Use the data set as is
Treat the lowest value of glucose level for each individual as a baseline (equal to zero)
Repeat the analysis above using tidyverse
Solution 10
# We can start with aggregated datadata
time patient glucose
1 pre A 92.5
2 20 A 86.0
3 30 A 89.0
4 60 A 87.5
5 90 A 88.5
6 pre B 85.0
7 20 B 75.0
8 30 B 79.0
9 60 B 77.5
10 90 B 76.0
11 pre C 90.5
12 20 C 85.0
13 30 C 89.5
14 60 C 90.0
15 90 C 92.5
16 pre D 88.0
17 20 D 80.0
18 30 D 85.0
19 60 D 82.5
20 90 D 83.5
21 pre E 95.5
22 20 E 93.0
23 30 E 94.5
24 60 E 92.0
25 90 E 90.0
# Create a trapezoid rule AUC functionpiecewise_auc =function(time, values) {sum(diff(time) * (head(values, -1) +tail(values, -1)) /2)}## Base R# Convert the time to integers (we need it for interval widths)# relabel pre as 0data$time =factor(data$time, labels =c(0,20,30,60,90))# Converting factors to it's numeric labels, we first need to convert to characterdata$time =as.integer(as.character(data$time))# Aggregate here would be difficult since we need to pass two arguments to # auc calculation We use `by` insteadauc_orig =data.frame(auc =by(data, data$patient, function(df) {piecewise_auc(df$time, df$glucose) }))# For baseline adjustment, we first split by patient, use anonymous function# to subtract the minimum and unsplit the data back to original formdata_baseline =lapply(split(data, data$patient), function(x) { x$glucose = x$glucose -min(x$glucose)return(x) # explicitly return x from anonymous function}) |>unsplit(data$patient)# Calculate the auc in a way similar to previousauc_baseline =data.frame(auc =by(data_baseline, data_baseline$patient, function(df) {piecewise_auc(df$time, df$glucose) }))auc_orig
auc
A 7947.5
B 7020.0
C 8057.5
D 7507.5
E 8350.0
auc_baseline
auc
A 207.5
B 270.0
C 407.5
D 307.5
E 250.0
## Tidyversedata |>mutate(time =factor(time, labels =c(0, 20, 30, 60, 90))) |>mutate(time =as.integer(as.character(time))) |>group_by(patient) |># following operations will be performed per groupmutate(baseline = glucose -min(glucose)) |>summarise(auc_orig =piecewise_auc(time, glucose),auc_baseline =piecewise_auc(time, baseline) )
# A tibble: 5 × 3
patient auc_orig auc_baseline
<fct> <dbl> <dbl>
1 A 7948. 208.
2 B 7020 270
3 C 8058. 408.
4 D 7508. 308.
5 E 8350 250
# You can see that for more complicated transformations, `tidyverse` usually # results in less verbose code that thanks to pipe operator and transformation# verbs can be easily read and understood.