Overview and revision of R basic

Author

Igor Cervenka

Published

February 19, 2025

Course website

Lecture program is available at https://ivanek.github.io/analysisOfGenomicsDataWithR

  • Includes R installation instructions
  • Includes handouts and link to supplementary data files
  • Updated after lecture to include answers to exercises
  • Bookmark it!

Lectures

Wednesday, 14:15-16:00, Biozentrum, Seminarraum U1.191

  • Teachers: bioinformaticians at the Department of Biomedicine of the Unibas, and at the FMI.
  • Each course will be a mix of lecture and exercises.
  • Bring your laptop
  • No lecture on 12.03.2025 due to Fasnachstferien
  • You can write us additional questions anytime on the Etherpad on ADAM.

Exam

  • 2 ECTS credits
  • Two online exam sessions: 09.04.2025 and 28.05.2025
  • 12 multiple choice questions per exam (with one correct answer per question)
  • To pass, you need score above 50% overall, i.e., at least 12 out of 24 questions correct
  • Requires computer with R and Bioconductor installed (i.e. some multiple-choice questions require a bit of coding)
  • All material allowed, communication strictly prohibited

We recommend free auditors to also take the exam, it’s a good training!

Learning goals

  • Understand the basic concepts involved in the analysis of most common types of genomics data (bulk and single-cell mRNA-Seq, ATAC-Seq, ChIP-Seq)
  • Be able to perform some simple analysis workflow using the Bioconductor framework, visualize and interpret the results.

Requirements

A solid knowledge of R basics

For example:

  • Understand the usage of basic data types such as factor or data.frame/tibble
  • Load, modify and summarize data fitting into a data.frame/tibble
  • Visualize the data, e.g., as a scatterplot or boxplot
  • Run and interpret a simple linear model
  • Understand error messages and get help from the documentation

Resources for catching-up (and for future reference)

If you find today’s exercises too challenging you will have to catch up fast. Here are a few resources:

Online resources

Books

  • The R chapters in: Buffalo V (2015) Bioinformatics data skills. O’Reilly. SwissBib link
  • Dalgaard P (2008) Introductory statistics with R. Springer Science & Business Media. SwissBib link

R Cheat sheets

Various cheat sheets at the RStudio website, for example:

Installation and setup

  • See the course webpage for the general options
  • R needs to be up-to-date (version >= 4.4.2)
  • Ask for help if you encounter troubles
  • Each lectures will list individual requirements

Best practices

Rstudio

  • RStudio is an Integrated Development Environment (IDE) for the R programming language.
  • It is one of the most productive option for using R
  • It enables interactive data analysis, reproducible research and result publishing
  • It can be locally installed (desktop version) or accessed via a web browser (RStudio server).
  • RStudio interface:

Working style

  • For each lecture, maintain a <lecture_name>.R or <lecture_name>.Rmd script on your working directory, storing all your commands

    • proper indentation of the code (CMD+I in Rstudio) is always a plus for readability
    • A code without comments is useless.
  • Always keep in mind reproducibility. Most of us use R/Bioconductor for their research: many months after the analysis was performed, some reviewer/colleague might ask for the code used to generate some specific result.

  • Use a clean working directory structure.

  • Recommended steps/commands at the beginning of every R script

    • Start by setting up the working directory (setwd()) with absolute path
    • List what is in workspace with ls() to check that it is clean
    • Load libraries with library()
  • When quitting R, do not save history and data. This might lead to a big mess later (and anyway all commands necessary to reproduce the analysis are in your script, right?)

  • When using R Studio, navigate to Tools -> Global Options -> General and unset “restore .RData into workspace on startup” and set “Save workspace .RData on exit” to Never.

  • Tip: use the auto completion (via TAB) as much as possible

Getting help

  • Really read the error messages! They might seem cryptic at first, but yet often contain useful information.
  • First carefully look at R help pages for functions (e.g., ?lm or pressing F1 when the cursor is positioned inside the function name) and to the vignettes of the packages
  • Search on Google, on forums (e.g., stack overflow) or mailing lists archives (e.g., the Bioconductor support site)
  • Use LLMs (responsibly) to both analyze possible causes of error messages and get suggestions to avoid them
  • Talk to colleagues to get a different perspective on the problem
  • If no success, ask on forums or mailing lists
    • Don’t forget to attach to your question the result of the function sessionInfo() (or the results of the function session_info() from the sessioninfo package, slightly easier to read). This documents what system you are using, the version of R and the different packages

      devtools::session_info()
      ─ Session info ──────────────────────────────────────────────────────────
       setting  value
       version  R version 4.4.2 (2024-10-31)
       os       macOS Sonoma 14.5
       system   aarch64, darwin23.6.0
       ui       unknown
       language (EN)
       collate  en_US.UTF-8
       ctype    en_US.UTF-8
       tz       Europe/Zurich
       date     2025-02-19
       pandoc   3.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
      
      ─ Packages ──────────────────────────────────────────────────────────────
       package     * version date (UTC) lib source
       cachem        1.1.0   2024-05-16 [1] CRAN (R 4.4.1)
       cellranger    1.1.0   2016-07-27 [1] CRAN (R 4.4.2)
       cli           3.6.3   2024-06-21 [1] CRAN (R 4.4.1)
       colorspace    2.1-1   2024-07-26 [1] CRAN (R 4.4.2)
       devtools    * 2.4.5   2022-10-11 [1] CRAN (R 4.4.2)
       digest        0.6.37  2024-08-19 [1] CRAN (R 4.4.2)
       dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.4.1)
       ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.4.2)
       evaluate      1.0.3   2025-01-10 [1] CRAN (R 4.4.2)
       fastmap       1.2.0   2024-05-15 [1] CRAN (R 4.4.1)
       forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.4.2)
       fs            1.6.5   2024-10-30 [1] CRAN (R 4.4.2)
       generics      0.1.3   2022-07-05 [1] CRAN (R 4.4.1)
       ggplot2     * 3.5.1   2024-04-23 [1] CRAN (R 4.4.2)
       glue          1.8.0   2024-09-30 [1] CRAN (R 4.4.2)
       gtable        0.3.6   2024-10-25 [1] CRAN (R 4.4.2)
       hms           1.1.3   2023-03-21 [1] CRAN (R 4.4.2)
       htmltools     0.5.8.1 2024-04-04 [1] CRAN (R 4.4.1)
       htmlwidgets   1.6.4   2023-12-06 [1] CRAN (R 4.4.2)
       httpuv        1.6.15  2024-03-26 [1] CRAN (R 4.4.2)
       jsonlite      1.8.9   2024-09-20 [1] CRAN (R 4.4.2)
       knitr       * 1.49    2024-11-08 [1] CRAN (R 4.4.2)
       later         1.4.1   2024-11-27 [1] CRAN (R 4.4.2)
       lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.4.1)
       lubridate   * 1.9.4   2024-12-08 [1] CRAN (R 4.4.2)
       magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.4.1)
       memoise       2.0.1   2021-11-26 [1] CRAN (R 4.4.1)
       mime          0.12    2021-09-28 [1] CRAN (R 4.4.1)
       miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.4.2)
       munsell       0.5.1   2024-04-01 [1] CRAN (R 4.4.2)
       pillar        1.10.1  2025-01-07 [1] CRAN (R 4.4.2)
       pkgbuild      1.4.5   2024-10-28 [1] CRAN (R 4.4.2)
       pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.4.1)
       pkgload       1.4.0   2024-06-28 [1] CRAN (R 4.4.1)
       profvis       0.4.0   2024-09-20 [1] CRAN (R 4.4.2)
       promises      1.3.2   2024-11-28 [1] CRAN (R 4.4.2)
       purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.4.1)
       R6            2.5.1   2021-08-19 [1] CRAN (R 4.4.1)
       Rcpp          1.0.14  2025-01-12 [1] CRAN (R 4.4.2)
       readr       * 2.1.5   2024-01-10 [1] CRAN (R 4.4.2)
       readxl      * 1.4.3   2023-07-06 [1] CRAN (R 4.4.2)
       remotes       2.5.0   2024-03-17 [1] CRAN (R 4.4.1)
       rlang         1.1.4   2024-06-04 [1] CRAN (R 4.4.1)
       rmarkdown     2.29    2024-11-04 [1] CRAN (R 4.4.2)
       rstudioapi    0.17.1  2024-10-22 [1] CRAN (R 4.4.2)
       scales        1.3.0   2023-11-28 [1] CRAN (R 4.4.2)
       sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.4.1)
       shiny         1.10.0  2024-12-14 [1] CRAN (R 4.4.2)
       stringi       1.8.4   2024-05-06 [1] CRAN (R 4.4.1)
       stringr     * 1.5.1   2023-11-14 [1] CRAN (R 4.4.1)
       tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.4.1)
       tidyr       * 1.3.1   2024-01-24 [1] CRAN (R 4.4.2)
       tidyselect    1.2.1   2024-03-11 [1] CRAN (R 4.4.1)
       tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.4.2)
       timechange    0.3.0   2024-01-18 [1] CRAN (R 4.4.2)
       tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.4.2)
       urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.4.2)
       usethis     * 3.1.0   2024-11-26 [1] CRAN (R 4.4.2)
       vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.4.1)
       withr         3.0.2   2024-10-28 [1] CRAN (R 4.4.2)
       writexl     * 1.5.1   2024-10-04 [1] CRAN (R 4.4.2)
       xfun          0.50    2025-01-07 [1] CRAN (R 4.4.2)
       xtable        1.8-4   2019-04-21 [1] CRAN (R 4.4.2)
       yaml          2.3.10  2024-07-26 [1] CRAN (R 4.4.1)
      
       [1] /Users/igor/R/lib
       [2] /opt/homebrew/lib/R/4.4/site-library
       [3] /opt/homebrew/Cellar/r/4.4.2_2/lib/R/library
      
      ─────────────────────────────────────────────────────────────────────────
      # save the outpout
      capture.output(devtools::session_info(pkgs='attached'),file="data/01/session_info_RevisionRBasics.txt") 
    • Do not forget to attach some code example allowing others to reproduce the issue, ideally a Minimal Reproducible Example

    • Be sure to work on an up-to-date system (R and Bioconductor)! Bugs are regularly fixed in Bioconductor packages

Important!

  • Asking questions is highly encouraged
  • If something is not working or understood, reach out for help as early as possible
  • Steep learning curve: practice, practice and practice again!
    • The best option is to get access to some genomics dataset to analyze (from your lab, from a paper your like, etc) and to train and apply what you learn during the lectures on it

Short revision of basic R

In the introduction to R basics, we mostly used functions from the tidyverse, which provides a modern, widely used collection of R packages designed for data science. However, for some aspects, we also illustrated how to perform the same operations using “base R” commands, that is, functions that are loaded with R by default rather than provided in a separate add-on package.

Import file

Read data from a text file into R represented as a tibble or data.frame.

Load the packages required for analysis or install them if not present.

suppressPackageStartupMessages({
  #install.packages(tidyverse)
  library(tidyverse)
  #install.packages(readxl)
  library(readxl)
  #install.packages(writexl)
  library(writexl)
  #install.packages(readr)
  library(readr)
})

We will be using data set called palmerpenguins.

# Install package if not available
# install.packages("palmerpenguins")

# Load the library
library(palmerpenguins)

# If you have the data saved in a file, you would read it from disk
# into tibble
# dat = readr::read_csv("data/01/penguins.csv")
# or into data frame
# df = read.table("data/01/penguins.csv", sep = ",", header = TRUE)

# Rename to match the rest of the analysis
dat = tibble(penguins)
str(dat)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
dim(dat)
[1] 344   8
dat
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <fct>   <fct>              <dbl>         <dbl>             <int>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>
# data frame
df = as.data.frame(dat) 
head(df)
  species    island bill_length_mm bill_depth_mm flipper_length_mm
1  Adelie Torgersen           39.1          18.7               181
2  Adelie Torgersen           39.5          17.4               186
3  Adelie Torgersen           40.3          18.0               195
4  Adelie Torgersen             NA            NA                NA
5  Adelie Torgersen           36.7          19.3               193
6  Adelie Torgersen           39.3          20.6               190
  body_mass_g    sex year
1        3750   male 2007
2        3800 female 2007
3        3250 female 2007
4          NA   <NA> 2007
5        3450 female 2007
6        3650   male 2007
str(df)
'data.frame':   344 obs. of  8 variables:
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
dim(df)
[1] 344   8
  • The penguins dataset can be stored in a tibble object (dat) or in data.frame (df). There are three main differences in the usage of a data frame vs a tibble: printing, subsetting and availability of row names.
  • the dataset contains 344 rows (observations) and 8 columns (variables): 344 penguin observations, with 8 information per observations (variable)
  • Each column of a data.frame is usually of an atomic (=simple) data type. Atomic data types have values that cannot be divided or broken down further.
  • Common atomic data types include:
    • logical (TRUE, FALSE and NA i.e., not available)
    • integer (3L). Use the L when creating and tells R to store this as an integer.
    • numeric (3.14, Inf, NaN)
    • character ("three")
    • factor for categorical variables (e.g., blood types "A","B","AB","0"). Variables that can take on one of a limited, and usually fixed, number of possible values.

Common data types derived from (are the combination of) the atomic ones include:

  • data.frame/tibble: all columns are of the same length, but each can be a different data type

  • matrix: all columns/rows are the same length and all elements are of the same atomic data type

  • list: no restriction on length/data type per element. Can have character names

  • You can query for the data type (also called class) of a variable using the function class()

class(df) # "data.frame"
[1] "data.frame"
class(dat) # [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 
[1] "tbl_df"     "tbl"        "data.frame"
class(df$body_mass_g) # atomic data type. Cannot be broken down further. 
[1] "integer"
  • Change data type:
str(df)
'data.frame':   344 obs. of  8 variables:
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
head(dat$sex)
[1] male   female female <NA>   female male  
Levels: female male
# Sometimes the factor might not be the best column type, we change sex to 
# character for illustration purposes
dat$sex = as.character(dat$sex)
df$sex = as.character(df$sex)

head(dat$sex)
[1] "male"   "female" "female" NA       "female" "male"  
# If you check the levels of the variable: 
levels(df$island) # [1] "Biscoe"    "Dream"     "Torgersen"
[1] "Biscoe"    "Dream"     "Torgersen"
# However, imagine Torgersen is the main island
# you can choose which level to put as reference level
df$island <- relevel(df$island, ref = "Torgersen")
levels(df$island) # [1] "Torgersen" "Biscoe"    "Dream"  
[1] "Torgersen" "Biscoe"    "Dream"    
  • Remember that behind the scenes, a factor is encoded with numerical values.
  • Whether a string variable should be considered as categorical or not depends on the question asked.

Subsetting

A common operation is to subset data sets, that is, to only retain some of the records. The specification of what to retain can be done in many different ways. There are many ways of achieving it:

1. by numeric index:

  • slice: To retain a pre-specified set of rows, we can use the slice function from dplyr.
  • Select: the select function, also from dplyr, allows us to retain a given set of columns
# first row
slice(dat, 1)
# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
1 Adelie  Torge…           39.1          18.7               181        3750
# ℹ 2 more variables: sex <chr>, year <int>
#slice(df, 1)

# first, third and fifth row
slice(dat, c(1, 3, 5))
# A tibble: 3 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
1 Adelie  Torge…           39.1          18.7               181        3750
2 Adelie  Torge…           40.3          18                 195        3250
3 Adelie  Torge…           36.7          19.3               193        3450
# ℹ 2 more variables: sex <chr>, year <int>
#slice(df, c(1, 3, 5))

# Check if dat and df give the same result:
# first to third column using tibble
on_dat = select(dat,1:3)
# first to third column using data.frame
on_df = select(df,1:3)
table(on_dat==on_df)

TRUE 
1030 
Side note

A more intuitive way, which reduces the number of intermediate variables that are explicitly created and scales to more complex operations, is to use the pipe operator (|>). Pipe is an infix operator that enables function composition g(f(x)) = x |> f() |> g().

# Select column 1:3, and row 1:5 of the dat tibble or df data.frame
dat |> 
  slice(1:5) |>
    select(1:3)
# A tibble: 5 × 3
  species island    bill_length_mm
  <fct>   <fct>              <dbl>
1 Adelie  Torgersen           39.1
2 Adelie  Torgersen           39.5
3 Adelie  Torgersen           40.3
4 Adelie  Torgersen           NA  
5 Adelie  Torgersen           36.7
  • square brackets []: Subsetting can also be done using base R commands. This type of subsetting is used a lot in S4 classes, such as SinglCellExperiment that underpin most of genomics analysis in Bioconductor.
# first five rows, first three columns
dat[1:5, 1:3]
# A tibble: 5 × 3
  species island    bill_length_mm
  <fct>   <fct>              <dbl>
1 Adelie  Torgersen           39.1
2 Adelie  Torgersen           39.5
3 Adelie  Torgersen           40.3
4 Adelie  Torgersen           NA  
5 Adelie  Torgersen           36.7

2. by row- and/or column names with:

Can be achieved with select or square brackets. An important shortcut to extract a single column is provided by the $ sign

# Use the column names
colnames(dat)
[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"             
dim(dat) # 344   8
[1] 344   8
select(dat,species)
# A tibble: 344 × 1
   species
   <fct>  
 1 Adelie 
 2 Adelie 
 3 Adelie 
 4 Adelie 
 5 Adelie 
 6 Adelie 
 7 Adelie 
 8 Adelie 
 9 Adelie 
10 Adelie 
# ℹ 334 more rows
# using `$` and square brackets
head(dat$species) 
[1] Adelie Adelie Adelie Adelie Adelie Adelie
Levels: Adelie Chinstrap Gentoo
length(dat$species)
[1] 344
# Use row number and colnames
dat[1:5, c("species", "island", "bill_length_mm")]
# A tibble: 5 × 3
  species island    bill_length_mm
  <fct>   <fct>              <dbl>
1 Adelie  Torgersen           39.1
2 Adelie  Torgersen           39.5
3 Adelie  Torgersen           40.3
4 Adelie  Torgersen           NA  
5 Adelie  Torgersen           36.7
Side note

There is an important difference between a tibble and a data.frame when it comes to subsetting using the bracket operator. By default, extracting a single column from a data.frame returns a vector (not a data.frame), while performing the same operation on a tibble returns another tibble. The key to getting a data.frame when extracting a column from a data.frame is to add the argument drop = FALSE. The $ operator always returns a vector. The tidyverse counterpart of $ operator that can be used in pipes is pull function from dplyr package.

# extract the first five elements in the 'species' column
str(df[,'species']) # vector
 Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
length(df[,'species']) 
[1] 344
str(dat[,'species']) # tibble
tibble [344 × 1] (S3: tbl_df/tbl/data.frame)
 $ species: Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
dim(dat[,'species'])
[1] 344   1
str(pull(dat, species))
 Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
length(pull(dat, species))
[1] 344

3. by logical index

  • filter: dplyr function to subset based on observed values
dat |>
  filter(year == 2007 & island == "Torgersen")
# A tibble: 20 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <fct>   <fct>              <dbl>         <dbl>             <int>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
11 Adelie  Torgersen           37.8          17.1               186
12 Adelie  Torgersen           37.8          17.3               180
13 Adelie  Torgersen           41.1          17.6               182
14 Adelie  Torgersen           38.6          21.2               191
15 Adelie  Torgersen           34.6          21.1               198
16 Adelie  Torgersen           36.6          17.8               185
17 Adelie  Torgersen           38.7          19                 195
18 Adelie  Torgersen           42.5          20.7               197
19 Adelie  Torgersen           34.4          18.4               184
20 Adelie  Torgersen           46            21.5               194
# ℹ 3 more variables: body_mass_g <int>, sex <chr>, year <int>
# can also be done in multiple steps
dat |>
  filter(year == 2007) |>
  filter(island == "Torgersen")
# A tibble: 20 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <fct>   <fct>              <dbl>         <dbl>             <int>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
11 Adelie  Torgersen           37.8          17.1               186
12 Adelie  Torgersen           37.8          17.3               180
13 Adelie  Torgersen           41.1          17.6               182
14 Adelie  Torgersen           38.6          21.2               191
15 Adelie  Torgersen           34.6          21.1               198
16 Adelie  Torgersen           36.6          17.8               185
17 Adelie  Torgersen           38.7          19                 195
18 Adelie  Torgersen           42.5          20.7               197
19 Adelie  Torgersen           34.4          18.4               184
20 Adelie  Torgersen           46            21.5               194
# ℹ 3 more variables: body_mass_g <int>, sex <chr>, year <int>
  • which: using square brackets and base R function which
# logical vector: 
head(dat$island=='Torgersen' & dat$year == 2007)
[1] TRUE TRUE TRUE TRUE TRUE TRUE
# Return only the rows that evaluate to TRUE
dat[dat$island=='Torgersen' & dat$year == 2007,]
# A tibble: 20 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <fct>   <fct>              <dbl>         <dbl>             <int>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
11 Adelie  Torgersen           37.8          17.1               186
12 Adelie  Torgersen           37.8          17.3               180
13 Adelie  Torgersen           41.1          17.6               182
14 Adelie  Torgersen           38.6          21.2               191
15 Adelie  Torgersen           34.6          21.1               198
16 Adelie  Torgersen           36.6          17.8               185
17 Adelie  Torgersen           38.7          19                 195
18 Adelie  Torgersen           42.5          20.7               197
19 Adelie  Torgersen           34.4          18.4               184
20 Adelie  Torgersen           46            21.5               194
# ℹ 3 more variables: body_mass_g <int>, sex <chr>, year <int>
# Convert TRUE to corresponding row numbers with `which` function
which(dat$island=='Torgersen' & dat$year == 2007)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
# Now we are back to subsetting by row numbers, this is useful e.g. for
# reordering, here we return the rows of interest in reverse order
dat[rev(which(dat$island=='Torgersen' & dat$year == 2007)),]
# A tibble: 20 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <fct>   <fct>              <dbl>         <dbl>             <int>
 1 Adelie  Torgersen           46            21.5               194
 2 Adelie  Torgersen           34.4          18.4               184
 3 Adelie  Torgersen           42.5          20.7               197
 4 Adelie  Torgersen           38.7          19                 195
 5 Adelie  Torgersen           36.6          17.8               185
 6 Adelie  Torgersen           34.6          21.1               198
 7 Adelie  Torgersen           38.6          21.2               191
 8 Adelie  Torgersen           41.1          17.6               182
 9 Adelie  Torgersen           37.8          17.3               180
10 Adelie  Torgersen           37.8          17.1               186
11 Adelie  Torgersen           42            20.2               190
12 Adelie  Torgersen           34.1          18.1               193
13 Adelie  Torgersen           39.2          19.6               195
14 Adelie  Torgersen           38.9          17.8               181
15 Adelie  Torgersen           39.3          20.6               190
16 Adelie  Torgersen           36.7          19.3               193
17 Adelie  Torgersen           NA            NA                  NA
18 Adelie  Torgersen           40.3          18                 195
19 Adelie  Torgersen           39.5          17.4               186
20 Adelie  Torgersen           39.1          18.7               181
# ℹ 3 more variables: body_mass_g <int>, sex <chr>, year <int>
  • subset: the subset() function allows to combine several logical filters.
subset(dat, island == "Torgersen" & year == 2007)
# A tibble: 20 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <fct>   <fct>              <dbl>         <dbl>             <int>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
11 Adelie  Torgersen           37.8          17.1               186
12 Adelie  Torgersen           37.8          17.3               180
13 Adelie  Torgersen           41.1          17.6               182
14 Adelie  Torgersen           38.6          21.2               191
15 Adelie  Torgersen           34.6          21.1               198
16 Adelie  Torgersen           36.6          17.8               185
17 Adelie  Torgersen           38.7          19                 195
18 Adelie  Torgersen           42.5          20.7               197
19 Adelie  Torgersen           34.4          18.4               184
20 Adelie  Torgersen           46            21.5               194
# ℹ 3 more variables: body_mass_g <int>, sex <chr>, year <int>
  • %in%: %in% This operator is used to identify if a series of elements of a vector belong to another vector. This is also very useful for subsetting data frames
c("Torgersen","Biscoe","Dream",'dream') %in% levels(dat$island)
[1]  TRUE  TRUE  TRUE FALSE
levels(df$island) %in% c("Torgersen","Biscoe","Dream",'dream') 
[1] TRUE TRUE TRUE

Adding a new column to a data set, or replacing an existing column

  • mutate: function from dplyr
# Add another column to `dat` that contains the sum of the `bill_length_mm` and `bill_depth_mm` columns.
dat |>
  mutate(lpd = bill_length_mm + bill_depth_mm)
# A tibble: 344 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <fct>   <fct>              <dbl>         <dbl>             <int>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
# ℹ 334 more rows
# ℹ 4 more variables: body_mass_g <int>, sex <chr>, year <int>, lpd <dbl>
#Add column in `dat` that contains the bill length in cm. Display only the species column and the newly created column
dat |>
  mutate(bill_length_cm = bill_length_mm / 10) |>
  select(species, bill_length_cm)
# A tibble: 344 × 2
   species bill_length_cm
   <fct>            <dbl>
 1 Adelie            3.91
 2 Adelie            3.95
 3 Adelie            4.03
 4 Adelie           NA   
 5 Adelie            3.67
 6 Adelie            3.93
 7 Adelie            3.89
 8 Adelie            3.92
 9 Adelie            3.41
10 Adelie            4.2 
# ℹ 334 more rows

Mutate is a very powerful function able to solve many of problems that you might encounter when trying to transform your data. It accepts many functions and tidyselect semantics on the righthand side.

# Example of more complicated mutates, only for illustration purposes
# This example doesn't use a native R pipe |>, but a bit more flexible 
# "tidyverse" pipe %>%
dat %>%
  mutate(s = apply(., 1, \(x) paste(x, collapse = "_"))) %>%
  mutate(ex = rowSums(across(starts_with("bill")))) %>%
  select(s, ex)
# A tibble: 344 × 2
   s                                                  ex
   <chr>                                           <dbl>
 1 Adelie_Torgersen_39.1_18.7_181_3750_male_2007    57.8
 2 Adelie_Torgersen_39.5_17.4_186_3800_female_2007  56.9
 3 Adelie_Torgersen_40.3_18.0_195_3250_female_2007  58.3
 4 Adelie_Torgersen_NA_NA_NA_NA_NA_2007             NA  
 5 Adelie_Torgersen_36.7_19.3_193_3450_female_2007  56  
 6 Adelie_Torgersen_39.3_20.6_190_3650_male_2007    59.9
 7 Adelie_Torgersen_38.9_17.8_181_3625_female_2007  56.7
 8 Adelie_Torgersen_39.2_19.6_195_4675_male_2007    58.8
 9 Adelie_Torgersen_34.1_18.1_193_3475_NA_2007      52.2
10 Adelie_Torgersen_42.0_20.2_190_4250_NA_2007      62.2
# ℹ 334 more rows
  • $: $ can also be used to add a new column to a data.frame
dat$bill_length_cm =  dat$bill_length_mm / 10

Finding all unique combinations

  • unique function from base R can be used to accomplish the same with a bit of extra typing compared to tidyverse.
unique(dat[, c("species", "island")])
# A tibble: 5 × 2
  species   island   
  <fct>     <fct>    
1 Adelie    Torgersen
2 Adelie    Biscoe   
3 Adelie    Dream    
4 Gentoo    Biscoe   
5 Chinstrap Dream    
  • distinct(): function from dplyr which returns all distinct rows in a data.frame or tibble. A useful argument to distinct() is .keep_all, that evaluates the uniqueness of specified column combinations, and keeps the first observation of the rest of the columns.
dat |>
  arrange(body_mass_g) |>
  distinct(species, island, .keep_all = TRUE)
# A tibble: 5 × 9
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
1 Chinst… Dream            46.9          16.6               192        2700
2 Adelie  Biscoe           36.5          16.6               181        2850
3 Adelie  Dream            33.1          16.1               178        2900
4 Adelie  Torge…           38.6          17                 188        2900
5 Gentoo  Biscoe           42.7          13.7               208        3950
# ℹ 3 more variables: sex <chr>, year <int>, bill_length_cm <dbl>

Computing summary statistics

Function summary from base R offers a quick overview of the basic summary statistics for each column.

summary(dat)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex                 year     
 Min.   :172.0     Min.   :2700   Length:344         Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   Class :character   1st Qu.:2007  
 Median :197.0     Median :4050   Mode  :character   Median :2008  
 Mean   :200.9     Mean   :4202                      Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                      3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                      Max.   :2009  
 NA's   :2         NA's   :2                                       
 bill_length_cm 
 Min.   :3.210  
 1st Qu.:3.922  
 Median :4.445  
 Mean   :4.392  
 3rd Qu.:4.850  
 Max.   :5.960  
 NA's   :2      

However, we might need to compute additional ones such as SD, SEM etc. When performing the calculations ourselves, we need to handle NAs gracefully. We already got some indication of presence of NAs from the summary function.

mean(dat$bill_length_cm) # return NA
[1] NA
# Deal with missing data
mean(dat$bill_length_cm, na.rm = TRUE)
[1] 4.392193
sd(dat$bill_length_cm, na.rm = TRUE)
[1] 0.5459584

For our example it seems more useful to split the flipper length according to species and/or sex and compute summary statistics on these data subsets.

The data set can be split in six groups: 1. Adelie - male 2. Adelie - female 3. Chinstrap - male 4. Chinstrap - female 5. Gentoo - male 6. Gentoo - female

1. Split into lists and compute mean using apply functions (base R)

# Split the data in four groups: 
dataSplit <- split(dat$flipper_length_mm, list(dat$species, df$sex))
dataSplit # the result is a list with for each group, the flipper_length_mm of the corresponding species and sex
$Adelie.female
 [1] 186 195 193 181 182 185 195 184 174 189 187 187 172 178 188 195 180
[18] 181 182 186 185 190 186 190 187 186 181 185 185 184 195 190 190 196
[35] 190 191 187 189 187 191 189 190 202 185 187 190 178 192 183 193 199
[52] 181 198 193 191 188 189 187 176 186 191 191 190 193 187 191 185 193
[69] 188 192 184 195 187

$Chinstrap.female
 [1] 192 188 198 178 195 193 185 190 181 190 181 187 195 200 191 187 187
[18] 199 195 192 187 196 196 190 198 199 193 187 191 194 189 195 202 198

$Gentoo.female
 [1] 211 210 210 211 209 214 214 210 210 209 215 213 215 210 209 207 220
[18] 213 208 208 210 217 210 213 210 210 217 208 208 208 214 219 220 216
[35] 217 216 209 215 212 212 212 218 212 218 212 214 222 203 219 215 210
[52] 208 216 213 217 214 215 212

$Adelie.male
 [1] 181 190 195 191 198 197 194 180 185 180 183 180 178 184 196 190 184
[18] 195 196 190 182 191 188 200 191 193 194 195 192 192 188 198 190 197
[35] 195 184 195 196 193 194 190 189 205 186 208 196 192 203 190 184 190
[52] 197 191 197 196 199 189 198 202 199 195 210 197 199 190 200 193 187
[69] 190 185 190 193 201

$Chinstrap.male
 [1] 196 193 197 197 198 194 201 201 197 195 191 193 197 200 205 201 203
[18] 195 210 205 210 196 201 212 187 201 203 197 203 202 206 207 193 210

$Gentoo.male
 [1] 230 218 215 219 215 216 213 217 221 222 218 215 215 215 220 222 230
[18] 220 219 208 225 216 222 225 215 220 225 220 220 224 221 231 230 229
[35] 223 221 221 230 220 223 221 224 228 218 230 228 224 226 216 225 228
[52] 228 215 219 209 229 230 230 222 222 213
# Apply on each group, the mean calculation: 
groupMeans <- sapply(dataSplit, mean) # sapply is specific to apply a function on a list 
groupMeans # Output is the mean of each group, saved in a list
   Adelie.female Chinstrap.female    Gentoo.female      Adelie.male 
        187.7945         191.7353         212.7069         192.4110 
  Chinstrap.male      Gentoo.male 
        199.9118         221.5410 

2. Use aggregate (base R)

The same is achieved with the function aggregate() (with or without a formula interface), which returns a data frame.

aggregate(dat$flipper_length_mm, list(dat$species, dat$sex), FUN=mean)
    Group.1 Group.2        x
1    Adelie  female 187.7945
2 Chinstrap  female 191.7353
3    Gentoo  female 212.7069
4    Adelie    male 192.4110
5 Chinstrap    male 199.9118
6    Gentoo    male 221.5410
aggregate(flipper_length_mm ~ species + sex, data=dat, FUN=mean)
    species    sex flipper_length_mm
1    Adelie female          187.7945
2 Chinstrap female          191.7353
3    Gentoo female          212.7069
4    Adelie   male          192.4110
5 Chinstrap   male          199.9118
6    Gentoo   male          221.5410

To compute the mean and standard deviation at the same time, we can define a function that returns both:

meanAndStd <- function(x) {
  c(mean=mean(x, na.rm = TRUE), std=sd(x, na.rm = TRUE))
}

This function can then be used in combination with aggregate, or sapply.

sapply(dataSplit, meanAndStd)
     Adelie.female Chinstrap.female Gentoo.female Adelie.male
mean    187.794521       191.735294    212.706897  192.410959
std       5.595035         5.754096      3.897856    6.599317
     Chinstrap.male Gentoo.male
mean     199.911765  221.540984
std        5.976558    5.673252
aggregate(flipper_length_mm ~ species + sex, data=dat, FUN=meanAndStd)
    species    sex flipper_length_mm.mean flipper_length_mm.std
1    Adelie female             187.794521              5.595035
2 Chinstrap female             191.735294              5.754096
3    Gentoo female             212.706897              3.897856
4    Adelie   male             192.410959              6.599317
5 Chinstrap   male             199.911765              5.976558
6    Gentoo   male             221.540984              5.673252

3. Using split-apply-combine strategy (tidyverse)

# We can use filter(sex %in% c("male", "female")) in the pipe to filter out NA
# sex
dat |>
  summarise(
    fmean = mean(flipper_length_mm, na.rm = TRUE),
    fsd = sd(flipper_length_mm, na.rm = TRUE),
    .by = c(species, sex)
  )
# A tibble: 8 × 4
  species   sex    fmean   fsd
  <fct>     <chr>  <dbl> <dbl>
1 Adelie    male    192.  6.60
2 Adelie    female  188.  5.60
3 Adelie    <NA>    186.  6.11
4 Gentoo    female  213.  3.90
5 Gentoo    male    222.  5.67
6 Gentoo    <NA>    216.  1.26
7 Chinstrap female  192.  5.75
8 Chinstrap male    200.  5.98

Visual summaries

There are good reasons to visualize the data instead of deriving conclusions from summary statistics alone:

  • It provides more complete quality check on a dataset. For example, extreme values can render summary statistics meaningless
  • Meaningful trends are often easier to grasp visually.

With base graphics

A boxplot will give a visual summary of the data spread per grouping variable.

  • It is best applied to continuous data
  • The formula interface allows to split the data into subsets before plotting it
par(mar=c(10,5,1,1))
boxplot(flipper_length_mm ~ species + sex, data=dat,las=2,xlab='')

  • Boxplots convey summary statistics (median, data spread and outliers) for large datasets
  • For small datasets (<= 5 points per group) it is more informative to plot the data points
  • Secondary plotting functions such as lines() and points() can be called after the main figure has been drawn to add elements
boxplot(flipper_length_mm ~ species + sex, data=dat, outline=F,las=2,xlab='')
for (n in 1:length(dataSplit)) { # for each group/box
  x <- jitter(rep(n, length(dataSplit[[n]])),amount=0.1) # amount to shift all the points if same value
  points(x, dataSplit[[n]], pch=1.2, cex=0.5,col='blue',lwd=0.8)
}

A scatterplot will give a visual summary of relationship of two numeric (usually continuous) variables.

  • We can add additional mappings such as points color, size or symbol
dat$sex <- as.factor(dat$sex)
plot(
  y = dat$flipper_length_mm,
  x = dat$body_mass_g,
  col = as.numeric(dat$species),
  pch = as.numeric(dat$sex),
  xlab = "body mass (g)",
  ylab = "flipper length (mm)"
)
legend("top", legend = levels(dat$sex), pch = 1:2)
legend("topleft", legend = levels(dat$species), pch = 1, col = 1:3)

This scatterplot shows that:

  • there seems to be a dependence of flipper length on body mass
  • the Gentoo species is heavier than the other two and their males are heavier than females
  • for the same body mass, the females might have longer flippers

With ggplot2

An elegant alternative to the base graphic system is the ggplot2 package system, which is an implementation of “The Grammar of Graphics” (Wilkinson 2005). Here each figure is composed of the same few building blocks, such as data, aesthetics and scales.

suppressPackageStartupMessages(library(ggplot2))

dat |> # what data to plot
  drop_na()  |> # we remove NA for consistency with previous plots
  ggplot() + # ggplot call
    aes(x = species:sex, y = flipper_length_mm) + # aesthetics = which variabless
    geom_boxplot(outlier.size = -1, fill = "lightblue") + # add boxplot
    geom_jitter(shape = 21, size = 1, width = 0.15, color = "grey20") + # add points
    theme_minimal(base_size = 14) + # change font size 
    labs(x = "species - sex", y = "flipper length (mm)", title = "Penguins")

Fitting a linear model

Linear models are used to test hypotheses on the relation between a numerical variable and a set of independent predictors. These can be:

  • numerical as in linear regression
  • categorical as for t-test or ANOVA
  • A combination of both

We can use the function lm() to test hypotheses about the dependence of flipper length(mm) on species and sex in our penguin dataset.

linmod <- lm(flipper_length_mm ~ species * sex, data=dat)
summary(linmod)

Call:
lm(formula = flipper_length_mm ~ species * sex, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.7945  -3.4110   0.0882   3.4590  17.5890 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)              187.7945     0.6619 283.721  < 2e-16 ***
speciesChinstrap           3.9408     1.1742   3.356 0.000884 ***
speciesGentoo             24.9124     0.9947  25.044  < 2e-16 ***
sexmale                    4.6164     0.9361   4.932  1.3e-06 ***
speciesChinstrap:sexmale   3.5600     1.6606   2.144 0.032782 *  
speciesGentoo:sexmale      4.2176     1.3971   3.019 0.002737 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.655 on 327 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8396,    Adjusted R-squared:  0.8372 
F-statistic: 342.4 on 5 and 327 DF,  p-value: < 2.2e-16
levels(dat$sex) # "female" "male"   
[1] "female" "male"  
levels(dat$species) #  "Adelie"    "Chinstrap" "Gentoo"  
[1] "Adelie"    "Chinstrap" "Gentoo"   
  • To correctly interpret the coefficients of a linear model remember that by default they reflect differences to a baseline
  • In our model, the baseline (here Intercept) is defined by the set of reference levels Adelie (for species) and female (for sex), respectively.

Lets check how the model coefficients relate to the group means calculated above:

# Intercept. The test statistics evaluate the difference of Adelie.female
# (baseline) penguin flipper length (average) from 0. As expected, this is
# highly significant but not at all relevant.
groupMeans['Adelie.female']
Adelie.female 
     187.7945 
## speciesChinstrap: Difference between Chinstrap amd Adelie in female group
groupMeans['Chinstrap.female'] - groupMeans['Adelie.female'] # deference to baseline 
Chinstrap.female 
        3.940774 
## sexmale: Difference between female and male for species Adelie
groupMeans['Adelie.male'] - groupMeans['Adelie.female'] # deference to baseline
Adelie.male 
   4.616438 
## speciesChinstrap:sexmale: A difference of differences (called 'interaction')
(groupMeans['Chinstrap.male'] - groupMeans['Chinstrap.female']) - (groupMeans['Adelie.male'] - groupMeans['Adelie.female'] ) 
Chinstrap.male 
      3.560032 

Exercises

In the exercise we will use data from a glucose response experiment (secretin data). Five patients with arterial hypertension were dosed with the hormone secretin. Among other roles, secretin is stimulating insulin release from the pancreas after oral glucose intake. Blood glucose levels were recorded in duplicates before administration and in intervals of 20, 30, 60 and 90 minutes after administration of the hormone. The dataset is a subset of the secretin data from the ISwR R package.

1. Read-in

Read-in the example dataset using an appropriate import function. Consult the help pages if you are uncertain about the function usage.

# Import the dataset
data = readr::read_csv("data/01/secretin.csv")
Rows: 50 Columns: 4
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr (3): patient, time, replicate
dbl (1): glucose

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Data structure

How is the data structured? Which variables are present and what are their data types?

data
# A tibble: 50 × 4
   glucose patient time  replicate
     <dbl> <chr>   <chr> <chr>    
 1      92 A       pre   a        
 2      93 A       pre   b        
 3      84 A       20    a        
 4      88 A       20    b        
 5      88 A       30    a        
 6      90 A       30    b        
 7      86 A       60    a        
 8      89 A       60    b        
 9      87 A       90    a        
10      90 A       90    b        
# ℹ 40 more rows
str(data)
spc_tbl_ [50 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ glucose  : num [1:50] 92 93 84 88 88 90 86 89 87 90 ...
 $ patient  : chr [1:50] "A" "A" "A" "A" ...
 $ time     : chr [1:50] "pre" "pre" "20" "20" ...
 $ replicate: chr [1:50] "a" "b" "a" "b" ...
 - attr(*, "spec")=
  .. cols(
  ..   glucose = col_double(),
  ..   patient = col_character(),
  ..   time = col_character(),
  ..   replicate = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

3. Data “clean-up”

If needed, change the data type of the column of the data frame so that they are represented with a relevant data type. When needed, relevel factors with the function relevel() to have a meaningful reference level.

## Some columns are better treated as factor
data$patient <- factor(data$patient)
data$time <- factor(data$time)
data$replicate <- factor(data$replicate)

## Relevel time to have "pre" as reference
data$time <- relevel(data$time, ref="pre")

4. Exploring

How many observations per time and/or patient are present? Use the functions split(), sapply() and length() to answer the question

sp <- split(data$glucose, list(data$time))
sapply(sp, length)
pre  20  30  60  90 
 10  10  10  10  10 
sp <- split(data$glucose, list(data$patient))
sapply(sp, length)
 A  B  C  D  E 
10 10 10 10 10 
sp <- split(data$glucose, list(data$time, data$patient))
sapply(sp, length)
pre.A  20.A  30.A  60.A  90.A pre.B  20.B  30.B  60.B  90.B pre.C  20.C 
    2     2     2     2     2     2     2     2     2     2     2     2 
 30.C  60.C  90.C pre.D  20.D  30.D  60.D  90.D pre.E  20.E  30.E  60.E 
    2     2     2     2     2     2     2     2     2     2     2     2 
 90.E 
    2 

5. Summarizing the data

Be careful about differentiating biological and technical replication. Here the patients represent the biological replicates, and the 2 measurements per patient and time point are purely technical replicates. It is rarely the case that technical variation is larger than biological variation, and if so, it often can be ignored. We will average the technical replicates for the next steps of the analysis.

Use the aggregate() function to average the glucose levels for each time/patient combination across replicates

data <- aggregate(glucose ~ time+patient, data, FUN=mean) 
head(data)
  time patient glucose
1  pre       A    92.5
2   20       A    86.0
3   30       A    89.0
4   60       A    87.5
5   90       A    88.5
6  pre       B    85.0

6. Plotting

Construct a boxplot of replicate-averaged glucose levels versus time points:

  • Overlay it with the single data points
  • Highlight the different patients with different colors
ggplot(data = data, aes(x = time, y = glucose)) + 
  geom_boxplot() + 
  geom_point(aes(color=patient), size = 2.5) +
  ggtitle('Secretin Data') + 
  xlab('Time') + 
  ylab('Glucose Level') + 
  theme_minimal(base_size = 14)

# Of note, with only 5 points per condition a boxplot is not an optimal plot
# choice. Plotting only the points is simpler, and since the observations are
# available for every patient, we can connect them with lines to see the
# evolution with time.

ggplot(data = data, aes(x = time, y = glucose)) + 
  geom_point(aes(color=patient), size = 2.5) +
  geom_line(aes(group=patient)) +
  ggtitle('Secretin Data') + 
  xlab('Time') + 
  ylab('Glucose Level') + 
  theme_minimal(base_size = 14)

7. Subsetting

Create a subset of the (replicate-averaged) data containing only time-point pre and 20 using the function subset().

dataSubset <- subset(data, time == "pre" | time == "20")
## alternative solution using %in%
dataSubset <- subset(data, time %in% c("pre","20"))

8. Linear model

  • On this reduced dataset, run a linear model to evaluate the change in glucose levels at 20 minutes relative to the baseline levels. What do you conclude?
mod.time <- lm(glucose ~ time, dataSubset)
summary(mod.time)

Call:
lm(formula = glucose ~ time, data = dataSubset)

Residuals:
   Min     1Q Median     3Q    Max 
-8.800 -3.425  0.700  2.200  9.200 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   90.300      2.490  36.258 3.67e-10 ***
time20        -6.500      3.522  -1.846    0.102    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.569 on 8 degrees of freedom
Multiple R-squared:  0.2986,    Adjusted R-squared:  0.2109 
F-statistic: 3.406 on 1 and 8 DF,  p-value: 0.1022
# It seems that the strong downregulation of glucose (-6.5 units in 20 minutes)
# is not significant (p=0.1). But the R-squared value indicates a bad fit. The
# boxplots above already indicated a large patient specific offset. So its
# probably a good choice to account for it in the model.
  • Implement a model taking the sample pairing by patients into account. What do you conclude now?
mod.time.patient <- lm(glucose ~ time + patient, dataSubset)
summary(mod.time.patient)

Call:
lm(formula = glucose ~ time + patient, data = dataSubset)

Residuals:
         1          2          6          7         11         12 
 5.995e-15 -6.647e-15  1.750e+00 -1.750e+00 -5.000e-01  5.000e-01 
        16         17         21         22 
 7.500e-01 -7.500e-01 -2.000e+00  2.000e+00 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   92.500      1.537  60.181 4.57e-07 ***
time20        -6.500      1.255  -5.179  0.00661 ** 
patientB      -9.250      1.984  -4.662  0.00958 ** 
patientC      -1.500      1.984  -0.756  0.49177    
patientD      -5.250      1.984  -2.646  0.05724 .  
patientE       5.000      1.984   2.520  0.06537 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.984 on 4 degrees of freedom
Multiple R-squared:  0.9555,    Adjusted R-squared:  0.8998 
F-statistic: 17.17 on 5 and 4 DF,  p-value: 0.008291
# Additionally accounting for patients makes the strong downregulation of glucose
# highly significant (p=0.0066) and removes a lot of unexplained variance (the
# R-squared is now 95%). Although the patient-specific effects are usually not of
# direct interest, its important to address them in the model!
  • Compare the first model with the patient offset model using anova
anova(mod.time, mod.time.patient)
Analysis of Variance Table

Model 1: glucose ~ time
Model 2: glucose ~ time + patient
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1      8 248.10                              
2      4  15.75  4    232.35 14.752 0.01158 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • What do you conclude now?
# The anova result tells us that the "time.patient" model explains the data
# significantly better than expected. Here, a better explanation means a smaller
# residual variation. Remember, the "time.patient" model is more complex (has 4
# more parameters compared with the simple "time" model), so it has to explain the
# data better. But this expectation is exceeded making it significantly better.

9. Writing to a file

Write the summary for the model coefficients of the previous model to a text file as well as your R session information (using SessionInfo())

df <- summary(mod.time.patient)$coefficients
write.table(df, file='data/01/secretin_model_summary.txt')

10. Bonus exercise

For each individual, you have a series of glucose measurements taken at discrete time points (See Question 6, second plot). By assuming that the glucose concentration changes linearly between consecutive time points, you can model the data as a piecewise linear function. Your goal is to approximate the definite integral of this function for each patient separately — i.e., the area under the curve (AUC)—using the trapezoidal rule, which computes the sum of the areas of trapezoids formed between adjacent data points, using only base R. For extra challenge, try to do it without the help of LLMs.

  • Use the data set as is
  • Treat the lowest value of glucose level for each individual as a baseline (equal to zero)
  • Repeat the analysis above using tidyverse
# We can start with aggregated data
data
   time patient glucose
1   pre       A    92.5
2    20       A    86.0
3    30       A    89.0
4    60       A    87.5
5    90       A    88.5
6   pre       B    85.0
7    20       B    75.0
8    30       B    79.0
9    60       B    77.5
10   90       B    76.0
11  pre       C    90.5
12   20       C    85.0
13   30       C    89.5
14   60       C    90.0
15   90       C    92.5
16  pre       D    88.0
17   20       D    80.0
18   30       D    85.0
19   60       D    82.5
20   90       D    83.5
21  pre       E    95.5
22   20       E    93.0
23   30       E    94.5
24   60       E    92.0
25   90       E    90.0
# Create a trapezoid rule AUC function
piecewise_auc = function(time, values) {
  sum(diff(time) * (head(values, -1) + tail(values, -1)) / 2)
}

## Base R
# Convert the time to integers (we need it for interval widths)
# relabel pre as 0
data$time = factor(data$time, labels = c(0,20,30,60,90))
# Converting factors to it's numeric labels, we first need to convert to character
data$time = as.integer(as.character(data$time))

# Aggregate here would be difficult since we need to pass two arguments to 
# auc calculation We use `by` instead
auc_orig = data.frame(
  auc = by(data, data$patient, function(df) {
    piecewise_auc(df$time, df$glucose)
  })
)

# For baseline adjustment, we first split by patient, use anonymous function
# to subtract the minimum and unsplit the data back to original form
data_baseline = lapply(split(data, data$patient), function(x) {
  x$glucose = x$glucose - min(x$glucose)
  return(x) # explicitly return x from anonymous function
}) |> 
  unsplit(data$patient)

# Calculate the auc in a way similar to previous
auc_baseline = data.frame(
  auc = by(data_baseline, data_baseline$patient, function(df) {
    piecewise_auc(df$time, df$glucose)
  })
)

auc_orig
     auc
A 7947.5
B 7020.0
C 8057.5
D 7507.5
E 8350.0
auc_baseline
    auc
A 207.5
B 270.0
C 407.5
D 307.5
E 250.0
## Tidyverse
data |>
  mutate(time = factor(time, labels = c(0, 20, 30, 60, 90))) |>
  mutate(time = as.integer(as.character(time))) |>
  group_by(patient) |> # following operations will be performed per group
  mutate(baseline = glucose - min(glucose)) |>
  summarise(
    auc_orig = piecewise_auc(time, glucose),
    auc_baseline = piecewise_auc(time, baseline)
  )
# A tibble: 5 × 3
  patient auc_orig auc_baseline
  <fct>      <dbl>        <dbl>
1 A          7948.         208.
2 B          7020          270 
3 C          8058.         408.
4 D          7508.         308.
5 E          8350          250 
# You can see that for more complicated transformations, `tidyverse` usually 
# results in less verbose code that thanks to pipe operator and transformation
# verbs can be easily read and understood.

Session info

R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin23.6.0
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /opt/homebrew/Cellar/openblas/0.3.28/lib/libopenblasp-r0.3.28.dylib 
LAPACK: /opt/homebrew/Cellar/r/4.4.2_2/lib/R/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] palmerpenguins_0.1.1 writexl_1.5.1        readxl_1.4.3        
 [4] lubridate_1.9.4      forcats_1.0.0        stringr_1.5.1       
 [7] dplyr_1.1.4          purrr_1.0.2          readr_2.1.5         
[10] tidyr_1.3.1          tibble_3.2.1         tidyverse_2.0.0     
[13] ggplot2_3.5.1        devtools_2.4.5       usethis_3.1.0       
[16] knitr_1.49          

loaded via a namespace (and not attached):
 [1] gtable_0.3.6      xfun_0.50         htmlwidgets_1.6.4
 [4] remotes_2.5.0     tzdb_0.4.0        vctrs_0.6.5      
 [7] tools_4.4.2       generics_0.1.3    parallel_4.4.2   
[10] pkgconfig_2.0.3   lifecycle_1.0.4   compiler_4.4.2   
[13] farver_2.1.2      munsell_0.5.1     httpuv_1.6.15    
[16] htmltools_0.5.8.1 yaml_2.3.10       later_1.4.1      
[19] pillar_1.10.1     crayon_1.5.3      urlchecker_1.0.1 
[22] ellipsis_0.3.2    cachem_1.1.0      sessioninfo_1.2.2
[25] mime_0.12         tidyselect_1.2.1  digest_0.6.37    
[28] stringi_1.8.4     labeling_0.4.3    fastmap_1.2.0    
[31] grid_4.4.2        colorspace_2.1-1  cli_3.6.3        
[34] magrittr_2.0.3    pkgbuild_1.4.5    utf8_1.2.4       
[37] withr_3.0.2       scales_1.3.0      promises_1.3.2   
[40] bit64_4.5.2       timechange_0.3.0  rmarkdown_2.29   
[43] bit_4.5.0.1       cellranger_1.1.0  hms_1.1.3        
[46] memoise_2.0.1     shiny_1.10.0      evaluate_1.0.3   
[49] miniUI_0.1.1.1    profvis_0.4.0     rlang_1.1.4      
[52] Rcpp_1.0.14       xtable_1.8-4      glue_1.8.0       
[55] pkgload_1.4.0     rstudioapi_0.17.1 vroom_1.6.5      
[58] jsonlite_1.8.9    R6_2.5.1          fs_1.6.5