Overview and revision of R basic

Author

Athimed El Taher

Published

February 28, 2024

Course website

Lecture program is available at https://ivanek.github.io/analysisOfGenomicsDataWithR/

  • Includes R installation instructions
  • Includes handouts and link to supplementary data files
  • Updated after lecture to include answers to exercises
  • Bookmark it!

Lectures

Wednesday, 14h15-16h, Biozentrum, Seminarraum U1.191

  • Teachers: bioinformaticians at the Department of Biomedicine of the Unibas, and at the FMI.
  • Each course will be a mix of lecture and exercises.
  • Bring your laptop
  • No lecture on 01.05.2024 due to Tag der Arbeit
  • You can write us additional questions anytime on the Etherpad on ADAM.

Exam

  • 2 ECTS credits
  • Two online exam sessions: 03.04.2024 and 29.05.2024
  • 12 multiple choice questions per exam (with one correct answer per question)
  • To pass, you need score above 50% overall, i.e., at least 12 out of 24 questions correct
  • Requires computer with R and Bioconductor installed (i.e. some multiple-choice questions require a bit of coding)
  • All material allowed, communication strictly prohibited

We recommend free auditors to also take the exam, it’s a good training!

Learning goals

  • Understand the basic concepts involved in the analysis of most common types of genomics data (bulk and single-cell mRNA-Seq, ATAC-Seq, ChIP-Seq)
  • Be able to perform some simple analysis workflow using the Bioconductor framework, visualize and interpret the results.

Requirements

A solid knowledge of R basics

For example:

  • Understand the usage of basic data types such as factor or data.frame/tibble
  • Load, modify and summarize data fitting into a data.frame/tibble
  • Visualize the data, e.g., as a scatterplot or boxplot
  • Run and interpret a simple linear model
  • Understand error messages and get help from the documentation

Resources for catching-up (and for future reference)

If you find today’s exercises too challenging you will have to catch up fast. Here are a few resources:

Online resources

Books

  • The R chapters in: Buffalo V (2015) Bioinformatics data skills. O’Reilly. SwissBib link
  • Dalgaard P (2008) Introductory statistics with R. Springer Science & Business Media. SwissBib link

R Cheat sheets

Various cheat sheets at the RStudio website, for example:

Installation and setup

  • See the course webpage for the general options
  • R needs to be up-to-date (version >= 4.3.2)
  • Ask for help if you encounter troubles
  • Each lectures will list individual requirements

Best practices

Rstudio

  • RStudio is an Integrated Development Environment (IDE) for the R programming language.
  • It is one of the most productive option for using R
  • It enables interactive data analysis, reproducible research and result publishing
  • It can be locally installed (desktop version) or accessed via a web browser (RStudio server).
  • RStudio interface:

Working style

  • For each lecture, maintain a <lecture_name>.R or <lecture_name>.Rmd script on your working directory, storing all your commands

    • proper indentation of the code (CMD+I in Rstudio) is always a plus for readability
    • A code without comments is useless.
  • Always keep in mind reproducibility. Most of us use R/Bioconductor for their research: many months after the analysis was performed, some reviewer/colleague might ask for the code used to generate some specific result.

  • Use a clean working directory structure.

  • Recommended steps/commands at the beginning of every R script

    • Start by setting up the working directory (setwd()) with absolute path
    • List what is in workspace with ls() to check that it is clean
    • Load libraries with library()
  • When quitting R, do not save history and data. This might lead to a big mess later (and anyway all commands necessary to reproduce the analysis are in your script, right?)

  • Tip: use the auto completion (via TAB) as much as possible

Getting help

  • Really read the error messages! They might seem cryptic at first, but yet often contain useful information.
  • First carefully look at R help pages for functions (e.g., ?lm) and to the vignettes of the packages
  • Search on Google, on forums (e.g., stack overflow) or mailing lists archives (e.g., the Bioconductor support site)
  • Talk to colleagues to get a different perspective on the problem
  • If no success, ask on forums or mailing lists
    • Don’t forget to attach to your question the result of the function sessionInfo() (or the results of the function session_info() from the sessioninfo package, slightly easier to read). This documents what system you are using, the version of R and the different packages

      devtools::session_info()
      ─ Session info ──────────────────────────────────────────────────────────
       setting  value
       version  R version 4.3.2 (2023-10-31)
       os       macOS Big Sur 11.4
       system   aarch64, darwin20
       ui       X11
       language (EN)
       collate  en_US.UTF-8
       ctype    en_US.UTF-8
       tz       Europe/Zurich
       date     2024-02-28
       pandoc   3.1.1 @ /private/var/folders/g_/jj13tm5907vdk15mdl6vvpbr0000gp/T/AppTranslocation/52721D3A-B671-4426-9525-B80654446EF6/d/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
      
      ─ Packages ──────────────────────────────────────────────────────────────
       package     * version date (UTC) lib source
       cachem        1.0.8   2023-05-01 [1] CRAN (R 4.3.0)
       callr         3.7.3   2022-11-02 [1] CRAN (R 4.3.0)
       cellranger    1.1.0   2016-07-27 [1] CRAN (R 4.3.0)
       cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)
       colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)
       crayon        1.5.2   2022-09-29 [1] CRAN (R 4.3.0)
       devtools    * 2.4.5   2022-10-11 [1] CRAN (R 4.3.0)
       digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)
       dplyr       * 1.1.3   2023-09-03 [1] CRAN (R 4.3.0)
       ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.3.0)
       evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.1)
       fansi         1.0.5   2023-10-08 [1] CRAN (R 4.3.1)
       fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
       forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)
       fs            1.6.3   2023-07-20 [1] CRAN (R 4.3.0)
       generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
       ggplot2     * 3.5.0   2024-02-23 [1] CRAN (R 4.3.1)
       glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)
       gtable        0.3.4   2023-08-21 [1] CRAN (R 4.3.0)
       hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)
       htmltools     0.5.7   2023-11-03 [1] CRAN (R 4.3.1)
       htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)
       httpuv        1.6.12  2023-10-23 [1] CRAN (R 4.3.1)
       jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)
       knitr       * 1.45    2023-10-30 [1] CRAN (R 4.3.1)
       later         1.3.1   2023-05-02 [1] CRAN (R 4.3.0)
       lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)
       lubridate   * 1.9.3   2023-09-27 [1] CRAN (R 4.3.1)
       magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
       memoise       2.0.1   2021-11-26 [1] CRAN (R 4.3.0)
       mime          0.12    2021-09-28 [1] CRAN (R 4.3.0)
       miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.3.0)
       munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)
       pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
       pkgbuild      1.4.2   2023-06-26 [1] CRAN (R 4.3.0)
       pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
       pkgload       1.3.3   2023-09-22 [1] CRAN (R 4.3.1)
       prettyunits   1.2.0   2023-09-24 [1] CRAN (R 4.3.1)
       processx      3.8.2   2023-06-30 [1] CRAN (R 4.3.0)
       profvis       0.3.8   2023-05-02 [1] CRAN (R 4.3.0)
       promises      1.2.1   2023-08-10 [1] CRAN (R 4.3.0)
       ps            1.7.5   2023-04-18 [1] CRAN (R 4.3.0)
       purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)
       R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
       Rcpp          1.0.11  2023-07-06 [1] CRAN (R 4.3.0)
       readr       * 2.1.5   2024-01-10 [1] CRAN (R 4.3.1)
       readxl      * 1.4.3   2023-07-06 [1] CRAN (R 4.3.0)
       remotes       2.4.2.1 2023-07-18 [1] CRAN (R 4.3.0)
       rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)
       rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.3.1)
       rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)
       scales        1.3.0   2023-11-28 [1] CRAN (R 4.3.1)
       sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
       shiny         1.7.5.1 2023-10-14 [1] CRAN (R 4.3.1)
       stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)
       stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)
       tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
       tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)
       tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)
       tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)
       timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)
       tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)
       urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.3.0)
       usethis     * 2.2.2   2023-07-06 [1] CRAN (R 4.3.0)
       utf8          1.2.4   2023-10-22 [1] CRAN (R 4.3.1)
       vctrs         0.6.4   2023-10-12 [1] CRAN (R 4.3.1)
       withr         2.5.2   2023-10-30 [1] CRAN (R 4.3.1)
       writexl     * 1.5.0   2024-02-09 [1] CRAN (R 4.3.1)
       xfun          0.41    2023-11-01 [1] CRAN (R 4.3.1)
       xtable        1.8-4   2019-04-21 [1] CRAN (R 4.3.0)
       yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)
      
       [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
      
      ─────────────────────────────────────────────────────────────────────────
      # save the outpout
      capture.output(devtools::session_info(pkgs='attached'),file="data/01/session_info_RevisionRBasics.txt") 
    • Do not forget to attach some code example allowing others to reproduce the issue

    • Be sure to work on an up-to-date system (R and Bioconductor)! Bugs are regularly fixed in Bioconductor packages

Important!

  • Asking questions is highly encouraged
  • If something is not working or understood, reach out for help as early as possible
  • Steep learning curve: practice, practice and practice again!
    • The best option is to get access to some genomics dataset to analyze (from your lab, from a paper your like, etc) and to train and apply what you learn during the lectures on it

Short revision of basic R

In the introduction to R basics, we mostly used functions from the tidyverse, which provides a modern, widely used collection of R packages designed for data science. However, for some aspects, we also illustrated how to perform the same operations using “base R” commands, that is, functions that are loaded with R by default rather than provided in a separate add-on package.

Import file

Read data from a text file into R represented as a tibble or data.frame.

setwd('~/Documents/teaching/2024/analysisOfGenomicsDataWithR')

suppressPackageStartupMessages({
#install.packages(tidyverse)
library(tidyverse)
#install.packages(readxl)
library(readxl)
#install.packages(writexl)
library(writexl)
#install.packages(readr)
library(readr)
})


# tibble
dat = readr::read_csv("data/01/penguins.csv")
Rows: 344 Columns: 8
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(dat)
spc_tbl_ [344 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ species          : chr [1:344] "Adelie" "Adelie" "Adelie" "Adelie" ...
 $ island           : chr [1:344] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: num [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : num [1:344] 3750 3800 3250 NA 3450 ...
 $ sex              : chr [1:344] "male" "female" "female" NA ...
 $ year             : num [1:344] 2007 2007 2007 2007 2007 ...
 - attr(*, "spec")=
  .. cols(
  ..   species = col_character(),
  ..   island = col_character(),
  ..   bill_length_mm = col_double(),
  ..   bill_depth_mm = col_double(),
  ..   flipper_length_mm = col_double(),
  ..   body_mass_g = col_double(),
  ..   sex = col_character(),
  ..   year = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
dim(dat)
[1] 344   8
dat
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <chr>   <chr>              <dbl>         <dbl>             <dbl>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>
# data frame
df = as.data.frame(dat) 
head(df)
  species    island bill_length_mm bill_depth_mm flipper_length_mm
1  Adelie Torgersen           39.1          18.7               181
2  Adelie Torgersen           39.5          17.4               186
3  Adelie Torgersen           40.3          18.0               195
4  Adelie Torgersen             NA            NA                NA
5  Adelie Torgersen           36.7          19.3               193
6  Adelie Torgersen           39.3          20.6               190
  body_mass_g    sex year
1        3750   male 2007
2        3800 female 2007
3        3250 female 2007
4          NA   <NA> 2007
5        3450 female 2007
6        3650   male 2007
str(df)
'data.frame':   344 obs. of  8 variables:
 $ species          : chr  "Adelie" "Adelie" "Adelie" "Adelie" ...
 $ island           : chr  "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
 $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: num  181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : num  3750 3800 3250 NA 3450 ...
 $ sex              : chr  "male" "female" "female" NA ...
 $ year             : num  2007 2007 2007 2007 2007 ...
dim(df)
[1] 344   8
# or: 
df = read.table("data/01/penguins.csv",sep=',',h=T)
  • The penguins dataset can be stored in a tibble object (dat) or in data.frame (df). There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting.
  • the dataset contains 344 rows (observations) and 8 columns (variables): 344 penguin observations, with 8 information per observations (variable)
  • Each column of a data.frame is of an atomic (=simple) data type. Atomic data types have values that cannot be divided or broken down further.
  • Common atomic data types include:
    • logical (TRUE, FALSE and NA i.e., not available)
    • integer (3L). Use the L when creating and tells R to store this as an integer.
    • numeric (3.14, Inf, NaN)
    • character ("three")
    • factor for categorical variables (e.g., blood types "A","B","AB","0"). Variables that can take on one of a limited, and usually fixed, number of possible values.

Common data types derived from (are the combination of) the atomic ones include:

  • data.frame/tibble: all columns are of the same length, but each can be a different atomic data type

  • matrix: all columns/rows are the same length and all elements are of the same atomic data type

  • list: no restriction on length/data type per element

  • You can query for the data type (also called class) of a variable using the function class()

class(df) # "data.frame"
[1] "data.frame"
class(dat) # [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 
class(df$body_mass_g) # atomic data type. Cannot be broken down further. 
[1] "integer"
  • Change data type:
str(df)
'data.frame':   344 obs. of  8 variables:
 $ species          : chr  "Adelie" "Adelie" "Adelie" "Adelie" ...
 $ island           : chr  "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
 $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : chr  "male" "female" "female" NA ...
 $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
df$species <- factor(df$species)
df$island <- factor(df$island)

# If you check the levels of the variable: 
levels(df$island) # [1] "Biscoe"    "Dream"     "Torgersen"
[1] "Biscoe"    "Dream"     "Torgersen"
# However, imagine Torgersen is the main island
# you can choose which level to put as reference level
df$island <- relevel(df$island, ref = "Torgersen")
levels(df$island) # [1] "Torgersen" "Biscoe"    "Dream"  
[1] "Torgersen" "Biscoe"    "Dream"    
  • Remember that behind the scenes, a factor is encoded with numerical values.
  • Whether a string variable should be considered as categorical or not depends on the question asked.

Subsetting

A common operation is to subset data sets, that is, to only retain some of the records. The specification of what to retain can be done in many different ways. There are many ways of achieving it:

  1. by numeric index:
  • slice: To retain a pre-specified set of rows, we can use the slice function from dplyr.
  • Select: the select function, also from dplyr, allows us to retain a given set of columns
# first row
slice(dat, 1)
# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torge…           39.1          18.7               181        3750
# ℹ 2 more variables: sex <chr>, year <dbl>
#slice(df, 1)

# first, third and fifth row
slice(dat, c(1, 3, 5))
# A tibble: 3 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torge…           39.1          18.7               181        3750
2 Adelie  Torge…           40.3          18                 195        3250
3 Adelie  Torge…           36.7          19.3               193        3450
# ℹ 2 more variables: sex <chr>, year <dbl>
#slice(df, c(1, 3, 5))

# Check if dat and df give the same result:
# first to third column using tibble
on_dat = select(dat,1:3)
# first to third column using data.frame
on_df = select(df,1:3)
table(on_dat==on_df)

TRUE 
1030 
Side note

A more intuitive way, which reduces the number of intermediate variables that are explicitly created and scales to more complex operations, is to use the pipe operator (|>).

# Select column 1:3, and row 1:5 of the dat tibble or df data.frame
dat |> 
  slice(1:5) |>
    select(1:3)
# A tibble: 5 × 3
  species island    bill_length_mm
  <chr>   <chr>              <dbl>
1 Adelie  Torgersen           39.1
2 Adelie  Torgersen           39.5
3 Adelie  Torgersen           40.3
4 Adelie  Torgersen           NA  
5 Adelie  Torgersen           36.7
  • square brackets []: Subsetting can also be done using base R commands.
# first five rows, first three columns
dat[1:5, 1:3]
# A tibble: 5 × 3
  species island    bill_length_mm
  <chr>   <chr>              <dbl>
1 Adelie  Torgersen           39.1
2 Adelie  Torgersen           39.5
3 Adelie  Torgersen           40.3
4 Adelie  Torgersen           NA  
5 Adelie  Torgersen           36.7
  1. by row- and/or column names with:

Can be achieved with select or square brackets. An important shortcut to extract a single column is provided by the $ sign

# Use the column names
colnames(dat)
[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"             
dim(dat) # 344   8
[1] 344   8
select(dat,species)
# A tibble: 344 × 1
   species
   <chr>  
 1 Adelie 
 2 Adelie 
 3 Adelie 
 4 Adelie 
 5 Adelie 
 6 Adelie 
 7 Adelie 
 8 Adelie 
 9 Adelie 
10 Adelie 
# ℹ 334 more rows
# using `$` and square brackets
head(dat$species) 
[1] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
length(dat$species)
[1] 344
# Use row number and colnames
dat[1:5, c("species", "island", "bill_length_mm")]
# A tibble: 5 × 3
  species island    bill_length_mm
  <chr>   <chr>              <dbl>
1 Adelie  Torgersen           39.1
2 Adelie  Torgersen           39.5
3 Adelie  Torgersen           40.3
4 Adelie  Torgersen           NA  
5 Adelie  Torgersen           36.7
Side note

There is an important difference between a tibble and a data.frame when it comes to subsetting using the bracket operator. By default, extracting a single column from a data.frame returns a vector (not a data.frame), while performing the same operation on a tibble returns another tibble. The key to getting a data.frame when extracting a column from a data.frame is to add the argument drop = FALSE. The $ operator always returns a vector.

# extract the first five elements in the 'species' column
str(df[,'species']) # vector
 Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
length(df[,'species']) 
[1] 344
str(dat[,'species']) # tibble
tibble [344 × 1] (S3: tbl_df/tbl/data.frame)
 $ species: chr [1:344] "Adelie" "Adelie" "Adelie" "Adelie" ...
dim(dat[,'species'])
[1] 344   1
  1. by logical index
  • filter: dplyr function to subset based on observed values
dat |>
  filter(year == 2007 & island == "Torgersen")
# A tibble: 20 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <chr>   <chr>              <dbl>         <dbl>             <dbl>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
11 Adelie  Torgersen           37.8          17.1               186
12 Adelie  Torgersen           37.8          17.3               180
13 Adelie  Torgersen           41.1          17.6               182
14 Adelie  Torgersen           38.6          21.2               191
15 Adelie  Torgersen           34.6          21.1               198
16 Adelie  Torgersen           36.6          17.8               185
17 Adelie  Torgersen           38.7          19                 195
18 Adelie  Torgersen           42.5          20.7               197
19 Adelie  Torgersen           34.4          18.4               184
20 Adelie  Torgersen           46            21.5               194
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>
# can also be done in multiple steps
dat |>
  filter(year == 2007) |>
  filter(island == "Torgersen")
# A tibble: 20 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <chr>   <chr>              <dbl>         <dbl>             <dbl>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
11 Adelie  Torgersen           37.8          17.1               186
12 Adelie  Torgersen           37.8          17.3               180
13 Adelie  Torgersen           41.1          17.6               182
14 Adelie  Torgersen           38.6          21.2               191
15 Adelie  Torgersen           34.6          21.1               198
16 Adelie  Torgersen           36.6          17.8               185
17 Adelie  Torgersen           38.7          19                 195
18 Adelie  Torgersen           42.5          20.7               197
19 Adelie  Torgersen           34.4          18.4               184
20 Adelie  Torgersen           46            21.5               194
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>
  • which: using square brackets and base R function which
# logical vector: 
head(dat$island=='Torgersen' & dat$year == 2007)
[1] TRUE TRUE TRUE TRUE TRUE TRUE
# index of logical == TRUE
which(dat$island=='Torgersen' & dat$year == 2007)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
# Use index to subset the df keeping only logical==T: 
dat[which(dat$island=='Torgersen' & dat$year == 2007),]
# A tibble: 20 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <chr>   <chr>              <dbl>         <dbl>             <dbl>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
11 Adelie  Torgersen           37.8          17.1               186
12 Adelie  Torgersen           37.8          17.3               180
13 Adelie  Torgersen           41.1          17.6               182
14 Adelie  Torgersen           38.6          21.2               191
15 Adelie  Torgersen           34.6          21.1               198
16 Adelie  Torgersen           36.6          17.8               185
17 Adelie  Torgersen           38.7          19                 195
18 Adelie  Torgersen           42.5          20.7               197
19 Adelie  Torgersen           34.4          18.4               184
20 Adelie  Torgersen           46            21.5               194
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>
  • subset: the subset() function allows to combine several logical filters.
subset(dat, island == "Torgersen" & year == 2007)
# A tibble: 20 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <chr>   <chr>              <dbl>         <dbl>             <dbl>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
11 Adelie  Torgersen           37.8          17.1               186
12 Adelie  Torgersen           37.8          17.3               180
13 Adelie  Torgersen           41.1          17.6               182
14 Adelie  Torgersen           38.6          21.2               191
15 Adelie  Torgersen           34.6          21.1               198
16 Adelie  Torgersen           36.6          17.8               185
17 Adelie  Torgersen           38.7          19                 195
18 Adelie  Torgersen           42.5          20.7               197
19 Adelie  Torgersen           34.4          18.4               184
20 Adelie  Torgersen           46            21.5               194
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>
  • %in%: %in% This operator is used to identify if a series of elements of a vector belong to another vector. This is also very useful for subsetting data frames
c("Torgersen","Biscoe","Dream",'dream') %in% levels(dat$island)
[1] FALSE FALSE FALSE FALSE
levels(df$island) %in% c("Torgersen","Biscoe","Dream",'dream') 
[1] TRUE TRUE TRUE

Adding a new column to a data set, or replacing an existing column

  • mutate: function from dplyr
# Add another column to `dat` that contains the sum of the `bill_length_mm` and `bill_depth_mm` columns.
dat |>
  mutate(lpd = bill_length_mm + bill_depth_mm)
# A tibble: 344 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <chr>   <chr>              <dbl>         <dbl>             <dbl>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
# ℹ 334 more rows
# ℹ 4 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>, lpd <dbl>
#Add column in `dat` that contains the bill length in cm. Display only the species column and the newly created column
dat |>
  mutate(bill_length_cm = bill_length_mm / 10) |>
  select(species, bill_length_cm)
# A tibble: 344 × 2
   species bill_length_cm
   <chr>            <dbl>
 1 Adelie            3.91
 2 Adelie            3.95
 3 Adelie            4.03
 4 Adelie           NA   
 5 Adelie            3.67
 6 Adelie            3.93
 7 Adelie            3.89
 8 Adelie            3.92
 9 Adelie            3.41
10 Adelie            4.2 
# ℹ 334 more rows
  • $: $ can also be used to add a new column to a data.frame
dat$bill_length_cm =  dat$bill_length_mm / 10

Finding all unique combinations

  • distinct(): function from dplyr which returns all distinct rows in a data.frame or tibble.
dat |>
  distinct(species, island)
# A tibble: 5 × 2
  species   island   
  <chr>     <chr>    
1 Adelie    Torgersen
2 Adelie    Biscoe   
3 Adelie    Dream    
4 Gentoo    Biscoe   
5 Chinstrap Dream    

Computing summary statistics

Common summary statistics for numerical variables are mean and standard deviation:

mean(dat$bill_length_cm) # return NA
[1] NA
# Deal with missing data
mean(dat$bill_length_cm,na.rm = T)
[1] 4.392193
sd(dat$bill_length_cm,na.rm = T)
[1] 0.5459584

For our example it seems more useful to split the flipper length according to species and/or sex and compute summary statistics on these data subsets.

The data set can be split in six groups: 1. Adelie.male 2. Adelie.female 3. Chinstrap.male 4. Chinstrap.female 5. Gentoo.male 6. Gentoo.female

# Split the data in four groups: 
dataSplit <- split(dat$flipper_length_mm, list(dat$species, df$sex))
dataSplit # the result is a list with for each group, the flipper_length_mm of the corresponding species and sex
$Adelie.female
 [1] 186 195 193 181 182 185 195 184 174 189 187 187 172 178 188 195 180
[18] 181 182 186 185 190 186 190 187 186 181 185 185 184 195 190 190 196
[35] 190 191 187 189 187 191 189 190 202 185 187 190 178 192 183 193 199
[52] 181 198 193 191 188 189 187 176 186 191 191 190 193 187 191 185 193
[69] 188 192 184 195 187

$Chinstrap.female
 [1] 192 188 198 178 195 193 185 190 181 190 181 187 195 200 191 187 187
[18] 199 195 192 187 196 196 190 198 199 193 187 191 194 189 195 202 198

$Gentoo.female
 [1] 211 210 210 211 209 214 214 210 210 209 215 213 215 210 209 207 220
[18] 213 208 208 210 217 210 213 210 210 217 208 208 208 214 219 220 216
[35] 217 216 209 215 212 212 212 218 212 218 212 214 222 203 219 215 210
[52] 208 216 213 217 214 215 212

$Adelie.male
 [1] 181 190 195 191 198 197 194 180 185 180 183 180 178 184 196 190 184
[18] 195 196 190 182 191 188 200 191 193 194 195 192 192 188 198 190 197
[35] 195 184 195 196 193 194 190 189 205 186 208 196 192 203 190 184 190
[52] 197 191 197 196 199 189 198 202 199 195 210 197 199 190 200 193 187
[69] 190 185 190 193 201

$Chinstrap.male
 [1] 196 193 197 197 198 194 201 201 197 195 191 193 197 200 205 201 203
[18] 195 210 205 210 196 201 212 187 201 203 197 203 202 206 207 193 210

$Gentoo.male
 [1] 230 218 215 219 215 216 213 217 221 222 218 215 215 215 220 222 230
[18] 220 219 208 225 216 222 225 215 220 225 220 220 224 221 231 230 229
[35] 223 221 221 230 220 223 221 224 228 218 230 228 224 226 216 225 228
[52] 228 215 219 209 229 230 230 222 222 213
# Apply on each group, the mean calculation: 
groupMeans <- lapply(dataSplit, mean) # lapply is specific to apply a function on a list 
groupMeans # Output is the mean of each group, saved in a list
$Adelie.female
[1] 187.7945

$Chinstrap.female
[1] 191.7353

$Gentoo.female
[1] 212.7069

$Adelie.male
[1] 192.411

$Chinstrap.male
[1] 199.9118

$Gentoo.male
[1] 221.541
## Same thing, but the output is a named vector:
groupMeans <- sapply(dataSplit, mean)
groupMeans
   Adelie.female Chinstrap.female    Gentoo.female      Adelie.male 
        187.7945         191.7353         212.7069         192.4110 
  Chinstrap.male      Gentoo.male 
        199.9118         221.5410 

The same is achieved with the function aggregate() (with or without a formula interface), which returns a data frame.

aggregate(dat$flipper_length_mm, list(dat$species, dat$sex), FUN=mean)
    Group.1 Group.2        x
1    Adelie  female 187.7945
2 Chinstrap  female 191.7353
3    Gentoo  female 212.7069
4    Adelie    male 192.4110
5 Chinstrap    male 199.9118
6    Gentoo    male 221.5410
aggregate(flipper_length_mm ~ species + sex, data=dat, FUN=mean)
    species    sex flipper_length_mm
1    Adelie female          187.7945
2 Chinstrap female          191.7353
3    Gentoo female          212.7069
4    Adelie   male          192.4110
5 Chinstrap   male          199.9118
6    Gentoo   male          221.5410

To compute the mean and standard deviation at the same time, we can define a function that returns both:

meanAndStd <- function(x) {
  c(mean=mean(x,na.rm=T), std=sd(x,na.rm=T))
}

meanAndStd(dat$flipper_length_mm) # What happen if you do not add na.rm=T? 
     mean       std 
200.91520  14.06171 

This function can then be used in combination with aggregate, or sapply.

sumStats <- aggregate(flipper_length_mm ~ species + sex, data=dat,
                      FUN=meanAndStd)
sumStats
    species    sex flipper_length_mm.mean flipper_length_mm.std
1    Adelie female             187.794521              5.595035
2 Chinstrap female             191.735294              5.754096
3    Gentoo female             212.706897              3.897856
4    Adelie   male             192.410959              6.599317
5 Chinstrap   male             199.911765              5.976558
6    Gentoo   male             221.540984              5.673252
sapply(dataSplit, meanAndStd)
     Adelie.female Chinstrap.female Gentoo.female Adelie.male
mean    187.794521       191.735294    212.706897  192.410959
std       5.595035         5.754096      3.897856    6.599317
     Chinstrap.male Gentoo.male
mean     199.911765  221.540984
std        5.976558    5.673252

Visual summaries

There are good reasons to visualize the data instead of deriving conclusions from summary statistics alone:

  • It provides more complete quality check on a dataset. For example, extreme values can render summary statistics meaningless
  • Meaningful trends are often easier to grasp visually.

With base graphics

dat$species = as.factor(dat$species)
dat$sex = as.factor(dat$sex)
plot(dat$flipper_length_mm, 
     col=as.numeric(dat$species), 
     pch=as.numeric(dat$sex), 
     ylab="flipper length (mm)")
legend('top', legend = levels(dat$sex), pch = 1:2)
legend('topright', legend = levels(dat$species), pch = 1, col = 1:3)

This scatter plot is not so meaningful because the x-axis represents only the order of the Penguins in the data frame (it just happens so that they are ordered by species and sex).

A boxplot might be a better option here: it gives a visual summary of the data spread.

  • It is best applied to continuous data
  • The formula interface allows to split the data into subsets before plotting it
par(mar=c(10,5,1,1))
boxplot(flipper_length_mm ~ species + sex, data=dat,las=2,xlab='')

  • Boxplots convey summary statistics (median, data spread and outliers) for large datasets
  • For small datasets (<= 5 points per group) it is more informative to plot the data points
  • Secondary plotting functions such as lines() and points() can be called after the main figure has been drawn to add elements
boxplot(flipper_length_mm ~ species + sex, data=dat, outline=F,las=2,xlab='')
for (n in 1:length(dataSplit)) { # for each group/box
  x <- jitter(rep(n, length(dataSplit[[n]])),amount=0.1) # amount to shift all the points if same value
  points(x, dataSplit[[n]], pch=1.2, cex=0.5,col='blue',lwd=0.8)
}

With ggplot2

An elegant alternative to the base graphic system is the ggplot2 package system, which is an implementation of “The Grammar of Graphics” (Wilkinson 2005). Here each figure is composed of the same few building blocks, such as data, aesthetics and scales.

#install.packages(ggplot2)
suppressPackageStartupMessages({

ggplot(data=dat) + 
  aes(x = species:sex) + 
  aes(y = flipper_length_mm) + 
  geom_boxplot(outlier.size = -1) + 
  geom_jitter(pch=1,
              size = 2.5, 
              height = 0, 
              width = 0.1) +
  theme_minimal(base_size = 14)
})
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

# If you wish to remove NA
set.seed(123)
dat |>
  drop_na()  |>
  ggplot() + 
  aes(x = species:sex) + 
  aes(y = flipper_length_mm) + 
  geom_boxplot(outlier.size = -1) + 
  geom_jitter(pch=1,
              size = 2.5, 
              height = 0, 
              width = 0.1) +
  theme_minimal(base_size = 14)

Fitting a linear model

Linear models are used to test hypotheses on the relation between a numerical variable and a set of independent predictors. These can be:

  • numerical as in linear regression
  • categorical as for t-test or ANOVA
  • A combination of both

We can use the function lm() to test hypotheses about the dependence of flipper length(mm) on species and sex in our penguin dataset.

linmod <- lm(flipper_length_mm ~ species * sex, data=dat)
summary(linmod)

Call:
lm(formula = flipper_length_mm ~ species * sex, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.7945  -3.4110   0.0882   3.4590  17.5890 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)              187.7945     0.6619 283.721  < 2e-16 ***
speciesChinstrap           3.9408     1.1742   3.356 0.000884 ***
speciesGentoo             24.9124     0.9947  25.044  < 2e-16 ***
sexmale                    4.6164     0.9361   4.932  1.3e-06 ***
speciesChinstrap:sexmale   3.5600     1.6606   2.144 0.032782 *  
speciesGentoo:sexmale      4.2176     1.3971   3.019 0.002737 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.655 on 327 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8396,    Adjusted R-squared:  0.8372 
F-statistic: 342.4 on 5 and 327 DF,  p-value: < 2.2e-16
levels(dat$sex) # "female" "male"   
[1] "female" "male"  
levels(dat$species) #  "Adelie"    "Chinstrap" "Gentoo"  
[1] "Adelie"    "Chinstrap" "Gentoo"   
  • To correctly interpret the coefficients of a linear model remember that by default they reflect differences to a baseline
  • In our model, the baseline (here Intercept) is defined by the set of reference levels Adelie (for species) and female (for sex), respectively.

Lets check how the model coefficients relate to the group means calculated above:

## Intercept. The test statistics evaluate the difference of Adelie.female (baseline) penguin flipper length (average) from 0. As expected, this is highly significant but not at all relevant.
groupMeans['Adelie.female']
Adelie.female 
     187.7945 
## speciesChinstrap: Difference between Chinstrap amd Adelie in female group
groupMeans['Chinstrap.female'] - groupMeans['Adelie.female'] # deference to baseline 
Chinstrap.female 
        3.940774 
## sexmale: Difference between female and male for species Adelie
groupMeans['Adelie.male'] - groupMeans['Adelie.female'] # deference to baseline
Adelie.male 
   4.616438 
## speciesChinstrap:sexmale: A difference of differences (called 'interaction')
(groupMeans['Chinstrap.male'] - groupMeans['Chinstrap.female']) - (groupMeans['Adelie.male'] - groupMeans['Adelie.female'] ) 
Chinstrap.male 
      3.560032 

Exercises

In the exercise we will use data from a glucose response experiment (secretin data). Five patients with arterial hypertension were dosed with the hormone secretin. Among other roles, secretin is stimulating insulin release from the pancreas after oral glucose intake. Blood glucose levels were recorded in duplicates before administration and in intervals of 20, 30, 60 and 90 minutes after administration of the hormone. The dataset is a subset of the secretin data from the ISwR R package.

1. Read-in

Read-in the example dataset using an appropriate import function. Consult the help pages if you are uncertain about the function usage.

# Import the dataset
data = readr::read_csv("data/01/secretin.csv")
Rows: 50 Columns: 4
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr (3): patient, time, replicate
dbl (1): glucose

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Data structure

How is the data structured? Which variables are present and what are their data types?

data
# A tibble: 50 × 4
   glucose patient time  replicate
     <dbl> <chr>   <chr> <chr>    
 1      92 A       pre   a        
 2      93 A       pre   b        
 3      84 A       20    a        
 4      88 A       20    b        
 5      88 A       30    a        
 6      90 A       30    b        
 7      86 A       60    a        
 8      89 A       60    b        
 9      87 A       90    a        
10      90 A       90    b        
# ℹ 40 more rows
str(data)
spc_tbl_ [50 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ glucose  : num [1:50] 92 93 84 88 88 90 86 89 87 90 ...
 $ patient  : chr [1:50] "A" "A" "A" "A" ...
 $ time     : chr [1:50] "pre" "pre" "20" "20" ...
 $ replicate: chr [1:50] "a" "b" "a" "b" ...
 - attr(*, "spec")=
  .. cols(
  ..   glucose = col_double(),
  ..   patient = col_character(),
  ..   time = col_character(),
  ..   replicate = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

3. Data “clean-up”

If needed, change the data type of the column of the data frame so that they are represented with a relevant data type. When needed, relevel factors with the function relevel() to have a meaningful reference level.

## Some columns are better treated as factor
data$patient <- factor(data$patient)
data$time <- factor(data$time)
data$replicate <- factor(data$replicate)

## Relevel time to have "pre" as reference
data$time <- relevel(data$time, ref="pre")

4. Exploring

How many observations per time and/or patient are present? Use the functions split(), sapply() and length() to answer the question

sp <- split(data$glucose, list(data$time))
sapply(sp, length)
pre  20  30  60  90 
 10  10  10  10  10 
sp <- split(data$glucose, list(data$patient))
sapply(sp, length)
 A  B  C  D  E 
10 10 10 10 10 
sp <- split(data$glucose, list(data$time, data$patient))
sapply(sp, length)
pre.A  20.A  30.A  60.A  90.A pre.B  20.B  30.B  60.B  90.B pre.C  20.C 
    2     2     2     2     2     2     2     2     2     2     2     2 
 30.C  60.C  90.C pre.D  20.D  30.D  60.D  90.D pre.E  20.E  30.E  60.E 
    2     2     2     2     2     2     2     2     2     2     2     2 
 90.E 
    2 

5. Summarizing the data

Be careful about differentiating biological and technical replication. Here the patients represent the biological replicates, and the 2 measurements per patient and time point are purely technical replicates. It is rarely the case that technical variation is larger than biological variation, and if so, it often can be ignored. We will average the technical replicates for the next steps of the analysis.

Use the aggregate() function to average the glucose levels for each time/patient combination across replicates

data <- aggregate(glucose ~ time+patient, data, FUN=mean) 
head(data)
  time patient glucose
1  pre       A    92.5
2   20       A    86.0
3   30       A    89.0
4   60       A    87.5
5   90       A    88.5
6  pre       B    85.0

6. Plotting

Construct a boxplot of replicate-averaged glucose levels versus time points:

  • Overlay it with the single data points
  • Highlight the different patients with different colors
ggplot(data = data, aes(x = time, y = glucose)) + 
  geom_boxplot() + 
  geom_point(aes(color=patient), size = 2.5) +
  ggtitle('Secretin Data') + 
  xlab('Time') + 
  ylab('Glucose Level') + 
  theme_minimal(base_size = 14)

## Of note, with only 5 points per condition a boxplot is not an optimal plot choice. Plotting only the points is simpler, and since the observations are available for every patient, we can connect them with lines to see the evolution with time.  

ggplot(data = data, aes(x = time, y = glucose)) + 
  geom_point(aes(color=patient), size = 2.5) +
  geom_line(aes(group=patient)) +
  ggtitle('Secretin Data') + 
  xlab('Time') + 
  ylab('Glucose Level') + 
  theme_minimal(base_size = 14)

7. Subsetting

Create a subset of the (replicate-averaged) data containing only time-point pre and 20 using the function subset().

dataSubset <- subset(data, time == "pre" | time == "20")
## alternative solution using %in%
dataSubset <- subset(data, time %in% c("pre","20"))

8. Linear model

  • On this reduced dataset, run a linear model to evaluate the change in glucose levels at 20 minutes relative to the baseline levels. What do you conclude?
mod.time <- lm(glucose ~ time, dataSubset)
summary(mod.time)

Call:
lm(formula = glucose ~ time, data = dataSubset)

Residuals:
   Min     1Q Median     3Q    Max 
-8.800 -3.425  0.700  2.200  9.200 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   90.300      2.490  36.258 3.67e-10 ***
time20        -6.500      3.522  -1.846    0.102    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.569 on 8 degrees of freedom
Multiple R-squared:  0.2986,    Adjusted R-squared:  0.2109 
F-statistic: 3.406 on 1 and 8 DF,  p-value: 0.1022
# It seems that the strong downregulation of glucose (-6.5 units in 20 minutes) is not significant (p=0.1). But the R-squared value indicates a bad fit. 
# The boxplots above already indicated a large patient specific offset. So its probably a good choice to account for it in the model.
  • Implement a model taking the sample pairing by patients into account. What do you conclude now?
mod.time.patient <- lm(glucose ~ time + patient, dataSubset)
summary(mod.time.patient)

Call:
lm(formula = glucose ~ time + patient, data = dataSubset)

Residuals:
         1          2          6          7         11         12 
 5.995e-15 -6.647e-15  1.750e+00 -1.750e+00 -5.000e-01  5.000e-01 
        16         17         21         22 
 7.500e-01 -7.500e-01 -2.000e+00  2.000e+00 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   92.500      1.537  60.181 4.57e-07 ***
time20        -6.500      1.255  -5.179  0.00661 ** 
patientB      -9.250      1.984  -4.662  0.00958 ** 
patientC      -1.500      1.984  -0.756  0.49177    
patientD      -5.250      1.984  -2.646  0.05724 .  
patientE       5.000      1.984   2.520  0.06537 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.984 on 4 degrees of freedom
Multiple R-squared:  0.9555,    Adjusted R-squared:  0.8998 
F-statistic: 17.17 on 5 and 4 DF,  p-value: 0.008291
#Additionally accounting for patients makes the strong downregulation of glucose highly significant (p=0.0066) and removes a lot of unexplained variance (the R-squared is now 95%). Although the patient-specific effects are usually not of direct interest, its important to address them in the model!
  • Compare the first model with the patient offset model using anova
anova(mod.time, mod.time.patient)
Analysis of Variance Table

Model 1: glucose ~ time
Model 2: glucose ~ time + patient
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1      8 248.10                              
2      4  15.75  4    232.35 14.752 0.01158 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • What do you conclude now?
# The anova result tells us that the "time.patient" model explains the data significantly better than expected. Here, a better explanation means a smaller residual variation. Remember, the "time.patient" model is more complex (has 4 more parameters compared with the simple "time" model), so it has to explain the data better. But this expectation is exceeded making it significantly better.

9. Writing to a file

Write the summary for the model coefficients of the previous model to a text file as well as your R session information (using SessionInfo())

df <- summary(mod.time.patient)$coefficients
write.table(df, file='data/01/secretin_model_summary.txt')

Session info

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Big Sur 11.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] writexl_1.5.0   readxl_1.4.3    lubridate_1.9.3 forcats_1.0.0  
 [5] stringr_1.5.0   dplyr_1.1.3     purrr_1.0.2     readr_2.1.5    
 [9] tidyr_1.3.0     tibble_3.2.1    tidyverse_2.0.0 ggplot2_3.5.0  
[13] devtools_2.4.5  usethis_2.2.2   knitr_1.45     

loaded via a namespace (and not attached):
 [1] gtable_0.3.4      xfun_0.41         htmlwidgets_1.6.2
 [4] remotes_2.4.2.1   processx_3.8.2    callr_3.7.3      
 [7] tzdb_0.4.0        vctrs_0.6.4       tools_4.3.2      
[10] ps_1.7.5          generics_0.1.3    parallel_4.3.2   
[13] fansi_1.0.5       pkgconfig_2.0.3   lifecycle_1.0.3  
[16] farver_2.1.1      compiler_4.3.2    munsell_0.5.0    
[19] httpuv_1.6.12     htmltools_0.5.7   yaml_2.3.7       
[22] later_1.3.1       pillar_1.9.0      crayon_1.5.2     
[25] urlchecker_1.0.1  ellipsis_0.3.2    cachem_1.0.8     
[28] sessioninfo_1.2.2 mime_0.12         tidyselect_1.2.0 
[31] digest_0.6.33     stringi_1.7.12    labeling_0.4.3   
[34] fastmap_1.1.1     grid_4.3.2        colorspace_2.1-0 
[37] cli_3.6.1         magrittr_2.0.3    pkgbuild_1.4.2   
[40] utf8_1.2.4        withr_2.5.2       prettyunits_1.2.0
[43] scales_1.3.0      promises_1.2.1    bit64_4.0.5      
[46] timechange_0.2.0  rmarkdown_2.25    bit_4.0.5        
[49] cellranger_1.1.0  hms_1.1.3         memoise_2.0.1    
[52] shiny_1.7.5.1     evaluate_0.23     miniUI_0.1.1.1   
[55] profvis_0.3.8     rlang_1.1.1       Rcpp_1.0.11      
[58] xtable_1.8-4      glue_1.6.2        pkgload_1.3.3    
[61] rstudioapi_0.15.0 vroom_1.6.4       jsonlite_1.8.7   
[64] R6_2.5.1          fs_1.6.3