RStudio is an Integrated Development Environment (IDE) for the R programming language.
It is one of the most productive option for using R
It enables interactive data analysis, reproducible research and result publishing
It can be locally installed (desktop version) or accessed via a web browser (RStudio server).
RStudio interface:
Working style
For each lecture, maintain a <lecture_name>.R or <lecture_name>.Rmd script on your working directory, storing all your commands
proper indentation of the code (CMD+I in Rstudio) is always a plus for readability
A code without comments is useless.
Always keep in mind reproducibility. Most of us use R/Bioconductor for their research: many months after the analysis was performed, some reviewer/colleague might ask for the code used to generate some specific result.
Use a clean working directory structure.
Recommended steps/commands at the beginning of every R script
Start by setting up the working directory (setwd()) with absolute path
List what is in workspace with ls() to check that it is clean
Load libraries with library()
When quitting R, do not save history and data. This might lead to a big mess later (and anyway all commands necessary to reproduce the analysis are in your script, right?)
Tip: use the auto completion (via TAB) as much as possible
Getting help
Really read the error messages! They might seem cryptic at first, but yet often contain useful information.
First carefully look at R help pages for functions (e.g., ?lm) and to the vignettes of the packages
Talk to colleagues to get a different perspective on the problem
If no success, ask on forums or mailing lists
Don’t forget to attach to your question the result of the function sessionInfo() (or the results of the function session_info() from the sessioninfo package, slightly easier to read). This documents what system you are using, the version of R and the different packages
# save the outpoutcapture.output(devtools::session_info(pkgs='attached'),file="data/01/session_info_RevisionRBasics.txt")
Do not forget to attach some code example allowing others to reproduce the issue
Be sure to work on an up-to-date system (R and Bioconductor)! Bugs are regularly fixed in Bioconductor packages
Important!
Asking questions is highly encouraged
If something is not working or understood, reach out for help as early as possible
Steep learning curve: practice, practice and practice again!
The best option is to get access to some genomics dataset to analyze (from your lab, from a paper your like, etc) and to train and apply what you learn during the lectures on it
Short revision of basic R
In the introduction to R basics, we mostly used functions from the tidyverse, which provides a modern, widely used collection of R packages designed for data science. However, for some aspects, we also illustrated how to perform the same operations using “base R” commands, that is, functions that are loaded with R by default rather than provided in a separate add-on package.
Import file
Read data from a text file into R represented as a tibble or data.frame.
Rows: 344 Columns: 8
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(dat)
spc_tbl_ [344 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ species : chr [1:344] "Adelie" "Adelie" "Adelie" "Adelie" ...
$ island : chr [1:344] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: num [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : num [1:344] 3750 3800 3250 NA 3450 ...
$ sex : chr [1:344] "male" "female" "female" NA ...
$ year : num [1:344] 2007 2007 2007 2007 2007 ...
- attr(*, "spec")=
.. cols(
.. species = col_character(),
.. island = col_character(),
.. bill_length_mm = col_double(),
.. bill_depth_mm = col_double(),
.. flipper_length_mm = col_double(),
.. body_mass_g = col_double(),
.. sex = col_character(),
.. year = col_double()
.. )
- attr(*, "problems")=<externalptr>
dim(dat)
[1] 344 8
dat
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm
<chr> <chr> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 181
2 Adelie Torgersen 39.5 17.4 186
3 Adelie Torgersen 40.3 18 195
4 Adelie Torgersen NA NA NA
5 Adelie Torgersen 36.7 19.3 193
6 Adelie Torgersen 39.3 20.6 190
7 Adelie Torgersen 38.9 17.8 181
8 Adelie Torgersen 39.2 19.6 195
9 Adelie Torgersen 34.1 18.1 193
10 Adelie Torgersen 42 20.2 190
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>
# data framedf =as.data.frame(dat) head(df)
species island bill_length_mm bill_depth_mm flipper_length_mm
1 Adelie Torgersen 39.1 18.7 181
2 Adelie Torgersen 39.5 17.4 186
3 Adelie Torgersen 40.3 18.0 195
4 Adelie Torgersen NA NA NA
5 Adelie Torgersen 36.7 19.3 193
6 Adelie Torgersen 39.3 20.6 190
body_mass_g sex year
1 3750 male 2007
2 3800 female 2007
3 3250 female 2007
4 NA <NA> 2007
5 3450 female 2007
6 3650 male 2007
str(df)
'data.frame': 344 obs. of 8 variables:
$ species : chr "Adelie" "Adelie" "Adelie" "Adelie" ...
$ island : chr "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
$ bill_length_mm : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: num 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : num 3750 3800 3250 NA 3450 ...
$ sex : chr "male" "female" "female" NA ...
$ year : num 2007 2007 2007 2007 2007 ...
The penguins dataset can be stored in a tibble object (dat) or in data.frame (df). There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting.
the dataset contains 344 rows (observations) and 8 columns (variables): 344 penguin observations, with 8 information per observations (variable)
Each column of a data.frame is of an atomic (=simple) data type. Atomic data types have values that cannot be divided or broken down further.
Common atomic data types include:
logical (TRUE, FALSE and NA i.e., not available)
integer (3L). Use the L when creating and tells R to store this as an integer.
numeric (3.14, Inf, NaN)
character ("three")
factor for categorical variables (e.g., blood types "A","B","AB","0"). Variables that can take on one of a limited, and usually fixed, number of possible values.
Common data types derived from (are the combination of) the atomic ones include:
data.frame/tibble: all columns are of the same length, but each can be a different atomic data type
matrix: all columns/rows are the same length and all elements are of the same atomic data type
list: no restriction on length/data type per element
You can query for the data type (also called class) of a variable using the function class()
class(df$body_mass_g) # atomic data type. Cannot be broken down further.
[1] "integer"
Change data type:
str(df)
'data.frame': 344 obs. of 8 variables:
$ species : chr "Adelie" "Adelie" "Adelie" "Adelie" ...
$ island : chr "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
$ bill_length_mm : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : chr "male" "female" "female" NA ...
$ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
df$species <-factor(df$species)df$island <-factor(df$island)# If you check the levels of the variable: levels(df$island) # [1] "Biscoe" "Dream" "Torgersen"
[1] "Biscoe" "Dream" "Torgersen"
# However, imagine Torgersen is the main island# you can choose which level to put as reference leveldf$island <-relevel(df$island, ref ="Torgersen")levels(df$island) # [1] "Torgersen" "Biscoe" "Dream"
[1] "Torgersen" "Biscoe" "Dream"
Remember that behind the scenes, a factor is encoded with numerical values.
Whether a string variable should be considered as categorical or not depends on the question asked.
Subsetting
A common operation is to subset data sets, that is, to only retain some of the records. The specification of what to retain can be done in many different ways. There are many ways of achieving it:
by numeric index:
slice: To retain a pre-specified set of rows, we can use the slice function from dplyr.
Select: the select function, also from dplyr, allows us to retain a given set of columns
# first rowslice(dat, 1)
# A tibble: 1 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torge… 39.1 18.7 181 3750
# ℹ 2 more variables: sex <chr>, year <dbl>
#slice(df, 1)# first, third and fifth rowslice(dat, c(1, 3, 5))
# A tibble: 3 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torge… 39.1 18.7 181 3750
2 Adelie Torge… 40.3 18 195 3250
3 Adelie Torge… 36.7 19.3 193 3450
# ℹ 2 more variables: sex <chr>, year <dbl>
#slice(df, c(1, 3, 5))# Check if dat and df give the same result:# first to third column using tibbleon_dat =select(dat,1:3)# first to third column using data.frameon_df =select(df,1:3)table(on_dat==on_df)
TRUE
1030
Side note
A more intuitive way, which reduces the number of intermediate variables that are explicitly created and scales to more complex operations, is to use the pipe operator (|>).
# Select column 1:3, and row 1:5 of the dat tibble or df data.framedat |>slice(1:5) |>select(1:3)
# A tibble: 5 × 3
species island bill_length_mm
<chr> <chr> <dbl>
1 Adelie Torgersen 39.1
2 Adelie Torgersen 39.5
3 Adelie Torgersen 40.3
4 Adelie Torgersen NA
5 Adelie Torgersen 36.7
square brackets []: Subsetting can also be done using base R commands.
# first five rows, first three columnsdat[1:5, 1:3]
# A tibble: 5 × 3
species island bill_length_mm
<chr> <chr> <dbl>
1 Adelie Torgersen 39.1
2 Adelie Torgersen 39.5
3 Adelie Torgersen 40.3
4 Adelie Torgersen NA
5 Adelie Torgersen 36.7
by row- and/or column names with:
Can be achieved with select or square brackets. An important shortcut to extract a single column is provided by the $ sign
# Use row number and colnamesdat[1:5, c("species", "island", "bill_length_mm")]
# A tibble: 5 × 3
species island bill_length_mm
<chr> <chr> <dbl>
1 Adelie Torgersen 39.1
2 Adelie Torgersen 39.5
3 Adelie Torgersen 40.3
4 Adelie Torgersen NA
5 Adelie Torgersen 36.7
Side note
There is an important difference between a tibble and a data.frame when it comes to subsetting using the bracket operator. By default, extracting a single column from a data.frame returns a vector (not a data.frame), while performing the same operation on a tibble returns another tibble. The key to getting a data.frame when extracting a column from a data.frame is to add the argument drop = FALSE. The $ operator always returns a vector.
# extract the first five elements in the 'species' columnstr(df[,'species']) # vector
%in%: %in% This operator is used to identify if a series of elements of a vector belong to another vector. This is also very useful for subsetting data frames
Adding a new column to a data set, or replacing an existing column
mutate: function from dplyr
# Add another column to `dat` that contains the sum of the `bill_length_mm` and `bill_depth_mm` columns.dat |>mutate(lpd = bill_length_mm + bill_depth_mm)
# A tibble: 344 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm
<chr> <chr> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 181
2 Adelie Torgersen 39.5 17.4 186
3 Adelie Torgersen 40.3 18 195
4 Adelie Torgersen NA NA NA
5 Adelie Torgersen 36.7 19.3 193
6 Adelie Torgersen 39.3 20.6 190
7 Adelie Torgersen 38.9 17.8 181
8 Adelie Torgersen 39.2 19.6 195
9 Adelie Torgersen 34.1 18.1 193
10 Adelie Torgersen 42 20.2 190
# ℹ 334 more rows
# ℹ 4 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>, lpd <dbl>
#Add column in `dat` that contains the bill length in cm. Display only the species column and the newly created columndat |>mutate(bill_length_cm = bill_length_mm /10) |>select(species, bill_length_cm)
$: $ can also be used to add a new column to a data.frame
dat$bill_length_cm = dat$bill_length_mm /10
Finding all unique combinations
distinct(): function from dplyr which returns all distinct rows in a data.frame or tibble.
dat |>distinct(species, island)
# A tibble: 5 × 2
species island
<chr> <chr>
1 Adelie Torgersen
2 Adelie Biscoe
3 Adelie Dream
4 Gentoo Biscoe
5 Chinstrap Dream
Computing summary statistics
Common summary statistics for numerical variables are mean and standard deviation:
mean(dat$bill_length_cm) # return NA
[1] NA
# Deal with missing datamean(dat$bill_length_cm,na.rm = T)
[1] 4.392193
sd(dat$bill_length_cm,na.rm = T)
[1] 0.5459584
For our example it seems more useful to split the flipper length according to species and/or sex and compute summary statistics on these data subsets.
The data set can be split in six groups: 1. Adelie.male 2. Adelie.female 3. Chinstrap.male 4. Chinstrap.female 5. Gentoo.male 6. Gentoo.female
# Split the data in four groups: dataSplit <-split(dat$flipper_length_mm, list(dat$species, df$sex))dataSplit # the result is a list with for each group, the flipper_length_mm of the corresponding species and sex
# Apply on each group, the mean calculation: groupMeans <-lapply(dataSplit, mean) # lapply is specific to apply a function on a list groupMeans # Output is the mean of each group, saved in a list
This scatter plot is not so meaningful because the x-axis represents only the order of the Penguins in the data frame (it just happens so that they are ordered by species and sex).
A boxplot might be a better option here: it gives a visual summary of the data spread.
It is best applied to continuous data
The formula interface allows to split the data into subsets before plotting it
par(mar=c(10,5,1,1))boxplot(flipper_length_mm ~ species + sex, data=dat,las=2,xlab='')
Boxplots convey summary statistics (median, data spread and outliers) for large datasets
For small datasets (<= 5 points per group) it is more informative to plot the data points
Secondary plotting functions such as lines() and points() can be called after the main figure has been drawn to add elements
boxplot(flipper_length_mm ~ species + sex, data=dat, outline=F,las=2,xlab='')for (n in1:length(dataSplit)) { # for each group/box x <-jitter(rep(n, length(dataSplit[[n]])),amount=0.1) # amount to shift all the points if same valuepoints(x, dataSplit[[n]], pch=1.2, cex=0.5,col='blue',lwd=0.8)}
With ggplot2
An elegant alternative to the base graphic system is the ggplot2 package system, which is an implementation of “The Grammar of Graphics” (Wilkinson 2005). Here each figure is composed of the same few building blocks, such as data, aesthetics and scales.
To correctly interpret the coefficients of a linear model remember that by default they reflect differences to a baseline
In our model, the baseline (here Intercept) is defined by the set of reference levels Adelie (for species) and female (for sex), respectively.
Lets check how the model coefficients relate to the group means calculated above:
## Intercept. The test statistics evaluate the difference of Adelie.female (baseline) penguin flipper length (average) from 0. As expected, this is highly significant but not at all relevant.groupMeans['Adelie.female']
Adelie.female
187.7945
## speciesChinstrap: Difference between Chinstrap amd Adelie in female groupgroupMeans['Chinstrap.female'] - groupMeans['Adelie.female'] # deference to baseline
Chinstrap.female
3.940774
## sexmale: Difference between female and male for species AdeliegroupMeans['Adelie.male'] - groupMeans['Adelie.female'] # deference to baseline
Adelie.male
4.616438
## speciesChinstrap:sexmale: A difference of differences (called 'interaction')(groupMeans['Chinstrap.male'] - groupMeans['Chinstrap.female']) - (groupMeans['Adelie.male'] - groupMeans['Adelie.female'] )
Chinstrap.male
3.560032
Exercises
In the exercise we will use data from a glucose response experiment (secretin data). Five patients with arterial hypertension were dosed with the hormone secretin. Among other roles, secretin is stimulating insulin release from the pancreas after oral glucose intake. Blood glucose levels were recorded in duplicates before administration and in intervals of 20, 30, 60 and 90 minutes after administration of the hormone. The dataset is a subset of the secretin data from the ISwR R package.
1. Read-in
Read-in the example dataset using an appropriate import function. Consult the help pages if you are uncertain about the function usage.
Solution 1
# Import the datasetdata = readr::read_csv("data/01/secretin.csv")
Rows: 50 Columns: 4
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr (3): patient, time, replicate
dbl (1): glucose
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
2. Data structure
How is the data structured? Which variables are present and what are their data types?
Solution 2
data
# A tibble: 50 × 4
glucose patient time replicate
<dbl> <chr> <chr> <chr>
1 92 A pre a
2 93 A pre b
3 84 A 20 a
4 88 A 20 b
5 88 A 30 a
6 90 A 30 b
7 86 A 60 a
8 89 A 60 b
9 87 A 90 a
10 90 A 90 b
# ℹ 40 more rows
str(data)
spc_tbl_ [50 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ glucose : num [1:50] 92 93 84 88 88 90 86 89 87 90 ...
$ patient : chr [1:50] "A" "A" "A" "A" ...
$ time : chr [1:50] "pre" "pre" "20" "20" ...
$ replicate: chr [1:50] "a" "b" "a" "b" ...
- attr(*, "spec")=
.. cols(
.. glucose = col_double(),
.. patient = col_character(),
.. time = col_character(),
.. replicate = col_character()
.. )
- attr(*, "problems")=<externalptr>
3. Data “clean-up”
If needed, change the data type of the column of the data frame so that they are represented with a relevant data type. When needed, relevel factors with the function relevel() to have a meaningful reference level.
Solution 3
## Some columns are better treated as factordata$patient <-factor(data$patient)data$time <-factor(data$time)data$replicate <-factor(data$replicate)## Relevel time to have "pre" as referencedata$time <-relevel(data$time, ref="pre")
4. Exploring
How many observations per time and/or patient are present? Use the functions split(), sapply() and length() to answer the question
Be careful about differentiating biological and technical replication. Here the patients represent the biological replicates, and the 2 measurements per patient and time point are purely technical replicates. It is rarely the case that technical variation is larger than biological variation, and if so, it often can be ignored. We will average the technical replicates for the next steps of the analysis.
Use the aggregate() function to average the glucose levels for each time/patient combination across replicates
Solution 5
data <-aggregate(glucose ~ time+patient, data, FUN=mean) head(data)
time patient glucose
1 pre A 92.5
2 20 A 86.0
3 30 A 89.0
4 60 A 87.5
5 90 A 88.5
6 pre B 85.0
6. Plotting
Construct a boxplot of replicate-averaged glucose levels versus time points:
Overlay it with the single data points
Highlight the different patients with different colors
Solution 6
ggplot(data = data, aes(x = time, y = glucose)) +geom_boxplot() +geom_point(aes(color=patient), size =2.5) +ggtitle('Secretin Data') +xlab('Time') +ylab('Glucose Level') +theme_minimal(base_size =14)
## Of note, with only 5 points per condition a boxplot is not an optimal plot choice. Plotting only the points is simpler, and since the observations are available for every patient, we can connect them with lines to see the evolution with time. ggplot(data = data, aes(x = time, y = glucose)) +geom_point(aes(color=patient), size =2.5) +geom_line(aes(group=patient)) +ggtitle('Secretin Data') +xlab('Time') +ylab('Glucose Level') +theme_minimal(base_size =14)
7. Subsetting
Create a subset of the (replicate-averaged) data containing only time-point pre and 20 using the function subset().
Solution 7
dataSubset <-subset(data, time =="pre"| time =="20")## alternative solution using %in%dataSubset <-subset(data, time %in%c("pre","20"))
8. Linear model
On this reduced dataset, run a linear model to evaluate the change in glucose levels at 20 minutes relative to the baseline levels. What do you conclude?
Solution 8a
mod.time <-lm(glucose ~ time, dataSubset)summary(mod.time)
Call:
lm(formula = glucose ~ time, data = dataSubset)
Residuals:
Min 1Q Median 3Q Max
-8.800 -3.425 0.700 2.200 9.200
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 90.300 2.490 36.258 3.67e-10 ***
time20 -6.500 3.522 -1.846 0.102
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.569 on 8 degrees of freedom
Multiple R-squared: 0.2986, Adjusted R-squared: 0.2109
F-statistic: 3.406 on 1 and 8 DF, p-value: 0.1022
# It seems that the strong downregulation of glucose (-6.5 units in 20 minutes) is not significant (p=0.1). But the R-squared value indicates a bad fit. # The boxplots above already indicated a large patient specific offset. So its probably a good choice to account for it in the model.
Implement a model taking the sample pairing by patients into account. What do you conclude now?
Solution 8b
mod.time.patient <-lm(glucose ~ time + patient, dataSubset)summary(mod.time.patient)
#Additionally accounting for patients makes the strong downregulation of glucose highly significant (p=0.0066) and removes a lot of unexplained variance (the R-squared is now 95%). Although the patient-specific effects are usually not of direct interest, its important to address them in the model!
Compare the first model with the patient offset model using anova
Solution 8c
anova(mod.time, mod.time.patient)
Analysis of Variance Table
Model 1: glucose ~ time
Model 2: glucose ~ time + patient
Res.Df RSS Df Sum of Sq F Pr(>F)
1 8 248.10
2 4 15.75 4 232.35 14.752 0.01158 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What do you conclude now?
Solution 8d
# The anova result tells us that the "time.patient" model explains the data significantly better than expected. Here, a better explanation means a smaller residual variation. Remember, the "time.patient" model is more complex (has 4 more parameters compared with the simple "time" model), so it has to explain the data better. But this expectation is exceeded making it significantly better.
9. Writing to a file
Write the summary for the model coefficients of the previous model to a text file as well as your R session information (using SessionInfo())