1 Introduction

1.1 History

R is a free implementation of the S programming language (1976)
First stable version in 2000
The R software is written primarily in C and Fortran

1.2 Why study R?

R is free, open-source, and runs on all platforms
Very comprehensive and versatile
High quality plotting and data visualization
Most widely used language for data science, with python.
Large community, increasing over the years: https://www.techrepublic.com/article/r-programming-language-continues-to-grow-in-popularity/, https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010372#sec009
More than 20,000 packages on the R central repository (CRAN, https://cran.r-project.org/web/packages/ ).
For biologists: 2,300 packages dedicated to analysis of genomics data (the Bioconductor project, http://bioconductor.org/ ).

1.3 Aims of the course

Give you a basic understanding of R
Allow you to implement a basic data analysis workflow, or modify an existing one
Get you onboard for more advanced courses, for example: https://ivanek.github.io/analysisOfGenomicsDataWithR/

1.4 Challenges

The learning curve can be steep 🤯
R is evolving fast, e.g., there are new packages released every day
There are multiple ways to do something… More or less readable… More or less efficient…

2 Practical information

2.1 Course website

https://ivanek.github.io/introductionToR/
Handouts and link to supplementary data files
Updated after lecture to include answers to exercises
R installation instructions
Details on Exams
Bookmark it!

2.2 Rstudio

Rstudio is an Integrated development environment (IDE). It is probably the most ergonomic option for learning and using R nowadays.

Get familiar with Rstudio main panes and their role: Console, Help, Files, Plots, Environment, History

3 Working directory in R

Whenever you are working in R, it is important to know where you are located on the computer, and where the data you want to work with is located.
To keep things ordered, it is good to create a directory on the computer for the course, and maybe a subdirectory for each lecture, a data directory for saved datasets, etc.

3.1 File system structure

A file system is a hierarchy that you can navigate.
Just in case: a “folder” is the same as a “directory”

Source: Software Capentry

What is at the top of the hierarchy?

The “root”: on linux or mac this is represented by a /
What is the absolute path to file paper.doc?

/projectX/paper/paper.doc
What is the absolute path to file CONF.txt?

/projectX/analysis/data/CONF.txt
If I am located in the paper directory, what is the relative path to the file CONF.txt?

../analysis/data/CONF.txt
If I am located in the paper directory, what is the relative path to the file paper.doc?

./paper.doc

3.2 Knowing where your are “located”

getwd()

## [1] "/Users/rouxj/work/Teaching/introductionToR"

3.3 Moving around in the file system

setwd("/Users/rouxj/work/Teaching/introductionToR/01-introduction/")
setwd("01-introduction/")
setwd("../introductionToR/01-introduction/")

Which alternative do you think is safer?
What could be a potential problem with it?

The first option because it works wherever your are currently located on the computer. It might however be problematic to share your code with someone since the path on a different computer will likely be different. In this case, a relative path is safer.

4 Let’s start!

4.1 Assignment

You can store the values or results in variables for later use. There are two ways of assigning a value to a variable:

= single equal sign
<- arrow pointing to the left

x <- 10
y = 10

## Display the value of a variable
x
y

## A syntactically valid name consists of letters, numbers and the . or _ characters
x1 <- 10 
x_1 <- 10
x.1 <- 10

## But no numeric at beginning of variable name
1x <- 10 ## does not work

## Variables can be assigned something else than numbers
h <- "hello"

Look at the Environment pane on Rstudio. Do you see these new variables and their values?

NOTE: variables are most often referred to as “objects” in R, and in the course the two words are used as synonyms (for more information see https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects)

4.2 Case sensitivity

a <- 2
A <- 3

4.3 Execution of commands

Commands are executed when you type a return character (Enter). To launch multiple commands at once, separate them with a semicolon ;

a <- 2; A <- 3

4.4 Comments

The # character stands for comments. The commented lines are ignored by R.

# You can see that variable b is not added to your environment
# b <- 2

4.5 Blank spaces

They play no role as long as the code is not ambiguous. Aim at using spacing for better code readability.

b <- 2
b<-2     ## Same result, but less readable
b < - 2  ## This has a different meaning!

## [1] FALSE

4.6 Calculations

Variables can be used for basic calculations using basic operators

1 * 3
a * 3
a * A
a / A

# Assign result to a new variable
b <- A + a
b

## Of course this has to make sense!
h
h + 1
## Beware: R sometimes performs weird operations between data types without any warning or error!

4.7 Comparisons

## testing equality with double equal sign
a == A
a == h
## testing inequality
a != A

4.8 Functions

R includes numerous built-in functions that can be used to ask R to perform a specific task using pre-written code. Most functions need some information/input in order to run, which are given inside the parentheses and are called “arguments”.

## other built-in functions
sqrt(a)
a^2
cos(1)
abs(-2)

## Some functions can take more than one argument
rep(1, 5) # repeat 1, 5 times
## It is better for clarity to call the argument with their name using "="
rep(x=1, times=5)

4.9 Help

Whenever you encounter a new function, it is a good idea to check its help page

help("sqrt")
?sqrt ## Same thing

help("*")
?"*"
?`*`

Look at the different sections of the help pages. The Example section is usually useful to understand the behavior of a function.

What is this doing?

?help

Can you list the arguments of the help function?

It shows the help page for the help() function!

The arguments topic, package, lib.loc, verbose, try.all.packages or help_type can be used.

If the name of the function is not known, you can (usually) find it with:

help.search("chi square")
??"chi square" ## Same thing

Last resort: search on your favorite search engine!

4.10 Autocompletion

This is a very useful feature, for long function or object names. You can type the beginning of the object or function name, and press the TAB key.

## Typing the first few letters of a function
chis ## Press TAB: it's magic!

thisIsaLoooooooongAndWeirdObjectName <- 10
this ## Press TAB: it's magic!

## Typing the first few letters of a function (not unique to a single function)
## Beware, the Rstudio and R Console behaviors differ a bit!
help. 
## - Rstudio will open a suggestion box: choose with the arrows up and down
## - On the console you can press TAB to see which functions start with "help."
##   then add the few letters that discriminate a unique name, 
##   and press TAB again to autocomplete
help.se ## Press TAB: it autocompletes!

This saves a lot of time and typos! It is probably weird and not intuitive at first, and you might have to force yourselves to use it at the beginning, but this is worth it.

Rstudio now offers even smarter code completion features. For example, the following symbols are used to disambiguate the various completions:

The auto-completion is now by default “Always On” (this can be changed in the Global Options), for example the possible completions are displayed after a period of typing inactivity.

Create the following variable, and play a bit with auto-completion:

supercalifragilisticexpialidocious <- 1

4.11 Navigating history

You can navigate among the previous commands you used with the up and down arrows
The History Pane of Rstudio might also be useful
Ctrl+r allows to search for previous commands that include some string your enter in.

4.12 Parentheses

Parentheses, braces or brackets need to match. If they do not match, the next line will start with a + instead of the usual >. Close the parenthesis on the new line, or cancel the command with Ctrl+c, or ESC.

(a + b)*2
(a + 
b)*2
(a +

4.13 Some particular reserved words in R

NULL represents the null object in R: it is a reserved word. It is often returned when an expression or function results in an undefined value.
NA stands for “Not Available” and is a logical constant of length 1 which contains a missing value indicator. Be careful, this is not equivalent to the quoted "NA" which is a character string and has not special meaning in R.

The behavior of these 2 reserved words can be complex, and details are too advanced to discuss within this course. Some other reserved words that are maybe easier to understand are:

NaN stands for “Not A Number” and is typically displayed when an invalid computation was conducted.
Inf and -Inf stand for infinity and negative infinity, and are often the result of a division by 0

pi / 0 ## A non-zero number divided by zero creates infinity
0 / 0  ## NaN

1/0 + 1/0 # Inf
1/0 - 1/0 # NaN

cos(Inf) # NaN

4.14 Packages

In addition to the built-in functions (the base R functions), you can also use functions from packages developed by others

kde2d() ## Not found
library(MASS)
kde2d() ## Now we get an error message but the function was found :)

The packages sometimes need to be installed:

## Through CRAN
install.packages("RColorBrewer")

## Through Bioconductor, see http://www.bioconductor.org/install/
install.packages("BiocManager")
BiocManager::install("limma")
library(limma)

4.15 Defining your functions

When some piece of code is needed multiple times, it is often a good idea to write your own function (we will see this later in the course)

4.16 Managing your workspace

ls()   # list the objects in your workspace
rm(h)  # remove object h from workspace

Or look at the Environment Pane in Rstudio!

4.17 Saving objects

Individual objects can be stored in a compressed file (.rds format) with:

saveRDS(a, file="a.rds")
## Load this object again
rm(a)
a <- readRDS(file="a.rds")

But this is not so recommended (unless it took a very long time to create the objects). Rather it is better to relaunch the full analysis and recreate the objects every time.

4.18 Saving commands to a script

Create an .R script on your working directory
Save there all commands and add comments to explain what your are doing and why.

Your colleagues, collaborators, and your “future self” will thank you! (https://xkcd.com/1421/)

You can for example save these from the History pane of Rstudio
You can then run the whole script using:

source("myscript.R")

Or open the script in Rstudio and run each command one by one (or highlight and run a whole chunk of code) using Ctrl+Enter or Cmd+Enter
Notice that RStudio provides nice syntax highlighting

4.19 Rmarkdown documents

R Markdown provides a framework to combine your code, its results, and your comments. The resulting output can be an HTML page, a PDF document, a Word file, a slideshow, etc.

To create a new R markdown document, go to File -> New File -> R Markdown. Choose a title, an output and save it to a specific folder. This creates a .Rmd file, including some header (surrounded by ---), code chunks (surrounded by ```), and simple text with headers (#) and formating (e.g., _italics_).
R code chunks can be run by clicking the “Run” button, or using Ctrl+Enter or Cmd+Enter just like R scripts.
To produce a complete output, click the “Knit” button.

4.20 Quitting / saving history

q()
quit()
## Save workspace image? [y/n/c]:

R will ask you if you want to save your workspace and history each time you quit R. We recommend to answer n (for “no”) – given that you have saved the important objects in .rds format or better saved a script with all commands necessary to regenerate the results (see below).

5 Exercises

Create a .R or .Rmd script to write all the code for the exercise and save it to a specific directory you created for the course.

If needed, change the R working directory (using setwd(...) to the directory you created for the course.

Create two objects a and b and attribute them a numeric value (you choose).

a <- 10
b <- 5
## You can of course also use the = sign
a = 10
b = 5

Calculate the value of y for x = 1 using the equation y = a*x + b. Repeat for x = -1

x <- 1
y <- a*x + b
y

## [1] 15

x <- -1
y <- a*x + b # This step is needed again, y is not stored as a formula
y

## [1] -5

How to convert 582 seconds to minutes and seconds? Which arithmetic operators could be useful? Check the help page: ?Arithmetic.

582 %/% 60

## [1] 9

582 %% 60

## [1] 42

## 9 minutes and 42 seconds

Add 2 to a and multiply everything by 4 to the power of 3 (all in one line)

(a + 2)*4^3

## [1] 768

What are alternative ways of entering the value of a millionth (i.e., one over a million) into R?

1/1000000

## [1] 1e-06

10^-6

## [1] 1e-06

1e-6

## [1] 1e-06

Create 2 objects: rational equal to 1/3 and decimal equal to 0.33. Are these objects equal? Which one is bigger?

rational <- 1/3
decimal <- 0.33
rational == decimal

## [1] FALSE

rational > decimal

## [1] TRUE

Update decimal to 0.3333333333333333 (sixteen 3s) and redo the test. What is going on?

decimal <- 0.3333333333333333
rational == decimal

## [1] TRUE

## R is storing an approximate number for 1/3, with a limited number of decimals

## Bonus: other ways to compare two number:
identical(rational, decimal)

## [1] TRUE

all.equal(rational, decimal) ## allows testing "near equality"

## [1] TRUE

There is a built-in constant in R for \(\pi\). Can you find how to get it?

Can you round it to 2 decimals? Tip: look at the help page

What is by default the number of decimals used?

help.search("pi")
help.search("round")
?round

round(pi, digits=2)

## [1] 3.14

The help page indicates round(x, digits = 0) so no decimals is kept by default (i.e., if the function is called like `round(pi)``)

5.1 A last exercise?

In chemistry, the \(pH\) scale is used as a measure of the acidity of a substance. Substances with a \(pH\) less than 7 are acidic, and substances with a \(pH\) greater than 7 are alkaline. The \(pH\) is a measure of the number of active positive hydrogen ions in the solution, defined by the following formula, where \([H+]\) is the concentration of hydrogen ions in the solution:

\(pH=-\log_{10}([H+])\)

What is the pH corresponding to a hydrogen ions concentration of \(10^{-4}\)?
If the concentration of hydrogen ions is doubled, what is the effect on the \(pH\)?
If the \(pH\) is increased from 7 to 7.5, what is the change in the concentration of hydrogen ions?

The pH corresponding to a hydrogen ions concentration of \(10^{-4}\) is 4:

\(-\log_{10}(10^{-4}) = \log_{10}(10^{4}) = 4\)

If the concentration of hydrogen ions is doubled, the \(pH\) is decreased by 0.30103:

\(pH_{after} = -\log_{10}(C*2) = -\log_{10}(C) -\log_{10}(2) = pH_{before} - 0.3010\)

For a \(pH\) of 7, the concentration of hydrogen ions is \(10^{-7}\). For a \(pH\) of 7.5, the concentration of hydrogen ions is:

\(-\log_{10}(C) = 7.5\)

\(\log(C)/\log(10) = -7.5\)

\(\log(C) = -7.5 * \log(10)\)

\(C = e^{-7.5 * \log(10)} = 3.162278.10^{-08}\)

The concentration of hydrogen ions is thus multiplied by \(e^{-7.5 * \log(10)}/10^{-7} = 0.3162278\)

6 To finish

6.1 Good practices

Really read error messages
Use the autocompletion!
Use a clean working directory structure
Quitting: do not save history and data. This might lead to a big mess later.

6.2 Reproducibility

Create .R or .Rmd script on your working directory to store all your commands and comments. Think of back-ups!
Always keep in mind reproducibility. If you use R for your research, some reviewer or colleague might ask you for the specific commands that were used to generate some result. Your script needs to allow you to reproduce the full analysis in the future.

6.3 Recommended steps at the beginning of every R script

Start by setting up the working directory (setwd), or create an R project from Rstudio, which will ensure that the working directory is located where the .Rproj file is located. See https://martinctc.github.io/blog/rstudio-projects-and-working-directories-a-beginner%27s-guide/
List what is in workspace with ls() to check that it is clean.
Load libraries with library(...)

6.4 Asking for help

First carefully look at R help pages
Search on your favorite search engine
Talk to colleagues to get a different perspective on the problem
If no success, ask on forums (Stackoverflow) or mailing lists (R or Bioconductor)
Don’t forget to give the result of the function sessionInfo() to let people know what system you are using and what versions of R and packages. Many problems are solved by updating to the the latest versions of the packages.

6.5 Questions

Ask your questions to any teacher during the face-to-face lectures.
Write us! Even better in advance on the Etherpad on ADAM

6.6 Cheat sheets

These simple sheets could be useful at the beginning:

6.7 For next week

“Play” with R
Get used to autocompletion and looking at help pages
Why not trying to make your first R markdown document?
Find a dataset of your own, or public to train on something else than the course material

Introduction to the “Introduction to R” course

https://ivanek.github.io/introductionToR/

September 18th, 2024