# 1 Goal

Identify differentially expressed (DE) genes.

• Data Preparation
• Normalization
• Data Filtering
• Re-normalization
• Vooming/Data Transformation
• Setting-up model and testing for DE

# 2 Data Preparation

Download file 06_counts.txt from our GitHub page and store it in YOUR_Working_Directory/data/06/.

library(RColorBrewer)  # for nice colors
library(limma)
library(edgeR)
library(org.Mm.eg.db)
dir.create("data/06", recursive = TRUE)
"data/06/06_counts.txt")
## this file contain row.names and col.names => header=TRUE row.names are ENTREZID
## width column containg the length of exons (to be more precise, union of
## exons)
head(cnts)
##           width ES.RNA.a ES.RNA.b NP.RNA.a NP.RNA.b TN.RNA.a TN.RNA.b
## 100009600  4352      181      139      134       59       82      110
## 100009609  2538        0        0        0        0        0        0
## 100009614   564        0        0        0        0        0        0
## 100009664  2396        0        0        0        0        0        0
## 100012     1854        0        0        0        0        0        0
## 100017     2736       64       50       36       25        2       38

We can build a DGEList object from the count table together with - gene annotation (gene name and symbol from annotation package) - factor describing the experimental group for every sample

By setting the remove.zeros parameter to TRUE we can remove genes with 0 counts in all samples.

# gene annotations
annot <- select(org.Mm.eg.db, keys = rownames(cnts), keytype = "ENTREZID", columns = c("ENTREZID",
"SYMBOL", "GENENAME"))
rownames(annot) <- annot$ENTREZID ## add width column to annotation and remove it from count table annot$LENGTH <- cnts[, "width"]
cnts <- cnts[, -1]
head(annot)
##            ENTREZID        SYMBOL                                           GENENAME LENGTH
## 100009600 100009600         Zglp1                   zinc finger, GATA-like protein 1   4352
## 100009609 100009609       Vmn2r65                         vomeronasal 2, receptor 65   2538
## 100009614 100009614       Gm10024                               predicted gene 10024    564
## 100009664 100009664 F630206G17Rik                         RIKEN cDNA F630206G17 gene   2396
## 100012       100012          Oog3                                        oogenesin 3   1854
## 100017       100017       Ldlrap1 low density lipoprotein receptor adaptor protein 1   2736
# experimental groups
colnames(cnts)
## [1] "ES.RNA.a" "ES.RNA.b" "NP.RNA.a" "NP.RNA.b" "TN.RNA.a" "TN.RNA.b"
groups <- sub(".RNA.[ab]", "", colnames(cnts))
groups <- factor(groups)
groups
## [1] ES ES NP NP TN TN
## Levels: ES NP TN
## generate DGEList object, you can use the remove.zeros=TRUE to get rid of genes
## with 0 counts
dge <- DGEList(counts = cnts, group = groups, genes = annot, remove.zeros = FALSE)
dge <- DGEList(counts = cnts, group = groups, genes = annot, remove.zeros = TRUE)  # overwrites the object
dge
## An object of class "DGEList"
## $counts ## ES.RNA.a ES.RNA.b NP.RNA.a NP.RNA.b TN.RNA.a TN.RNA.b ## 100009600 181 139 134 59 82 110 ## 100017 64 50 36 25 2 38 ## 100019 15802 3791 2140 1037 1342 1432 ## 100033459 0 0 0 0 1 0 ## 100034361 15 5 6 4 6 8 ## 18901 more rows ... ## ##$samples
##          group lib.size norm.factors
## ES.RNA.a    ES  8877288            1
## ES.RNA.b    ES  4218362            1
## NP.RNA.a    NP  5916334            1
## NP.RNA.b    NP  3491889            1
## TN.RNA.a    TN  4628253            1
## TN.RNA.b    TN  5163895            1
##
## $genes ## ENTREZID SYMBOL GENENAME LENGTH ## 100009600 100009600 Zglp1 zinc finger, GATA-like protein 1 4352 ## 100017 100017 Ldlrap1 low density lipoprotein receptor adaptor protein 1 2736 ## 100019 100019 Mdn1 midasin AAA ATPase 1 21654 ## 100033459 100033459 Ifi208 interferon activated gene 208 5250 ## 100034361 100034361 Mfap1b microfibrillar-associated protein 1B 1887 ## 18901 more rows ... # 3 Normalizing gene expression data External factors arising from sample preparation or sequencing process can affect the expression level in individual samples (i.e. different sequencing depth). We assume that all samples should have a similar range and distribution of expression values (majority of genes are NOT differentially expressed) One of the possibilities would be to use ‘counts-per-million’ (CPM) or rather log2 CPM values for visualization and downstream analysis. In this case all counts are divided by total counts per sample (library size) and multiplied by 1 million. We additionally use calcNormFactors function from edgeR package to normalize the counts by the method of trimmed mean of M-values (TMM) (Robinson and Oshlack 2010). A trimmed mean is the average after removing the upper and lower x% of the data. • The TMM can eliminate effect of few genes which have very high counts (i.e. in muscle myosin, actin) and which would dominate the library size calculation and have an effect on simple normalization by using only library size • The calculated normalization factors are stored in dge$samples$norm.factors and used as a scaling factor for the library sizes. For our data, the effect of TMM-normalization is mild, as libraries are of similar size, there are no ‘outlier’ gene(s) and therefore most of the scaling factors are relatively close to 1. dge <- calcNormFactors(dge, method = "TMM") dge$samples\$norm.factors
## [1] 0.8719417 0.9597959 1.0564716 1.0375404 1.0081419 1.0813074

To demonstrate the utility of normalization, we duplicate and adjust the data: - counts of the first sample are reduced to 10% of their original values - counts of the second sample are multiplied by 5 - for the third sample we boost the expression of 3 genes by 100 fold.

The scaling factors for this example would be:

## [1] 1.3590202 1.3027145 0.2205393 1.3880234 1.3420554 1.3749016

# 4 Data Filtering

Genes with zero counts were already removed by specifying remove.zeros=TRUE parameter to DGEList function. However genes with only few counts are still present in the DGEList.

It is hard to say if those counts simply represent some noise from data production or if these are truly lowly expressed genes. However, these low counts typically do to provide sufficient evidence for differential expression analysis.

The goal today is to test every single gene for difference in expression across experimental groups. We will perform several thousands of test and we will therefore need to correct the obtained p-values from test for multiple testing problem. The strength of the correction depends on the number of tests: more tests $$\rightarrow$$ stronger the correction (more details later).

Therefore it is meaningful to remove some genes from the downstream analysis and by that reduce the number of performed tests. Keep in mind, that we try to remove only genes that would not be considered as differentially expressed (they have low counts in all samples).

### 4.0.1 How to select threshold?

We typically recommend to check the distributions of log2CPM values and select a appropriate threshold for filtering. We then require that the gene must have log2 CPM values above this threshold in at least ‘n’ samples.

We never specify in which samples (which experimental group), it is therefore non-specific filtering.