1 Part 1: Exam

2 Part 2: Recap exercise

2.1 Introduction

We will reanalyze an RNA-seq dataset published in the following paper:

Bou Sleiman MS et al. Genetic, molecular and physiological basis of variation in Drosophila gut immunocompetence. Nature Communications. 2015;6:7829.

The authors studied how genetic variation impacts gut immunocompetence in Drosophila melanogaster. They performed RNA-seq on gut samples after Pseudomonas entomophila infection, or in unchallenged condition, in susceptible and resistant strains of flies (from the Drosophila Genetic Reference Panel).

  • Find the GEO record associated to this paper.
  • What type of data is available for download on GEO? Where can you find the raw data?
  • Which protocol was used to generate this dataset?
  • Which types of reads were generated? How long are they?
  • How many biological replicates condition does the dataset include?
  • Which aligner was used to align the reads?
  • The GEO record including the RNA-seq data can be found at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE59411.
  • Only processed data (RPKM) are available from GEO. Raw data are available from SRA.
  • The Illumina Truseq stranded protocol was used
  • Single-end reads of 101bp were generated. This info is only available from SRA, by looking at a specific sample (e.g., https://www.ncbi.nlm.nih.gov/sra/SRX652733[accn]) and then at a specific run (e.g., https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1515119)
  • There are 16 samples in total and 4 conditions (untreated/susceptible, untreated/resistant, treated/susceptible, treated/resistant), so 4 replicates per condition.
  • Unfortunately this information is not available neither in GEO, nor in the paper. It is thus not a good idea to rely on the processed data for the reanalysis as we cannot control whether the software/analysis choices are reasonable. In general it is anyway a good idea to start over from the read mapping steps to have control of the whole data analysis process.

2.2 Summary of previous steps

Today, to be able to move forward faster to the expression data analysis with Bioconductor, we already did the pre-processing:

  • Raw data files were downloaded from SRA.
  • Reads were pseudo-aligned to the D. melanogaster transcriptome with the tool Kallisto.
  • A DGEList object was created and saved to an .rds object.

  • Download the DGEList object provided on the “Code, Datasets” column of the course page
  • Import it into R in a variable called dge.
  • Have a look at this dge object. What slots are analogous to the assays, colData and rowData slots of the summarizedExperiment class (cf. lecture 1)?
  • Was the dataset normalized?
# To import the .rds file: 
dge <- readRDS("data/08/DGE.rds")
## or:
#dge <- readRDS(url("https://ivanek.github.io/analysisOfGenomicsDataWithR/data/08/DGE.rds"))
  • The different slots of the dge object are
    • dge$counts -> assays
    • dge$genes -> rowData
    • dge$samples -> colData
  • The dataset was not normalized: all normalization factors in dge$samples$norm.factors are equal to 1.

2.3 Dataset filtering

NOTE: Through the next steps, the R commands are mostly provided but a few blanks are left for your to fill.

It is a good practice to filter out the genes which have a very low expression level in “most” samples. There is low power to detect any differential expression among these genes, and filtering them reduces the burden of multiple testing. We first calculate the CPM values after a first round of TMM normalization, and we then filter genes based on these CPM values:

dge <- calcNormFactors(dge, method = "TMM")
keep <- rowSums(cpm(dge) > ...) >= ... # What do you think would be reasonable values here?
## Subset genes, while updating the total counts in dge$samples
dge <- dge[keep, , keep.lib.sizes=FALSE] 
## Update TMM normalization factors in the new dge object
dge <- calcNormFactors(dge, method = "TMM") 
  • A reasonable filtering approach is to remove genes that are not expressed above 1 CPM (this is an arbitrary number on the lower side of the CPM spectrum) in at least 4 samples (which corresponds to the number of biological replicates in this experiment).
dge <- calcNormFactors(dge, method = "TMM")
keep <- rowSums(cpm(dge) > 1) >= 3 # n=4-1 because we have 4 replicates per condition
## See also the filterByExpr() function from the edgeR package

## keep
##  4311  9342
## Subset genes, while updating the total counts in dge$samples
dge <- dge[keep, , keep.lib.sizes=FALSE] 
## Update TMM normalization factors in the new dge object
dge <- calcNormFactors(dge, method = "TMM") 

2.4 Quality control

It is important to check that all samples can be included in the analysis. We can check how they correlate to each other to pinpoint any outlier sample.

  • Using the code below, calculate the correlation matrix across samples using the log-transformed normalized CPM values.
  • Visualize this matrix on a heatmap. The function heatmap.2 from the gplots package offers more flexibility than some basic functions such as heatmap
  • Use the information from the dge$samples slots to show the sample conditions in the heatmap.
  • What can we conclude from this heatmap?
log2cpm <- cpm(dge, log=TRUE, normalized.lib.sizes=...)

## Visualize their distribution
plotDensities(log2cpm, group=dge$samples$group, legend = "topright")

## Calculate sample correlation matrix
cor_matrix <- cor(log2cpm, method = "pearson")

## Plot heatmap
          margins = c(2, 12),
          RowSideColors=c("orange", "green")[dge$samples$...],
          ColSideColors=c("red", "blue")[dge$samples$...])
log2cpm <- cpm(dge, log=TRUE, normalized.lib.sizes=TRUE)
plotDensities(log2cpm, group=dge$samples$group, legend = "topright")