Exploratory analysis of RNA-seq data in R

Author

Panagiotis Papasaikas, Charlotte Soneson

Published

March 20, 2024

RNAseq analysis


Outline of this course

Week 1:

  1. Basic concepts in RNAseq. From raw data to count data (alignment and gene quantification)

Week 2:

  1. Summarized Experiment,normalization, exploratory analyses and visualization

Week 3:

  1. Differential gene expression, design matrices.

Recap - transcriptomics

RNA-seq quantification

Input data

The input to the quantification is a set of reads and a set of reference sequences (genome/chromosomes or transcripts).

Reading quantifications into R

  • Here, we will focus on the processing of Salmon output.
  • Recall that the main output of Salmon is a text file with transcript abundances. Each sample is stored in a separate text file.
  • Counting tools like Rsubread::featureCounts and QuasR::qCount, on the other hand, directly return gene-level count matrices in your R session.

  • Name is the transcript identifier

  • Length is the annotated transcript length

  • EffectiveLength is the length after accounting for the fact that a read is not equally likely to start anywhere in the transcript

  • TPM is the estimated abundance in transcript-per-million

  • NumReads is the estimated abundance in read counts

  • To read this into R, we can use the tximeta function from the tximeta package.

library(tximeta)
  • The first argument required by the tximeta function is a coldata data.frame, which tells the function which samples we want to import, and where the corresponding quantification files are.
  • As an example, we will use data from the macrophage package. This package contains quant.sf files from Salmon, as well as a coldata file.
  • These files are stored in the macrophage package folder in your R package library. We can find this location and check its content like so:
library(macrophage)
dir <- system.file("extdata", package = "macrophage")
dir
[1] "/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/macrophage/extdata"
list.files(dir)
[1] "coldata.csv"                   "errs"                         
[3] "gencode.v29_salmon_0.12.0"     "gencode.v29.annotation.gtf.gz"
[5] "PRJEB18997.txt"                "quants"                       
[7] "supp_table_1.csv"              "supp_table_7.csv"             
  • We start by reading the coldata file. If you analyze your own data, you can read this information from a text file, or create the data.frame directly in R.
coldata <- read.csv(file.path(dir, "coldata.csv"))[, c(1, 2, 3, 5)]
dim(coldata)
[1] 24  4
head(coldata)
           names sample_id line_id condition_name
1 SAMEA103885102    diku_A  diku_1          naive
2 SAMEA103885347    diku_B  diku_1           IFNg
3 SAMEA103885043    diku_C  diku_1         SL1344
4 SAMEA103885392    diku_D  diku_1    IFNg_SL1344
5 SAMEA103885182    eiwy_A  eiwy_1          naive
6 SAMEA103885136    eiwy_B  eiwy_1           IFNg
  • In addition to the names column, tximeta requires that coldata has a column named files, pointing to the Salmon output (the quant.sf file) for the respective samples.
coldata$files <- file.path(dir, "quants", coldata$names, "quant.sf.gz")
head(coldata)
           names sample_id line_id condition_name
1 SAMEA103885102    diku_A  diku_1          naive
2 SAMEA103885347    diku_B  diku_1           IFNg
3 SAMEA103885043    diku_C  diku_1         SL1344
4 SAMEA103885392    diku_D  diku_1    IFNg_SL1344
5 SAMEA103885182    eiwy_A  eiwy_1          naive
6 SAMEA103885136    eiwy_B  eiwy_1           IFNg
                                                                                                                      files
1 /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/macrophage/extdata/quants/SAMEA103885102/quant.sf.gz
2 /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/macrophage/extdata/quants/SAMEA103885347/quant.sf.gz
3 /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/macrophage/extdata/quants/SAMEA103885043/quant.sf.gz
4 /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/macrophage/extdata/quants/SAMEA103885392/quant.sf.gz
5 /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/macrophage/extdata/quants/SAMEA103885182/quant.sf.gz
6 /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/macrophage/extdata/quants/SAMEA103885136/quant.sf.gz
all(file.exists(coldata$files))
[1] TRUE
  • Now we have everything we need, and can import the quantifications with tximeta.
  • In this process, we will see that tximeta automatically identifies the source and version of the transcriptome reference that was used for the quantification, and adds some metadata.
  • The imported data will be stored in a SummarizedExperiment container, which you heard about in a previous lecture.
suppressPackageStartupMessages({
  library(SummarizedExperiment)
})
se <- tximeta(coldata = coldata, type = "salmon", dropInfReps = TRUE)
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
found matching transcriptome:
[ GENCODE - Homo sapiens - release 29 ]
loading existing TxDb created: 2023-03-22 12:02:57
Loading required package: GenomicFeatures
Loading required package: AnnotationDbi
loading existing transcript ranges created: 2023-03-22 12:02:57
se
class: RangedSummarizedExperiment 
dim: 205870 24 
metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
assays(3): counts abundance length
rownames(205870): ENST00000456328.2 ENST00000450305.2 ...
  ENST00000387460.2 ENST00000387461.2
rowData names(3): tx_id gene_id tx_name
colnames(24): SAMEA103885102 SAMEA103885347 ... SAMEA103885308
  SAMEA103884949
colData names(4): names sample_id line_id condition_name
  • We see that tximeta has identified the transcriptome used for the quantification as GENCODE - Homo sapiens - release 29. How did this happen?
  • In fact, the output directory from Salmon contains much more information than just the quant.sf file! (Side note: this means that it is not advisable to move files out of the folder, or to share only the quant.sf file, since the context is lost):
list.files(file.path(dir, "quants", coldata$names[1]), recursive = TRUE)
 [1] "aux_info/ambig_info.tsv.gz"       "aux_info/bootstrap/bootstraps.gz"
 [3] "aux_info/bootstrap/names.tsv.gz"  "aux_info/exp_gc.gz"              
 [5] "aux_info/expected_bias.gz"        "aux_info/fld.gz"                 
 [7] "aux_info/meta_info.json"          "aux_info/obs_gc.gz"              
 [9] "aux_info/observed_bias_3p.gz"     "aux_info/observed_bias.gz"       
[11] "cmd_info.json"                    "lib_format_counts.json"          
[13] "libParams/flenDist.txt"           "logs/salmon_quant.log.txt"       
[15] "quant.sf.gz"                     
  • In particular, the meta_info.json file contains a hash checksum, which is derived from the set of transcripts used as reference during the quantification and which lets tximeta identify the reference source (by comparing to a table of these hash checksums for commonly used references).
rjson::fromJSON(file = file.path(dir, "quants", coldata$names[1], "aux_info", "meta_info.json"))
$salmon_version
[1] "0.12.0"

$samp_type
[1] "gibbs"

$opt_type
[1] "vb"

$quant_errors
list()

$num_libraries
[1] 1

$library_types
[1] "ISR"

$frag_dist_length
[1] 1001

$seq_bias_correct
[1] FALSE

$gc_bias_correct
[1] TRUE

$num_bias_bins
[1] 4096

$mapping_type
[1] "mapping"

$num_targets
[1] 205870

$serialized_eq_classes
[1] FALSE

$eq_class_properties
list()

$length_classes
[1]    520    669   1065   2328 205012

$index_seq_hash
[1] "40849ed828ea7d6a94af54a5c40e5d87eb0ce0fc1e9513208a5cffe59d442292"

$index_name_hash
[1] "77aca5545a0626421efb4730dd7b95482c77da261f9bdef70d36e25ee68bb7ef"

$index_seq_hash512
[1] "f37ae2a7412efd8518d68c22fc3fcc2478b59833809382abd5d9055475505e516730c70914343af34c9232af92fe22832e70afde6b38c26381097068e1e82551"

$index_name_hash512
[1] "67f286e5ec2895c10aa6e3a8feed04cb3a80747de7b3afb620298078481b303d2f979407fc104d4f3e8af0e82be2acb2f294ee5f2f68c00087f2855120d9f4ed"

$num_bootstraps
[1] 20

$num_processed
[1] 45141218

$num_mapped
[1] 40075160

$percent_mapped
[1] 88.77731

$call
[1] "quant"

$start_time
[1] "Tue Jan 15 21:21:22 2019"

$end_time
[1] "Tue Jan 15 21:29:00 2019"

What’s in the SummarizedExperiment object?

se
class: RangedSummarizedExperiment 
dim: 205870 24 
metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
assays(3): counts abundance length
rownames(205870): ENST00000456328.2 ENST00000450305.2 ...
  ENST00000387460.2 ENST00000387461.2
rowData names(3): tx_id gene_id tx_name
colnames(24): SAMEA103885102 SAMEA103885347 ... SAMEA103885308
  SAMEA103884949
colData names(4): names sample_id line_id condition_name
  • Looking at the size of the SummarizedExperiment object (205,870 rows!) as well as the row names, we see that this object contains transcript-level information.
  • The assays are created by directly importing the values from the quant.sf files and combining this information across the 24 samples:
    • counts - NumReads column
    • abundance - TPM column
    • length - EffectiveLength column
  • We can access any of the assays via the assay or assays functions:
head(assay(se, "counts"), 3)
                  SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392
ENST00000456328.2         22.058         12.404           5.44          0.000
ENST00000450305.2          0.000          0.000           0.00          0.000
ENST00000488147.1        119.092        180.069         161.55         93.747
                  SAMEA103885182 SAMEA103885136 SAMEA103885413 SAMEA103884967
ENST00000456328.2            0.0         10.833          5.119          6.708
ENST00000450305.2            0.0          0.000          0.000          0.000
ENST00000488147.1          145.5        141.607        189.152         98.019
                  SAMEA103885368 SAMEA103885218 SAMEA103885319 SAMEA103885004
ENST00000456328.2          0.000          4.484          0.000          0.000
ENST00000450305.2          0.000          0.000          0.000          0.000
ENST00000488147.1        132.243         88.429         96.871         66.813
                  SAMEA103885284 SAMEA103885059 SAMEA103884898 SAMEA103885157
ENST00000456328.2         27.337         12.401          0.000          6.107
ENST00000450305.2          0.000          0.000          0.000          0.000
ENST00000488147.1        250.127        211.475        144.212        134.842
                  SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021
ENST00000456328.2         23.333         11.794          4.670          0.000
ENST00000450305.2          0.000          0.000          0.000          0.000
ENST00000488147.1        205.167        151.599        134.082         69.402
                  SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949
ENST00000456328.2          7.629          3.907          9.195          0.000
ENST00000450305.2          0.000          0.000          0.000          0.000
ENST00000488147.1        154.885        125.882        183.322         53.233
head(assays(se)[["counts"]], 3)
                  SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392
ENST00000456328.2         22.058         12.404           5.44          0.000
ENST00000450305.2          0.000          0.000           0.00          0.000
ENST00000488147.1        119.092        180.069         161.55         93.747
                  SAMEA103885182 SAMEA103885136 SAMEA103885413 SAMEA103884967
ENST00000456328.2            0.0         10.833          5.119          6.708
ENST00000450305.2            0.0          0.000          0.000          0.000
ENST00000488147.1          145.5        141.607        189.152         98.019
                  SAMEA103885368 SAMEA103885218 SAMEA103885319 SAMEA103885004
ENST00000456328.2          0.000          4.484          0.000          0.000
ENST00000450305.2          0.000          0.000          0.000          0.000
ENST00000488147.1        132.243         88.429         96.871         66.813
                  SAMEA103885284 SAMEA103885059 SAMEA103884898 SAMEA103885157
ENST00000456328.2         27.337         12.401          0.000          6.107
ENST00000450305.2          0.000          0.000          0.000          0.000
ENST00000488147.1        250.127        211.475        144.212        134.842
                  SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021
ENST00000456328.2         23.333         11.794          4.670          0.000
ENST00000450305.2          0.000          0.000          0.000          0.000
ENST00000488147.1        205.167        151.599        134.082         69.402
                  SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949
ENST00000456328.2          7.629          3.907          9.195          0.000
ENST00000450305.2          0.000          0.000          0.000          0.000
ENST00000488147.1        154.885        125.882        183.322         53.233

Exercises

  1. The screenshot of the quant.sf file above was taken from the Salmon output from the sample SAMEA103884898. Read the file into R manually and compare the values in each of the assays in the SummarizedExperiment object to the corresponding values given in the quant.sf file. Note that the file is compressed (the name is quant.sf.gz), but it can still be read by e.g. read.delim().
qf <- read.delim(file.path(dir, "quants", "SAMEA103884898", "quant.sf.gz"), 
                 header = TRUE, as.is = TRUE)
all(rownames(se) == qf$Name)
[1] TRUE
all.equal(unname(assay(se, "counts")[, "SAMEA103884898"]), qf$NumReads)
[1] TRUE
all.equal(unname(assay(se, "abundance")[, "SAMEA103884898"]), qf$TPM)
[1] TRUE
all.equal(unname(assay(se, "length")[, "SAMEA103884898"]), qf$EffectiveLength)
[1] TRUE
  1. Check that for each sample, the sum of the TPMs for all the transcripts is indeed 1 million, as expected.
colSums(assay(se, "abundance"))
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182 
         1e+06          1e+06          1e+06          1e+06          1e+06 
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218 
         1e+06          1e+06          1e+06          1e+06          1e+06 
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898 
         1e+06          1e+06          1e+06          1e+06          1e+06 
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021 
         1e+06          1e+06          1e+06          1e+06          1e+06 
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949 
         1e+06          1e+06          1e+06          1e+06 

Row (feature) annotations and genomic ranges

  • You may have noted that se is in fact a RangedSummarizedExperiment object (rather than “just” a SummarizedExperiment object). What does this mean?
  • Let’s look at the information we have about the rows (transcripts) in the object:
rowRanges(se)
GRanges object with 205870 ranges and 3 metadata columns:
                    seqnames      ranges strand |     tx_id           gene_id
                       <Rle>   <IRanges>  <Rle> | <integer>   <CharacterList>
  ENST00000456328.2     chr1 11869-14409      + |         1 ENSG00000223972.5
  ENST00000450305.2     chr1 12010-13670      + |         2 ENSG00000223972.5
  ENST00000488147.1     chr1 14404-29570      - |      9483 ENSG00000227232.5
  ENST00000619216.1     chr1 17369-17436      - |      9484 ENSG00000278267.1
  ENST00000473358.1     chr1 29554-31097      + |         3 ENSG00000243485.5
                ...      ...         ...    ... .       ...               ...
  ENST00000361681.2     chrM 14149-14673      - |    206692 ENSG00000198695.2
  ENST00000387459.1     chrM 14674-14742      - |    206693 ENSG00000210194.1
  ENST00000361789.2     chrM 14747-15887      + |    206684 ENSG00000198727.2
  ENST00000387460.2     chrM 15888-15953      + |    206685 ENSG00000210195.2
  ENST00000387461.2     chrM 15956-16023      - |    206694 ENSG00000210196.2
                              tx_name
                          <character>
  ENST00000456328.2 ENST00000456328.2
  ENST00000450305.2 ENST00000450305.2
  ENST00000488147.1 ENST00000488147.1
  ENST00000619216.1 ENST00000619216.1
  ENST00000473358.1 ENST00000473358.1
                ...               ...
  ENST00000361681.2 ENST00000361681.2
  ENST00000387459.1 ENST00000387459.1
  ENST00000361789.2 ENST00000361789.2
  ENST00000387460.2 ENST00000387460.2
  ENST00000387461.2 ENST00000387461.2
  -------
  seqinfo: 25 sequences (1 circular) from hg38 genome
  • By knowing the source and version of the reference used for the quantification, tximeta was able to retrieve the annotation files and decorate the object with information about the transcripts, such as the chromosome and position, and the corresponding gene ID. Importantly, Salmon did not use (or know about) any of this during the quantification! It needs only the transcript sequences.
  • The information about transcript locations in the genome is provided in a GRanges object, which you saw already a couple of weeks ago.
  • This gives us lots of possibilities of interacting with our data based on the genomic locations of the features.
  • Let’s say for example that we want to only consider transcripts located between position 1e6 and 2e6 on chromosome 3. We can subset the SummarizedExperiment object (including the assays) to only these features:
## Define the target region
gr <- GRanges(seqnames = "chr3", 
              ranges = IRanges(
                start = 1e6,
                end = 2e6),
              strand = "*")
## Subset SummarizedExperiment by overlap
sesub <- IRanges::subsetByOverlaps(se, gr)
sesub
class: RangedSummarizedExperiment 
dim: 13 24 
metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
assays(3): counts abundance length
rownames(13): ENST00000451707.1 ENST00000463924.3 ... ENST00000397467.2
  ENST00000418972.1
rowData names(3): tx_id gene_id tx_name
colnames(24): SAMEA103885102 SAMEA103885347 ... SAMEA103885308
  SAMEA103884949
colData names(4): names sample_id line_id condition_name
rowRanges(sesub)
GRanges object with 13 ranges and 3 metadata columns:
                    seqnames          ranges strand |     tx_id
                       <Rle>       <IRanges>  <Rle> | <integer>
  ENST00000451707.1     chr3 1008135-1012934      + |     33364
  ENST00000463924.3     chr3 1065580-1065875      - |     39749
  ENST00000446702.6     chr3 1092576-1404217      + |     33365
  ENST00000397479.6     chr3 1092658-1403442      + |     33366
  ENST00000350110.2     chr3 1092661-1403594      + |     33367
                ...      ...             ...    ... .       ...
  ENST00000449338.1     chr3 1394173-1394448      - |     39750
  ENST00000423801.1     chr3 1595777-1596245      - |     39751
  ENST00000415533.1     chr3 1730070-1730474      + |     33371
  ENST00000397467.2     chr3 1905651-1906125      + |     33372
  ENST00000418972.1     chr3 1962381-1987470      - |     39752
                               gene_id           tx_name
                       <CharacterList>       <character>
  ENST00000451707.1  ENSG00000235158.1 ENST00000451707.1
  ENST00000463924.3  ENSG00000244169.3 ENST00000463924.3
  ENST00000446702.6 ENSG00000134115.12 ENST00000446702.6
  ENST00000397479.6 ENSG00000134115.12 ENST00000397479.6
  ENST00000350110.2 ENSG00000134115.12 ENST00000350110.2
                ...                ...               ...
  ENST00000449338.1  ENSG00000232563.1 ENST00000449338.1
  ENST00000423801.1  ENSG00000184423.5 ENST00000423801.1
  ENST00000415533.1  ENSG00000214074.3 ENST00000415533.1
  ENST00000397467.2  ENSG00000214073.2 ENST00000397467.2
  ENST00000418972.1  ENSG00000225044.1 ENST00000418972.1
  -------
  seqinfo: 25 sequences (1 circular) from hg38 genome
  • If we just want the annotation columns, without the ranges, we can get those with the rowData accessor:
rowData(se)
DataFrame with 205870 rows and 3 columns
                      tx_id           gene_id           tx_name
                  <integer>   <CharacterList>       <character>
ENST00000456328.2         1 ENSG00000223972.5 ENST00000456328.2
ENST00000450305.2         2 ENSG00000223972.5 ENST00000450305.2
ENST00000488147.1      9483 ENSG00000227232.5 ENST00000488147.1
ENST00000619216.1      9484 ENSG00000278267.1 ENST00000619216.1
ENST00000473358.1         3 ENSG00000243485.5 ENST00000473358.1
...                     ...               ...               ...
ENST00000361681.2    206692 ENSG00000198695.2 ENST00000361681.2
ENST00000387459.1    206693 ENSG00000210194.1 ENST00000387459.1
ENST00000361789.2    206684 ENSG00000198727.2 ENST00000361789.2
ENST00000387460.2    206685 ENSG00000210195.2 ENST00000387460.2
ENST00000387461.2    206694 ENSG00000210196.2 ENST00000387461.2

Column (sample) annotations

  • Similar to the row annotations in rowData, the SummarizedExperiment object contains sample annotations in the colData slot.
colData(se)
DataFrame with 24 rows and 4 columns
                        names   sample_id     line_id condition_name
                  <character> <character> <character>    <character>
SAMEA103885102 SAMEA103885102      diku_A      diku_1          naive
SAMEA103885347 SAMEA103885347      diku_B      diku_1           IFNg
SAMEA103885043 SAMEA103885043      diku_C      diku_1         SL1344
SAMEA103885392 SAMEA103885392      diku_D      diku_1    IFNg_SL1344
SAMEA103885182 SAMEA103885182      eiwy_A      eiwy_1          naive
...                       ...         ...         ...            ...
SAMEA103885021 SAMEA103885021      podx_D      podx_1    IFNg_SL1344
SAMEA103885262 SAMEA103885262      qaqx_A      qaqx_1          naive
SAMEA103885228 SAMEA103885228      qaqx_B      qaqx_1           IFNg
SAMEA103885308 SAMEA103885308      qaqx_C      qaqx_1         SL1344
SAMEA103884949 SAMEA103884949      qaqx_D      qaqx_1    IFNg_SL1344
  • Storing this information in the object rather than in an external text file or data.frame is very helpful to make sure things stay matched. For example, we can easily subset the object to only the naive and IFNg conditions, and we can be sure that the annotations in colData still match up with the columns of the count matrix:
se[, se$condition_name %in% c("naive", "IFNg")]
class: RangedSummarizedExperiment 
dim: 205870 12 
metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
assays(3): counts abundance length
rownames(205870): ENST00000456328.2 ENST00000450305.2 ...
  ENST00000387460.2 ENST00000387461.2
rowData names(3): tx_id gene_id tx_name
colnames(12): SAMEA103885102 SAMEA103885347 ... SAMEA103885262
  SAMEA103885228
colData names(4): names sample_id line_id condition_name

Exercises

  1. Subset the SummarizedExperiment object to only contain transcripts on the X chromosome, and only samples from the naive group. How many rows and columns do the resulting object contain?
sesub2 <- se[seqnames(rowRanges(se)) == "chrX", 
             se$condition_name == "naive"]
sesub2
class: RangedSummarizedExperiment 
dim: 6431 6 
metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
assays(3): counts abundance length
rownames(6431): ENST00000431238.7 ENST00000399012.6 ...
  ENST00000464205.6 ENST00000507418.6
rowData names(3): tx_id gene_id tx_name
colnames(6): SAMEA103885102 SAMEA103885182 ... SAMEA103885111
  SAMEA103885262
colData names(4): names sample_id line_id condition_name
rowRanges(sesub2)
GRanges object with 6431 ranges and 3 metadata columns:
                    seqnames              ranges strand |     tx_id
                       <Rle>           <IRanges>  <Rle> | <integer>
  ENST00000431238.7     chrX       253743-255091      + |    199270
  ENST00000399012.6     chrX       276322-303353      + |    199271
  ENST00000484611.7     chrX       276324-291537      + |    199272
  ENST00000430923.7     chrX       276353-291629      + |    199273
  ENST00000445062.6     chrX       281055-288869      + |    199274
                ...      ...                 ...    ... .       ...
  ENST00000483079.6     chrX 156022786-156023531      + |    202580
  ENST00000496301.6     chrX 156023367-156025666      + |    202581
  ENST00000483286.6     chrX 156023824-156025554      + |    202582
  ENST00000464205.6     chrX 156024071-156025554      + |    202583
  ENST00000507418.6     chrX 156025664-156027877      - |    205759
                               gene_id           tx_name
                       <CharacterList>       <character>
  ENST00000431238.7  ENSG00000228572.7 ENST00000431238.7
  ENST00000399012.6 ENSG00000182378.13 ENST00000399012.6
  ENST00000484611.7 ENSG00000182378.13 ENST00000484611.7
  ENST00000430923.7 ENSG00000182378.13 ENST00000430923.7
  ENST00000445062.6 ENSG00000182378.13 ENST00000445062.6
                ...                ...               ...
  ENST00000483079.6 ENSG00000182484.15 ENST00000483079.6
  ENST00000496301.6 ENSG00000182484.15 ENST00000496301.6
  ENST00000483286.6 ENSG00000182484.15 ENST00000483286.6
  ENST00000464205.6 ENSG00000182484.15 ENST00000464205.6
  ENST00000507418.6  ENSG00000227159.8 ENST00000507418.6
  -------
  seqinfo: 25 sequences (1 circular) from hg38 genome
dim(sesub2)
[1] 6431    6

Summarizing on the gene level

  • As we saw, the features in the SummarizedExperiment object above are individual transcripts, rather than genes.
  • Most of the time, however, we want to do analysis on the gene level, since the gene-level abundances are more robust and sometimes more interpretable than transcript-level abundances.

Summarizing the row annotations

  • The rowData contains the information about the corresponding gene for each transcript, in the gene_id column.
  • tximeta provides a function to summarize on the gene level.
    • Counts are added up
    • TPMs are added up
    • Transcript lengths are added up after weighting by the respective transcript TPMs
seg <- summarizeToGene(se)
loading existing TxDb created: 2023-03-22 12:02:57
obtaining transcript-to-gene mapping from database
loading existing gene ranges created: 2023-03-22 14:25:40
summarizing abundance
summarizing counts
summarizing length
seg
class: RangedSummarizedExperiment 
dim: 58294 24 
metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
assays(3): counts abundance length
rownames(58294): ENSG00000000003.14 ENSG00000000005.5 ...
  ENSG00000285993.1 ENSG00000285994.1
rowData names(2): gene_id tx_ids
colnames(24): SAMEA103885102 SAMEA103885347 ... SAMEA103885308
  SAMEA103884949
colData names(4): names sample_id line_id condition_name
  • Now we have a new RangedSummarizedExperiment object, with one row per gene.
  • The row ranges have been summarized as well, and can be used for subsetting and interpretation just as for the transcripts.
rowRanges(seg)
GRanges object with 58294 ranges and 2 metadata columns:
                     seqnames              ranges strand |            gene_id
                        <Rle>           <IRanges>  <Rle> |        <character>
  ENSG00000000003.14     chrX 100627109-100639991      - | ENSG00000000003.14
   ENSG00000000005.5     chrX 100584802-100599885      + |  ENSG00000000005.5
  ENSG00000000419.12    chr20   50934867-50958555      - | ENSG00000000419.12
  ENSG00000000457.13     chr1 169849631-169894267      - | ENSG00000000457.13
  ENSG00000000460.16     chr1 169662007-169854080      + | ENSG00000000460.16
                 ...      ...                 ...    ... .                ...
   ENSG00000285990.1    chr14   19244904-19269380      - |  ENSG00000285990.1
   ENSG00000285991.1     chr6 149817937-149896011      - |  ENSG00000285991.1
   ENSG00000285992.1     chr8   47129262-47132628      + |  ENSG00000285992.1
   ENSG00000285993.1    chr18   46409197-46410645      - |  ENSG00000285993.1
   ENSG00000285994.1    chr10   12563151-12567351      + |  ENSG00000285994.1
                                                                         tx_ids
                                                                <CharacterList>
  ENSG00000000003.14  ENST00000612152.4,ENST00000373020.8,ENST00000614008.4,...
   ENSG00000000005.5                        ENST00000373031.4,ENST00000485971.1
  ENSG00000000419.12  ENST00000371588.9,ENST00000466152.5,ENST00000371582.8,...
  ENSG00000000457.13 ENST00000367771.10,ENST00000367770.5,ENST00000367772.8,...
  ENSG00000000460.16  ENST00000498289.5,ENST00000472795.5,ENST00000496973.5,...
                 ...                                                        ...
   ENSG00000285990.1                                          ENST00000649331.1
   ENSG00000285991.1                                          ENST00000647612.1
   ENSG00000285992.1                                          ENST00000648949.1
   ENSG00000285993.1                                          ENST00000650266.1
   ENSG00000285994.1                                          ENST00000648650.1
  -------
  seqinfo: 25 sequences (1 circular) from hg38 genome
rowRanges(IRanges::subsetByOverlaps(seg, gr))
GRanges object with 8 ranges and 2 metadata columns:
                     seqnames          ranges strand |            gene_id
                        <Rle>       <IRanges>  <Rle> |        <character>
  ENSG00000134115.12     chr3 1092576-1404217      + | ENSG00000134115.12
   ENSG00000184423.5     chr3 1595777-1596245      - |  ENSG00000184423.5
   ENSG00000214073.2     chr3 1905651-1906125      + |  ENSG00000214073.2
   ENSG00000214074.3     chr3 1730070-1730474      + |  ENSG00000214074.3
   ENSG00000225044.1     chr3 1962381-1987470      - |  ENSG00000225044.1
   ENSG00000232563.1     chr3 1394173-1394448      - |  ENSG00000232563.1
   ENSG00000235158.1     chr3 1008135-1012934      + |  ENSG00000235158.1
   ENSG00000244169.3     chr3 1065580-1065875      - |  ENSG00000244169.3
                                                                        tx_ids
                                                               <CharacterList>
  ENSG00000134115.12 ENST00000446702.6,ENST00000397479.6,ENST00000350110.2,...
   ENSG00000184423.5                                         ENST00000423801.1
   ENSG00000214073.2                                         ENST00000397467.2
   ENSG00000214074.3                                         ENST00000415533.1
   ENSG00000225044.1                                         ENST00000418972.1
   ENSG00000232563.1                                         ENST00000449338.1
   ENSG00000235158.1                                         ENST00000451707.1
   ENSG00000244169.3                                         ENST00000463924.3
  -------
  seqinfo: 25 sequences (1 circular) from hg38 genome

Exercises

  1. Convince yourself that the summarization rules listed above are correct, e.g., by looking at a few example genes. Hint: the transcript IDs associated with each gene ID are included in the tx_ids column of rowData(seg).
## A gene (ENSG00000001561.6) with a single transcript (ENST00000321037.4)
assay(se, "counts")["ENST00000321037.4", ]
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182 
            75            297             53            196            203 
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218 
           503            161            207             84            348 
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898 
            83            166             14             50             13 
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021 
            22            114            321            110             93 
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949 
           191            544            164            232 
assay(seg, "counts")["ENSG00000001561.6", ]
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182 
            75            297             53            196            203 
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218 
           503            161            207             84            348 
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898 
            83            166             14             50             13 
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021 
            22            114            321            110             93 
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949 
           191            544            164            232 
assay(se, "abundance")["ENST00000321037.4", ]
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182 
      0.771645       3.012179       0.483955       1.783965       2.069614 
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218 
      5.345586       1.531936       2.077714       0.812283       3.780581 
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898 
      0.866642       1.653559       0.201541       0.774063       0.169594 
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021 
      0.290852       1.672535       4.366731       1.264230       1.178936 
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949 
      1.993649       6.215327       1.838931       2.405130 
assay(seg, "abundance")["ENSG00000001561.6", ]
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182 
      0.771645       3.012179       0.483955       1.783965       2.069614 
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218 
      5.345586       1.531936       2.077714       0.812283       3.780581 
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898 
      0.866642       1.653559       0.201541       0.774063       0.169594 
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021 
      0.290852       1.672535       4.366731       1.264230       1.178936 
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949 
      1.993649       6.215327       1.838931       2.405130 
assay(se, "length")["ENST00000321037.4", ]
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182 
      3867.923       3899.012       3904.951       3869.813       3779.375 
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218 
      3767.240       3838.151       3833.323       3869.960       3718.040 
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898 
      3973.670       3845.925       3263.954       3219.796       3378.609 
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021 
      3275.387       3205.206       3473.419       3536.652       3369.202 
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949 
      3568.291       3587.445       3669.451       3661.080 
assay(seg, "length")["ENSG00000001561.6", ]
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182 
      3867.923       3899.012       3904.951       3869.813       3779.375 
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218 
      3767.240       3838.151       3833.323       3869.960       3718.040 
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898 
      3973.670       3845.925       3263.954       3219.796       3378.609 
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021 
      3275.387       3205.206       3473.419       3536.652       3369.202 
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949 
      3568.291       3587.445       3669.451       3661.080 
## A gene (ENSG00000000003.14) with multiple transcripts (ENST00000612152.4, ENST00000373020.8, ENST00000614008.4, ENST00000496771.5, ENST00000494424.1)
gn <- "ENSG00000000003.14"
txs <- c("ENST00000612152.4", "ENST00000373020.8", "ENST00000614008.4", 
         "ENST00000496771.5", "ENST00000494424.1")
colSums(assay(se, "counts")[txs, ])
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182 
        83.818        144.156         71.239        130.372        241.527 
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218 
       140.124        176.200        131.309        106.873         32.232 
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898 
        61.065         34.597         67.708         69.832         54.947 
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021 
        38.000        243.366        151.900        442.684        191.000 
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949 
       953.947        511.918        298.509        217.753 
assay(seg, "counts")[gn, ]
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182 
        83.818        144.156         71.239        130.372        241.527 
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218 
       140.124        176.200        131.309        106.873         32.232 
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898 
        61.065         34.597         67.708         69.832         54.947 
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021 
        38.000        243.366        151.900        442.684        191.000 
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949 
       953.947        511.918        298.509        217.753 
colSums(assay(se, "abundance")[txs, ])
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182 
      1.675253       2.831133       1.289042       2.299402       4.922389 
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218 
      2.889044       3.234795       2.732676       1.979229       0.699002 
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898 
      1.252130       0.662058       1.639050       1.785457       1.252458 
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021 
      0.851824       5.991920       3.754027       9.203870       4.196592 
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949 
     18.465397      10.837159       6.115662       4.146591 
assay(seg, "abundance")[gn, ]
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182 
      1.675253       2.831133       1.289042       2.299402       4.922389 
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218 
      2.889044       3.234795       2.732676       1.979229       0.699002 
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898 
      1.252130       0.662058       1.639050       1.785457       1.252458 
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021 
      0.851824       5.991920       3.754027       9.203870       4.196592 
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949 
     18.465397      10.837159       6.115662       4.146591 
colSums(assay(se, "length")[txs, ] * 
            scale(assay(se, "abundance")[txs, ], 
                  center = FALSE, 
                  scale = colSums(assay(se, "abundance")[txs, ])))
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182 
      1991.076       2013.493       1970.577       1997.052       1890.622 
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218 
      1941.822       1989.272       1848.819       2020.717       1862.530 
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898 
      2023.468       2001.930       1941.024       1949.595       1933.715 
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021 
      1931.731       1909.937       1911.907       1955.005       1943.886 
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949 
      1924.158       1936.135       2008.338       1993.125 
assay(seg, "length")[gn, ]
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182 
      1991.076       2013.493       1970.577       1997.052       1890.622 
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218 
      1941.822       1989.272       1848.819       2020.717       1862.530 
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898 
      2023.468       2001.930       1941.024       1949.595       1933.715 
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021 
      1931.731       1909.937       1911.907       1955.005       1943.886 
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949 
      1924.158       1936.135       2008.338       1993.125 

Adding additional annotations

  • At this point, the only information we have about the genes in our data set, apart from their genomic location and the associated transcript IDs, is the Ensembl ID.
  • Often we need additional annotations, such as gene symbols.
  • Bioconductor provides a host of annotation packages (recall from a previous lecture):
    • OrgDb packages, providing gene-based annotations for a given organism
    • TxDb and EnsDb packages, providing transcript ranges for a given genome build
    • BSgenome packages, providing the genome sequence for a given genome build
  • For our purposes here, the appropriate OrgDb package is the most suitable, since it contains gene-centric ID conversion tables.
  • Since this is human data, we will use the org.Hs.eg.db package.
library(org.Hs.eg.db)

Using addIds

  • The easiest way of adding a column is by using the addIds function from tximeta. For example, to add gene symbols:
seg <- addIds(seg, "SYMBOL")
it appears the rows are gene IDs, setting 'gene' to TRUE
mapping to new IDs using org.Hs.eg.db
if all matching IDs are desired, and '1:many mappings' are reported,
set multiVals='list' to obtain all the matching IDs
'select()' returned 1:many mapping between keys and columns
head(rowData(seg))
DataFrame with 6 rows and 3 columns
                              gene_id
                          <character>
ENSG00000000003.14 ENSG00000000003.14
ENSG00000000005.5   ENSG00000000005.5
ENSG00000000419.12 ENSG00000000419.12
ENSG00000000457.13 ENSG00000000457.13
ENSG00000000460.16 ENSG00000000460.16
ENSG00000000938.12 ENSG00000000938.12
                                                                       tx_ids
                                                              <CharacterList>
ENSG00000000003.14  ENST00000612152.4,ENST00000373020.8,ENST00000614008.4,...
ENSG00000000005.5                         ENST00000373031.4,ENST00000485971.1
ENSG00000000419.12  ENST00000371588.9,ENST00000466152.5,ENST00000371582.8,...
ENSG00000000457.13 ENST00000367771.10,ENST00000367770.5,ENST00000367772.8,...
ENSG00000000460.16  ENST00000498289.5,ENST00000472795.5,ENST00000496973.5,...
ENSG00000000938.12  ENST00000374005.7,ENST00000399173.5,ENST00000374004.5,...
                        SYMBOL
                   <character>
ENSG00000000003.14      TSPAN6
ENSG00000000005.5         TNMD
ENSG00000000419.12        DPM1
ENSG00000000457.13       SCYL3
ENSG00000000460.16       FIRRM
ENSG00000000938.12         FGR
  • To see a list of the possible columns, use the columns function from the AnnotationDbi package:
AnnotationDbi::columns(org.Hs.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
[16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
[21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
[26] "UNIPROT"     
  • We can even add annotations where we expect multiple mapping values, e.g., associated GO terms:
seg <- addIds(seg, "GO", multiVals = "list")
it appears the rows are gene IDs, setting 'gene' to TRUE
mapping to new IDs using org.Hs.eg.db
if all matching IDs are desired, and '1:many mappings' are reported,
set multiVals='list' to obtain all the matching IDs
'select()' returned 1:many mapping between keys and columns
head(rowData(seg))
DataFrame with 6 rows and 4 columns
                              gene_id
                          <character>
ENSG00000000003.14 ENSG00000000003.14
ENSG00000000005.5   ENSG00000000005.5
ENSG00000000419.12 ENSG00000000419.12
ENSG00000000457.13 ENSG00000000457.13
ENSG00000000460.16 ENSG00000000460.16
ENSG00000000938.12 ENSG00000000938.12
                                                                       tx_ids
                                                              <CharacterList>
ENSG00000000003.14  ENST00000612152.4,ENST00000373020.8,ENST00000614008.4,...
ENSG00000000005.5                         ENST00000373031.4,ENST00000485971.1
ENSG00000000419.12  ENST00000371588.9,ENST00000466152.5,ENST00000371582.8,...
ENSG00000000457.13 ENST00000367771.10,ENST00000367770.5,ENST00000367772.8,...
ENSG00000000460.16  ENST00000498289.5,ENST00000472795.5,ENST00000496973.5,...
ENSG00000000938.12  ENST00000374005.7,ENST00000399173.5,ENST00000374004.5,...
                        SYMBOL                                   GO
                   <character>                               <list>
ENSG00000000003.14      TSPAN6 GO:0005515,GO:0005886,GO:0039532,...
ENSG00000000005.5         TNMD GO:0001886,GO:0001937,GO:0005515,...
ENSG00000000419.12        DPM1 GO:0004169,GO:0004582,GO:0004582,...
ENSG00000000457.13       SCYL3 GO:0000139,GO:0005515,GO:0005524,...
ENSG00000000460.16       FIRRM                           GO:0005515
ENSG00000000938.12         FGR GO:0001784,GO:0001784,GO:0001819,...

Exercises

  1. Add a column with Entrez Gene IDs to the SummarizedExperiment object
seg <- addIds(seg, "ENTREZID")
rowData(seg)
DataFrame with 58294 rows and 5 columns
                              gene_id
                          <character>
ENSG00000000003.14 ENSG00000000003.14
ENSG00000000005.5   ENSG00000000005.5
ENSG00000000419.12 ENSG00000000419.12
ENSG00000000457.13 ENSG00000000457.13
ENSG00000000460.16 ENSG00000000460.16
...                               ...
ENSG00000285990.1   ENSG00000285990.1
ENSG00000285991.1   ENSG00000285991.1
ENSG00000285992.1   ENSG00000285992.1
ENSG00000285993.1   ENSG00000285993.1
ENSG00000285994.1   ENSG00000285994.1
                                                                       tx_ids
                                                              <CharacterList>
ENSG00000000003.14  ENST00000612152.4,ENST00000373020.8,ENST00000614008.4,...
ENSG00000000005.5                         ENST00000373031.4,ENST00000485971.1
ENSG00000000419.12  ENST00000371588.9,ENST00000466152.5,ENST00000371582.8,...
ENSG00000000457.13 ENST00000367771.10,ENST00000367770.5,ENST00000367772.8,...
ENSG00000000460.16  ENST00000498289.5,ENST00000472795.5,ENST00000496973.5,...
...                                                                       ...
ENSG00000285990.1                                           ENST00000649331.1
ENSG00000285991.1                                           ENST00000647612.1
ENSG00000285992.1                                           ENST00000648949.1
ENSG00000285993.1                                           ENST00000650266.1
ENSG00000285994.1                                           ENST00000648650.1
                        SYMBOL                                   GO    ENTREZID
                   <character>                               <list> <character>
ENSG00000000003.14      TSPAN6 GO:0005515,GO:0005886,GO:0039532,...        7105
ENSG00000000005.5         TNMD GO:0001886,GO:0001937,GO:0005515,...       64102
ENSG00000000419.12        DPM1 GO:0004169,GO:0004582,GO:0004582,...        8813
ENSG00000000457.13       SCYL3 GO:0000139,GO:0005515,GO:0005524,...       57147
ENSG00000000460.16       FIRRM                           GO:0005515       55732
...                        ...                                  ...         ...
ENSG00000285990.1       NBEAP6                                   NA   100418904
ENSG00000285991.1           NA                                   NA          NA
ENSG00000285992.1           NA                                   NA          NA
ENSG00000285993.1           NA                                   NA          NA
ENSG00000285994.1           NA                                   NA          NA

Normalization for library size differences

  • The library size or sequencing depth of an RNA-seq sample is typically used to refer to the number of reads assigned to genes - in other words, the column sums of the count matrix of the experiment.
  • These can vary quite a lot between samples even in the same experiment.
sort(colSums(assay(seg, "counts")))
SAMEA103885059 SAMEA103885284 SAMEA103885157 SAMEA103884919 SAMEA103884898 
      31636942       33348797       33349938       33448610       33708215 
SAMEA103885111 SAMEA103885021 SAMEA103885319 SAMEA103885276 SAMEA103885218 
      34187620       34531636       34923684       36376670       37805027 
SAMEA103884967 SAMEA103885308 SAMEA103885368 SAMEA103885004 SAMEA103885136 
      38244703       38330402       39226849       39366479       39702126 
SAMEA103885102 SAMEA103885413 SAMEA103885347 SAMEA103884949 SAMEA103885182 
      40075160       40089867       40467958       40479409       40738760 
SAMEA103885228 SAMEA103885043 SAMEA103885392 SAMEA103885262 
      41204929       41833047       42535450       45746950 
  • Differences in library size need to be accounted for, since they mean that the observed counts are not directly comparable between samples. If a sample is sequenced twice as deeply, all genes will have twice the number of reads, but that doesn’t mean that all genes are twice as highly expressed!
  • The simplest normalization approach would be to divide the counts for a sample by a factor proportional to the library size.
    • However, this can have unintended consequences if a small set of genes are strongly upregulated in one condition compared to another.
    • Since this approach effectively fixes the total number of reads, if a few genes are upregulated it means the rest will look downregulated!

  • Instead, a better idea is to attempt to estimate the scaling factors (typically called “size factors” or “normalization factors” depending on exactly what they represent) from non-differentially expressed genes only.
    • Assume “most genes” are not differentially expressed.
    • Estimate size factors by comparing library sizes after excluding genes with the largest fold changes between samples.
    • TMM (edgeR)
    • RLE (DESeq/DESeq2)

  • It is important to note that (as you will see later) the counts are not usually explicitly normalized before differential expression analysis (e.g., with edgeR or DESeq2). These packages take the raw counts as input, and build in the normalization in the statistical model.
  • However, for exploratory analysis (and some other differential expression analysis methods), the counts are explicitly normalized, and often also log-transformed, before analysis.

Exploratory analysis and visualization

  • The first step of an RNA-seq analysis (after the raw read QC and quantification) is often an exploratory analysis.
  • The purpose of this is to get a feeling for the strongest signals in the data, and to see whether these agree with expectations.
  • The most common approach is Principal Component Analysis, or PCA.
  • PCA finds linear combinations of the original features (genes) that represent as much of the variance in the original data set as possible.
  • Note that it doesn’t use any information about the grouping of the samples, and as such, it is an unsupervised method.
  • Before applying PCA, the read counts are normalized and typically transformed to a scale where the variance does not depend on the mean. This is important to avoid the expression level driving the PCA.
  • One common transformation is the variance-stabilizing transformation, in the DESeq2 package.
library(DESeq2)
## We must first convert the SummarizedExperiment object to a 
## DESeqDataSet object (which is, effectively, a SummarizedExperiment
## object with additional information about the experimental design)
seg <- DESeqDataSet(seg, design = ~ condition_name)
using counts and average transcript lengths from tximeta
Warning in DESeqDataSet(seg, design = ~condition_name): some variables in
design formula are characters, converting to factors
## We filter out the genes that are not expressed
seg <- seg[rowSums(assay(seg, "counts")) > 0, ]

## Apply the variance stabilizing transformation
vst <- varianceStabilizingTransformation(seg, blind = TRUE)
using 'avgTxLength' from assays(dds), correcting for library size
## Plot the mean-variance relationship for the raw counts as well as 
## transformed data
seg <- DESeq2::estimateSizeFactors(seg)
using 'avgTxLength' from assays(dds), correcting for library size
vsn::meanSdPlot(counts(seg, normalized = TRUE))

vsn::meanSdPlot(assay(normTransform(seg)))

vsn::meanSdPlot(assay(vst))

  • After transforming the data, we apply Principal Component Analysis using the prcomp() R function.
  • For more details on PCA, see e.g. this introduction.
library(ggplot2)
rv <- rowVars(assay(vst))
sel <- order(rv, decreasing = TRUE)[1:500]
pca <- prcomp(t(assay(vst)[sel, ]))
df <- data.frame(pca$x[, 1:2], colData(vst))
ggplot(df, aes(x = PC1, y = PC2, color = condition_name)) + 
  geom_point(size = 6) + theme_bw()

Alternatives to the SummarizedExperiment container

  • Many Bioconductor packages for RNA-seq analysis use data represented in the form of a SummarizedExperiment object (or a class that extends SummarizedExperiment).
  • The DESeqDataSet that we saw above is one example.
  • Later we will also see the SingleCellExperiment class, which is very similar to the SummarizedExperiment class but allows us to store the results of dimension reduction algorithms inside the object.
  • However, there are also widely used packages (like edgeR) that use other representations. In the case of edgeR, this is the DGEList class.
suppressPackageStartupMessages({
  library(edgeR)
})
dge <- DGEList(counts = assay(seg, "counts"), 
               samples = colData(seg), 
               genes = rowData(seg))

Session info

sessionInfo()
R version 4.3.1 Patched (2023-06-20 r84599)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] edgeR_3.42.4                limma_3.56.2               
 [3] ggplot2_3.4.4               DESeq2_1.40.1              
 [5] org.Hs.eg.db_3.17.0         GenomicFeatures_1.52.1     
 [7] AnnotationDbi_1.62.1        SummarizedExperiment_1.30.2
 [9] Biobase_2.60.0              GenomicRanges_1.52.0       
[11] GenomeInfoDb_1.36.1         IRanges_2.34.0             
[13] S4Vectors_0.38.1            BiocGenerics_0.46.0        
[15] MatrixGenerics_1.12.2       matrixStats_1.0.0          
[17] macrophage_1.16.0           tximeta_1.18.0             
[19] BiocStyle_2.28.0            png_0.1-8                  
[21] knitr_1.44                 

loaded via a namespace (and not attached):
  [1] rstudioapi_0.15.0             jsonlite_1.8.7               
  [3] tximport_1.28.0               magrittr_2.0.3               
  [5] farver_2.1.1                  rmarkdown_2.25               
  [7] BiocIO_1.10.0                 zlibbioc_1.46.0              
  [9] vctrs_0.6.4                   memoise_2.0.1                
 [11] Rsamtools_2.16.0              RCurl_1.98-1.12              
 [13] htmltools_0.5.6.1             S4Arrays_1.0.4               
 [15] progress_1.2.2                AnnotationHub_3.8.0          
 [17] curl_5.1.0                    htmlwidgets_1.6.2            
 [19] cachem_1.0.8                  GenomicAlignments_1.36.0     
 [21] mime_0.12                     lifecycle_1.0.3              
 [23] pkgconfig_2.0.3               Matrix_1.6-1.1               
 [25] R6_2.5.1                      fastmap_1.1.1                
 [27] GenomeInfoDbData_1.2.10       shiny_1.7.5                  
 [29] digest_0.6.33                 colorspace_2.1-0             
 [31] RSQLite_2.3.1                 filelock_1.0.2               
 [33] labeling_0.4.3                fansi_1.0.5                  
 [35] httr_1.4.7                    compiler_4.3.1               
 [37] bit64_4.0.5                   withr_2.5.1                  
 [39] BiocParallel_1.34.2           DBI_1.1.3                    
 [41] hexbin_1.28.3                 biomaRt_2.56.1               
 [43] rappdirs_0.3.3                DelayedArray_0.26.3          
 [45] rjson_0.2.21                  tools_4.3.1                  
 [47] interactiveDisplayBase_1.38.0 httpuv_1.6.11                
 [49] glue_1.6.2                    restfulr_0.0.15              
 [51] promises_1.2.1                generics_0.1.3               
 [53] gtable_0.3.4                  tzdb_0.4.0                   
 [55] preprocessCore_1.62.1         ensembldb_2.24.0             
 [57] hms_1.1.3                     xml2_1.3.5                   
 [59] utf8_1.2.3                    XVector_0.40.0               
 [61] BiocVersion_3.17.1            pillar_1.9.0                 
 [63] stringr_1.5.0                 vroom_1.6.4                  
 [65] later_1.3.1                   dplyr_1.1.3                  
 [67] BiocFileCache_2.8.0           lattice_0.21-9               
 [69] rtracklayer_1.60.0            bit_4.0.5                    
 [71] tidyselect_1.2.0              locfit_1.5-9.8               
 [73] Biostrings_2.68.1             ProtGenerics_1.32.0          
 [75] xfun_0.40                     stringi_1.7.12               
 [77] lazyeval_0.2.2                yaml_2.3.7                   
 [79] evaluate_0.22                 codetools_0.2-19             
 [81] tibble_3.2.1                  BiocManager_1.30.22          
 [83] cli_3.6.1                     affyio_1.70.0                
 [85] xtable_1.8-4                  munsell_0.5.0                
 [87] Rcpp_1.0.11                   dbplyr_2.3.4                 
 [89] XML_3.99-0.14                 parallel_4.3.1               
 [91] ellipsis_0.3.2                readr_2.1.4                  
 [93] blob_1.2.4                    prettyunits_1.2.0            
 [95] AnnotationFilter_1.24.0       bitops_1.0-7                 
 [97] scales_1.2.1                  affy_1.78.0                  
 [99] purrr_1.0.2                   crayon_1.5.2                 
[101] rlang_1.1.1                   vsn_3.68.0                   
[103] KEGGREST_1.40.0