library(tximeta)
Exploratory analysis of RNA-seq data in R
RNAseq analysis
Outline of this course
Week 1:
- Basic concepts in RNAseq. From raw data to count data (alignment and gene quantification)
Week 2:
- Summarized Experiment,normalization, exploratory analyses and visualization
Week 3:
- Differential gene expression, design matrices.
Recap - transcriptomics
RNA-seq quantification
Input data
The input to the quantification is a set of reads and a set of reference sequences (genome/chromosomes or transcripts).
Reading quantifications into R
- Here, we will focus on the processing of
Salmon
output. - Recall that the main output of
Salmon
is a text file with transcript abundances. Each sample is stored in a separate text file. - Counting tools like
Rsubread::featureCounts
andQuasR::qCount
, on the other hand, directly return gene-level count matrices in your R session.
Name
is the transcript identifierLength
is the annotated transcript lengthEffectiveLength
is the length after accounting for the fact that a read is not equally likely to start anywhere in the transcriptTPM
is the estimated abundance in transcript-per-millionNumReads
is the estimated abundance in read countsTo read this into R, we can use the
tximeta
function from the tximeta package.
- The first argument required by the
tximeta
function is acoldata
data.frame, which tells the function which samples we want to import, and where the corresponding quantification files are. - As an example, we will use data from the macrophage package. This package contains
quant.sf
files fromSalmon
, as well as acoldata
file. - These files are stored in the
macrophage
package folder in your R package library. We can find this location and check its content like so:
library(macrophage)
<- system.file("extdata", package = "macrophage")
dir dir
[1] "/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/macrophage/extdata"
list.files(dir)
[1] "coldata.csv" "errs"
[3] "gencode.v29_salmon_0.12.0" "gencode.v29.annotation.gtf.gz"
[5] "PRJEB18997.txt" "quants"
[7] "supp_table_1.csv" "supp_table_7.csv"
- We start by reading the
coldata
file. If you analyze your own data, you can read this information from a text file, or create the data.frame directly in R.
<- read.csv(file.path(dir, "coldata.csv"))[, c(1, 2, 3, 5)]
coldata dim(coldata)
[1] 24 4
head(coldata)
names sample_id line_id condition_name
1 SAMEA103885102 diku_A diku_1 naive
2 SAMEA103885347 diku_B diku_1 IFNg
3 SAMEA103885043 diku_C diku_1 SL1344
4 SAMEA103885392 diku_D diku_1 IFNg_SL1344
5 SAMEA103885182 eiwy_A eiwy_1 naive
6 SAMEA103885136 eiwy_B eiwy_1 IFNg
- In addition to the
names
column,tximeta
requires thatcoldata
has a column namedfiles
, pointing to the Salmon output (the quant.sf file) for the respective samples.
$files <- file.path(dir, "quants", coldata$names, "quant.sf.gz")
coldatahead(coldata)
names sample_id line_id condition_name
1 SAMEA103885102 diku_A diku_1 naive
2 SAMEA103885347 diku_B diku_1 IFNg
3 SAMEA103885043 diku_C diku_1 SL1344
4 SAMEA103885392 diku_D diku_1 IFNg_SL1344
5 SAMEA103885182 eiwy_A eiwy_1 naive
6 SAMEA103885136 eiwy_B eiwy_1 IFNg
files
1 /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/macrophage/extdata/quants/SAMEA103885102/quant.sf.gz
2 /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/macrophage/extdata/quants/SAMEA103885347/quant.sf.gz
3 /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/macrophage/extdata/quants/SAMEA103885043/quant.sf.gz
4 /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/macrophage/extdata/quants/SAMEA103885392/quant.sf.gz
5 /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/macrophage/extdata/quants/SAMEA103885182/quant.sf.gz
6 /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/macrophage/extdata/quants/SAMEA103885136/quant.sf.gz
all(file.exists(coldata$files))
[1] TRUE
- Now we have everything we need, and can import the quantifications with
tximeta
. - In this process, we will see that
tximeta
automatically identifies the source and version of the transcriptome reference that was used for the quantification, and adds some metadata. - The imported data will be stored in a
SummarizedExperiment
container, which you heard about in a previous lecture.
suppressPackageStartupMessages({
library(SummarizedExperiment)
})<- tximeta(coldata = coldata, type = "salmon", dropInfReps = TRUE) se
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
found matching transcriptome:
[ GENCODE - Homo sapiens - release 29 ]
loading existing TxDb created: 2023-03-22 12:02:57
Loading required package: GenomicFeatures
Loading required package: AnnotationDbi
loading existing transcript ranges created: 2023-03-22 12:02:57
se
class: RangedSummarizedExperiment
dim: 205870 24
metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
assays(3): counts abundance length
rownames(205870): ENST00000456328.2 ENST00000450305.2 ...
ENST00000387460.2 ENST00000387461.2
rowData names(3): tx_id gene_id tx_name
colnames(24): SAMEA103885102 SAMEA103885347 ... SAMEA103885308
SAMEA103884949
colData names(4): names sample_id line_id condition_name
- We see that
tximeta
has identified the transcriptome used for the quantification as GENCODE - Homo sapiens - release 29. How did this happen? - In fact, the output directory from Salmon contains much more information than just the quant.sf file! (Side note: this means that it is not advisable to move files out of the folder, or to share only the quant.sf file, since the context is lost):
list.files(file.path(dir, "quants", coldata$names[1]), recursive = TRUE)
[1] "aux_info/ambig_info.tsv.gz" "aux_info/bootstrap/bootstraps.gz"
[3] "aux_info/bootstrap/names.tsv.gz" "aux_info/exp_gc.gz"
[5] "aux_info/expected_bias.gz" "aux_info/fld.gz"
[7] "aux_info/meta_info.json" "aux_info/obs_gc.gz"
[9] "aux_info/observed_bias_3p.gz" "aux_info/observed_bias.gz"
[11] "cmd_info.json" "lib_format_counts.json"
[13] "libParams/flenDist.txt" "logs/salmon_quant.log.txt"
[15] "quant.sf.gz"
- In particular, the
meta_info.json
file contains a hash checksum, which is derived from the set of transcripts used as reference during the quantification and which letstximeta
identify the reference source (by comparing to a table of these hash checksums for commonly used references).
::fromJSON(file = file.path(dir, "quants", coldata$names[1], "aux_info", "meta_info.json")) rjson
$salmon_version
[1] "0.12.0"
$samp_type
[1] "gibbs"
$opt_type
[1] "vb"
$quant_errors
list()
$num_libraries
[1] 1
$library_types
[1] "ISR"
$frag_dist_length
[1] 1001
$seq_bias_correct
[1] FALSE
$gc_bias_correct
[1] TRUE
$num_bias_bins
[1] 4096
$mapping_type
[1] "mapping"
$num_targets
[1] 205870
$serialized_eq_classes
[1] FALSE
$eq_class_properties
list()
$length_classes
[1] 520 669 1065 2328 205012
$index_seq_hash
[1] "40849ed828ea7d6a94af54a5c40e5d87eb0ce0fc1e9513208a5cffe59d442292"
$index_name_hash
[1] "77aca5545a0626421efb4730dd7b95482c77da261f9bdef70d36e25ee68bb7ef"
$index_seq_hash512
[1] "f37ae2a7412efd8518d68c22fc3fcc2478b59833809382abd5d9055475505e516730c70914343af34c9232af92fe22832e70afde6b38c26381097068e1e82551"
$index_name_hash512
[1] "67f286e5ec2895c10aa6e3a8feed04cb3a80747de7b3afb620298078481b303d2f979407fc104d4f3e8af0e82be2acb2f294ee5f2f68c00087f2855120d9f4ed"
$num_bootstraps
[1] 20
$num_processed
[1] 45141218
$num_mapped
[1] 40075160
$percent_mapped
[1] 88.77731
$call
[1] "quant"
$start_time
[1] "Tue Jan 15 21:21:22 2019"
$end_time
[1] "Tue Jan 15 21:29:00 2019"
What’s in the SummarizedExperiment object?
se
class: RangedSummarizedExperiment
dim: 205870 24
metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
assays(3): counts abundance length
rownames(205870): ENST00000456328.2 ENST00000450305.2 ...
ENST00000387460.2 ENST00000387461.2
rowData names(3): tx_id gene_id tx_name
colnames(24): SAMEA103885102 SAMEA103885347 ... SAMEA103885308
SAMEA103884949
colData names(4): names sample_id line_id condition_name
- Looking at the size of the
SummarizedExperiment
object (205,870 rows!) as well as the row names, we see that this object contains transcript-level information. - The assays are created by directly importing the values from the quant.sf files and combining this information across the 24 samples:
- counts - NumReads column
- abundance - TPM column
- length - EffectiveLength column
- We can access any of the assays via the
assay
orassays
functions:
head(assay(se, "counts"), 3)
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392
ENST00000456328.2 22.058 12.404 5.44 0.000
ENST00000450305.2 0.000 0.000 0.00 0.000
ENST00000488147.1 119.092 180.069 161.55 93.747
SAMEA103885182 SAMEA103885136 SAMEA103885413 SAMEA103884967
ENST00000456328.2 0.0 10.833 5.119 6.708
ENST00000450305.2 0.0 0.000 0.000 0.000
ENST00000488147.1 145.5 141.607 189.152 98.019
SAMEA103885368 SAMEA103885218 SAMEA103885319 SAMEA103885004
ENST00000456328.2 0.000 4.484 0.000 0.000
ENST00000450305.2 0.000 0.000 0.000 0.000
ENST00000488147.1 132.243 88.429 96.871 66.813
SAMEA103885284 SAMEA103885059 SAMEA103884898 SAMEA103885157
ENST00000456328.2 27.337 12.401 0.000 6.107
ENST00000450305.2 0.000 0.000 0.000 0.000
ENST00000488147.1 250.127 211.475 144.212 134.842
SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021
ENST00000456328.2 23.333 11.794 4.670 0.000
ENST00000450305.2 0.000 0.000 0.000 0.000
ENST00000488147.1 205.167 151.599 134.082 69.402
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949
ENST00000456328.2 7.629 3.907 9.195 0.000
ENST00000450305.2 0.000 0.000 0.000 0.000
ENST00000488147.1 154.885 125.882 183.322 53.233
head(assays(se)[["counts"]], 3)
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392
ENST00000456328.2 22.058 12.404 5.44 0.000
ENST00000450305.2 0.000 0.000 0.00 0.000
ENST00000488147.1 119.092 180.069 161.55 93.747
SAMEA103885182 SAMEA103885136 SAMEA103885413 SAMEA103884967
ENST00000456328.2 0.0 10.833 5.119 6.708
ENST00000450305.2 0.0 0.000 0.000 0.000
ENST00000488147.1 145.5 141.607 189.152 98.019
SAMEA103885368 SAMEA103885218 SAMEA103885319 SAMEA103885004
ENST00000456328.2 0.000 4.484 0.000 0.000
ENST00000450305.2 0.000 0.000 0.000 0.000
ENST00000488147.1 132.243 88.429 96.871 66.813
SAMEA103885284 SAMEA103885059 SAMEA103884898 SAMEA103885157
ENST00000456328.2 27.337 12.401 0.000 6.107
ENST00000450305.2 0.000 0.000 0.000 0.000
ENST00000488147.1 250.127 211.475 144.212 134.842
SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021
ENST00000456328.2 23.333 11.794 4.670 0.000
ENST00000450305.2 0.000 0.000 0.000 0.000
ENST00000488147.1 205.167 151.599 134.082 69.402
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949
ENST00000456328.2 7.629 3.907 9.195 0.000
ENST00000450305.2 0.000 0.000 0.000 0.000
ENST00000488147.1 154.885 125.882 183.322 53.233
Exercises
- The screenshot of the quant.sf file above was taken from the Salmon output from the sample SAMEA103884898. Read the file into R manually and compare the values in each of the assays in the
SummarizedExperiment
object to the corresponding values given in the quant.sf file. Note that the file is compressed (the name is quant.sf.gz), but it can still be read by e.g.read.delim()
.
Solution
<- read.delim(file.path(dir, "quants", "SAMEA103884898", "quant.sf.gz"),
qf header = TRUE, as.is = TRUE)
all(rownames(se) == qf$Name)
[1] TRUE
all.equal(unname(assay(se, "counts")[, "SAMEA103884898"]), qf$NumReads)
[1] TRUE
all.equal(unname(assay(se, "abundance")[, "SAMEA103884898"]), qf$TPM)
[1] TRUE
all.equal(unname(assay(se, "length")[, "SAMEA103884898"]), qf$EffectiveLength)
[1] TRUE
- Check that for each sample, the sum of the TPMs for all the transcripts is indeed 1 million, as expected.
Solution
colSums(assay(se, "abundance"))
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182
1e+06 1e+06 1e+06 1e+06 1e+06
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218
1e+06 1e+06 1e+06 1e+06 1e+06
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898
1e+06 1e+06 1e+06 1e+06 1e+06
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021
1e+06 1e+06 1e+06 1e+06 1e+06
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949
1e+06 1e+06 1e+06 1e+06
Row (feature) annotations and genomic ranges
- You may have noted that
se
is in fact aRangedSummarizedExperiment
object (rather than “just” aSummarizedExperiment
object). What does this mean? - Let’s look at the information we have about the rows (transcripts) in the object:
rowRanges(se)
GRanges object with 205870 ranges and 3 metadata columns:
seqnames ranges strand | tx_id gene_id
<Rle> <IRanges> <Rle> | <integer> <CharacterList>
ENST00000456328.2 chr1 11869-14409 + | 1 ENSG00000223972.5
ENST00000450305.2 chr1 12010-13670 + | 2 ENSG00000223972.5
ENST00000488147.1 chr1 14404-29570 - | 9483 ENSG00000227232.5
ENST00000619216.1 chr1 17369-17436 - | 9484 ENSG00000278267.1
ENST00000473358.1 chr1 29554-31097 + | 3 ENSG00000243485.5
... ... ... ... . ... ...
ENST00000361681.2 chrM 14149-14673 - | 206692 ENSG00000198695.2
ENST00000387459.1 chrM 14674-14742 - | 206693 ENSG00000210194.1
ENST00000361789.2 chrM 14747-15887 + | 206684 ENSG00000198727.2
ENST00000387460.2 chrM 15888-15953 + | 206685 ENSG00000210195.2
ENST00000387461.2 chrM 15956-16023 - | 206694 ENSG00000210196.2
tx_name
<character>
ENST00000456328.2 ENST00000456328.2
ENST00000450305.2 ENST00000450305.2
ENST00000488147.1 ENST00000488147.1
ENST00000619216.1 ENST00000619216.1
ENST00000473358.1 ENST00000473358.1
... ...
ENST00000361681.2 ENST00000361681.2
ENST00000387459.1 ENST00000387459.1
ENST00000361789.2 ENST00000361789.2
ENST00000387460.2 ENST00000387460.2
ENST00000387461.2 ENST00000387461.2
-------
seqinfo: 25 sequences (1 circular) from hg38 genome
- By knowing the source and version of the reference used for the quantification,
tximeta
was able to retrieve the annotation files and decorate the object with information about the transcripts, such as the chromosome and position, and the corresponding gene ID. Importantly, Salmon did not use (or know about) any of this during the quantification! It needs only the transcript sequences. - The information about transcript locations in the genome is provided in a
GRanges
object, which you saw already a couple of weeks ago. - This gives us lots of possibilities of interacting with our data based on the genomic locations of the features.
- Let’s say for example that we want to only consider transcripts located between position 1e6 and 2e6 on chromosome 3. We can subset the SummarizedExperiment object (including the assays) to only these features:
## Define the target region
<- GRanges(seqnames = "chr3",
gr ranges = IRanges(
start = 1e6,
end = 2e6),
strand = "*")
## Subset SummarizedExperiment by overlap
<- IRanges::subsetByOverlaps(se, gr)
sesub sesub
class: RangedSummarizedExperiment
dim: 13 24
metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
assays(3): counts abundance length
rownames(13): ENST00000451707.1 ENST00000463924.3 ... ENST00000397467.2
ENST00000418972.1
rowData names(3): tx_id gene_id tx_name
colnames(24): SAMEA103885102 SAMEA103885347 ... SAMEA103885308
SAMEA103884949
colData names(4): names sample_id line_id condition_name
rowRanges(sesub)
GRanges object with 13 ranges and 3 metadata columns:
seqnames ranges strand | tx_id
<Rle> <IRanges> <Rle> | <integer>
ENST00000451707.1 chr3 1008135-1012934 + | 33364
ENST00000463924.3 chr3 1065580-1065875 - | 39749
ENST00000446702.6 chr3 1092576-1404217 + | 33365
ENST00000397479.6 chr3 1092658-1403442 + | 33366
ENST00000350110.2 chr3 1092661-1403594 + | 33367
... ... ... ... . ...
ENST00000449338.1 chr3 1394173-1394448 - | 39750
ENST00000423801.1 chr3 1595777-1596245 - | 39751
ENST00000415533.1 chr3 1730070-1730474 + | 33371
ENST00000397467.2 chr3 1905651-1906125 + | 33372
ENST00000418972.1 chr3 1962381-1987470 - | 39752
gene_id tx_name
<CharacterList> <character>
ENST00000451707.1 ENSG00000235158.1 ENST00000451707.1
ENST00000463924.3 ENSG00000244169.3 ENST00000463924.3
ENST00000446702.6 ENSG00000134115.12 ENST00000446702.6
ENST00000397479.6 ENSG00000134115.12 ENST00000397479.6
ENST00000350110.2 ENSG00000134115.12 ENST00000350110.2
... ... ...
ENST00000449338.1 ENSG00000232563.1 ENST00000449338.1
ENST00000423801.1 ENSG00000184423.5 ENST00000423801.1
ENST00000415533.1 ENSG00000214074.3 ENST00000415533.1
ENST00000397467.2 ENSG00000214073.2 ENST00000397467.2
ENST00000418972.1 ENSG00000225044.1 ENST00000418972.1
-------
seqinfo: 25 sequences (1 circular) from hg38 genome
- If we just want the annotation columns, without the ranges, we can get those with the
rowData
accessor:
rowData(se)
DataFrame with 205870 rows and 3 columns
tx_id gene_id tx_name
<integer> <CharacterList> <character>
ENST00000456328.2 1 ENSG00000223972.5 ENST00000456328.2
ENST00000450305.2 2 ENSG00000223972.5 ENST00000450305.2
ENST00000488147.1 9483 ENSG00000227232.5 ENST00000488147.1
ENST00000619216.1 9484 ENSG00000278267.1 ENST00000619216.1
ENST00000473358.1 3 ENSG00000243485.5 ENST00000473358.1
... ... ... ...
ENST00000361681.2 206692 ENSG00000198695.2 ENST00000361681.2
ENST00000387459.1 206693 ENSG00000210194.1 ENST00000387459.1
ENST00000361789.2 206684 ENSG00000198727.2 ENST00000361789.2
ENST00000387460.2 206685 ENSG00000210195.2 ENST00000387460.2
ENST00000387461.2 206694 ENSG00000210196.2 ENST00000387461.2
Column (sample) annotations
- Similar to the row annotations in
rowData
, the SummarizedExperiment object contains sample annotations in thecolData
slot.
colData(se)
DataFrame with 24 rows and 4 columns
names sample_id line_id condition_name
<character> <character> <character> <character>
SAMEA103885102 SAMEA103885102 diku_A diku_1 naive
SAMEA103885347 SAMEA103885347 diku_B diku_1 IFNg
SAMEA103885043 SAMEA103885043 diku_C diku_1 SL1344
SAMEA103885392 SAMEA103885392 diku_D diku_1 IFNg_SL1344
SAMEA103885182 SAMEA103885182 eiwy_A eiwy_1 naive
... ... ... ... ...
SAMEA103885021 SAMEA103885021 podx_D podx_1 IFNg_SL1344
SAMEA103885262 SAMEA103885262 qaqx_A qaqx_1 naive
SAMEA103885228 SAMEA103885228 qaqx_B qaqx_1 IFNg
SAMEA103885308 SAMEA103885308 qaqx_C qaqx_1 SL1344
SAMEA103884949 SAMEA103884949 qaqx_D qaqx_1 IFNg_SL1344
- Storing this information in the object rather than in an external text file or data.frame is very helpful to make sure things stay matched. For example, we can easily subset the object to only the naive and IFNg conditions, and we can be sure that the annotations in
colData
still match up with the columns of the count matrix:
$condition_name %in% c("naive", "IFNg")] se[, se
class: RangedSummarizedExperiment
dim: 205870 12
metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
assays(3): counts abundance length
rownames(205870): ENST00000456328.2 ENST00000450305.2 ...
ENST00000387460.2 ENST00000387461.2
rowData names(3): tx_id gene_id tx_name
colnames(12): SAMEA103885102 SAMEA103885347 ... SAMEA103885262
SAMEA103885228
colData names(4): names sample_id line_id condition_name
Exercises
- Subset the
SummarizedExperiment
object to only contain transcripts on the X chromosome, and only samples from the naive group. How many rows and columns do the resulting object contain?
Solution
<- se[seqnames(rowRanges(se)) == "chrX",
sesub2 $condition_name == "naive"]
se sesub2
class: RangedSummarizedExperiment
dim: 6431 6
metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
assays(3): counts abundance length
rownames(6431): ENST00000431238.7 ENST00000399012.6 ...
ENST00000464205.6 ENST00000507418.6
rowData names(3): tx_id gene_id tx_name
colnames(6): SAMEA103885102 SAMEA103885182 ... SAMEA103885111
SAMEA103885262
colData names(4): names sample_id line_id condition_name
rowRanges(sesub2)
GRanges object with 6431 ranges and 3 metadata columns:
seqnames ranges strand | tx_id
<Rle> <IRanges> <Rle> | <integer>
ENST00000431238.7 chrX 253743-255091 + | 199270
ENST00000399012.6 chrX 276322-303353 + | 199271
ENST00000484611.7 chrX 276324-291537 + | 199272
ENST00000430923.7 chrX 276353-291629 + | 199273
ENST00000445062.6 chrX 281055-288869 + | 199274
... ... ... ... . ...
ENST00000483079.6 chrX 156022786-156023531 + | 202580
ENST00000496301.6 chrX 156023367-156025666 + | 202581
ENST00000483286.6 chrX 156023824-156025554 + | 202582
ENST00000464205.6 chrX 156024071-156025554 + | 202583
ENST00000507418.6 chrX 156025664-156027877 - | 205759
gene_id tx_name
<CharacterList> <character>
ENST00000431238.7 ENSG00000228572.7 ENST00000431238.7
ENST00000399012.6 ENSG00000182378.13 ENST00000399012.6
ENST00000484611.7 ENSG00000182378.13 ENST00000484611.7
ENST00000430923.7 ENSG00000182378.13 ENST00000430923.7
ENST00000445062.6 ENSG00000182378.13 ENST00000445062.6
... ... ...
ENST00000483079.6 ENSG00000182484.15 ENST00000483079.6
ENST00000496301.6 ENSG00000182484.15 ENST00000496301.6
ENST00000483286.6 ENSG00000182484.15 ENST00000483286.6
ENST00000464205.6 ENSG00000182484.15 ENST00000464205.6
ENST00000507418.6 ENSG00000227159.8 ENST00000507418.6
-------
seqinfo: 25 sequences (1 circular) from hg38 genome
dim(sesub2)
[1] 6431 6
Summarizing on the gene level
- As we saw, the features in the SummarizedExperiment object above are individual transcripts, rather than genes.
- Most of the time, however, we want to do analysis on the gene level, since the gene-level abundances are more robust and sometimes more interpretable than transcript-level abundances.
Summarizing the row annotations
- The
rowData
contains the information about the corresponding gene for each transcript, in thegene_id
column. - tximeta provides a function to summarize on the gene level.
- Counts are added up
- TPMs are added up
- Transcript lengths are added up after weighting by the respective transcript TPMs
<- summarizeToGene(se) seg
loading existing TxDb created: 2023-03-22 12:02:57
obtaining transcript-to-gene mapping from database
loading existing gene ranges created: 2023-03-22 14:25:40
summarizing abundance
summarizing counts
summarizing length
seg
class: RangedSummarizedExperiment
dim: 58294 24
metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
assays(3): counts abundance length
rownames(58294): ENSG00000000003.14 ENSG00000000005.5 ...
ENSG00000285993.1 ENSG00000285994.1
rowData names(2): gene_id tx_ids
colnames(24): SAMEA103885102 SAMEA103885347 ... SAMEA103885308
SAMEA103884949
colData names(4): names sample_id line_id condition_name
- Now we have a new RangedSummarizedExperiment object, with one row per gene.
- The row ranges have been summarized as well, and can be used for subsetting and interpretation just as for the transcripts.
rowRanges(seg)
GRanges object with 58294 ranges and 2 metadata columns:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
ENSG00000000003.14 chrX 100627109-100639991 - | ENSG00000000003.14
ENSG00000000005.5 chrX 100584802-100599885 + | ENSG00000000005.5
ENSG00000000419.12 chr20 50934867-50958555 - | ENSG00000000419.12
ENSG00000000457.13 chr1 169849631-169894267 - | ENSG00000000457.13
ENSG00000000460.16 chr1 169662007-169854080 + | ENSG00000000460.16
... ... ... ... . ...
ENSG00000285990.1 chr14 19244904-19269380 - | ENSG00000285990.1
ENSG00000285991.1 chr6 149817937-149896011 - | ENSG00000285991.1
ENSG00000285992.1 chr8 47129262-47132628 + | ENSG00000285992.1
ENSG00000285993.1 chr18 46409197-46410645 - | ENSG00000285993.1
ENSG00000285994.1 chr10 12563151-12567351 + | ENSG00000285994.1
tx_ids
<CharacterList>
ENSG00000000003.14 ENST00000612152.4,ENST00000373020.8,ENST00000614008.4,...
ENSG00000000005.5 ENST00000373031.4,ENST00000485971.1
ENSG00000000419.12 ENST00000371588.9,ENST00000466152.5,ENST00000371582.8,...
ENSG00000000457.13 ENST00000367771.10,ENST00000367770.5,ENST00000367772.8,...
ENSG00000000460.16 ENST00000498289.5,ENST00000472795.5,ENST00000496973.5,...
... ...
ENSG00000285990.1 ENST00000649331.1
ENSG00000285991.1 ENST00000647612.1
ENSG00000285992.1 ENST00000648949.1
ENSG00000285993.1 ENST00000650266.1
ENSG00000285994.1 ENST00000648650.1
-------
seqinfo: 25 sequences (1 circular) from hg38 genome
rowRanges(IRanges::subsetByOverlaps(seg, gr))
GRanges object with 8 ranges and 2 metadata columns:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
ENSG00000134115.12 chr3 1092576-1404217 + | ENSG00000134115.12
ENSG00000184423.5 chr3 1595777-1596245 - | ENSG00000184423.5
ENSG00000214073.2 chr3 1905651-1906125 + | ENSG00000214073.2
ENSG00000214074.3 chr3 1730070-1730474 + | ENSG00000214074.3
ENSG00000225044.1 chr3 1962381-1987470 - | ENSG00000225044.1
ENSG00000232563.1 chr3 1394173-1394448 - | ENSG00000232563.1
ENSG00000235158.1 chr3 1008135-1012934 + | ENSG00000235158.1
ENSG00000244169.3 chr3 1065580-1065875 - | ENSG00000244169.3
tx_ids
<CharacterList>
ENSG00000134115.12 ENST00000446702.6,ENST00000397479.6,ENST00000350110.2,...
ENSG00000184423.5 ENST00000423801.1
ENSG00000214073.2 ENST00000397467.2
ENSG00000214074.3 ENST00000415533.1
ENSG00000225044.1 ENST00000418972.1
ENSG00000232563.1 ENST00000449338.1
ENSG00000235158.1 ENST00000451707.1
ENSG00000244169.3 ENST00000463924.3
-------
seqinfo: 25 sequences (1 circular) from hg38 genome
Exercises
- Convince yourself that the summarization rules listed above are correct, e.g., by looking at a few example genes. Hint: the transcript IDs associated with each gene ID are included in the
tx_ids
column ofrowData(seg)
.
Solution
## A gene (ENSG00000001561.6) with a single transcript (ENST00000321037.4)
assay(se, "counts")["ENST00000321037.4", ]
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182
75 297 53 196 203
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218
503 161 207 84 348
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898
83 166 14 50 13
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021
22 114 321 110 93
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949
191 544 164 232
assay(seg, "counts")["ENSG00000001561.6", ]
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182
75 297 53 196 203
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218
503 161 207 84 348
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898
83 166 14 50 13
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021
22 114 321 110 93
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949
191 544 164 232
assay(se, "abundance")["ENST00000321037.4", ]
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182
0.771645 3.012179 0.483955 1.783965 2.069614
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218
5.345586 1.531936 2.077714 0.812283 3.780581
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898
0.866642 1.653559 0.201541 0.774063 0.169594
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021
0.290852 1.672535 4.366731 1.264230 1.178936
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949
1.993649 6.215327 1.838931 2.405130
assay(seg, "abundance")["ENSG00000001561.6", ]
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182
0.771645 3.012179 0.483955 1.783965 2.069614
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218
5.345586 1.531936 2.077714 0.812283 3.780581
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898
0.866642 1.653559 0.201541 0.774063 0.169594
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021
0.290852 1.672535 4.366731 1.264230 1.178936
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949
1.993649 6.215327 1.838931 2.405130
assay(se, "length")["ENST00000321037.4", ]
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182
3867.923 3899.012 3904.951 3869.813 3779.375
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218
3767.240 3838.151 3833.323 3869.960 3718.040
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898
3973.670 3845.925 3263.954 3219.796 3378.609
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021
3275.387 3205.206 3473.419 3536.652 3369.202
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949
3568.291 3587.445 3669.451 3661.080
assay(seg, "length")["ENSG00000001561.6", ]
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182
3867.923 3899.012 3904.951 3869.813 3779.375
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218
3767.240 3838.151 3833.323 3869.960 3718.040
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898
3973.670 3845.925 3263.954 3219.796 3378.609
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021
3275.387 3205.206 3473.419 3536.652 3369.202
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949
3568.291 3587.445 3669.451 3661.080
## A gene (ENSG00000000003.14) with multiple transcripts (ENST00000612152.4, ENST00000373020.8, ENST00000614008.4, ENST00000496771.5, ENST00000494424.1)
<- "ENSG00000000003.14"
gn <- c("ENST00000612152.4", "ENST00000373020.8", "ENST00000614008.4",
txs "ENST00000496771.5", "ENST00000494424.1")
colSums(assay(se, "counts")[txs, ])
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182
83.818 144.156 71.239 130.372 241.527
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218
140.124 176.200 131.309 106.873 32.232
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898
61.065 34.597 67.708 69.832 54.947
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021
38.000 243.366 151.900 442.684 191.000
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949
953.947 511.918 298.509 217.753
assay(seg, "counts")[gn, ]
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182
83.818 144.156 71.239 130.372 241.527
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218
140.124 176.200 131.309 106.873 32.232
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898
61.065 34.597 67.708 69.832 54.947
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021
38.000 243.366 151.900 442.684 191.000
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949
953.947 511.918 298.509 217.753
colSums(assay(se, "abundance")[txs, ])
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182
1.675253 2.831133 1.289042 2.299402 4.922389
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218
2.889044 3.234795 2.732676 1.979229 0.699002
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898
1.252130 0.662058 1.639050 1.785457 1.252458
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021
0.851824 5.991920 3.754027 9.203870 4.196592
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949
18.465397 10.837159 6.115662 4.146591
assay(seg, "abundance")[gn, ]
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182
1.675253 2.831133 1.289042 2.299402 4.922389
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218
2.889044 3.234795 2.732676 1.979229 0.699002
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898
1.252130 0.662058 1.639050 1.785457 1.252458
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021
0.851824 5.991920 3.754027 9.203870 4.196592
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949
18.465397 10.837159 6.115662 4.146591
colSums(assay(se, "length")[txs, ] *
scale(assay(se, "abundance")[txs, ],
center = FALSE,
scale = colSums(assay(se, "abundance")[txs, ])))
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182
1991.076 2013.493 1970.577 1997.052 1890.622
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218
1941.822 1989.272 1848.819 2020.717 1862.530
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898
2023.468 2001.930 1941.024 1949.595 1933.715
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021
1931.731 1909.937 1911.907 1955.005 1943.886
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949
1924.158 1936.135 2008.338 1993.125
assay(seg, "length")[gn, ]
SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392 SAMEA103885182
1991.076 2013.493 1970.577 1997.052 1890.622
SAMEA103885136 SAMEA103885413 SAMEA103884967 SAMEA103885368 SAMEA103885218
1941.822 1989.272 1848.819 2020.717 1862.530
SAMEA103885319 SAMEA103885004 SAMEA103885284 SAMEA103885059 SAMEA103884898
2023.468 2001.930 1941.024 1949.595 1933.715
SAMEA103885157 SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021
1931.731 1909.937 1911.907 1955.005 1943.886
SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949
1924.158 1936.135 2008.338 1993.125
Adding additional annotations
- At this point, the only information we have about the genes in our data set, apart from their genomic location and the associated transcript IDs, is the Ensembl ID.
- Often we need additional annotations, such as gene symbols.
- Bioconductor provides a host of annotation packages (recall from a previous lecture):
OrgDb
packages, providing gene-based annotations for a given organismTxDb
andEnsDb
packages, providing transcript ranges for a given genome buildBSgenome
packages, providing the genome sequence for a given genome build
- For our purposes here, the appropriate
OrgDb
package is the most suitable, since it contains gene-centric ID conversion tables. - Since this is human data, we will use the
org.Hs.eg.db
package.
library(org.Hs.eg.db)
Using addIds
- The easiest way of adding a column is by using the
addIds
function from tximeta. For example, to add gene symbols:
<- addIds(seg, "SYMBOL") seg
it appears the rows are gene IDs, setting 'gene' to TRUE
mapping to new IDs using org.Hs.eg.db
if all matching IDs are desired, and '1:many mappings' are reported,
set multiVals='list' to obtain all the matching IDs
'select()' returned 1:many mapping between keys and columns
head(rowData(seg))
DataFrame with 6 rows and 3 columns
gene_id
<character>
ENSG00000000003.14 ENSG00000000003.14
ENSG00000000005.5 ENSG00000000005.5
ENSG00000000419.12 ENSG00000000419.12
ENSG00000000457.13 ENSG00000000457.13
ENSG00000000460.16 ENSG00000000460.16
ENSG00000000938.12 ENSG00000000938.12
tx_ids
<CharacterList>
ENSG00000000003.14 ENST00000612152.4,ENST00000373020.8,ENST00000614008.4,...
ENSG00000000005.5 ENST00000373031.4,ENST00000485971.1
ENSG00000000419.12 ENST00000371588.9,ENST00000466152.5,ENST00000371582.8,...
ENSG00000000457.13 ENST00000367771.10,ENST00000367770.5,ENST00000367772.8,...
ENSG00000000460.16 ENST00000498289.5,ENST00000472795.5,ENST00000496973.5,...
ENSG00000000938.12 ENST00000374005.7,ENST00000399173.5,ENST00000374004.5,...
SYMBOL
<character>
ENSG00000000003.14 TSPAN6
ENSG00000000005.5 TNMD
ENSG00000000419.12 DPM1
ENSG00000000457.13 SCYL3
ENSG00000000460.16 FIRRM
ENSG00000000938.12 FGR
- To see a list of the possible columns, use the
columns
function from theAnnotationDbi
package:
::columns(org.Hs.eg.db) AnnotationDbi
[1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
[6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
[16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
[21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
[26] "UNIPROT"
- We can even add annotations where we expect multiple mapping values, e.g., associated GO terms:
<- addIds(seg, "GO", multiVals = "list") seg
it appears the rows are gene IDs, setting 'gene' to TRUE
mapping to new IDs using org.Hs.eg.db
if all matching IDs are desired, and '1:many mappings' are reported,
set multiVals='list' to obtain all the matching IDs
'select()' returned 1:many mapping between keys and columns
head(rowData(seg))
DataFrame with 6 rows and 4 columns
gene_id
<character>
ENSG00000000003.14 ENSG00000000003.14
ENSG00000000005.5 ENSG00000000005.5
ENSG00000000419.12 ENSG00000000419.12
ENSG00000000457.13 ENSG00000000457.13
ENSG00000000460.16 ENSG00000000460.16
ENSG00000000938.12 ENSG00000000938.12
tx_ids
<CharacterList>
ENSG00000000003.14 ENST00000612152.4,ENST00000373020.8,ENST00000614008.4,...
ENSG00000000005.5 ENST00000373031.4,ENST00000485971.1
ENSG00000000419.12 ENST00000371588.9,ENST00000466152.5,ENST00000371582.8,...
ENSG00000000457.13 ENST00000367771.10,ENST00000367770.5,ENST00000367772.8,...
ENSG00000000460.16 ENST00000498289.5,ENST00000472795.5,ENST00000496973.5,...
ENSG00000000938.12 ENST00000374005.7,ENST00000399173.5,ENST00000374004.5,...
SYMBOL GO
<character> <list>
ENSG00000000003.14 TSPAN6 GO:0005515,GO:0005886,GO:0039532,...
ENSG00000000005.5 TNMD GO:0001886,GO:0001937,GO:0005515,...
ENSG00000000419.12 DPM1 GO:0004169,GO:0004582,GO:0004582,...
ENSG00000000457.13 SCYL3 GO:0000139,GO:0005515,GO:0005524,...
ENSG00000000460.16 FIRRM GO:0005515
ENSG00000000938.12 FGR GO:0001784,GO:0001784,GO:0001819,...
Exercises
- Add a column with Entrez Gene IDs to the SummarizedExperiment object
Solution
<- addIds(seg, "ENTREZID")
seg rowData(seg)
DataFrame with 58294 rows and 5 columns
gene_id
<character>
ENSG00000000003.14 ENSG00000000003.14
ENSG00000000005.5 ENSG00000000005.5
ENSG00000000419.12 ENSG00000000419.12
ENSG00000000457.13 ENSG00000000457.13
ENSG00000000460.16 ENSG00000000460.16
... ...
ENSG00000285990.1 ENSG00000285990.1
ENSG00000285991.1 ENSG00000285991.1
ENSG00000285992.1 ENSG00000285992.1
ENSG00000285993.1 ENSG00000285993.1
ENSG00000285994.1 ENSG00000285994.1
tx_ids
<CharacterList>
ENSG00000000003.14 ENST00000612152.4,ENST00000373020.8,ENST00000614008.4,...
ENSG00000000005.5 ENST00000373031.4,ENST00000485971.1
ENSG00000000419.12 ENST00000371588.9,ENST00000466152.5,ENST00000371582.8,...
ENSG00000000457.13 ENST00000367771.10,ENST00000367770.5,ENST00000367772.8,...
ENSG00000000460.16 ENST00000498289.5,ENST00000472795.5,ENST00000496973.5,...
... ...
ENSG00000285990.1 ENST00000649331.1
ENSG00000285991.1 ENST00000647612.1
ENSG00000285992.1 ENST00000648949.1
ENSG00000285993.1 ENST00000650266.1
ENSG00000285994.1 ENST00000648650.1
SYMBOL GO ENTREZID
<character> <list> <character>
ENSG00000000003.14 TSPAN6 GO:0005515,GO:0005886,GO:0039532,... 7105
ENSG00000000005.5 TNMD GO:0001886,GO:0001937,GO:0005515,... 64102
ENSG00000000419.12 DPM1 GO:0004169,GO:0004582,GO:0004582,... 8813
ENSG00000000457.13 SCYL3 GO:0000139,GO:0005515,GO:0005524,... 57147
ENSG00000000460.16 FIRRM GO:0005515 55732
... ... ... ...
ENSG00000285990.1 NBEAP6 NA 100418904
ENSG00000285991.1 NA NA NA
ENSG00000285992.1 NA NA NA
ENSG00000285993.1 NA NA NA
ENSG00000285994.1 NA NA NA
Normalization for library size differences
- The library size or sequencing depth of an RNA-seq sample is typically used to refer to the number of reads assigned to genes - in other words, the column sums of the count matrix of the experiment.
- These can vary quite a lot between samples even in the same experiment.
sort(colSums(assay(seg, "counts")))
SAMEA103885059 SAMEA103885284 SAMEA103885157 SAMEA103884919 SAMEA103884898
31636942 33348797 33349938 33448610 33708215
SAMEA103885111 SAMEA103885021 SAMEA103885319 SAMEA103885276 SAMEA103885218
34187620 34531636 34923684 36376670 37805027
SAMEA103884967 SAMEA103885308 SAMEA103885368 SAMEA103885004 SAMEA103885136
38244703 38330402 39226849 39366479 39702126
SAMEA103885102 SAMEA103885413 SAMEA103885347 SAMEA103884949 SAMEA103885182
40075160 40089867 40467958 40479409 40738760
SAMEA103885228 SAMEA103885043 SAMEA103885392 SAMEA103885262
41204929 41833047 42535450 45746950
- Differences in library size need to be accounted for, since they mean that the observed counts are not directly comparable between samples. If a sample is sequenced twice as deeply, all genes will have twice the number of reads, but that doesn’t mean that all genes are twice as highly expressed!
- The simplest normalization approach would be to divide the counts for a sample by a factor proportional to the library size.
- However, this can have unintended consequences if a small set of genes are strongly upregulated in one condition compared to another.
- Since this approach effectively fixes the total number of reads, if a few genes are upregulated it means the rest will look downregulated!
- Instead, a better idea is to attempt to estimate the scaling factors (typically called “size factors” or “normalization factors” depending on exactly what they represent) from non-differentially expressed genes only.
- It is important to note that (as you will see later) the counts are not usually explicitly normalized before differential expression analysis (e.g., with edgeR or DESeq2). These packages take the raw counts as input, and build in the normalization in the statistical model.
- However, for exploratory analysis (and some other differential expression analysis methods), the counts are explicitly normalized, and often also log-transformed, before analysis.
Exploratory analysis and visualization
- The first step of an RNA-seq analysis (after the raw read QC and quantification) is often an exploratory analysis.
- The purpose of this is to get a feeling for the strongest signals in the data, and to see whether these agree with expectations.
- The most common approach is Principal Component Analysis, or PCA.
- PCA finds linear combinations of the original features (genes) that represent as much of the variance in the original data set as possible.
- Note that it doesn’t use any information about the grouping of the samples, and as such, it is an unsupervised method.
- Before applying PCA, the read counts are normalized and typically transformed to a scale where the variance does not depend on the mean. This is important to avoid the expression level driving the PCA.
- One common transformation is the variance-stabilizing transformation, in the DESeq2 package.
library(DESeq2)
## We must first convert the SummarizedExperiment object to a
## DESeqDataSet object (which is, effectively, a SummarizedExperiment
## object with additional information about the experimental design)
<- DESeqDataSet(seg, design = ~ condition_name) seg
using counts and average transcript lengths from tximeta
Warning in DESeqDataSet(seg, design = ~condition_name): some variables in
design formula are characters, converting to factors
## We filter out the genes that are not expressed
<- seg[rowSums(assay(seg, "counts")) > 0, ]
seg
## Apply the variance stabilizing transformation
<- varianceStabilizingTransformation(seg, blind = TRUE) vst
using 'avgTxLength' from assays(dds), correcting for library size
## Plot the mean-variance relationship for the raw counts as well as
## transformed data
<- DESeq2::estimateSizeFactors(seg) seg
using 'avgTxLength' from assays(dds), correcting for library size
::meanSdPlot(counts(seg, normalized = TRUE)) vsn
::meanSdPlot(assay(normTransform(seg))) vsn
::meanSdPlot(assay(vst)) vsn
- After transforming the data, we apply Principal Component Analysis using the
prcomp()
R function. - For more details on PCA, see e.g. this introduction.
library(ggplot2)
<- rowVars(assay(vst))
rv <- order(rv, decreasing = TRUE)[1:500]
sel <- prcomp(t(assay(vst)[sel, ]))
pca <- data.frame(pca$x[, 1:2], colData(vst))
df ggplot(df, aes(x = PC1, y = PC2, color = condition_name)) +
geom_point(size = 6) + theme_bw()
Alternatives to the SummarizedExperiment container
- Many Bioconductor packages for RNA-seq analysis use data represented in the form of a
SummarizedExperiment
object (or a class that extendsSummarizedExperiment
). - The
DESeqDataSet
that we saw above is one example. - Later we will also see the
SingleCellExperiment
class, which is very similar to theSummarizedExperiment
class but allows us to store the results of dimension reduction algorithms inside the object. - However, there are also widely used packages (like edgeR) that use other representations. In the case of edgeR, this is the
DGEList
class.
suppressPackageStartupMessages({
library(edgeR)
})<- DGEList(counts = assay(seg, "counts"),
dge samples = colData(seg),
genes = rowData(seg))
Session info
sessionInfo()
R version 4.3.1 Patched (2023-06-20 r84599)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Zurich
tzcode source: internal
attached base packages:
[1] stats4 grid stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] edgeR_3.42.4 limma_3.56.2
[3] ggplot2_3.4.4 DESeq2_1.40.1
[5] org.Hs.eg.db_3.17.0 GenomicFeatures_1.52.1
[7] AnnotationDbi_1.62.1 SummarizedExperiment_1.30.2
[9] Biobase_2.60.0 GenomicRanges_1.52.0
[11] GenomeInfoDb_1.36.1 IRanges_2.34.0
[13] S4Vectors_0.38.1 BiocGenerics_0.46.0
[15] MatrixGenerics_1.12.2 matrixStats_1.0.0
[17] macrophage_1.16.0 tximeta_1.18.0
[19] BiocStyle_2.28.0 png_0.1-8
[21] knitr_1.44
loaded via a namespace (and not attached):
[1] rstudioapi_0.15.0 jsonlite_1.8.7
[3] tximport_1.28.0 magrittr_2.0.3
[5] farver_2.1.1 rmarkdown_2.25
[7] BiocIO_1.10.0 zlibbioc_1.46.0
[9] vctrs_0.6.4 memoise_2.0.1
[11] Rsamtools_2.16.0 RCurl_1.98-1.12
[13] htmltools_0.5.6.1 S4Arrays_1.0.4
[15] progress_1.2.2 AnnotationHub_3.8.0
[17] curl_5.1.0 htmlwidgets_1.6.2
[19] cachem_1.0.8 GenomicAlignments_1.36.0
[21] mime_0.12 lifecycle_1.0.3
[23] pkgconfig_2.0.3 Matrix_1.6-1.1
[25] R6_2.5.1 fastmap_1.1.1
[27] GenomeInfoDbData_1.2.10 shiny_1.7.5
[29] digest_0.6.33 colorspace_2.1-0
[31] RSQLite_2.3.1 filelock_1.0.2
[33] labeling_0.4.3 fansi_1.0.5
[35] httr_1.4.7 compiler_4.3.1
[37] bit64_4.0.5 withr_2.5.1
[39] BiocParallel_1.34.2 DBI_1.1.3
[41] hexbin_1.28.3 biomaRt_2.56.1
[43] rappdirs_0.3.3 DelayedArray_0.26.3
[45] rjson_0.2.21 tools_4.3.1
[47] interactiveDisplayBase_1.38.0 httpuv_1.6.11
[49] glue_1.6.2 restfulr_0.0.15
[51] promises_1.2.1 generics_0.1.3
[53] gtable_0.3.4 tzdb_0.4.0
[55] preprocessCore_1.62.1 ensembldb_2.24.0
[57] hms_1.1.3 xml2_1.3.5
[59] utf8_1.2.3 XVector_0.40.0
[61] BiocVersion_3.17.1 pillar_1.9.0
[63] stringr_1.5.0 vroom_1.6.4
[65] later_1.3.1 dplyr_1.1.3
[67] BiocFileCache_2.8.0 lattice_0.21-9
[69] rtracklayer_1.60.0 bit_4.0.5
[71] tidyselect_1.2.0 locfit_1.5-9.8
[73] Biostrings_2.68.1 ProtGenerics_1.32.0
[75] xfun_0.40 stringi_1.7.12
[77] lazyeval_0.2.2 yaml_2.3.7
[79] evaluate_0.22 codetools_0.2-19
[81] tibble_3.2.1 BiocManager_1.30.22
[83] cli_3.6.1 affyio_1.70.0
[85] xtable_1.8-4 munsell_0.5.0
[87] Rcpp_1.0.11 dbplyr_2.3.4
[89] XML_3.99-0.14 parallel_4.3.1
[91] ellipsis_0.3.2 readr_2.1.4
[93] blob_1.2.4 prettyunits_1.2.0
[95] AnnotationFilter_1.24.0 bitops_1.0-7
[97] scales_1.2.1 affy_1.78.0
[99] purrr_1.0.2 crayon_1.5.2
[101] rlang_1.1.1 vsn_3.68.0
[103] KEGGREST_1.40.0