STAR --runThreadN 24 \
--runMode genomeGenerate \
--genomeDir my_genome \
--genomeFastaFiles my_genome.fa \
--sjdbGTFfile my_genes.gtf \
--sjdbOverhang 99
Introduction to RNA-seq, quantification and import into R
Introduction to RNA sequencing
RNA sequencing (RNA-Seq) is revolutionizing the study of the transcriptome
Evolution of transcriptomics technologies:
- Northern blot –> single genes
- RT-PCR –> several genes
- Microarrays –> Targeted genomes
- RNAseq –> whole genomes
RNA-Seq takes over other transcriptomics technologies
(Adapted from Lachmann et al: https://doi.org/10.1038/s41467-018-03751-6 )
Advantages of RNAseq over microarrays:
- Highly sensitive and accurate
- Detects both known and novel features
- Can be applied to any species
Caveats of RNAseq:
- Results are highly sensitive to protocol and analysis parameters
- Difficulty to analyse repetitive regions
RNAseq: high throughput sequencing of cDNA using NGS technologies
What is RNAseq?
High-throughput sequencing of cDNA to assess the presence/quantity of RNA in a biological sample at a given moment in time
How does RNAseq work?
–> Sequencing of every RNA molecule –> Counting the number of RNA molecules, or reads
, per transcript / gene
Output of an RNAseq experiment:
The summarized RNAseq data is a count table. Typically features (usually genes or transcript isoforms) are arranged in the rows and assayed samples are arranged in the columns:
Applications of RNAseq
RNAseq comes as a replacement to microarrays…
… to study changes occurring in disease states, in response to therapeutics, under different environmental conditions, etc
What can be obtained/analyzed from RNAseq?
- Gene expression (different cell fractions, RNA populations can be targeted or selected for)
- Transcript isoforms (alternative splicing, alternative pA…)
- Gene fusions
- Single nucleotide variants
- Allele-specific gene expression
RNAseq analysis
Outline of this course
Week 1:
- Basic concepts in RNAseq. From raw data to count data (alignment and gene quantification)
Week 2:
- Summarized Experiment,normalization, exploratory analyses and visualization
Week 3:
- Differential gene expression, design matrices.
The different RNA-seq protocols
Prior to analysis, a good knowledge of which RNA-seq protocol was used is needed. The analysis parameters depend on it.
Paired-end vs. single-end RNAseq protocols
- Paired-end libraries contain sequence information of sequence linkage over longer distances
- Paired-end sequencing is more expensive in terms of price/sequenced fragment (same coverage but half the fragments for the same price)
- Paired-end gives more resolution, in particular in repeat regions and over transcript isoforms
Stranded vs. unstranded RNAseq protocols
- Stranded protocols are useful with regards for example to overlapping genes
Poly-A selected vs. ribo-depleted RNAseq protocols
A large fraction on the endogenous RNA originates from ribosomal RNA sequences. Typical approaches to enrich for non rRNA sequences include:
- Poly-A libraries for analysis of mRNA only. RNA products that are not polyadenylated (e.g. histone genes) will be absent/underrepresented.
- ribo-depleted libaries give access to the coding and non-coding transcriptome + non-polyA transcripts. Higher fraction of intronic reads.
RNA sequencing workflow
Inside an illumina flowcell
Every chip (flowcell) contains multiple lanes and every lane contains two columns of 50-60 tiles. Each tile contains several million “clusters” that contain primers:
In every cycle sequences with addapters are bound to primers, bridged to nearby primers and extended. After dissociation the cycle is repeated resulting in clusters of identical sequences. Labelled nucleotides are then added and scanned one at a time during incorporation giving the sequencing information.
During a labelled nucleotide addition cycle a tile area would look smth like this:
The FASTQ format
Sequencing information is recorded in a FASTQ file. This file contains information on the coordinates of the cluster in the flowcell, the read sequence and a string encoding the quality of the read nucleotides:
Quality scores explained: https://en.wikipedia.org/wiki/FASTQ_format#Quality
Quality control
Every sample in a sequecing run comes with a QC report (typically from the FASTQC tool). A FASTQC report is a summary of several indicators that allow us to pinpoint quality issues (e.g sequence duplication levels, contamination):
A couple of examples:
Exercise 1:
For the following examples of fastqc reports identify the issues in the FASTQC quality report:
Recap - transcriptomics
RNA-seq quantification
Input data
- Quantification starts from a set of reads and a set of reference sequences
- The reference sequences are typically either chromosomes (=genomic sequences) or transcript isoforms (=transcriptomic sequences)
- Both types of reference files can be downloaded from (e.g.):
- GENCODE (only human and mouse): https://www.gencodegenes.org/
- Ensembl (lots of species): http://www.ensembl.org/info/data/ftp/index.html
- Always stick to the same source and version of reference files during a project!
Approach I: Genome alignment and overlap counting
Alignment
- The genome sequence (FASTA file) is obtained from one of the reference databases
- In addition, a GTF file with the genomic locations of exons is obtained (from the same source as the genome)
- Reads are aligned to the genome (after indexing the latter for increased computational efficiency)
- For RNA-seq, the aligner must be able to accommodate spliced reads
- Example
STAR
command, genome indexing
- Example
STAR
command, read alignment
STAR --runThreadN 24 \
--runMode alignReads \
--genomeDir my_genome \
--readFilesIn S1_read1.fq.gz S1_read2.fq.gz \
--readFilesCommand zcat \
--outFileNamePrefix output/S1/ \
--outSAMtype BAM SortedByCoordinate
- The output file from the alignment step is typically a SAM file (or a BAM file, which is just a binary version of the SAM file)
- The SAM/BAM file contains a header, which provides information about the reference sequences and the call to the aligner, and the main body, with one line per alignment
- Detailed information about the SAM file format is available from https://samtools.github.io/hts-specs/SAMv1.pdf
- Interpretation of SAM flags: https://broadinstitute.github.io/picard/explain-flags.html
- Illustration of the CIGAR string: https://genome.sph.umich.edu/wiki/SAM#What_is_a_CIGAR.3F
Read counting
- The abundance of a gene is obtained by counting the number of reads that overlap the exonic regions of the gene
- Reads that overlap multiple genes (or map to multiple genomic locations) are typically discarded
- Several R packages are available for doing the counting:
- They can also be generated by STAR as part of the alignment process
- We can also use this approach to quantify other features, such as individual exons (here a read is typically counted for all the exons that it overlaps)
- This is useful e.g. if we are interested in finding specific exons that are only expressed in certain biological conditions
A problem with overlap counting
- Imagine a situation where we have two samples, each expressing a gene.
- In sample 1, isoform T1 (of length L) is expressed (at a level 2E).
- In sample 2, isoform T2 (of length 2L) is expressed (at a level E).
- When the transcripts are fragmented for sequencing, each molecule of isoform T2 contributes twice as many reads as a molecule of isoform T1.
- Thus, when we count the number of reads overlapping the gene, both samples get the same number, even though the gene was twice as highly expressed in sample 1.
- There are ways to address this issue, but we need additional information.
Approach II: Transcriptome mapping and abundance estimation
Mapping
- The transcriptome sequences (FASTA file) is obtained from one of the reference databases
- In addition, a text file mapping transcript IDs to the corresponding gene IDs should be prepared
- Reads are mapped to the transcripts (after indexing the latter for increased computational efficiency)
- For each read, we record the set of transcripts that it could have been generated from
- Each combination of transcripts form an equivalence class, and we construct a table of counts for these equivalence classes
Quantification
- Next, we estimate the individual transcript abundances that are most likely, given the observed equivalence class counts
- Note that we are not assigning a particular read to a specific transcript, but rather estimating underlying abundances!
- These “expected” abundances do not have to be integers
- Multimapping reads are naturally accounted for via the equivalence classes
- Several tools are available for performing this type of “alignment-free” quantification
- Example command,
Salmon
, transcriptome indexing
salmon index -i my_transcripts.idx \
-t my_transcripts.fasta
- Example command,
Salmon
, mapping/quantification
salmon quant -i my_transcripts.idx -l A \
-1 sample1_1.fastq -2 sample1_2.fastq \
-p 10 -o results/sample1 --validateMappings \
--numBootstraps 30 --seqBias --gcBias
- The estimated transcript-level abundances can be added up on the gene level
- It’s important to note that if we just add up counts from different isoforms, the resulting gene count has the same problem as the gene count from the overlap counting (\(C\) counts from a shorter isoform does not correspond to the same abundance as \(C\) counts from a longer isoform)! However, with the transcript abundance estimation approach we have additional information that can help us rectify it (see this paper for more information).
- The basic output from the transcript abundance estimation tools is a text file with estimated abundances. This one is from Salmon:
Name
is the transcript identifierLength
is the annotated transcript lengthEffectiveLength
is the length after accounting for the fact that a read is not equally likely to start anywhere in the transcriptTPM
is the estimated abundance in transcript-per-millionNumReads
is the estimated abundance in read counts
TPM vs read counts - what’s the difference?
- The read count depends on the length of the feature, in addition to its abundance.
- The TPM (similar to the nowadays less used RPKM/FPKM) aims to provide an abundance unit that is proportional to the number of molecules of a given target gene/transcript in a sample.
- Most of the widely used framework for differential gene expression directly model the counts, since these contain information related to the certainty of the abundance estimate, which is lost in the TPMs.
- Another commonly used abundance unit is the CPM (counts per million), which is obtained by normalizing the counts for differences in library size (but not by feature length).
How do the two quantification approaches differ?
- For most genes, counts are similar with the two approaches.
- Alignment-free methods are faster than methods based on genome alignment, since the former don’t perform a base-by-base matching.
- Salmon provides more highly resolved information than (e.g.) featureCounts (transcript-level rather than gene-level).
- Alignment-free methods typically use a larger fraction of the reads, since they don’t discard multimapping reads.
- Alignment-free methods don’t provide a precise alignment to the genome (which may be useful, e.g., for visualization in genome browsers).
Reading quantifications into R
- Here, we will focus on the processing of
Salmon
output. - Recall that the main output of
Salmon
is a text file with transcript abundances.
- To read this into R, we can use the
tximeta
function from the tximeta package.
library(tximeta)
?tximeta
- The first argument required by the
tximeta
function is acoldata
data.frame, which tells the function which samples we want to import, and where the corresponding quantification files are. - As an example, we will use data from the macrophage package. This package contains
quant.sf
files fromSalmon
, as well as acoldata
file. - These files are stored in the
macrophage
package folder in your R package library. We can find this location and check its content like so:
library(macrophage)
<- system.file("extdata", package = "macrophage")
dir dir
[1] "/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/macrophage/extdata"
list.files(dir)
[1] "coldata.csv" "errs"
[3] "gencode.v29_salmon_0.12.0" "gencode.v29.annotation.gtf.gz"
[5] "PRJEB18997.txt" "quants"
[7] "supp_table_1.csv" "supp_table_7.csv"
- We start by reading the
coldata
file. If you analyze your own data, you can read this information from a text file, or create the data.frame directly in R.
<- read.csv(file.path(dir, "coldata.csv"))[, c(1, 2, 3, 5)]
coldata dim(coldata)
[1] 24 4
head(coldata)
names sample_id line_id condition_name
1 SAMEA103885102 diku_A diku_1 naive
2 SAMEA103885347 diku_B diku_1 IFNg
3 SAMEA103885043 diku_C diku_1 SL1344
4 SAMEA103885392 diku_D diku_1 IFNg_SL1344
5 SAMEA103885182 eiwy_A eiwy_1 naive
6 SAMEA103885136 eiwy_B eiwy_1 IFNg
- In addition to the
names
column,tximeta
requires thatcoldata
has a column namedfiles
, pointing to the Salmon output (the quant.sf file) for the respective samples.
$files <- file.path(dir, "quants", coldata$names, "quant.sf.gz")
coldatahead(coldata)
names sample_id line_id condition_name
1 SAMEA103885102 diku_A diku_1 naive
2 SAMEA103885347 diku_B diku_1 IFNg
3 SAMEA103885043 diku_C diku_1 SL1344
4 SAMEA103885392 diku_D diku_1 IFNg_SL1344
5 SAMEA103885182 eiwy_A eiwy_1 naive
6 SAMEA103885136 eiwy_B eiwy_1 IFNg
files
1 /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/macrophage/extdata/quants/SAMEA103885102/quant.sf.gz
2 /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/macrophage/extdata/quants/SAMEA103885347/quant.sf.gz
3 /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/macrophage/extdata/quants/SAMEA103885043/quant.sf.gz
4 /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/macrophage/extdata/quants/SAMEA103885392/quant.sf.gz
5 /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/macrophage/extdata/quants/SAMEA103885182/quant.sf.gz
6 /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/macrophage/extdata/quants/SAMEA103885136/quant.sf.gz
all(file.exists(coldata$files))
[1] TRUE
- Now we have everything we need, and can import the quantifications with
tximeta
. - In this process, we will see that
tximeta
automatically identifies the source and version of the transcriptome reference that was used for the quantification, and adds some metadata. - The imported data will be stored in a
SummarizedExperiment
container, which you heard about in a previous lecture.
suppressPackageStartupMessages({
library(SummarizedExperiment)
})<- tximeta(coldata = coldata, type = "salmon", dropInfReps = TRUE) se
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
found matching transcriptome:
[ GENCODE - Homo sapiens - release 29 ]
loading existing TxDb created: 2023-03-22 12:02:57
Loading required package: GenomicFeatures
Loading required package: AnnotationDbi
loading existing transcript ranges created: 2023-03-22 12:02:57
se
class: RangedSummarizedExperiment
dim: 205870 24
metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
assays(3): counts abundance length
rownames(205870): ENST00000456328.2 ENST00000450305.2 ...
ENST00000387460.2 ENST00000387461.2
rowData names(3): tx_id gene_id tx_name
colnames(24): SAMEA103885102 SAMEA103885347 ... SAMEA103885308
SAMEA103884949
colData names(4): names sample_id line_id condition_name
- We see that
tximeta
has identified the transcriptome used for the quantification as GENCODE - Homo sapiens - release 29. How did this happen? - In fact, the output directory from Salmon contains much more information than just the quant.sf file! (Side note: this means that it is not advisable to move files out of the folder, or to share only the quant.sf file, since the context is lost):
list.files(file.path(dir, "quants", coldata$names[1]), recursive = TRUE)
[1] "aux_info/ambig_info.tsv.gz" "aux_info/bootstrap/bootstraps.gz"
[3] "aux_info/bootstrap/names.tsv.gz" "aux_info/exp_gc.gz"
[5] "aux_info/expected_bias.gz" "aux_info/fld.gz"
[7] "aux_info/meta_info.json" "aux_info/obs_gc.gz"
[9] "aux_info/observed_bias_3p.gz" "aux_info/observed_bias.gz"
[11] "cmd_info.json" "lib_format_counts.json"
[13] "libParams/flenDist.txt" "logs/salmon_quant.log.txt"
[15] "quant.sf.gz"
- In particular, the
meta_info.json
file contains a hash checksum, which is derived from the set of transcripts used as reference during the quantification and which letstximeta
identify the reference source (by comparing to a table of these hash checksums for commonly used references).
::fromJSON(file = file.path(dir, "quants", coldata$names[1], "aux_info", "meta_info.json")) rjson
$salmon_version
[1] "0.12.0"
$samp_type
[1] "gibbs"
$opt_type
[1] "vb"
$quant_errors
list()
$num_libraries
[1] 1
$library_types
[1] "ISR"
$frag_dist_length
[1] 1001
$seq_bias_correct
[1] FALSE
$gc_bias_correct
[1] TRUE
$num_bias_bins
[1] 4096
$mapping_type
[1] "mapping"
$num_targets
[1] 205870
$serialized_eq_classes
[1] FALSE
$eq_class_properties
list()
$length_classes
[1] 520 669 1065 2328 205012
$index_seq_hash
[1] "40849ed828ea7d6a94af54a5c40e5d87eb0ce0fc1e9513208a5cffe59d442292"
$index_name_hash
[1] "77aca5545a0626421efb4730dd7b95482c77da261f9bdef70d36e25ee68bb7ef"
$index_seq_hash512
[1] "f37ae2a7412efd8518d68c22fc3fcc2478b59833809382abd5d9055475505e516730c70914343af34c9232af92fe22832e70afde6b38c26381097068e1e82551"
$index_name_hash512
[1] "67f286e5ec2895c10aa6e3a8feed04cb3a80747de7b3afb620298078481b303d2f979407fc104d4f3e8af0e82be2acb2f294ee5f2f68c00087f2855120d9f4ed"
$num_bootstraps
[1] 20
$num_processed
[1] 45141218
$num_mapped
[1] 40075160
$percent_mapped
[1] 88.77731
$call
[1] "quant"
$start_time
[1] "Tue Jan 15 21:21:22 2019"
$end_time
[1] "Tue Jan 15 21:29:00 2019"
Appendix: Accessing and downloading RNA-sequencing data from GEO
Most journals require that all datasets (including RNAseq) used in a publication are deposited in public genomic data repositories. The NCBI repository is the Sequence Read Archive or SRA
. Metadata on the experiments are deposited in GEO (Gene Expression Omnibus).
In order to query, download and obtain fastq files of published experiments on then typicall can:
- Download
.sra files
from GEO (raw files) - Convert .sra files into
.fastq files
(input for alignment) using fastq-dump
The GEO identifier of the RNAseq dataset can be found in the original article.
After a quick search for ‘gse’ in the whole article, we can find that data has been deposited in GEO under GSE33252
Fetching useful information prior to analysis
Have a careful look at:
- Material and Method section of the article
- GEO database (use GEO id)
- SRA database (use SRR id)
and get information about:
- protocol
- analysis
- Single-end reads
- 36b-long reads
- First four samples: PolyA, unstranded
- Last two samples: Total RNA, ribo-depleted, stranded
- Genome assembly used: mm9
Downloading raw data: SRA
Raw data are stored in the GEO/SRA databases under the .sra form. The .sra files need to be downloaded and converted into .fastq files, which can then be aligned.
.sra files can be downloaded from the GEO database
Converting the .sra files to .fastq files
.sra to .fastq conversion is done using the fastq-dump tool from the SRA toolkit
To run fastq-dump from Rstudio, use a .sh script (in Rstudio: new text document saved as
# ... specify the working directory containing the .sra file
cd <directoryname>
# ... convert .sra to .fastq (single read)
fastq-dump <filename.sra> --gzip --stdout > <filename.fq.gz>
# ... convert .sra to .fastq (paired-end)
fastq-dump <./filename.sra> --gzip --split-files
Downloading data from within R using the SRAdb package
The SRAdb package provides access to the Sequence Read Archive database from within R. It can be used to access, query and download SRA datasets.
This is accomplished by parsing all the NCBI SRA metadata into a SQLite database that can be stored and queried locally.
library(SRAdb)
<- file.path(system.file('extdata', package='SRAdb'),'SRAmetadb_demo.sqlite' ) #Note this is a downsized SQLdb only for illustration purposes
sqlfile # Create a connection to the SQLdatabase:
<- dbConnect(SQLite(),sqlfile)
sra_con
#download sra data files from the ftp site:
getSRAfile( c("SRR000648"), sra_con, fileType ='sra', destDir = "data/05/")
#Convert to fastq:
system ("fastq-dump data/05/SRR000648.sra") #This requires the sra command line tools to be installed in the system
#Or directly download fastq files from EBI using the ftp protocol:
getFASTQinfo( c("SRR000648"), sra_con, srcType ='ftp')
getSRAfile( c("SRR000648"), sra_con, fileType ='fastq',destDir = "data/05/")
For more information on the versatile SRAdb functionality see:
https://www.bioconductor.org/packages/release/bioc/vignettes/SRAdb/inst/doc/SRAdb.pdf
Session info
sessionInfo()
R version 4.3.1 Patched (2023-06-20 r84599)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Zurich
tzcode source: internal
attached base packages:
[1] stats4 grid stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] GenomicFeatures_1.52.1 AnnotationDbi_1.62.1
[3] SummarizedExperiment_1.30.2 Biobase_2.60.0
[5] GenomicRanges_1.52.0 GenomeInfoDb_1.36.1
[7] IRanges_2.34.0 S4Vectors_0.38.1
[9] BiocGenerics_0.46.0 MatrixGenerics_1.12.2
[11] matrixStats_1.0.0 macrophage_1.16.0
[13] tximeta_1.18.0 BiocStyle_2.28.0
[15] png_0.1-8 knitr_1.44
loaded via a namespace (and not attached):
[1] DBI_1.1.3 bitops_1.0-7
[3] biomaRt_2.56.1 rlang_1.1.1
[5] magrittr_2.0.3 compiler_4.3.1
[7] RSQLite_2.3.1 vctrs_0.6.4
[9] stringr_1.5.0 ProtGenerics_1.32.0
[11] pkgconfig_2.0.3 crayon_1.5.2
[13] fastmap_1.1.1 dbplyr_2.3.4
[15] XVector_0.40.0 ellipsis_0.3.2
[17] utf8_1.2.3 Rsamtools_2.16.0
[19] promises_1.2.1 rmarkdown_2.25
[21] tzdb_0.4.0 purrr_1.0.2
[23] bit_4.0.5 xfun_0.40
[25] zlibbioc_1.46.0 cachem_1.0.8
[27] jsonlite_1.8.7 progress_1.2.2
[29] blob_1.2.4 later_1.3.1
[31] DelayedArray_0.26.3 BiocParallel_1.34.2
[33] interactiveDisplayBase_1.38.0 parallel_4.3.1
[35] prettyunits_1.2.0 R6_2.5.1
[37] stringi_1.7.12 rtracklayer_1.60.0
[39] Rcpp_1.0.11 readr_2.1.4
[41] httpuv_1.6.11 Matrix_1.6-1.1
[43] tidyselect_1.2.0 rstudioapi_0.15.0
[45] yaml_2.3.7 codetools_0.2-19
[47] curl_5.1.0 lattice_0.21-9
[49] tibble_3.2.1 withr_2.5.1
[51] shiny_1.7.5 KEGGREST_1.40.0
[53] evaluate_0.22 BiocFileCache_2.8.0
[55] xml2_1.3.5 Biostrings_2.68.1
[57] pillar_1.9.0 BiocManager_1.30.22
[59] filelock_1.0.2 generics_0.1.3
[61] vroom_1.6.4 RCurl_1.98-1.12
[63] BiocVersion_3.17.1 ensembldb_2.24.0
[65] hms_1.1.3 xtable_1.8-4
[67] glue_1.6.2 lazyeval_0.2.2
[69] tools_4.3.1 AnnotationHub_3.8.0
[71] BiocIO_1.10.0 GenomicAlignments_1.36.0
[73] XML_3.99-0.14 GenomeInfoDbData_1.2.10
[75] restfulr_0.0.15 cli_3.6.1
[77] rappdirs_0.3.3 fansi_1.0.5
[79] S4Arrays_1.0.4 dplyr_1.1.3
[81] AnnotationFilter_1.24.0 digest_0.6.33
[83] tximport_1.28.0 rjson_0.2.21
[85] htmlwidgets_1.6.2 memoise_2.0.1
[87] htmltools_0.5.6.1 lifecycle_1.0.3
[89] httr_1.4.7 mime_0.12
[91] bit64_4.0.5