Introduction to RNA-seq, quantification and import into R

Author

Charlotte Soneson and Panagiotis Papasaikas

Published

March 5, 2025

What is RNA-seq?

High-throughput sequencing of cDNA to assess the presence/quantity of RNA in a biological sample at a given moment in time.
Can be used to analyze gene expression, transcript isoforms, gene fusions, single nuclotide variants, allele-specific gene expression, …
In this course we focus on short-read sequencing data. However, more recently the use of long-read (‘third generation’) sequencing (e.g. provided by Oxford Nanopore Technologies/ONT and Pacific Biosciences/PacBio) has been increasing.

While RNA-seq is currently widely used for studying the transcriptome, it is not the first (or the only) technology that allows us to do so - historically, the research community has used e.g.:

Northern blots –> single genes
RT-PCR –> several genes
Microarrays –> targeted transcriptomes
RNAseq –> whole transcriptomes

Outline of this course

Week 1:

Basic concepts in RNA-seq. From raw data to count data (alignment and gene quantification).

Week 2:

SummarizedExperiment, normalization, exploratory analyses and visualization.

Week 3:

Differential gene expression, design matrices.

RNA-seq workflow

(Figure from Martin and Wang (2011))

Poly-A selection vs. ribo-depletion

A large fraction on the endogenous RNA originates from ribosomal RNA sequences.

(Figure from Palazzo and Lee (2015))

Typical approaches to enrich for non-rRNA sequences include:

poly-A selection for analysis of mRNA only. RNA products that are not polyadenylated (e.g. histone genes) will be absent/underrepresented.
ribo-depletion gives access to the coding and non-coding transcriptome + non-polyA transcripts. This type of libraries have a higher fraction of intronic reads than poly-A libraries.

Illumina sequencing

Every chip (flowcell) contains multiple lanes and every lane contains two columns of 50-60 tiles. Each tile contains several million “clusters” that contain primers:

Sequence fragments with adapters are bound to primers, bridged to nearby primers and extended. After dissociation the cycle is repeated resulting in clusters of identical sequences. Labelled nucleotides are then added and scanned one at a time during incorporation giving the sequence information.

During a labelled nucleotide addition cycle a tile area would look something like this:

Variations of the RNA-seq protocol

Prior to analysis, a good knowledge of the RNA-seq protocol that was used is needed, as the parameters of the downstream analysis depend on it.

Paired-end vs. single-end RNA-seq protocols

Paired-end libraries contain sequence information of sequence linkage over longer distances.
Paired-end sequencing is more expensive in terms of price/sequenced fragment (same coverage but half the fragments for the same price).
Paired-end gives more resolution, in particular in repeat regions and over transcript isoforms.

Stranded vs. unstranded RNA-seq protocols

Stranded protocols are useful e.g. to distinguish overlapping genes (on different strands). Strandedness is obtained by degrading one strand after cDNA synthesis.

(Figure from Griffith et al (2015))

The FASTQ format

Sequencing information is recorded in a FASTQ file. This file contains information on the coordinates of the cluster in the flowcell, the read sequence and a string encoding the quality of the read nucleotides:

For more information about quality (phred) scores, see e.g. https://en.wikipedia.org/wiki/FASTQ_format#Quality.

Quality control

The first step when analyzing sequencing data is typically to perform quality assessment of the raw reads. The most common tool for this is FastQC, which creates a summary report for each FASTQ file. This report provides a summary of several indicators that allow us to pinpoint quality issues (e.g sequence duplication levels, contamination):

Here is an example:

FastQC example

Exercise 1:

For the following examples of FastQC reports, can you identify the quality issues?

RNA-seq quantification

Input data

Quantification starts from a set of reads and a set of reference sequences

The reference sequences are typically either chromosomes (=genomic sequences) or transcript isoforms (=transcriptomic sequences)
Both types of reference files can be downloaded from (e.g.):
- GENCODE (only human and mouse): https://www.gencodegenes.org/
- Ensembl (lots of species): http://www.ensembl.org/info/data/ftp/index.html
- Always stick to the same source and version of reference files during a project!

Approach I: Genome alignment and overlap counting

Alignment

The genome sequence (FASTA file) is obtained from one of the reference databases

In addition, a GTF file with the genomic locations of exons is obtained (from the same source as the genome)

Reads are aligned to the genome (after indexing the latter for increased computational efficiency)
For RNA-seq, the aligner must be able to accommodate spliced reads
- STAR - command line application
- HISAT2 - command line application
- QuasR - R/Bioconductor package, incorporates HISAT2 for alignment of RNA-seq reads

Example STAR command, genome indexing

STAR --runThreadN 24 \
     --runMode genomeGenerate \
     --genomeDir my_genome \
     --genomeFastaFiles my_genome.fa \
     --sjdbGTFfile my_genes.gtf \
     --sjdbOverhang 99

Example STAR command, read alignment

STAR --runThreadN 24 \
     --runMode alignReads \
     --genomeDir my_genome \
     --readFilesIn S1_read1.fq.gz S1_read2.fq.gz \
     --readFilesCommand zcat \
     --outFileNamePrefix output/S1/ \
     --outSAMtype BAM SortedByCoordinate

The output file from the alignment step is typically a SAM file (or a BAM file, which is just a binary version of the SAM file)
The SAM/BAM file contains a header, which provides information about the reference sequences and the call to the aligner, and the main body, with one line per alignment

Detailed information about the SAM file format is available from https://samtools.github.io/hts-specs/SAMv1.pdf
Interpretation of SAM flags: https://broadinstitute.github.io/picard/explain-flags.html
Illustration of the CIGAR string: https://genome.sph.umich.edu/wiki/SAM#What_is_a_CIGAR.3F

Read counting

The abundance of a gene is obtained by counting the number of reads that overlap the exonic regions of the gene
Reads that overlap multiple genes (or map to multiple genomic locations) are typically discarded
Several R packages are available for doing the counting:
- Rsubread (featureCounts() function)
- QuasR (qCount() function)
They can also be generated by STAR as part of the alignment process

We can also use this approach to quantify other features, such as individual exons (here a read is typically counted for all the exons that it overlaps)
This is useful e.g. if we are interested in finding specific exons that are only expressed in certain biological conditions

A problem with overlap counting

Imagine a situation where we have two samples, each expressing a gene.
In sample 1, isoform T1 (of length L) is expressed (at a level 2E).
In sample 2, isoform T2 (of length 2L) is expressed (at a level E).

When the transcripts are fragmented for sequencing, each molecule of isoform T2 contributes twice as many reads as a molecule of isoform T1.

Thus, when we count the number of reads overlapping the gene, both samples get the same number, even though the gene was twice as highly expressed in sample 1.

There are ways to address this issue, but we need additional information.

Approach II: Transcriptome mapping and abundance estimation

Mapping

The transcriptome sequences (FASTA file) is obtained from one of the reference databases

In addition, a text file mapping transcript IDs to the corresponding gene IDs should be prepared
Reads are mapped to the transcripts (after indexing the latter for increased computational efficiency)
For each read, we record the set of transcripts that it could have been generated from
Each combination of transcripts form an equivalence class, and we construct a table of counts for these equivalence classes

Quantification

Next, we estimate the individual transcript abundances that are most likely, given the observed equivalence class counts
Note that we are not assigning a particular read to a specific transcript, but rather estimating underlying abundances!
These “expected” abundances do not have to be integers
Multimapping reads are naturally accounted for via the equivalence classes
Several tools are available for performing this type of “alignment-free” quantification
- Salmon
- kallisto

Example command, Salmon, transcriptome indexing

salmon index -i my_transcripts.idx \
     -t <(cat my_transcripts.fasta my_genome.fasta) \
     -d chromosome_names.txt -k 31

Example command, Salmon, mapping/quantification

salmon quant -i my_transcripts.idx -l A \
     -1 sample1_1.fastq -2 sample1_2.fastq \
     -p 10 -o results/sample1 \
     --numBootstraps 30 --seqBias --gcBias

The estimated transcript-level abundances can be added up on the gene level

It’s important to note that if we just add up counts from different isoforms, the resulting gene count has the same problem as the gene count from the overlap counting (\(C\) counts from a shorter isoform does not correspond to the same abundance as \(C\) counts from a longer isoform)! However, with the transcript abundance estimation approach we have additional information that can help us rectify it (see this paper for more information).
The basic output from the transcript abundance estimation tools is a text file with estimated abundances. This one is from Salmon:

Name is the transcript identifier
Length is the annotated transcript length
EffectiveLength is the length after accounting for the fact that a read is not equally likely to start anywhere in the transcript
TPM is the estimated abundance in transcript-per-million
NumReads is the estimated abundance in read counts

TPM vs read counts - what’s the difference?

The read count depends on the length of the feature, in addition to its abundance.
The TPM (similar to the nowadays less used RPKM/FPKM) aims to provide an abundance unit that is proportional to the number of molecules of a given target gene/transcript in a sample.
Most of the widely used framework for differential gene expression directly model the counts, since these contain information related to the uncertainty of the abundance estimate, which is lost in the TPMs.

Another commonly used abundance unit is the CPM (counts per million), which is obtained by normalizing the counts for differences in library size (but not by feature length).

How do the two quantification approaches differ?

For most genes, counts are similar with the two approaches.
Alignment-free methods are faster than methods based on genome alignment, since the former don’t perform a base-by-base matching.
Salmon provides more highly resolved information than (e.g.) featureCounts (transcript-level rather than gene-level).
Alignment-free methods typically use a larger fraction of the reads, since they don’t discard multimapping reads.
Alignment-free methods don’t provide a precise alignment to the genome (which may be useful, e.g., for visualization in genome browsers).

Reading quantifications into R

Here, we will focus on the processing of Salmon output.
Recall that the main output of Salmon is a text file with transcript abundances.

To read this into R, we can use the tximeta function from the tximeta package.

library(tximeta)
?tximeta

The first argument required by the tximeta function is a coldata data.frame, which tells the function which samples we want to import, and where the corresponding quantification files are.
As an example, we will use data from the macrophage package. This package contains quant.sf files from Salmon, as well as a coldata file.
These files are stored in the macrophage package folder in your R package library. We can find this location and check its content like so:

library(macrophage)
dir <- system.file("extdata", package = "macrophage")
dir

[1] "/Users/charlottesoneson/Library/R/arm64/4.5/library/__bioc321/macrophage/extdata"

list.files(dir)

[1] "coldata.csv"                   "errs"                         
[3] "gencode.v29_salmon_0.12.0"     "gencode.v29.annotation.gtf.gz"
[5] "PRJEB18997.txt"                "quants"                       
[7] "supp_table_1.csv"              "supp_table_7.csv"

We start by reading the coldata file. If you analyze your own data, you can read this information from a text file, or create the data.frame directly in R.

coldata <- read.csv(file.path(dir, "coldata.csv"))[, c(1, 2, 3, 5)]
dim(coldata)

[1] 24  4

head(coldata)

           names sample_id line_id condition_name
1 SAMEA103885102    diku_A  diku_1          naive
2 SAMEA103885347    diku_B  diku_1           IFNg
3 SAMEA103885043    diku_C  diku_1         SL1344
4 SAMEA103885392    diku_D  diku_1    IFNg_SL1344
5 SAMEA103885182    eiwy_A  eiwy_1          naive
6 SAMEA103885136    eiwy_B  eiwy_1           IFNg

In addition to the names column, tximeta requires that coldata has a column named files, pointing to the Salmon output (the quant.sf file) for the respective samples.

coldata$files <- file.path(dir, "quants", coldata$names, "quant.sf.gz")
head(coldata)

           names sample_id line_id condition_name
1 SAMEA103885102    diku_A  diku_1          naive
2 SAMEA103885347    diku_B  diku_1           IFNg
3 SAMEA103885043    diku_C  diku_1         SL1344
4 SAMEA103885392    diku_D  diku_1    IFNg_SL1344
5 SAMEA103885182    eiwy_A  eiwy_1          naive
6 SAMEA103885136    eiwy_B  eiwy_1           IFNg
                                                                                                               files
1 /Users/charlottesoneson/Library/R/arm64/4.5/library/__bioc321/macrophage/extdata/quants/SAMEA103885102/quant.sf.gz
2 /Users/charlottesoneson/Library/R/arm64/4.5/library/__bioc321/macrophage/extdata/quants/SAMEA103885347/quant.sf.gz
3 /Users/charlottesoneson/Library/R/arm64/4.5/library/__bioc321/macrophage/extdata/quants/SAMEA103885043/quant.sf.gz
4 /Users/charlottesoneson/Library/R/arm64/4.5/library/__bioc321/macrophage/extdata/quants/SAMEA103885392/quant.sf.gz
5 /Users/charlottesoneson/Library/R/arm64/4.5/library/__bioc321/macrophage/extdata/quants/SAMEA103885182/quant.sf.gz
6 /Users/charlottesoneson/Library/R/arm64/4.5/library/__bioc321/macrophage/extdata/quants/SAMEA103885136/quant.sf.gz

all(file.exists(coldata$files))

[1] TRUE

Now we have everything we need, and can import the quantifications with tximeta.
In this process, we will see that tximeta automatically identifies the source and version of the transcriptome reference that was used for the quantification, and adds some metadata.
The imported data will be stored in a SummarizedExperiment container, which you heard about in a previous lecture.

suppressPackageStartupMessages({
  library(SummarizedExperiment)
})
se <- tximeta(coldata = coldata, type = "salmon", dropInfReps = TRUE)

importing quantifications

reading in files with read_tsv

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
found matching transcriptome:
[ GENCODE - Homo sapiens - release 29 ]
loading existing TxDb created: 2022-03-23 15:48:47
Loading required package: GenomicFeatures
Loading required package: AnnotationDbi
loading existing transcript ranges created: 2022-03-23 15:48:49

se

class: RangedSummarizedExperiment 
dim: 205870 24 
metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
assays(3): counts abundance length
rownames(205870): ENST00000456328.2 ENST00000450305.2 ...
  ENST00000387460.2 ENST00000387461.2
rowData names(3): tx_id gene_id tx_name
colnames(24): SAMEA103885102 SAMEA103885347 ... SAMEA103885308
  SAMEA103884949
colData names(4): names sample_id line_id condition_name

We see that tximeta has identified the transcriptome used for the quantification as GENCODE - Homo sapiens - release 29. How did this happen?
In fact, the output directory from Salmon contains much more information than just the quant.sf file! (Side note: this means that it is not advisable to move files out of the folder, or to share only the quant.sf file, since the context is lost):

list.files(file.path(dir, "quants", coldata$names[1]), recursive = TRUE)

 [1] "aux_info/ambig_info.tsv.gz"       "aux_info/bootstrap/bootstraps.gz"
 [3] "aux_info/bootstrap/names.tsv.gz"  "aux_info/exp_gc.gz"              
 [5] "aux_info/expected_bias.gz"        "aux_info/fld.gz"                 
 [7] "aux_info/meta_info.json"          "aux_info/obs_gc.gz"              
 [9] "aux_info/observed_bias_3p.gz"     "aux_info/observed_bias.gz"       
[11] "cmd_info.json"                    "lib_format_counts.json"          
[13] "libParams/flenDist.txt"           "logs/salmon_quant.log.txt"       
[15] "quant.sf.gz"

In particular, the meta_info.json file contains a hash checksum, which is derived from the set of transcripts used as reference during the quantification and which lets tximeta identify the reference source (by comparing to a table of these hash checksums for commonly used references).

rjson::fromJSON(file = file.path(dir, "quants", coldata$names[1], "aux_info", "meta_info.json"))

$salmon_version
[1] "0.12.0"

$samp_type
[1] "gibbs"

$opt_type
[1] "vb"

$quant_errors
list()

$num_libraries
[1] 1

$library_types
[1] "ISR"

$frag_dist_length
[1] 1001

$seq_bias_correct
[1] FALSE

$gc_bias_correct
[1] TRUE

$num_bias_bins
[1] 4096

$mapping_type
[1] "mapping"

$num_targets
[1] 205870

$serialized_eq_classes
[1] FALSE

$eq_class_properties
list()

$length_classes
[1]    520    669   1065   2328 205012

$index_seq_hash
[1] "40849ed828ea7d6a94af54a5c40e5d87eb0ce0fc1e9513208a5cffe59d442292"

$index_name_hash
[1] "77aca5545a0626421efb4730dd7b95482c77da261f9bdef70d36e25ee68bb7ef"

$index_seq_hash512
[1] "f37ae2a7412efd8518d68c22fc3fcc2478b59833809382abd5d9055475505e516730c70914343af34c9232af92fe22832e70afde6b38c26381097068e1e82551"

$index_name_hash512
[1] "67f286e5ec2895c10aa6e3a8feed04cb3a80747de7b3afb620298078481b303d2f979407fc104d4f3e8af0e82be2acb2f294ee5f2f68c00087f2855120d9f4ed"

$num_bootstraps
[1] 20

$num_processed
[1] 45141218

$num_mapped
[1] 40075160

$percent_mapped
[1] 88.77731

$call
[1] "quant"

$start_time
[1] "Tue Jan 15 21:21:22 2019"

$end_time
[1] "Tue Jan 15 21:29:00 2019"

Appendix: Accessing and downloading RNA-sequencing data from GEO

Most journals require that all datasets (including RNA-seq) used in a publication are deposited in public genomic data repositories. The NCBI repository is the Sequence Read Archive or SRA. Metadata about the experiments are deposited in GEO (Gene Expression Omnibus).

There are several ways of querying and downloading FASTQ files from published experiments. One option is to download .sra files from GEO (raw files), and then convert these .sra files into FASTQ files (input for alignment) using fastq-dump.

Another option is to use fastq-dl, which allows us to directly retrieve FASTQ files for a given SRA accession number.

The GEO identifier of a published (RNA-seq) dataset can typically be found in the corresponding article.

In this example, the data has been deposited in GEO under accession number GSE33252

Before downloading data, it is often useful to have a careful look at the:

Material and Methods section of the article
GEO database (use GEO id)
SRA database (use SRR id)

and get information about:

the protocol that was used to generate the data
the analysis steps that may have already been performed

In GEO, we can click further on each individual sample, to retrieve more (sample-specific) information.

From the GEO landing page, we can also follow the link to the Sequence Read Archive.

In this example, the data consists of single-end, 36 bp long reads, and a ribo-zero kit was used to remove ribosomal RNA.

Downloading raw data: SRA

Raw data are stored in the GEO/SRA databases in the .sra format. The .sra files need to be downloaded and converted into FASTQ files, which can then be aligned.

.sra files can be downloaded from GEO.

Converting the .sra files to .fastq files

.sra to FASTQ conversion can be done using the fastq-dump tool from the SRA toolkit

Note that fastq-dump is a command line utility (not an R package), and should be run from a terminal.

# ... specify the working directory containing the .sra file
cd <directoryname>

# ... convert .sra to .fastq (single-end reads)
fastq-dump <filename.sra> --gzip --stdout > <filename.fq.gz>

# ... convert .sra to .fastq (paired-end reads)
fastq-dump <./filename.sra> --gzip --split-files

Downloading data from within R using the SRAdb package

The SRAdb package provides access to the Sequence Read Archive database from within R. It can be used to access, query and download SRA datasets.

This is accomplished by parsing all the NCBI SRA metadata into a SQLite database that can be stored and queried locally.

## Code below adapted from the help page of `SRAdb::getSRAfile`

library(SRAdb)

# Define the path to a demo SQL database (only for illustration purposes)
sqlfile <- file.path(system.file('extdata', package = 'SRAdb'),
                     'SRAmetadb_demo.sqlite' ) 

# Create a connection to the SQLdatabase:
sra_con <- dbConnect(SQLite(), sqlfile)

# Download sra data files from the ftp site:
getSRAfile("SRR000648", sra_con, fileType = 'sra', destDir = "data/05/")

# Convert to fastq
# This requires the sra command line tools to be installed in the system
system("fastq-dump data/05/SRR000648.sra") 

# Or directly download fastq files from EBI using the ftp protocol:
getFASTQinfo("SRR000648", sra_con, srcType = 'ftp')
getSRAfile("SRR000648", sra_con, fileType = 'fastq', destDir = "data/05/")

For more information on the versatile SRAdb functionality see:

https://www.bioconductor.org/packages/release/bioc/vignettes/SRAdb/inst/doc/SRAdb.pdf

Session info

sessionInfo()

R Under development (unstable) (2025-01-03 r87521)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] GenomicFeatures_1.59.1      AnnotationDbi_1.69.0       
 [3] SummarizedExperiment_1.37.0 Biobase_2.67.0             
 [5] GenomicRanges_1.59.1        GenomeInfoDb_1.43.4        
 [7] IRanges_2.41.3              S4Vectors_0.45.4           
 [9] BiocGenerics_0.53.6         generics_0.1.3             
[11] MatrixGenerics_1.19.1       matrixStats_1.5.0          
[13] macrophage_1.23.0           tximeta_1.25.2             
[15] BiocStyle_2.35.0            png_0.1-8                  
[17] knitr_1.49                 

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1         dplyr_1.1.4              blob_1.2.4              
 [4] filelock_1.0.3           Biostrings_2.75.4        bitops_1.0-9            
 [7] fastmap_1.2.0            RCurl_1.98-1.16          lazyeval_0.2.2          
[10] BiocFileCache_2.15.1     GenomicAlignments_1.43.0 XML_3.99-0.18           
[13] digest_0.6.37            lifecycle_1.0.4          ProtGenerics_1.39.2     
[16] KEGGREST_1.47.0          RSQLite_2.3.9            magrittr_2.0.3          
[19] compiler_4.5.0           rlang_1.1.5              progress_1.2.3          
[22] tools_4.5.0              yaml_2.3.10              rtracklayer_1.67.1      
[25] prettyunits_1.2.0        S4Arrays_1.7.2           htmlwidgets_1.6.4       
[28] bit_4.5.0.1              curl_6.2.1               DelayedArray_0.33.6     
[31] xml2_1.3.7               abind_1.4-8              BiocParallel_1.41.2     
[34] withr_3.0.2              purrr_1.0.4              txdbmaker_1.3.1         
[37] biomaRt_2.63.1           cli_3.6.4                rmarkdown_2.29          
[40] crayon_1.5.3             rstudioapi_0.17.1        tzdb_0.4.0              
[43] httr_1.4.7               rjson_0.2.23             DBI_1.2.3               
[46] cachem_1.1.0             stringr_1.5.1            parallel_4.5.0          
[49] AnnotationFilter_1.31.0  BiocManager_1.30.25      XVector_0.47.2          
[52] restfulr_0.0.15          vctrs_0.6.5              Matrix_1.7-1            
[55] jsonlite_1.9.0           hms_1.1.3                bit64_4.6.0-1           
[58] archive_1.1.11           ensembldb_2.31.0         glue_1.8.0              
[61] codetools_0.2-20         stringi_1.8.4            BiocVersion_3.21.1      
[64] BiocIO_1.17.1            UCSC.utils_1.3.1         tibble_3.2.1            
[67] pillar_1.10.1            rappdirs_0.3.3           htmltools_0.5.8.1       
[70] GenomeInfoDbData_1.2.13  httr2_1.1.0              R6_2.6.1                
[73] dbplyr_2.5.0             tximport_1.35.0          vroom_1.6.5             
[76] evaluate_1.0.3           lattice_0.22-6           readr_2.1.5             
[79] AnnotationHub_3.15.0     Rsamtools_2.23.1         memoise_2.0.1           
[82] SparseArray_1.7.6        xfun_0.51                pkgconfig_2.0.3