Introduction to single-cell RNA sequencing

Author

Charlotte Soneson, Athimed El Taher, Panagiotis Papasaikas

Published

April 1, 2026

Why single-cell assays?

(Image credit: Elizabeth Sergison, Honeycomb Biotechnologies)

  • Limited input material (e.g early development, circulating tumor cells, …)
  • Sample heterogeneity (mixture of cell types / states)
    • Number and frequency of cell populations
    • Markers for cell populations
    • Variability of gene expression among cells
    • Cell population relationships (developmental/pseudo-temporal ordering)
    • Discriminate between changes in overall (bulk) gene expression due to changes in cell type composition, and changes due to changes in expression levels within a given cell type

(Image credit: 10x Genomics: Getting started with single cell gene expression)

Generating scRNA-seq data

  • Currently, the most common classes of protocols are
    • low-throughput, plate-based, full transcript coverage, with high sequencing depth (e.g., Smart-Seq2)
    • high-throughput, droplet-based, 3’ biased, with low sequencing depth (e.g., 10x Genomics Chromium)
  • We will not go through the different protocols in detail, but if you want to read more, see e.g.
  • Which protocol to choose for a given application depends on the goal of the study.
    • Plate-based: many detected genes per cell, fewer cells, isoform analysis (perhaps) possible
    • Droplet-based: fewer detected genes per cell (mostly the highly expressed ones), many cells

Unique molecular identifiers

  • Because of the small amount of input material, strong amplification is required during library preparation
  • By attaching a tag to each original transcript molecule, before the amplification, reads generated from the same molecule can be identified and collapsed at the quantification step
  • These tags are called unique molecular identifiers (UMIs), and are part of many of the single-cell sequencing protocols (in particular the droplet-based ones)

Example read structure (10x Genomics Chromium v3)

(Image credit: 10x Genomics)

  • The sample index identifies the library (in case of multiplexing)
  • The 10x barcode (or cell index) identifies the cell within the library
  • The UMI identifies the transcript molecule within a cell and gene
  • The insert is the actual cDNA sequence

Quantification of scRNA-seq data

  • The quantification approach depends on the type of protocol
  • For Smart-Seq2 and similar protocols, one often uses the same tools as for bulk RNA-seq (STAR+featureCounts, Salmon, kallisto)
  • For 10x Genomics and similar protocols (where reads are generated from the 3’ end of the transcript, and additionally we need to collapse reads with the same UMI within a gene), there are specialized tools, often building on one of the bulk RNA-seq tools:
  • Depending on the quantification tools, there are dedicated functions for importing the quantifications into R:
    • DropletUtils and Seurat contain functions for reading CellRanger output
    • tximeta can read alevin output (specify type = "alevin")
    • fishpond can read alevin-fry output
    • BUSpaRse can read kallisto|bustools output

Representation of scRNA-seq data

  • We have seen how bulk RNA-seq data is often represented in a SummarizedExperiment object (at least within the Bioconductor framework)
  • Single-cell RNA-seq has a similar structure (a count matrix, with genes as rows and cells as columns), and is represented in a SingleCellExperiment object
  • The SingleCellExperiment object is an extension of the SummarizedExperiment object
  • The main difference is that a SingleCellExperiment object can also store reduced dimension representations (PCA, tSNE, UMAP)
  • There are also other commonly used representation formats, such as the Seurat object (R) and anndata (python).
  • Several packages are available for converting between these formats

(Original image credit: Jim Hester / https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html)

Load SingleCellExperiment

For this lecture, we will use an example data set from the TENxPBMCData Bioconductor package. This package contains a number of data sets of PBMCs (peripheral blood mononuclear cells), obtained with the 10x Genomics Chromium platform. Here we will use one containing approximately 3,000 PBMCs.

suppressPackageStartupMessages({
    library(TENxPBMCData)
    library(SingleCellExperiment)
    library(scuttle)
    library(ggplot2)
})

# Import dataset:
sce <- TENxPBMCData("pbmc3k")
see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation
loading from cache

We can see that sce is indeed a SingleCellExperiment object, and that the structure is similar to that of a SummarizedExperiment object.

sce
class: SingleCellExperiment 
dim: 32738 2700 
metadata(0):
assays(1): counts
rownames(32738): ENSG00000243485 ENSG00000237613 ...
  ENSG00000215616 ENSG00000215611
rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
colnames: NULL
colData names(11): Sample Barcode ... Individual Date_published
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

Explore object

The object contains annotations for the genes (rows), including the Ensembl gene ID and the gene symbol. Since the gene symbols are generally easier to remember and interpret, we will use them as row names (keeping the information about the corresponding Ensembl ID in the object). Only if the gene symbols are not unique we will fall back to the Ensembl ID. This can be achieved using the uniquifyFeatureNames function from the scuttle Bioconductor package.

# Dimension of SingleCellExperiment
dim(sce) # genes X cells (barcodes)
[1] 32738  2700
# colnames of SingleCellExperiment
head(colnames(sce))
NULL
# rownames of SingleCellExperiment
head(rownames(sce)) 
[1] "ENSG00000243485" "ENSG00000237613" "ENSG00000186092" "ENSG00000238009"
[5] "ENSG00000239945" "ENSG00000237683"
# SingleCellExperiment colData
colData(sce)
DataFrame with 2700 rows and 11 columns
          Sample          Barcode       Sequence   Library
     <character>      <character>    <character> <integer>
1         pbmc3k AAACATACAACCAC-1 AAACATACAACCAC         1
2         pbmc3k AAACATTGAGCTAC-1 AAACATTGAGCTAC         1
3         pbmc3k AAACATTGATCAGC-1 AAACATTGATCAGC         1
4         pbmc3k AAACCGTGCTTCCG-1 AAACCGTGCTTCCG         1
5         pbmc3k AAACCGTGTATGCG-1 AAACCGTGTATGCG         1
...          ...              ...            ...       ...
2696      pbmc3k TTTCGAACTCTCAT-1 TTTCGAACTCTCAT         1
2697      pbmc3k TTTCTACTGAGGCA-1 TTTCTACTGAGGCA         1
2698      pbmc3k TTTCTACTTCCTCG-1 TTTCTACTTCCTCG         1
2699      pbmc3k TTTGCATGAGAGGC-1 TTTGCATGAGAGGC         1
2700      pbmc3k TTTGCATGCCTCAC-1 TTTGCATGCCTCAC         1
     Cell_ranger_version Tissue_status Barcode_type   Chemistry
             <character>   <character>  <character> <character>
1                 v1.1.0            NA      GemCode Chromium_v1
2                 v1.1.0            NA      GemCode Chromium_v1
3                 v1.1.0            NA      GemCode Chromium_v1
4                 v1.1.0            NA      GemCode Chromium_v1
5                 v1.1.0            NA      GemCode Chromium_v1
...                  ...           ...          ...         ...
2696              v1.1.0            NA      GemCode Chromium_v1
2697              v1.1.0            NA      GemCode Chromium_v1
2698              v1.1.0            NA      GemCode Chromium_v1
2699              v1.1.0            NA      GemCode Chromium_v1
2700              v1.1.0            NA      GemCode Chromium_v1
     Sequence_platform    Individual Date_published
           <character>   <character>    <character>
1           NextSeq500 HealthyDonor2     2016-05-26
2           NextSeq500 HealthyDonor2     2016-05-26
3           NextSeq500 HealthyDonor2     2016-05-26
4           NextSeq500 HealthyDonor2     2016-05-26
5           NextSeq500 HealthyDonor2     2016-05-26
...                ...           ...            ...
2696        NextSeq500 HealthyDonor2     2016-05-26
2697        NextSeq500 HealthyDonor2     2016-05-26
2698        NextSeq500 HealthyDonor2     2016-05-26
2699        NextSeq500 HealthyDonor2     2016-05-26
2700        NextSeq500 HealthyDonor2     2016-05-26
# SingleCellExperiment rowData
rowData(sce)
DataFrame with 32738 rows and 3 columns
                     ENSEMBL_ID  Symbol_TENx       Symbol
                    <character>  <character>  <character>
ENSG00000243485 ENSG00000243485   MIR1302-10           NA
ENSG00000237613 ENSG00000237613      FAM138A      FAM138A
ENSG00000186092 ENSG00000186092        OR4F5        OR4F5
ENSG00000238009 ENSG00000238009 RP11-34P13.7 LOC100996442
ENSG00000239945 ENSG00000239945 RP11-34P13.8           NA
...                         ...          ...          ...
ENSG00000215635 ENSG00000215635   AC145205.1           NA
ENSG00000268590 ENSG00000268590        BAGE5           NA
ENSG00000251180 ENSG00000251180   CU459201.1           NA
ENSG00000215616 ENSG00000215616   AC002321.2           NA
ENSG00000215611 ENSG00000215611   AC002321.1           NA
# set gene symbol as row names of the data object, use Ensembl ID if not unique
rownames(sce) <- uniquifyFeatureNames(ID = rownames(sce), 
                                      names = rowData(sce)$Symbol_TENx)

# set the colnames: 
colnames(sce) <- sce$Sequence

sce
class: SingleCellExperiment 
dim: 32738 2700 
metadata(0):
assays(1): counts
rownames(32738): MIR1302-10 FAM138A ... AC002321.2 AC002321.1
rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
colnames(2700): AAACATACAACCAC AAACATTGAGCTAC ... TTTGCATGAGAGGC
  TTTGCATGCCTCAC
colData names(11): Sample Barcode ... Individual Date_published
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

Assays

Accessing the assay data, row and column annotations is done in the same way as for SummarizedExperiment objects. There is also an additional counts() accessor for the counts assay.

# All assays present in the sce object
assayNames(sce)
[1] "counts"
# Access the "counts" assay in different ways 
counts_method1 <- assays(sce)[["counts"]]
counts_method2 <- assay(sce, "counts")
counts_method3 <- counts(sce)

counts_method1[1:5, 1:3]
<5 x 3> sparse DelayedMatrix object of type "integer":
             AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC
MIR1302-10                0              0              0
FAM138A                   0              0              0
OR4F5                     0              0              0
RP11-34P13.7              0              0              0
RP11-34P13.8              0              0              0
counts_method2[1:5, 1:3]
<5 x 3> sparse DelayedMatrix object of type "integer":
             AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC
MIR1302-10                0              0              0
FAM138A                   0              0              0
OR4F5                     0              0              0
RP11-34P13.7              0              0              0
RP11-34P13.8              0              0              0
counts_method3[1:5, 1:3]
<5 x 3> sparse DelayedMatrix object of type "integer":
             AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC
MIR1302-10                0              0              0
FAM138A                   0              0              0
OR4F5                     0              0              0
RP11-34P13.7              0              0              0
RP11-34P13.8              0              0              0

Sparse matrix representation

While the structure of the scRNAseq data is similar to that of the bulk data, there are also important differences that affect the downstream analysis. One of these differences is that single-cell data is much more sparse; in other words, there are many more zeros in the count matrix from a single-cell experiment than from a bulk experiment.

This is due to things such as:

  • the much lower sequencing depth per cell/sample (especially for droplet-based protocols)
  • not every individual cell expressing each gene
  • a failure of the library preparation to capture many of the expressed transcript molecules

Let’s check the fraction of zeros in our count matrix:

# fraction of 0 per barcode/cell
frac_0_per_cell <- colMeans(counts(sce) == 0)
head(frac_0_per_cell)
AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC AAACCGTGCTTCCG AAACCGTGTATGCG 
     0.9761439      0.9587024      0.9654530      0.9706763      0.9840552 
AAACGCACTGGTAC 
     0.9761134 
# fraction of 0 across all cells
mean(counts(sce) == 0)
[1] 0.9741281

We also calculate the range of library sizes, noting that these are much smaller than the typical values for bulk RNA-seq:

# Library size (=total amount of reads per cell/barcode)
head(colSums(counts(sce)))
AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC AAACCGTGCTTCCG AAACCGTGTATGCG 
          2421           4903           3149           2639            981 
AAACGCACTGGTAC 
          2164 
# Summary of library size distribution per cell
summary(colSums(counts(sce)))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    548    1758    2197    2367    2763   15844 

The number of cells in a scRNA-seq data set is typically much (several orders of magnitude) larger than the number of samples in a bulk RNA-seq experiment. Hence, the count matrices can get very large. However, since most of the values are zero, efficient storage modes, where only the non-zero values and the corresponding matrix positions are stored, can be employed. We can make sure that the count matrix in our object is indeed such a sparse matrix (in this particular data set, it is actually provided as a DelayedMatrix, which is beyond the scope of this course to discuss in detail, but which can be suitable for very large data matrices that do not fit in memory).

class(counts(sce))
[1] "DelayedMatrix"
attr(,"package")
[1] "DelayedArray"
counts(sce) <- as(counts(sce), "dgCMatrix")
class(counts(sce))
[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"
counts(sce)[1:10, 1:10]
10 x 10 sparse Matrix of class "dgCMatrix"
  [[ suppressing 10 column names 'AAACATACAACCAC', 'AAACATTGAGCTAC', 'AAACATTGATCAGC' ... ]]
                                 
MIR1302-10    . . . . . . . . . .
FAM138A       . . . . . . . . . .
OR4F5         . . . . . . . . . .
RP11-34P13.7  . . . . . . . . . .
RP11-34P13.8  . . . . . . . . . .
AL627309.1    . . . . . . . . . .
RP11-34P13.14 . . . . . . . . . .
RP11-34P13.9  . . . . . . . . . .
AP006222.2    . . . . . . . . . .
RP4-669L17.10 . . . . . . . . . .

Properties of scRNA-seq data

We already noted a couple of differences between scRNA-seq data and bulk data:

  • The former typically contains many more observations
  • The count matrix is much more sparse.
  • The low amount of starting material for scRNA-seq experiments also results in a high sampling noise
  • There is a lower correlation among cells than among bulk RNA-seq samples.

There are many genes that are not observed in any given cell, but still have a high number of assigned counts in other cells. This indicates that the gene detection is to some extent random, as expected by the sampling process. In other words, it is not always the same genes that go undetected in all cells.

While we don’t observe each gene in each cell, there is still a clear association between the overall expression level of a gene (e.g., total count) and the fraction of cells where it is detected.

ggplot(data.frame(total_count = rowSums(counts(sce)),
                  frac_det = rowMeans(counts(sce) > 0)),
       aes(x = total_count, y = frac_det)) + 
    geom_point() + 
    labs(x = "Total count per gene", 
         y = "Fraction of cells in which the gene is detected") + 
    scale_x_log10() + 
    theme_bw()
Warning in scale_x_log10(): log-10 transformation introduced infinite
values.

Exercise 1 - Exploration

Explore the SingleCellExperiment object:

  • dimension: How many genes and cells are in the object?
  • colnames: What are the column names of the object?
  • rownames: What are the row names of the object?
  • colData: What info is stored in the colData (cell metadata)?
  • rowData: What info is stored in the rowData (gene metadata)

Exercise 2

Check that the counts() accessor returns the same matrix as you would get if you access the counts assay using assay() or assays().

How many genes are detected in at least one cell? How many genes are detected for each individual cell?

Quality control (QC)

  • Low-quality libraries can arise for a variety of reasons (e.g. cell damage during dissociation, failure in library preparation)
  • They are problematic as they can contribute to misleading results in downstream analyses (form their own distinct cluster, interfere with characterization of population heterogeneity, large variability in scaling factors,…)
  • Low quality cells usually have low total counts, few expressed genes and high mitochondrial proportion of the counts.
  • To calculate these summary QC statistics (for both cells and genes), we again use the scuttle Bioconductor package.
  • By providing a subset of genes (here, mitochondrial genes), we can calculate the fraction of counts falling in these genes.
# add QC statistics per gene
sce <- addPerFeatureQCMetrics(sce)

# Two new columns (mean and detected) are added to the rowData 
rowData(sce)
DataFrame with 32738 rows and 5 columns
                  ENSEMBL_ID  Symbol_TENx       Symbol      mean  detected
                 <character>  <character>  <character> <numeric> <numeric>
MIR1302-10   ENSG00000243485   MIR1302-10           NA         0         0
FAM138A      ENSG00000237613      FAM138A      FAM138A         0         0
OR4F5        ENSG00000186092        OR4F5        OR4F5         0         0
RP11-34P13.7 ENSG00000238009 RP11-34P13.7 LOC100996442         0         0
RP11-34P13.8 ENSG00000239945 RP11-34P13.8           NA         0         0
...                      ...          ...          ...       ...       ...
AC145205.1   ENSG00000215635   AC145205.1           NA         0         0
BAGE5        ENSG00000268590        BAGE5           NA         0         0
CU459201.1   ENSG00000251180   CU459201.1           NA         0         0
AC002321.2   ENSG00000215616   AC002321.2           NA         0         0
AC002321.1   ENSG00000215611   AC002321.1           NA         0         0
# 1. mean: mean expression of the gene (mean number of read across all cells) 
# 2. detected: percentage of cells where the gene is detected (where count>0)

# add QC statistics per cell: 
# first extract list of mitochondrial genes ("MT" pattern within their symbol)
mt <- rownames(sce)[grep("^MT-", rowData(sce)$Symbol_TENx)]
mt # 13 genes
 [1] "MT-ND1"  "MT-ND2"  "MT-CO1"  "MT-CO2"  "MT-ATP8" "MT-ATP6" "MT-CO3" 
 [8] "MT-ND3"  "MT-ND4L" "MT-ND4"  "MT-ND5"  "MT-ND6"  "MT-CYB" 
# provide this list to addPerCellQCMetrics to calculate summaries 
# globally as well as specifically for the gene subset
sce <- addPerCellQCMetrics(sce, subsets = list(MT = mt))

# Six new columns are added to the colData 
colData(sce)
DataFrame with 2700 rows and 17 columns
                    Sample          Barcode       Sequence   Library
               <character>      <character>    <character> <integer>
AAACATACAACCAC      pbmc3k AAACATACAACCAC-1 AAACATACAACCAC         1
AAACATTGAGCTAC      pbmc3k AAACATTGAGCTAC-1 AAACATTGAGCTAC         1
AAACATTGATCAGC      pbmc3k AAACATTGATCAGC-1 AAACATTGATCAGC         1
AAACCGTGCTTCCG      pbmc3k AAACCGTGCTTCCG-1 AAACCGTGCTTCCG         1
AAACCGTGTATGCG      pbmc3k AAACCGTGTATGCG-1 AAACCGTGTATGCG         1
...                    ...              ...            ...       ...
TTTCGAACTCTCAT      pbmc3k TTTCGAACTCTCAT-1 TTTCGAACTCTCAT         1
TTTCTACTGAGGCA      pbmc3k TTTCTACTGAGGCA-1 TTTCTACTGAGGCA         1
TTTCTACTTCCTCG      pbmc3k TTTCTACTTCCTCG-1 TTTCTACTTCCTCG         1
TTTGCATGAGAGGC      pbmc3k TTTGCATGAGAGGC-1 TTTGCATGAGAGGC         1
TTTGCATGCCTCAC      pbmc3k TTTGCATGCCTCAC-1 TTTGCATGCCTCAC         1
               Cell_ranger_version Tissue_status Barcode_type   Chemistry
                       <character>   <character>  <character> <character>
AAACATACAACCAC              v1.1.0            NA      GemCode Chromium_v1
AAACATTGAGCTAC              v1.1.0            NA      GemCode Chromium_v1
AAACATTGATCAGC              v1.1.0            NA      GemCode Chromium_v1
AAACCGTGCTTCCG              v1.1.0            NA      GemCode Chromium_v1
AAACCGTGTATGCG              v1.1.0            NA      GemCode Chromium_v1
...                            ...           ...          ...         ...
TTTCGAACTCTCAT              v1.1.0            NA      GemCode Chromium_v1
TTTCTACTGAGGCA              v1.1.0            NA      GemCode Chromium_v1
TTTCTACTTCCTCG              v1.1.0            NA      GemCode Chromium_v1
TTTGCATGAGAGGC              v1.1.0            NA      GemCode Chromium_v1
TTTGCATGCCTCAC              v1.1.0            NA      GemCode Chromium_v1
               Sequence_platform    Individual Date_published       sum
                     <character>   <character>    <character> <numeric>
AAACATACAACCAC        NextSeq500 HealthyDonor2     2016-05-26      2421
AAACATTGAGCTAC        NextSeq500 HealthyDonor2     2016-05-26      4903
AAACATTGATCAGC        NextSeq500 HealthyDonor2     2016-05-26      3149
AAACCGTGCTTCCG        NextSeq500 HealthyDonor2     2016-05-26      2639
AAACCGTGTATGCG        NextSeq500 HealthyDonor2     2016-05-26       981
...                          ...           ...            ...       ...
TTTCGAACTCTCAT        NextSeq500 HealthyDonor2     2016-05-26      3461
TTTCTACTGAGGCA        NextSeq500 HealthyDonor2     2016-05-26      3447
TTTCTACTTCCTCG        NextSeq500 HealthyDonor2     2016-05-26      1684
TTTGCATGAGAGGC        NextSeq500 HealthyDonor2     2016-05-26      1024
TTTGCATGCCTCAC        NextSeq500 HealthyDonor2     2016-05-26      1985
                detected subsets_MT_sum subsets_MT_detected
               <integer>      <numeric>           <integer>
AAACATACAACCAC       781             73                  10
AAACATTGAGCTAC      1352            186                  10
AAACATTGATCAGC      1131             28                   8
AAACCGTGCTTCCG       960             46                  10
AAACCGTGTATGCG       522             12                   5
...                  ...            ...                 ...
TTTCGAACTCTCAT      1155             73                   7
TTTCTACTGAGGCA      1227             32                   8
TTTCTACTTCCTCG       622             37                   7
TTTGCATGAGAGGC       454             21                   7
TTTGCATGCCTCAC       724             16                   6
               subsets_MT_percent     total
                        <numeric> <numeric>
AAACATACAACCAC           3.015283      2421
AAACATTGAGCTAC           3.793596      4903
AAACATTGATCAGC           0.889171      3149
AAACCGTGCTTCCG           1.743085      2639
AAACCGTGTATGCG           1.223242       981
...                           ...       ...
TTTCGAACTCTCAT           2.109217      3461
TTTCTACTGAGGCA           0.928343      3447
TTTCTACTTCCTCG           2.197150      1684
TTTGCATGAGAGGC           2.050781      1024
TTTGCATGCCTCAC           0.806045      1985
# 1. sum: library size of the cell
# 2. detected: number of genes detected in the cell
# 3. subsets_MT_sum: total number of reads on the mitochondrial genes
# 4. subsets_MT_detected: number of mitochondrial genes detected
# 5. subsets_MT_percent: percentage of reads mapped to mitochondrial genes
# 6. total: identical to sum

Exercise 3

Use the newly calculated summary statistics to see whether there is an association between the total count and the number of detected genes in a cell, or between the average count and the number of observed values for a gene.

Library size and number of detected genes

We can plot the distribution of library sizes (sum) and the number of detected genes (detected) across the cells.

suppressPackageStartupMessages({
    library(patchwork)
})
g1 <- ggplot(as.data.frame(colData(sce)), 
             aes(x = log10(sum))) + 
    geom_histogram(bins = 30, fill = "lightgrey", 
                   color = "darkgrey") + 
    labs(x = "log10(total count per cell)") + 
    theme_bw()
g2 <- ggplot(as.data.frame(colData(sce)), 
             aes(x = detected)) + 
    geom_histogram(bins = 30, fill = "lightgrey", 
                   color = "darkgrey") + 
    labs(x = "number of genes detected per cell") + 
    theme_bw()
g1 + g2

Mitochondrial genes

The proportion of counts assigned to mitochondrial genes is another useful indicator of cell quality, since high numbers of such reads can be indicative of cell damage.

ggplot(as.data.frame(colData(sce)), 
       aes(x = subsets_MT_percent)) + 
    geom_histogram(bins = 30, fill = "lightgrey", 
                   color = "darkgrey") + 
    labs(x = "percentage of counts assigned to mitochondrial genes") + 
    theme_bw()

Filtering

Now that we have calculated the QC metrics, we will use them to filter out low-quality cells that will be excluded from the rest of the analysis.

The optimal parameters for filtering are debated and likely data set dependent. Here are two commonly used strategies:

  1. Fixed threshold: For example, you might consider cells to be of low quality if their library sizes is below 100,000 reads; or if they express fewer than 5,000 genes or have mitochondrial proportions above 10%. However, while simple, this strategy requires substantial experience to determine appropriate thresholds for each experimental protocol and biological system.

  2. Adaptive thresholds: A typical approach is to remove cells that fall ‘too far’ from the average cells on one or more of the considered criteria. This makes the implicit assumption that ‘most’ cells are of good quality, which is often sensible. It should also be noted that in some cases, cells that seem to be of bad quality can do so for biological reasons. For example, certain cell types express very few genes, or have a high metabolic rate and consequently express a lot of mitochondrial genes.

For example, here we exclude cells according to two criteria:

  • few detected genes
  • high fraction of reads assigned to mitochondrial genes

For each of these criteria, we exclude cells that are more than 4 median absolute deviations (MAD) from the median across cells, in the direction indicating low quality. The isOutlier function from scuttle can help to extract such cells.

sce$low_detected <- isOutlier(
    sce$detected, type = "lower", 
    log = TRUE, nmads = 4
)

sce$high_mt <- isOutlier(
    sce$subsets_MT_percent, type = "higher",
    log = FALSE, nmads = 4
)

g1 <- ggplot(as.data.frame(colData(sce)), 
             aes(x = rank(-detected), y = detected, color = low_detected)) + 
    geom_point() + 
    labs(x = "cell rank",
         y = "number of detected genes") + 
    theme_bw()
g2 <- ggplot(as.data.frame(colData(sce)), 
             aes(x = rank(subsets_MT_percent), y = subsets_MT_percent, 
                 color = high_mt)) +
    geom_point() + 
    labs(x = "cell rank",
         y = "proportion of counts assigned to mitochondrial genes") + 
    theme_bw()
g1 + g2

Exercise 4

Filter out (remove) the cells with low number of detected genes or high mitochondrial counts and look at the distribution of the remaining cells.

Normalization

Just as for bulk RNA-seq, the raw scRNA-seq counts are not directly comparable across cells due to, e.g., differences in library sizes. This typically arises from technical differences due to the difficulty of achieving consistent library preparation with minimal starting material.

Thus, we need to apply a normalization strategy. There are many approaches to normalization of scRNA-seq data (see e.g. Lytal et al. 2020 and Cole et al. 2019 for comparisons). In this lecture we will focus on scaling normalization, which is the simplest and most commonly used class of normalization strategies. This involves dividing all counts for each cell by a cell-specific scaling factor, often called a “size factor”, similarly as we did for the bulk RNA-seq data before.

Here, we will use one scaling strategy implemented in the scran package. Similarly to the TMM and DESeq normalization approaches, this works by estimating a size factor for each cell, which incorporates the library size as well as a measure of the RNA composition.

However, the bulk RNA-seq methods are sometimes struggling with scRNA-seq data due to the large number of zeros; the scran approach solves this by repeatedly pooling multiple cells (which reduces the number of zeros), calculating size factors for the pools, and deriving individual size factors for the cells from those of the pools. After calculating the size factors, we normalize the observed counts and log-transform the values. The new “log counts” are placed in a new assay in sce, named logcounts.

suppressPackageStartupMessages({
    library(scran)
})

# quickCluster(): pre-clustering step where cells in each cluster are normalized separately. 
# The size factors are rescaled to be comparable across clusters.
# Avoid assumption that most genes are non-DE across the entire population. 
set.seed(871293L)
clust <- quickCluster(sce) 
sce <- computeSumFactors(sce, cluster = clust, min.mean = 0.1)

# new sizeFactor column: 
colnames(colData(sce))
 [1] "Sample"              "Barcode"             "Sequence"           
 [4] "Library"             "Cell_ranger_version" "Tissue_status"      
 [7] "Barcode_type"        "Chemistry"           "Sequence_platform"  
[10] "Individual"          "Date_published"      "sum"                
[13] "detected"            "subsets_MT_sum"      "subsets_MT_detected"
[16] "subsets_MT_percent"  "total"               "low_detected"       
[19] "high_mt"             "sizeFactor"         
# The size factor for each cell represents the estimate of the relative bias in 
# that cell, so division of its counts by its size factor should remove that bias
# To access the size factor: 
sizeFactors(sce)[1:10]
 [1] 0.8916905 1.7609335 1.5073993 1.1070487 0.5736406 0.9310324 0.9238984
 [8] 0.9542445 0.6071420 0.4883361
# Size factors are clearly correlated with (but not identical to) library sizes
ggplot(colData(sce), aes(x = total, y = sizeFactor)) + 
    geom_point() + 
    theme_bw()

# Compute log-transformed normalized expression values (using size factors): 
# divides the count for each gene with the appropriate size factor for that cell
sce <- logNormCounts(sce)
assayNames(sce) # new assay named "logcounts"
[1] "counts"    "logcounts"

Feature selection

scRNA-seq is widely used to characterize heterogeneity across cells. To compare cells based on their gene expression profiles, we can use procedures like clustering and dimensionality reduction which involves aggregating per-gene differences into a single (dis)similarity metric. The choice of genes to use in this calculation has a crucial impact on the behavior of the metric.

In practice, we want to identify genes that contain useful information about the biology of the system while removing genes that contain mostly random noise.

Mean-variance relationship

Variation in gene abundance estimates between different cells can be thought of as the convolution of the technical (mainly sampling) and the biological (e.g cell type) sources of variance. Typically one wants to isolate and focus on the biological variance so that differences due to experimental noise (technical noise) have as small an impact as possible on subsequent analyses.

There are different approaches to disentangling the technical and biological variability in a single-cell data set:

1. Assume that “most” genes exhibit only technical variability (i.e., most genes are not differentially expressed between different studied cell types).

2. Assume that the technical variance follows a Poisson distribution (a common distribution capturing sampling variability in count data), and that deviations from the Poisson expectation corresponds to biological variability.

Quantifying per-gene variation

The simplest approach to quantifying per-gene variation is to compute the variance of the log-normalized expression values (i.e., “log-counts”) for each gene across all cells. This requires modelling of the mean-variance relationship because the log-transformation is often not a variance stabilizing transformation (total variance of a gene is driven more by its abundance than its underlying biological heterogeneity). To account for this effect, we fit a trend to the variance with respect to abundance across all genes.

# Fit a trend to the variance with respect to abundance across all genes
dec.trend <- modelGeneVar(sce)

# Visualize the fit
fit.trend <- metadata(dec.trend)

plot(fit.trend$mean, fit.trend$var, xlab = "Mean of log-expression",
     ylab = "Variance of log-expression")
curve(fit.trend$trend(x), col = "dodgerblue", add = TRUE, lwd = 2)

Quantifying technical noise

However, the previous assumption may be problematic in scenarios where many genes at a particular abundance are affected by a biological process. We can avoid this problem by creating a trend by making some distributional assumptions about the noise. UMI counts typically exhibit near-Poisson variation and can be used to construct a mean-variance trend.

set.seed(7891L)
dec.pois <- modelGeneVarByPoisson(sce)
plot(dec.pois$mean, dec.pois$total, xlab = "Mean of log-expression",
     ylab = "Variance of log-expression")
curve(metadata(dec.pois)$trend(x), col = "dodgerblue", add = TRUE)

In each case, the biological variance of a gene can be estimated as the difference between the total variance and the modeled technical variance. This value can be used to rank the genes, and to select a set of “highly variable” genes to use as the basis of further downstream analysis. Here, we select the top 2,000 genes based on the trend fit, and add these to the metadata of the SingleCellExperiment object.

head(dec.trend[order(dec.trend$bio, decreasing = TRUE), ])
DataFrame with 6 rows and 6 columns
             mean     total      tech       bio      p.value          FDR
        <numeric> <numeric> <numeric> <numeric>    <numeric>    <numeric>
LYZ      1.639005   3.97107  0.883921   3.08715 3.88759e-147 2.15554e-143
S100A9   1.073371   3.41736  0.774743   2.64262 1.89712e-140 6.31133e-137
HLA-DRA  1.552600   3.12767  0.876509   2.25116  1.33932e-80  1.85652e-77
FTL      3.657163   2.96517  0.714211   2.25096 3.04694e-120 7.24041e-117
CD74     2.218245   2.73070  0.865694   1.86500  2.39372e-57  1.73118e-54
S100A8   0.804017   2.51019  0.656455   1.85374  5.65314e-97  8.54858e-94
head(dec.pois[order(dec.pois$bio, decreasing = TRUE), ])
DataFrame with 6 rows and 6 columns
             mean     total      tech       bio   p.value       FDR
        <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
LYZ       1.63901   3.97107  0.666763   3.30431         0         0
S100A9    1.07337   3.41736  0.656436   2.76092         0         0
FTL       3.65716   2.96517  0.225499   2.73967         0         0
HLA-DRA   1.55260   3.12767  0.677265   2.45041         0         0
CD74      2.21824   2.73070  0.549723   2.18097         0         0
FTH1      3.49779   2.26951  0.253019   2.01649         0         0
# many genes are found in both approaches
table(getTopHVGs(dec.trend, n = 2000) %in% 
          getTopHVGs(dec.pois, n = 2000))

FALSE  TRUE 
  533  1467 
# Save hvgs into metadata
metadata(sce)$hvgs <- getTopHVGs(dec.trend, n = 2000)

Data visualization

For bulk RNA-seq data, we typically use PCA for visualization and exploratory analysis. This can of course also be applied to single-cell RNA-seq data. However, other methods (e.g., tSNE and UMAP) are more commonly used for scRNA-seq, since the first two principal components (which are typically used for visualization) often don’t capture a large fraction of the variance in large and heterogeneous single-cell data sets. Both tSNE and UMAP are non-linear dimension reduction methods, which focus to a larger extent on retaining small distances. That means that cells that are similar to each other (e.g., cells of the same cell type) tend to be placed close together in the low-dimensional representation, whereas larger distances are potentially less faithfully represented. PCA, on the other hand, tends to put more focus on preserving the large cell-to-cell distances (the “global structure”). Importantly, even though PCA is rarely used for visualization of large scRNA-seq data sets in two-dimensional plots, it is often used as a first dimension reduction step, and other approaches such as tSNE and UMAP are subsequently applied to the PCA output.

Here, we therefore first apply PCA to our data set. We supply the set of highly variable genes derived above, to only calculate the principal components based on these. By default, the runPCA function from the scater package will apply the PCA to the ‘logcounts’ assay. We plot the first two principal components (note that we extract 30 components, which will be used as the basis for the tSNE later) and color by the different colData entries.

suppressPackageStartupMessages({
    library(scater)
})

sce <- runPCA(sce, exprs_values = "logcounts", ncomponents = 30, 
              subset_row = metadata(sce)$hvgs)

# A new reducedDimNames entry was created
reducedDimNames(sce)
[1] "PCA"
## PCA plotting
# color according to colData
plotReducedDim(sce, "PCA", colour_by = "detected")

plotReducedDim(sce, "PCA", colour_by = "sum")

plotReducedDim(sce, "PCA", colour_by = "subsets_MT_sum")

Next, we apply tSNE, which is a non-linear, stochastic dimension reduction technique that attempts to find a mapping of the data to a low-dimensional subspace while preserving local distances between cells. The non-linear character of tSNE means that it will often produce projections that better resolve differences between cell groups. The better separation of tSNE comes at the cost of interpretability; while in a tSNE representation similar cells are placed close to each other, longer distances in the representation are not guaranteed to reflect true relationships. This means that it is risky to draw conclusions of “similarity” or “dissimilarity” from the positional relationships of different cell groupings that appear in a tSNE plot. In addition, the stochastic nature of tSNE means that every time the algorithm is applied a different representation will be produced unlesss a random seed is set.

# random seed because of stochastic nature of tSNE 
set.seed(789134L)
sce <- runTSNE(sce, dimred = "PCA")
reducedDimNames(sce) 
[1] "PCA"  "TSNE"
plotReducedDim(sce, "TSNE", colour_by = "detected")

plotReducedDim(sce, "TSNE", colour_by = "sum")

plotReducedDim(sce, "TSNE", colour_by = "subsets_MT_sum")

We can often identify specific cell types and gain some understanding of the data directly from the visualization (other types of cell type identification, e.g. via cell type clustering, will be discussed in the next lecture). Typically, this would be done by colouring the cells according to the expression level of known marker genes for certain cell types.

## Color by expression of a B-cell marker (MS4A1)
plotReducedDim(sce, "TSNE", colour_by = "MS4A1")

## Color by expression of a T-cell marker
plotReducedDim(sce, "TSNE", colour_by = "CD3D")

plotReducedDim(sce, "TSNE", colour_by = "CCR7")

## Color by expression of a Monocyte marker
plotReducedDim(sce, "TSNE", colour_by = "LYZ")

## Color by expression of a platelet marker
plotReducedDim(sce, "TSNE", colour_by = "PPBP")

Interactive data exploration

For complex data sets like those from scRNA-seq, it is often helpful to be able to explore the data interactively. The iSEE Bioconductor package provides an easy way to create a flexible visualization interface for any type of ‘rectangular’ data (i.e., data that can be represented using a matrix), including scRNA-seq data.

suppressPackageStartupMessages({
    library(iSEE)
})

iSEE(sce)

Note that iSEE does not perform any analysis of the data set, but rather provides a interface to exploring pre-computed results that are stored in a SummarizedExperiment object (or any extension, such as SingleCellExperiment). Thus, it often becomes more and more useful the further the analysis proceeds and the more results are stored in the object.

Session info

R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.3.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] scater_1.38.0               scran_1.38.0               
 [3] patchwork_1.3.2             ggplot2_4.0.2              
 [5] scuttle_1.20.0              TENxPBMCData_1.28.0        
 [7] HDF5Array_1.38.0            h5mread_1.2.1              
 [9] rhdf5_2.54.1                DelayedArray_0.36.0        
[11] SparseArray_1.11.10         S4Arrays_1.11.1            
[13] abind_1.4-8                 Matrix_1.7-4               
[15] SingleCellExperiment_1.32.0 SummarizedExperiment_1.40.0
[17] Biobase_2.70.0              GenomicRanges_1.62.1       
[19] Seqinfo_1.0.0               IRanges_2.44.0             
[21] S4Vectors_0.48.0            BiocGenerics_0.56.0        
[23] generics_0.1.4              MatrixGenerics_1.22.0      
[25] matrixStats_1.5.0           BiocStyle_2.38.0           
[27] png_0.1-8                   knitr_1.51                 

loaded via a namespace (and not attached):
 [1] DBI_1.3.0            gridExtra_2.3        httr2_1.2.2         
 [4] rlang_1.1.7          magrittr_2.0.4       otel_0.2.0          
 [7] compiler_4.5.2       RSQLite_2.4.6        vctrs_0.7.1         
[10] pkgconfig_2.0.3      crayon_1.5.3         fastmap_1.2.0       
[13] dbplyr_2.5.2         XVector_0.50.0       labeling_0.4.3      
[16] rmarkdown_2.30       ggbeeswarm_0.7.3     purrr_1.2.1         
[19] bit_4.6.0            xfun_0.56            bluster_1.20.0      
[22] cachem_1.1.0         beachmat_2.26.0      jsonlite_2.0.0      
[25] blob_1.3.0           rhdf5filters_1.22.0  Rhdf5lib_1.32.0     
[28] BiocParallel_1.44.0  irlba_2.3.7          parallel_4.5.2      
[31] cluster_2.1.8.2      R6_2.6.1             RColorBrewer_1.1-3  
[34] limma_3.66.0         Rcpp_1.1.1           igraph_2.2.2        
[37] tidyselect_1.2.1     viridis_0.6.5        rstudioapi_0.18.0   
[40] dichromat_2.0-0.1    yaml_2.3.12          codetools_0.2-20    
[43] curl_7.0.0           lattice_0.22-9       tibble_3.3.1        
[46] withr_3.0.2          KEGGREST_1.50.0      S7_0.2.1            
[49] Rtsne_0.17           evaluate_1.0.5       BiocFileCache_3.0.0 
[52] ExperimentHub_3.0.0  Biostrings_2.78.0    pillar_1.11.1       
[55] BiocManager_1.30.27  filelock_1.0.3       BiocVersion_3.22.0  
[58] scales_1.4.0         glue_1.8.0           metapod_1.18.0      
[61] tools_4.5.2          AnnotationHub_4.0.0  BiocNeighbors_2.4.0 
[64] ScaledMatrix_1.18.0  locfit_1.5-9.12      cowplot_1.2.0       
[67] edgeR_4.8.2          AnnotationDbi_1.72.0 beeswarm_0.4.0      
[70] BiocSingular_1.26.1  vipor_0.4.7          cli_3.6.5           
[73] rsvd_1.0.5           rappdirs_0.3.4       viridisLite_0.4.3   
[76] dplyr_1.2.0          gtable_0.3.6         digest_0.6.39       
[79] ggrepel_0.9.7        dqrng_0.4.1          htmlwidgets_1.6.4   
[82] farver_2.1.2         memoise_2.0.1        htmltools_0.5.9     
[85] lifecycle_1.0.5      httr_1.4.8           statmod_1.5.1       
[88] bit64_4.6.0-1