Charlotte Soneson, Athimed El Taher, Panagiotis Papasaikas
Published
April 1, 2026
Why single-cell assays?
(Image credit: Elizabeth Sergison, Honeycomb Biotechnologies)
Limited input material (e.g early development, circulating tumor cells, …)
Sample heterogeneity (mixture of cell types / states)
Number and frequency of cell populations
Markers for cell populations
Variability of gene expression among cells
Cell population relationships (developmental/pseudo-temporal ordering)
Discriminate between changes in overall (bulk) gene expression due to changes in cell type composition, and changes due to changes in expression levels within a given cell type
(Image credit: 10x Genomics: Getting started with single cell gene expression)
Generating scRNA-seq data
Currently, the most common classes of protocols are
low-throughput, plate-based, full transcript coverage, with high sequencing depth (e.g., Smart-Seq2)
Which protocol to choose for a given application depends on the goal of the study.
Plate-based: many detected genes per cell, fewer cells, isoform analysis (perhaps) possible
Droplet-based: fewer detected genes per cell (mostly the highly expressed ones), many cells
Unique molecular identifiers
Because of the small amount of input material, strong amplification is required during library preparation
By attaching a tag to each original transcript molecule, before the amplification, reads generated from the same molecule can be identified and collapsed at the quantification step
These tags are called unique molecular identifiers (UMIs), and are part of many of the single-cell sequencing protocols (in particular the droplet-based ones)
Example read structure (10x Genomics Chromium v3)
(Image credit: 10x Genomics)
The sample index identifies the library (in case of multiplexing)
The 10x barcode (or cell index) identifies the cell within the library
The UMI identifies the transcript molecule within a cell and gene
The insert is the actual cDNA sequence
Quantification of scRNA-seq data
The quantification approach depends on the type of protocol
For Smart-Seq2 and similar protocols, one often uses the same tools as for bulk RNA-seq (STAR+featureCounts, Salmon, kallisto)
For 10x Genomics and similar protocols (where reads are generated from the 3’ end of the transcript, and additionally we need to collapse reads with the same UMI within a gene), there are specialized tools, often building on one of the bulk RNA-seq tools:
We have seen how bulk RNA-seq data is often represented in a SummarizedExperiment object (at least within the Bioconductor framework)
Single-cell RNA-seq has a similar structure (a count matrix, with genes as rows and cells as columns), and is represented in a SingleCellExperiment object
The SingleCellExperiment object is an extension of the SummarizedExperiment object
The main difference is that a SingleCellExperiment object can also store reduced dimension representations (PCA, tSNE, UMAP)
There are also other commonly used representation formats, such as the Seurat object (R) and anndata (python).
Several packages are available for converting between these formats
(Original image credit: Jim Hester / https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html)
Load SingleCellExperiment
For this lecture, we will use an example data set from the TENxPBMCData Bioconductor package. This package contains a number of data sets of PBMCs (peripheral blood mononuclear cells), obtained with the 10x Genomics Chromium platform. Here we will use one containing approximately 3,000 PBMCs.
The object contains annotations for the genes (rows), including the Ensembl gene ID and the gene symbol. Since the gene symbols are generally easier to remember and interpret, we will use them as row names (keeping the information about the corresponding Ensembl ID in the object). Only if the gene symbols are not unique we will fall back to the Ensembl ID. This can be achieved using the uniquifyFeatureNames function from the scuttle Bioconductor package.
# Dimension of SingleCellExperimentdim(sce) # genes X cells (barcodes)
[1] 32738 2700
# colnames of SingleCellExperimenthead(colnames(sce))
NULL
# rownames of SingleCellExperimenthead(rownames(sce))
DataFrame with 32738 rows and 3 columns
ENSEMBL_ID Symbol_TENx Symbol
<character> <character> <character>
ENSG00000243485 ENSG00000243485 MIR1302-10 NA
ENSG00000237613 ENSG00000237613 FAM138A FAM138A
ENSG00000186092 ENSG00000186092 OR4F5 OR4F5
ENSG00000238009 ENSG00000238009 RP11-34P13.7 LOC100996442
ENSG00000239945 ENSG00000239945 RP11-34P13.8 NA
... ... ... ...
ENSG00000215635 ENSG00000215635 AC145205.1 NA
ENSG00000268590 ENSG00000268590 BAGE5 NA
ENSG00000251180 ENSG00000251180 CU459201.1 NA
ENSG00000215616 ENSG00000215616 AC002321.2 NA
ENSG00000215611 ENSG00000215611 AC002321.1 NA
# set gene symbol as row names of the data object, use Ensembl ID if not uniquerownames(sce) <-uniquifyFeatureNames(ID =rownames(sce), names =rowData(sce)$Symbol_TENx)# set the colnames: colnames(sce) <- sce$Sequencesce
Accessing the assay data, row and column annotations is done in the same way as for SummarizedExperiment objects. There is also an additional counts() accessor for the counts assay.
# All assays present in the sce objectassayNames(sce)
[1] "counts"
# Access the "counts" assay in different ways counts_method1 <-assays(sce)[["counts"]]counts_method2 <-assay(sce, "counts")counts_method3 <-counts(sce)counts_method1[1:5, 1:3]
While the structure of the scRNAseq data is similar to that of the bulk data, there are also important differences that affect the downstream analysis. One of these differences is that single-cell data is much more sparse; in other words, there are many more zeros in the count matrix from a single-cell experiment than from a bulk experiment.
This is due to things such as:
the much lower sequencing depth per cell/sample (especially for droplet-based protocols)
not every individual cell expressing each gene
a failure of the library preparation to capture many of the expressed transcript molecules
Let’s check the fraction of zeros in our count matrix:
# fraction of 0 per barcode/cellfrac_0_per_cell <-colMeans(counts(sce) ==0)head(frac_0_per_cell)
# Summary of library size distribution per cellsummary(colSums(counts(sce)))
Min. 1st Qu. Median Mean 3rd Qu. Max.
548 1758 2197 2367 2763 15844
The number of cells in a scRNA-seq data set is typically much (several orders of magnitude) larger than the number of samples in a bulk RNA-seq experiment. Hence, the count matrices can get very large. However, since most of the values are zero, efficient storage modes, where only the non-zero values and the corresponding matrix positions are stored, can be employed. We can make sure that the count matrix in our object is indeed such a sparse matrix (in this particular data set, it is actually provided as a DelayedMatrix, which is beyond the scope of this course to discuss in detail, but which can be suitable for very large data matrices that do not fit in memory).
We already noted a couple of differences between scRNA-seq data and bulk data:
The former typically contains many more observations
The count matrix is much more sparse.
The low amount of starting material for scRNA-seq experiments also results in a high sampling noise
There is a lower correlation among cells than among bulk RNA-seq samples.
There are many genes that are not observed in any given cell, but still have a high number of assigned counts in other cells. This indicates that the gene detection is to some extent random, as expected by the sampling process. In other words, it is not always the same genes that go undetected in all cells.
While we don’t observe each gene in each cell, there is still a clear association between the overall expression level of a gene (e.g., total count) and the fraction of cells where it is detected.
ggplot(data.frame(total_count =rowSums(counts(sce)),frac_det =rowMeans(counts(sce) >0)),aes(x = total_count, y = frac_det)) +geom_point() +labs(x ="Total count per gene", y ="Fraction of cells in which the gene is detected") +scale_x_log10() +theme_bw()
Warning in scale_x_log10(): log-10 transformation introduced infinite
values.
Exercise 1 - Exploration
Explore the SingleCellExperiment object:
dimension: How many genes and cells are in the object?
colnames: What are the column names of the object?
rownames: What are the row names of the object?
colData: What info is stored in the colData (cell metadata)?
rowData: What info is stored in the rowData (gene metadata)
TipSolution 1
Exercise 2
Check that the counts() accessor returns the same matrix as you would get if you access the counts assay using assay() or assays().
How many genes are detected in at least one cell? How many genes are detected for each individual cell?
TipSolution 2
Quality control (QC)
Low-quality libraries can arise for a variety of reasons (e.g. cell damage during dissociation, failure in library preparation)
They are problematic as they can contribute to misleading results in downstream analyses (form their own distinct cluster, interfere with characterization of population heterogeneity, large variability in scaling factors,…)
Low quality cells usually have low total counts, few expressed genes and high mitochondrial proportion of the counts.
To calculate these summary QC statistics (for both cells and genes), we again use the scuttle Bioconductor package.
By providing a subset of genes (here, mitochondrial genes), we can calculate the fraction of counts falling in these genes.
# add QC statistics per genesce <-addPerFeatureQCMetrics(sce)# Two new columns (mean and detected) are added to the rowData rowData(sce)
DataFrame with 32738 rows and 5 columns
ENSEMBL_ID Symbol_TENx Symbol mean detected
<character> <character> <character> <numeric> <numeric>
MIR1302-10 ENSG00000243485 MIR1302-10 NA 0 0
FAM138A ENSG00000237613 FAM138A FAM138A 0 0
OR4F5 ENSG00000186092 OR4F5 OR4F5 0 0
RP11-34P13.7 ENSG00000238009 RP11-34P13.7 LOC100996442 0 0
RP11-34P13.8 ENSG00000239945 RP11-34P13.8 NA 0 0
... ... ... ... ... ...
AC145205.1 ENSG00000215635 AC145205.1 NA 0 0
BAGE5 ENSG00000268590 BAGE5 NA 0 0
CU459201.1 ENSG00000251180 CU459201.1 NA 0 0
AC002321.2 ENSG00000215616 AC002321.2 NA 0 0
AC002321.1 ENSG00000215611 AC002321.1 NA 0 0
# 1. mean: mean expression of the gene (mean number of read across all cells) # 2. detected: percentage of cells where the gene is detected (where count>0)# add QC statistics per cell: # first extract list of mitochondrial genes ("MT" pattern within their symbol)mt <-rownames(sce)[grep("^MT-", rowData(sce)$Symbol_TENx)]mt # 13 genes
# provide this list to addPerCellQCMetrics to calculate summaries # globally as well as specifically for the gene subsetsce <-addPerCellQCMetrics(sce, subsets =list(MT = mt))# Six new columns are added to the colData colData(sce)
# 1. sum: library size of the cell# 2. detected: number of genes detected in the cell# 3. subsets_MT_sum: total number of reads on the mitochondrial genes# 4. subsets_MT_detected: number of mitochondrial genes detected# 5. subsets_MT_percent: percentage of reads mapped to mitochondrial genes# 6. total: identical to sum
Exercise 3
Use the newly calculated summary statistics to see whether there is an association between the total count and the number of detected genes in a cell, or between the average count and the number of observed values for a gene.
TipSolution 3
Library size and number of detected genes
We can plot the distribution of library sizes (sum) and the number of detected genes (detected) across the cells.
suppressPackageStartupMessages({library(patchwork)})g1 <-ggplot(as.data.frame(colData(sce)), aes(x =log10(sum))) +geom_histogram(bins =30, fill ="lightgrey", color ="darkgrey") +labs(x ="log10(total count per cell)") +theme_bw()g2 <-ggplot(as.data.frame(colData(sce)), aes(x = detected)) +geom_histogram(bins =30, fill ="lightgrey", color ="darkgrey") +labs(x ="number of genes detected per cell") +theme_bw()g1 + g2
Mitochondrial genes
The proportion of counts assigned to mitochondrial genes is another useful indicator of cell quality, since high numbers of such reads can be indicative of cell damage.
ggplot(as.data.frame(colData(sce)), aes(x = subsets_MT_percent)) +geom_histogram(bins =30, fill ="lightgrey", color ="darkgrey") +labs(x ="percentage of counts assigned to mitochondrial genes") +theme_bw()
Filtering
Now that we have calculated the QC metrics, we will use them to filter out low-quality cells that will be excluded from the rest of the analysis.
The optimal parameters for filtering are debated and likely data set dependent. Here are two commonly used strategies:
Fixed threshold: For example, you might consider cells to be of low quality if their library sizes is below 100,000 reads; or if they express fewer than 5,000 genes or have mitochondrial proportions above 10%. However, while simple, this strategy requires substantial experience to determine appropriate thresholds for each experimental protocol and biological system.
Adaptive thresholds: A typical approach is to remove cells that fall ‘too far’ from the average cells on one or more of the considered criteria. This makes the implicit assumption that ‘most’ cells are of good quality, which is often sensible. It should also be noted that in some cases, cells that seem to be of bad quality can do so for biological reasons. For example, certain cell types express very few genes, or have a high metabolic rate and consequently express a lot of mitochondrial genes.
For example, here we exclude cells according to two criteria:
few detected genes
high fraction of reads assigned to mitochondrial genes
For each of these criteria, we exclude cells that are more than 4 median absolute deviations (MAD) from the median across cells, in the direction indicating low quality. The isOutlier function from scuttle can help to extract such cells.
sce$low_detected <-isOutlier( sce$detected, type ="lower", log =TRUE, nmads =4)sce$high_mt <-isOutlier( sce$subsets_MT_percent, type ="higher",log =FALSE, nmads =4)g1 <-ggplot(as.data.frame(colData(sce)), aes(x =rank(-detected), y = detected, color = low_detected)) +geom_point() +labs(x ="cell rank",y ="number of detected genes") +theme_bw()g2 <-ggplot(as.data.frame(colData(sce)), aes(x =rank(subsets_MT_percent), y = subsets_MT_percent, color = high_mt)) +geom_point() +labs(x ="cell rank",y ="proportion of counts assigned to mitochondrial genes") +theme_bw()g1 + g2
Exercise 4
Filter out (remove) the cells with low number of detected genes or high mitochondrial counts and look at the distribution of the remaining cells.
TipSolution 4
Normalization
Just as for bulk RNA-seq, the raw scRNA-seq counts are not directly comparable across cells due to, e.g., differences in library sizes. This typically arises from technical differences due to the difficulty of achieving consistent library preparation with minimal starting material.
Thus, we need to apply a normalization strategy. There are many approaches to normalization of scRNA-seq data (see e.g. Lytal et al. 2020 and Cole et al. 2019 for comparisons). In this lecture we will focus on scaling normalization, which is the simplest and most commonly used class of normalization strategies. This involves dividing all counts for each cell by a cell-specific scaling factor, often called a “size factor”, similarly as we did for the bulk RNA-seq data before.
Here, we will use one scaling strategy implemented in the scran package. Similarly to the TMM and DESeq normalization approaches, this works by estimating a size factor for each cell, which incorporates the library size as well as a measure of the RNA composition.
However, the bulk RNA-seq methods are sometimes struggling with scRNA-seq data due to the large number of zeros; the scran approach solves this by repeatedly pooling multiple cells (which reduces the number of zeros), calculating size factors for the pools, and deriving individual size factors for the cells from those of the pools. After calculating the size factors, we normalize the observed counts and log-transform the values. The new “log counts” are placed in a new assay in sce, named logcounts.
suppressPackageStartupMessages({library(scran)})# quickCluster(): pre-clustering step where cells in each cluster are normalized separately. # The size factors are rescaled to be comparable across clusters.# Avoid assumption that most genes are non-DE across the entire population. set.seed(871293L)clust <-quickCluster(sce) sce <-computeSumFactors(sce, cluster = clust, min.mean =0.1)# new sizeFactor column: colnames(colData(sce))
# The size factor for each cell represents the estimate of the relative bias in # that cell, so division of its counts by its size factor should remove that bias# To access the size factor: sizeFactors(sce)[1:10]
# Size factors are clearly correlated with (but not identical to) library sizesggplot(colData(sce), aes(x = total, y = sizeFactor)) +geom_point() +theme_bw()
# Compute log-transformed normalized expression values (using size factors): # divides the count for each gene with the appropriate size factor for that cellsce <-logNormCounts(sce)assayNames(sce) # new assay named "logcounts"
[1] "counts" "logcounts"
Feature selection
scRNA-seq is widely used to characterize heterogeneity across cells. To compare cells based on their gene expression profiles, we can use procedures like clustering and dimensionality reduction which involves aggregating per-gene differences into a single (dis)similarity metric. The choice of genes to use in this calculation has a crucial impact on the behavior of the metric.
In practice, we want to identify genes that contain useful information about the biology of the system while removing genes that contain mostly random noise.
Mean-variance relationship
Variation in gene abundance estimates between different cells can be thought of as the convolution of the technical (mainly sampling) and the biological (e.g cell type) sources of variance. Typically one wants to isolate and focus on the biological variance so that differences due to experimental noise (technical noise) have as small an impact as possible on subsequent analyses.
There are different approaches to disentangling the technical and biological variability in a single-cell data set:
1. Assume that “most” genes exhibit only technical variability (i.e., most genes are not differentially expressed between different studied cell types).
2. Assume that the technical variance follows a Poisson distribution (a common distribution capturing sampling variability in count data), and that deviations from the Poisson expectation corresponds to biological variability.
Quantifying per-gene variation
The simplest approach to quantifying per-gene variation is to compute the variance of the log-normalized expression values (i.e., “log-counts”) for each gene across all cells. This requires modelling of the mean-variance relationship because the log-transformation is often not a variance stabilizing transformation (total variance of a gene is driven more by its abundance than its underlying biological heterogeneity). To account for this effect, we fit a trend to the variance with respect to abundance across all genes.
# Fit a trend to the variance with respect to abundance across all genesdec.trend <-modelGeneVar(sce)# Visualize the fitfit.trend <-metadata(dec.trend)plot(fit.trend$mean, fit.trend$var, xlab ="Mean of log-expression",ylab ="Variance of log-expression")curve(fit.trend$trend(x), col ="dodgerblue", add =TRUE, lwd =2)
Quantifying technical noise
However, the previous assumption may be problematic in scenarios where many genes at a particular abundance are affected by a biological process. We can avoid this problem by creating a trend by making some distributional assumptions about the noise. UMI counts typically exhibit near-Poisson variation and can be used to construct a mean-variance trend.
set.seed(7891L)dec.pois <-modelGeneVarByPoisson(sce)plot(dec.pois$mean, dec.pois$total, xlab ="Mean of log-expression",ylab ="Variance of log-expression")curve(metadata(dec.pois)$trend(x), col ="dodgerblue", add =TRUE)
In each case, the biological variance of a gene can be estimated as the difference between the total variance and the modeled technical variance. This value can be used to rank the genes, and to select a set of “highly variable” genes to use as the basis of further downstream analysis. Here, we select the top 2,000 genes based on the trend fit, and add these to the metadata of the SingleCellExperiment object.
# many genes are found in both approachestable(getTopHVGs(dec.trend, n =2000) %in%getTopHVGs(dec.pois, n =2000))
FALSE TRUE
533 1467
# Save hvgs into metadatametadata(sce)$hvgs <-getTopHVGs(dec.trend, n =2000)
Data visualization
For bulk RNA-seq data, we typically use PCA for visualization and exploratory analysis. This can of course also be applied to single-cell RNA-seq data. However, other methods (e.g., tSNE and UMAP) are more commonly used for scRNA-seq, since the first two principal components (which are typically used for visualization) often don’t capture a large fraction of the variance in large and heterogeneous single-cell data sets. Both tSNE and UMAP are non-linear dimension reduction methods, which focus to a larger extent on retaining small distances. That means that cells that are similar to each other (e.g., cells of the same cell type) tend to be placed close together in the low-dimensional representation, whereas larger distances are potentially less faithfully represented. PCA, on the other hand, tends to put more focus on preserving the large cell-to-cell distances (the “global structure”). Importantly, even though PCA is rarely used for visualization of large scRNA-seq data sets in two-dimensional plots, it is often used as a first dimension reduction step, and other approaches such as tSNE and UMAP are subsequently applied to the PCA output.
Here, we therefore first apply PCA to our data set. We supply the set of highly variable genes derived above, to only calculate the principal components based on these. By default, the runPCA function from the scater package will apply the PCA to the ‘logcounts’ assay. We plot the first two principal components (note that we extract 30 components, which will be used as the basis for the tSNE later) and color by the different colData entries.
suppressPackageStartupMessages({library(scater)})sce <-runPCA(sce, exprs_values ="logcounts", ncomponents =30, subset_row =metadata(sce)$hvgs)# A new reducedDimNames entry was createdreducedDimNames(sce)
[1] "PCA"
## PCA plotting# color according to colDataplotReducedDim(sce, "PCA", colour_by ="detected")
Next, we apply tSNE, which is a non-linear, stochastic dimension reduction technique that attempts to find a mapping of the data to a low-dimensional subspace while preserving local distances between cells. The non-linear character of tSNE means that it will often produce projections that better resolve differences between cell groups. The better separation of tSNE comes at the cost of interpretability; while in a tSNE representation similar cells are placed close to each other, longer distances in the representation are not guaranteed to reflect true relationships. This means that it is risky to draw conclusions of “similarity” or “dissimilarity” from the positional relationships of different cell groupings that appear in a tSNE plot. In addition, the stochastic nature of tSNE means that every time the algorithm is applied a different representation will be produced unlesss a random seed is set.
# random seed because of stochastic nature of tSNE set.seed(789134L)sce <-runTSNE(sce, dimred ="PCA")reducedDimNames(sce)
We can often identify specific cell types and gain some understanding of the data directly from the visualization (other types of cell type identification, e.g. via cell type clustering, will be discussed in the next lecture). Typically, this would be done by colouring the cells according to the expression level of known marker genes for certain cell types.
## Color by expression of a B-cell marker (MS4A1)plotReducedDim(sce, "TSNE", colour_by ="MS4A1")
## Color by expression of a T-cell markerplotReducedDim(sce, "TSNE", colour_by ="CD3D")
plotReducedDim(sce, "TSNE", colour_by ="CCR7")
## Color by expression of a Monocyte markerplotReducedDim(sce, "TSNE", colour_by ="LYZ")
## Color by expression of a platelet markerplotReducedDim(sce, "TSNE", colour_by ="PPBP")
Interactive data exploration
For complex data sets like those from scRNA-seq, it is often helpful to be able to explore the data interactively. The iSEE Bioconductor package provides an easy way to create a flexible visualization interface for any type of ‘rectangular’ data (i.e., data that can be represented using a matrix), including scRNA-seq data.
Note that iSEE does not perform any analysis of the data set, but rather provides a interface to exploring pre-computed results that are stored in a SummarizedExperiment object (or any extension, such as SingleCellExperiment). Thus, it often becomes more and more useful the further the analysis proceeds and the more results are stored in the object.