Introduction to single-cell RNA sequencing

Author

Athimed El Taher

Published

April 10, 2024

Why single-cell assays?

(Image credit: Elizabeth Sergison, Honeycomb Biotechnologies)

Limited input material (e.g early development, circulating tumor cells, …)
Sample heterogeneity (mixture of cell types / states)
- Number and frequency of cell populations
- Markers for cell populations
- Variability of gene expression among cells
- Cell population relationships (developmental/pseudo-temporal ordering)
- Discriminate between changes in overall (bulk) gene expression due to changes in cell type composition, and changes due to changes in expression levels within a given cell type

(Image credit: 10x Genomics: Getting started with single cell gene expression)

Generating scRNA-seq data

Currently, the most common classes of protocols are
- low-throughput, plate-based, full transcript coverage, with high sequencing depth (e.g., Smart-Seq2)
- high-throughput, droplet-based, 3’ biased, with low sequencing depth (e.g., 10x Genomics Chromium)
We will not go through the different protocols in detail, but if you want to read more, see e.g.
- Mereu et al. 2020
- Ding et al. 2020
Which protocol to choose for a given application depends on the goal of the study.
- Plate-based: many detected genes per cell, fewer cells, isoform analysis (perhaps) possible
- Droplet-based: fewer detected genes per cell (mostly the highly expressed ones), many cells

Unique molecular identifiers

Because of the small amount of input material, strong amplification is required during library preparation
By attaching a tag to each original transcript molecule, before the amplification, reads generated from the same molecule can be identified and collapsed at the quantification step
These tags are called unique molecular identifiers (UMIs), and are part of many of the single-cell sequencing protocols (in particular the droplet-based ones)

(Image credit: lexogen: rna-lexicon chapter8)

Example of UMI application

(Image credit: lexogen: rna-lexicon chapter8)

Example read structure (10x Genomics Chromium v3)

(Image credit: 10x Genomics)

The sample index identifies the library (in case of multiplexing)
The 10x barcode (or cell index) identifies the cell within the library
The UMI identifies the transcript molecule within a cell and gene
The insert is the actual cDNA sequence

Quantification of scRNA-seq data

The quantification approach depends on the type of protocol
For Smart-Seq2 and similar protocols, one often uses the same tools as for bulk RNA-seq (STAR+featureCounts, Salmon, kallisto)
For 10x Genomics and similar protocols (where reads are generated from the 3’ end of the transcript, and additionally we need to collapse reads with the same UMI within a gene), there are specialized tools, often building on one of the bulk RNA-seq tools:
- alevin (based on Salmon)
- kallisto|bustools (based on kallisto)
- STARsolo (based on STAR alignment)
- CellRanger (specifically for 10x Genomics, based on STAR alignment)
Depending on the quantification tools, there are dedicated functions for importing the quantifications into R:
- DropletUtils and Seurat contain functions for reading CellRanger output
- tximeta can read alevin output (specify type = "alevin")
- BUSpaRse can read kallisto|bustools output

Representation of scRNA-seq data

We have seen how bulk RNA-seq data is often represented in a SummarizedExperiment object (at least within the Bioconductor framework)
Single-cell RNA-seq has a similar structure (a count matrix, with genes as rows and cells as columns), and is represented in a SingleCellExperiment object
The SingleCellExperiment object is an extension of the SummarizedExperiment object
The main difference is that a SingleCellExperiment object can also store reduced dimension representations (PCA, tSNE, UMAP)

(Original image credit: Jim Hester / https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html)

Example data

Load SingleCellExperiment

For this lecture, we will use an example data set from the TENxPBMCData Bioconductor package. This package contains a number of PBMC data sets obtained with the 10x Genomics Chromium platform. Here we will use one containing approximately 3,000 peripheral blood mononuclear cells.

suppressPackageStartupMessages({
library(TENxPBMCData)
library(SingleCellExperiment)
})
# Import dataset:
sce = TENxPBMCData("pbmc5k-CITEseq")

see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation

loading from cache

We can see that sce is indeed a SingleCellExperiment object, and that the structure is similar to that of a SummarizedExperiment object.

sce

class: SingleCellExperiment 
dim: 33538 5247 
metadata(0):
assays(1): counts
rownames(33538): ENSG00000243485 ENSG00000237613 ...
  ENSG00000277475 ENSG00000268674
rowData names(4): ENSEMBL_ID Symbol_TENx Type Symbol
colnames: NULL
colData names(11): Sample Barcode ... Individual Date_published
reducedDimNames(0):
mainExpName: Gene Expression
altExpNames(1): Antibody Capture

Exploration

The object contains gene annotations including the Ensembl gene ID and the gene symbol; to make it easier to interpret the results we combine these into a gene identifier that is used as the row names of the data object.

# Dimension of SingleCellExperiment: 
dim(sce) # genes X cells (barcodes)

[1] 33538  5247

# colnames SingleCellExperiment:
head(colnames(sce))

NULL

# rownames SingleCellExperiment:
head(rownames(sce))

[1] "ENSG00000243485" "ENSG00000237613" "ENSG00000186092" "ENSG00000238009"
[5] "ENSG00000239945" "ENSG00000239906"

# SingleCellExperiment colData info:
colData(sce)

DataFrame with 5247 rows and 11 columns
             Sample            Barcode         Sequence   Library
        <character>        <character>      <character> <integer>
1    pbmc5k-CITEseq AAACCCAAGAGACAAG-1 AAACCCAAGAGACAAG         1
2    pbmc5k-CITEseq AAACCCAAGGCCTAGA-1 AAACCCAAGGCCTAGA         1
3    pbmc5k-CITEseq AAACCCAGTCGTGCCA-1 AAACCCAGTCGTGCCA         1
4    pbmc5k-CITEseq AAACCCATCGTGCATA-1 AAACCCATCGTGCATA         1
5    pbmc5k-CITEseq AAACGAAAGACAAGCC-1 AAACGAAAGACAAGCC         1
...             ...                ...              ...       ...
5243 pbmc5k-CITEseq TTTGTTGCAGCACAAG-1 TTTGTTGCAGCACAAG         1
5244 pbmc5k-CITEseq TTTGTTGCAGTCTTCC-1 TTTGTTGCAGTCTTCC         1
5245 pbmc5k-CITEseq TTTGTTGCATGGCCCA-1 TTTGTTGCATGGCCCA         1
5246 pbmc5k-CITEseq TTTGTTGCATTGCCGG-1 TTTGTTGCATTGCCGG         1
5247 pbmc5k-CITEseq TTTGTTGGTCCGGCAT-1 TTTGTTGGTCCGGCAT         1
     Cell_ranger_version Tissue_status Barcode_type   Chemistry
             <character>   <character>  <character> <character>
1                 v3.0.2            NA     Chromium Chromium_v3
2                 v3.0.2            NA     Chromium Chromium_v3
3                 v3.0.2            NA     Chromium Chromium_v3
4                 v3.0.2            NA     Chromium Chromium_v3
5                 v3.0.2            NA     Chromium Chromium_v3
...                  ...           ...          ...         ...
5243              v3.0.2            NA     Chromium Chromium_v3
5244              v3.0.2            NA     Chromium Chromium_v3
5245              v3.0.2            NA     Chromium Chromium_v3
5246              v3.0.2            NA     Chromium Chromium_v3
5247              v3.0.2            NA     Chromium Chromium_v3
     Sequence_platform   Individual Date_published
           <character>  <character>    <character>
1              NovaSeq HealthyDonor     2019-05-29
2              NovaSeq HealthyDonor     2019-05-29
3              NovaSeq HealthyDonor     2019-05-29
4              NovaSeq HealthyDonor     2019-05-29
5              NovaSeq HealthyDonor     2019-05-29
...                ...          ...            ...
5243           NovaSeq HealthyDonor     2019-05-29
5244           NovaSeq HealthyDonor     2019-05-29
5245           NovaSeq HealthyDonor     2019-05-29
5246           NovaSeq HealthyDonor     2019-05-29
5247           NovaSeq HealthyDonor     2019-05-29

# SingleCellExperiment rowData info:
rowData(sce)

DataFrame with 33538 rows and 4 columns
                     ENSEMBL_ID Symbol_TENx            Type       Symbol
                    <character> <character>     <character>  <character>
ENSG00000243485 ENSG00000243485 MIR1302-2HG Gene Expression           NA
ENSG00000237613 ENSG00000237613     FAM138A Gene Expression      FAM138A
ENSG00000186092 ENSG00000186092       OR4F5 Gene Expression        OR4F5
ENSG00000238009 ENSG00000238009  AL627309.1 Gene Expression LOC100996442
ENSG00000239945 ENSG00000239945  AL627309.3 Gene Expression           NA
...                         ...         ...             ...          ...
ENSG00000277856 ENSG00000277856  AC233755.2 Gene Expression           NA
ENSG00000275063 ENSG00000275063  AC233755.1 Gene Expression LOC102723407
ENSG00000271254 ENSG00000271254  AC240274.1 Gene Expression LOC102724250
ENSG00000277475 ENSG00000277475  AC213203.1 Gene Expression           NA
ENSG00000268674 ENSG00000268674     FAM231C Gene Expression           NA

# Rename rownames:
# combine Ensembl ID and gene symbol and set as row names of the data object. 
rownames(sce) = paste0(rowData(sce)$ENSEMBL_ID, "_", 
                        rowData(sce)$Symbol_TENx)

# Name the colnames: 
colnames(sce) = sce$Sequence

Assays

Accessing the assay data, row and column annotations is done in the same way as for SummarizedExperiment objects. There is also an additional counts() accessor for the counts assay.

# All assays present in the sce object: 
assays(sce)

List of length 1
names(1): counts

# Access sce's assays: 
counts_method1 = assays(sce)[['counts']]
counts_method1[1:5,1:3]

<5 x 3> sparse DelayedMatrix object of type "integer":
                            AAACCCAAGAGACAAG AAACCCAAGGCCTAGA
ENSG00000243485_MIR1302-2HG                0                0
ENSG00000237613_FAM138A                    0                0
ENSG00000186092_OR4F5                      0                0
ENSG00000238009_AL627309.1                 0                0
ENSG00000239945_AL627309.3                 0                0
                            AAACCCAGTCGTGCCA
ENSG00000243485_MIR1302-2HG                0
ENSG00000237613_FAM138A                    0
ENSG00000186092_OR4F5                      0
ENSG00000238009_AL627309.1                 0
ENSG00000239945_AL627309.3                 0

counts_method2 = assay(sce,'counts')
counts_method2[1:5,1:3]

<5 x 3> sparse DelayedMatrix object of type "integer":
                            AAACCCAAGAGACAAG AAACCCAAGGCCTAGA
ENSG00000243485_MIR1302-2HG                0                0
ENSG00000237613_FAM138A                    0                0
ENSG00000186092_OR4F5                      0                0
ENSG00000238009_AL627309.1                 0                0
ENSG00000239945_AL627309.3                 0                0
                            AAACCCAGTCGTGCCA
ENSG00000243485_MIR1302-2HG                0
ENSG00000237613_FAM138A                    0
ENSG00000186092_OR4F5                      0
ENSG00000238009_AL627309.1                 0
ENSG00000239945_AL627309.3                 0

counts_method3 = counts(sce)
counts_method3[1:5,1:3]

<5 x 3> sparse DelayedMatrix object of type "integer":
                            AAACCCAAGAGACAAG AAACCCAAGGCCTAGA
ENSG00000243485_MIR1302-2HG                0                0
ENSG00000237613_FAM138A                    0                0
ENSG00000186092_OR4F5                      0                0
ENSG00000238009_AL627309.1                 0                0
ENSG00000239945_AL627309.3                 0                0
                            AAACCCAGTCGTGCCA
ENSG00000243485_MIR1302-2HG                0
ENSG00000237613_FAM138A                    0
ENSG00000186092_OR4F5                      0
ENSG00000238009_AL627309.1                 0
ENSG00000239945_AL627309.3                 0

Sparse matrix

While the structure of the scRNAseq data is similar to that of the bulk data, there are also important differences that affect the downstream analysis. One of these differences is that single-cell data is much more sparse; in other words, there are many more zeros in the count matrix from a single-cell experiment than from a bulk experiment.

counts(sce)[1:10,1:3]

<10 x 3> sparse DelayedMatrix object of type "integer":
                            AAACCCAAGAGACAAG AAACCCAAGGCCTAGA
ENSG00000243485_MIR1302-2HG                0                0
ENSG00000237613_FAM138A                    0                0
ENSG00000186092_OR4F5                      0                0
ENSG00000238009_AL627309.1                 0                0
ENSG00000239945_AL627309.3                 0                0
ENSG00000239906_AL627309.2                 0                0
ENSG00000241599_AL627309.4                 0                0
ENSG00000236601_AL732372.1                 0                0
ENSG00000284733_OR4F29                     0                0
ENSG00000235146_AC114498.1                 0                0
                            AAACCCAGTCGTGCCA
ENSG00000243485_MIR1302-2HG                0
ENSG00000237613_FAM138A                    0
ENSG00000186092_OR4F5                      0
ENSG00000238009_AL627309.1                 0
ENSG00000239945_AL627309.3                 0
ENSG00000239906_AL627309.2                 0
ENSG00000241599_AL627309.4                 0
ENSG00000236601_AL732372.1                 0
ENSG00000284733_OR4F29                     0
ENSG00000235146_AC114498.1                 0

This is due to things such as:

the much lower sequencing depth per cell/sample (especially for droplet-based protocols)
not every individual cell expressing each gene
a failure of the library preparation to capture many of the expressed transcript molecules

Let’s check the fraction of zeros in our count matrix:

# fraction of 0 per barcode/cell:
frac_0_per_cell = colMeans(counts(sce) == 0)

# fraction of 0 for all cells: 
mean(counts(sce) == 0) # On average, the faction of 0 in a cell is of 94%

[1] 0.9473629

We also calculate the range of library sizes, noting that these are much smaller than the typical values for bulk RNA-seq:

# Library size (=total amount of reads per cell/barcode):
head(colSums(counts(sce)))

AAACCCAAGAGACAAG AAACCCAAGGCCTAGA AAACCCAGTCGTGCCA AAACCCATCGTGCATA 
            7375             3772             4902             6704 
AAACGAAAGACAAGCC AAACGAAAGAGTGACC 
            3900            11781

# Summary of library size per cells:
summary(colSums(counts(sce)))

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    501    3221    5506    6142    7686   53587

The number of cells in a scRNA-seq data set is typically much (several orders of magnitude) larger than the number of samples in a bulk RNA-seq experiment. Hence, the count matrices can get very large. However, since most of the values are zero, efficient storage modes, where only the non-zero values and the corresponding matrix positions are stored, can be employed. We can make sure that the count matrix in our object is indeed such a sparse matrix (in this particular data set, it is actually provided as a DelayedMatrix, which is beyond the scope of this course to discuss in detail, but which can be suitable for very large data matrices that do not fit in memory).

class(counts(sce))

[1] "DelayedMatrix"
attr(,"package")
[1] "DelayedArray"

counts(sce) = as(counts(sce), "dgCMatrix")
class(counts(sce))

[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"

counts(sce)[1:10, 1:10]

10 x 10 sparse Matrix of class "dgCMatrix"

  [[ suppressing 10 column names 'AAACCCAAGAGACAAG', 'AAACCCAAGGCCTAGA', 'AAACCCAGTCGTGCCA' ... ]]

                                               
ENSG00000243485_MIR1302-2HG . . . . . . . . . .
ENSG00000237613_FAM138A     . . . . . . . . . .
ENSG00000186092_OR4F5       . . . . . . . . . .
ENSG00000238009_AL627309.1  . . . . . . . . . .
ENSG00000239945_AL627309.3  . . . . . . . . . .
ENSG00000239906_AL627309.2  . . . . . . . . . .
ENSG00000241599_AL627309.4  . . . . . . . . . .
ENSG00000236601_AL732372.1  . . . . . . . . . .
ENSG00000284733_OR4F29      . . . . . . . . . .
ENSG00000235146_AC114498.1  . . . . . . . . . .

Properties of scRNA-seq data

We already noted a couple of differences between scRNA-seq data and bulk data:

The former typically contains many more observations
The count matrix is much more sparse.
The low amount of starting material for scRNA-seq experiments also results in a high sampling noise
Lower correlation among cells than among bulk RNA-seq samples.

There are many genes that are not observed in one of the cells, but still have a high number of assigned counts in the other cell. This indicates that the gene detection is to some extent random, as expected by the sampling process. In other words, it is not always the same genes that go undetected in all cells.

While we don’t observe each gene in each cell, there is still a clear association between the overall expression level of a gene (e.g., total count) and the fraction of cells where it is detected.

plot(rowSums(counts(sce)), 1-rowMeans(counts(sce) == 0), log = "x",xlab='total count per gene',ylab='fraction of cell in which the gene is detected')

Warning in xy.coords(x, y, xlabel, ylabel, log): 12698 x values <= 0
omitted from logarithmic plot

Quality control (QC)

Low-quality libraries can rise from a variety of reason (e.g. cell damage during dissociation, failure in library preparation)
They are problematic as they can contribute to misleading results in downstream analyses (form their own distinct cluster, interfere with characterization of population heterogeneity, aggressive scaling factors,…)
Low quality cells usually have low total counts, few expressed genes and high mitochondrial proportion.
To calculate these summary QC statistics, we use the Bioconductor package scater. For both for the cells and for the genes.
By providing a subset of genes (here, mitochondrial genes), we can calculate the fraction of counts falling in these genes.

suppressPackageStartupMessages({library(scater)})

# QC statistics per gene: 
sce = addPerFeatureQCMetrics(sce)

# Two new colunms (mean and detected) in rowData matrix: 
rowData(sce)

DataFrame with 33538 rows and 6 columns
                                 ENSEMBL_ID Symbol_TENx            Type
                                <character> <character>     <character>
ENSG00000243485_MIR1302-2HG ENSG00000243485 MIR1302-2HG Gene Expression
ENSG00000237613_FAM138A     ENSG00000237613     FAM138A Gene Expression
ENSG00000186092_OR4F5       ENSG00000186092       OR4F5 Gene Expression
ENSG00000238009_AL627309.1  ENSG00000238009  AL627309.1 Gene Expression
ENSG00000239945_AL627309.3  ENSG00000239945  AL627309.3 Gene Expression
...                                     ...         ...             ...
ENSG00000277856_AC233755.2  ENSG00000277856  AC233755.2 Gene Expression
ENSG00000275063_AC233755.1  ENSG00000275063  AC233755.1 Gene Expression
ENSG00000271254_AC240274.1  ENSG00000271254  AC240274.1 Gene Expression
ENSG00000277475_AC213203.1  ENSG00000277475  AC213203.1 Gene Expression
ENSG00000268674_FAM231C     ENSG00000268674     FAM231C Gene Expression
                                  Symbol        mean  detected
                             <character>   <numeric> <numeric>
ENSG00000243485_MIR1302-2HG           NA   0.0000000   0.00000
ENSG00000237613_FAM138A          FAM138A   0.0000000   0.00000
ENSG00000186092_OR4F5              OR4F5   0.0000000   0.00000
ENSG00000238009_AL627309.1  LOC100996442   0.0051458   0.51458
ENSG00000239945_AL627309.3            NA   0.0000000   0.00000
...                                  ...         ...       ...
ENSG00000277856_AC233755.2            NA 0.000190585 0.0190585
ENSG00000275063_AC233755.1  LOC102723407 0.000381170 0.0381170
ENSG00000271254_AC240274.1  LOC102724250 0.011244521 1.0672765
ENSG00000277475_AC213203.1            NA 0.000000000 0.0000000
ENSG00000268674_FAM231C               NA 0.000000000 0.0000000

# Information per cells: stored in colData (not in rowData)
# 1. mean: mean expression of the gene (mean number of read across all cells) 
# 2. detected: number of cell where the gene is detected (number of cell where count>0)

## QC statistics per cell: 
# Information on subset of genes (target)
# Mitochrondrial gene ("MT" pattern within their symbol name):
mt = rownames(sce)[grep("^MT-", rowData(sce)$Symbol_TENx)]
mt # 13 genes

 [1] "ENSG00000198888_MT-ND1"  "ENSG00000198763_MT-ND2" 
 [3] "ENSG00000198804_MT-CO1"  "ENSG00000198712_MT-CO2" 
 [5] "ENSG00000228253_MT-ATP8" "ENSG00000198899_MT-ATP6"
 [7] "ENSG00000198938_MT-CO3"  "ENSG00000198840_MT-ND3" 
 [9] "ENSG00000212907_MT-ND4L" "ENSG00000198886_MT-ND4" 
[11] "ENSG00000198786_MT-ND5"  "ENSG00000198695_MT-ND6" 
[13] "ENSG00000198727_MT-CYB"

# For each cell, general statistics and statistics about the genes in the targeted group: 
sce = addPerCellQCMetrics(sce, subsets = list(MT = mt))


# Six new colunms in colData matrix: 
colData(sce)

DataFrame with 5247 rows and 20 columns
                         Sample            Barcode         Sequence
                    <character>        <character>      <character>
AAACCCAAGAGACAAG pbmc5k-CITEseq AAACCCAAGAGACAAG-1 AAACCCAAGAGACAAG
AAACCCAAGGCCTAGA pbmc5k-CITEseq AAACCCAAGGCCTAGA-1 AAACCCAAGGCCTAGA
AAACCCAGTCGTGCCA pbmc5k-CITEseq AAACCCAGTCGTGCCA-1 AAACCCAGTCGTGCCA
AAACCCATCGTGCATA pbmc5k-CITEseq AAACCCATCGTGCATA-1 AAACCCATCGTGCATA
AAACGAAAGACAAGCC pbmc5k-CITEseq AAACGAAAGACAAGCC-1 AAACGAAAGACAAGCC
...                         ...                ...              ...
TTTGTTGCAGCACAAG pbmc5k-CITEseq TTTGTTGCAGCACAAG-1 TTTGTTGCAGCACAAG
TTTGTTGCAGTCTTCC pbmc5k-CITEseq TTTGTTGCAGTCTTCC-1 TTTGTTGCAGTCTTCC
TTTGTTGCATGGCCCA pbmc5k-CITEseq TTTGTTGCATGGCCCA-1 TTTGTTGCATGGCCCA
TTTGTTGCATTGCCGG pbmc5k-CITEseq TTTGTTGCATTGCCGG-1 TTTGTTGCATTGCCGG
TTTGTTGGTCCGGCAT pbmc5k-CITEseq TTTGTTGGTCCGGCAT-1 TTTGTTGGTCCGGCAT
                   Library Cell_ranger_version Tissue_status Barcode_type
                 <integer>         <character>   <character>  <character>
AAACCCAAGAGACAAG         1              v3.0.2            NA     Chromium
AAACCCAAGGCCTAGA         1              v3.0.2            NA     Chromium
AAACCCAGTCGTGCCA         1              v3.0.2            NA     Chromium
AAACCCATCGTGCATA         1              v3.0.2            NA     Chromium
AAACGAAAGACAAGCC         1              v3.0.2            NA     Chromium
...                    ...                 ...           ...          ...
TTTGTTGCAGCACAAG         1              v3.0.2            NA     Chromium
TTTGTTGCAGTCTTCC         1              v3.0.2            NA     Chromium
TTTGTTGCATGGCCCA         1              v3.0.2            NA     Chromium
TTTGTTGCATTGCCGG         1              v3.0.2            NA     Chromium
TTTGTTGGTCCGGCAT         1              v3.0.2            NA     Chromium
                   Chemistry Sequence_platform   Individual Date_published
                 <character>       <character>  <character>    <character>
AAACCCAAGAGACAAG Chromium_v3           NovaSeq HealthyDonor     2019-05-29
AAACCCAAGGCCTAGA Chromium_v3           NovaSeq HealthyDonor     2019-05-29
AAACCCAGTCGTGCCA Chromium_v3           NovaSeq HealthyDonor     2019-05-29
AAACCCATCGTGCATA Chromium_v3           NovaSeq HealthyDonor     2019-05-29
AAACGAAAGACAAGCC Chromium_v3           NovaSeq HealthyDonor     2019-05-29
...                      ...               ...          ...            ...
TTTGTTGCAGCACAAG Chromium_v3           NovaSeq HealthyDonor     2019-05-29
TTTGTTGCAGTCTTCC Chromium_v3           NovaSeq HealthyDonor     2019-05-29
TTTGTTGCATGGCCCA Chromium_v3           NovaSeq HealthyDonor     2019-05-29
TTTGTTGCATTGCCGG Chromium_v3           NovaSeq HealthyDonor     2019-05-29
TTTGTTGGTCCGGCAT Chromium_v3           NovaSeq HealthyDonor     2019-05-29
                       sum  detected subsets_MT_sum subsets_MT_detected
                 <numeric> <integer>      <numeric>           <integer>
AAACCCAAGAGACAAG      7375      2363            467                  11
AAACCCAAGGCCTAGA      3772      1259            343                  13
AAACCCAGTCGTGCCA      4902      1578            646                  11
AAACCCATCGTGCATA      6704      1908            426                  11
AAACGAAAGACAAGCC      3900      1589            363                  10
...                    ...       ...            ...                 ...
TTTGTTGCAGCACAAG      4754      1649            468                  12
TTTGTTGCAGTCTTCC      6373      1901            553                  11
TTTGTTGCATGGCCCA      1110       496            383                  11
TTTGTTGCATTGCCGG     12220      3443           1287                  12
TTTGTTGGTCCGGCAT      1932       651            752                  12
                 subsets_MT_percent altexps_Antibody Capture_sum
                          <numeric>                    <numeric>
AAACCCAAGAGACAAG            6.33220                         5178
AAACCCAAGGCCTAGA            9.09332                         2893
AAACCCAGTCGTGCCA           13.17829                         3635
AAACCCATCGTGCATA            6.35442                         3700
AAACGAAAGACAAGCC            9.30769                         3423
...                             ...                          ...
TTTGTTGCAGCACAAG            9.84434                         2262
TTTGTTGCAGTCTTCC            8.67723                         2886
TTTGTTGCATGGCCCA           34.50450                         1140
TTTGTTGCATTGCCGG           10.53191                         1919
TTTGTTGGTCCGGCAT           38.92340                         5634
                 altexps_Antibody Capture_detected
                                         <integer>
AAACCCAAGAGACAAG                                31
AAACCCAAGGCCTAGA                                29
AAACCCAGTCGTGCCA                                29
AAACCCATCGTGCATA                                31
AAACGAAAGACAAGCC                                28
...                                            ...
TTTGTTGCAGCACAAG                                31
TTTGTTGCAGTCTTCC                                30
TTTGTTGCATGGCCCA                                24
TTTGTTGCATTGCCGG                                28
TTTGTTGGTCCGGCAT                                31
                 altexps_Antibody Capture_percent     total
                                        <numeric> <numeric>
AAACCCAAGAGACAAG                          41.2491     12553
AAACCCAAGGCCTAGA                          43.4059      6665
AAACCCAGTCGTGCCA                          42.5794      8537
AAACCCATCGTGCATA                          35.5632     10404
AAACGAAAGACAAGCC                          46.7431      7323
...                                           ...       ...
TTTGTTGCAGCACAAG                          32.2406      7016
TTTGTTGCAGTCTTCC                          31.1697      9259
TTTGTTGCATGGCCCA                          50.6667      2250
TTTGTTGCATTGCCGG                          13.5724     14139
TTTGTTGGTCCGGCAT                          74.4647      7566

# Information per cells: stored in colData (not in rowData)
# 1. sum: library size of the cell
# 2. detected: number of genes detected in the cell
# 3. subsets_MT_sum: total number of reads on the targeted genes
# 4. subsets_MT_detected: number of targeted genes detected
# 5. subsets_MT_percent: percentage of reads mapped on mitochondrial genes
# - Additional information on altExp (see end of the lecture)

Library size and gene detected

we can plot the distribution of library sizes (sum) and the number of detected genes (detected) across the cells

par(mfrow=c(1,3))
hist(log10(sce$sum), breaks = 30,xlab='log10(sum of counts per cell)',main='distribution of library sizes')
hist(sce$detected, breaks = 30,xlab='number of genes detected per cell',main='distribution of detected genes')

Mitochrondrial genes

The proportion of counts assigned to mitochondrial genes is another useful indicator of cell quality, since high numbers of such reads can be associated to cell damage.

hist(sce$subsets_MT_percent, breaks = 30,main='Distribution of mitochondrial genes proportion',xlab='proportion of counts')

Filtering

Now that we have calculated the QC metrics, we will use it to filter out low-quality cells that will be excluded from the rest of the analysis.

The optimal parameters for filtering are debated and likely data set dependent. Here are two commonly used strategies:

Fixed threshold: For example, you might consider cells to be of low quality if their library sizes is below 100,000 reads; or if they express fewer than 5,000 genes; have mitochondrial proportions above 10%. However, while simple, this strategy requires substantial experience to determine appropriate thresholds for each experimental protocol and biological system.
Adaptive thresholds: A typical approach is to remove cells that fall ‘too far’ from the average cells on one or more of the considered criteria. This makes the implicit assumption that ‘most’ cells are of good quality, which is often sensible. It should also be noted that in some cases, cells that seem to be of bad quality can do so for biological reasons. For example, certain cell types express very few genes, or have a high metabolic rate and consequently express a lot of mitochondrial genes.

For exemple, here we exclude cells according to two criteria:

few detected genes
high fraction of mitochondrial genes

For each of these criteria, we exclude cells that are more than 4 median absolute deviations (MAD) from the median across cells, in the direction indicating low quality.

low_detected = isOutlier(sce$detected, type = "lower", 
                          log = TRUE, nmads = 4) # lower than 225.1728 genes

high_mt = isOutlier(sce$subsets_MT_percent, type = "higher",
                     log = FALSE, nmads = 4) # higher than 21.94558 % percent of read on mitochondrial genes

par(mfrow=c(1,2))
plot(rank(-sce$detected), sce$detected, col = low_detected + 1,ylab='number of gene detected',xlab='cell rank')
plot(rank(sce$subsets_MT_percent), sce$subsets_MT_percent,
     col = high_mt + 1,ylab=' proportion of counts assigned to mitochondrial genes',xlab='cell rank')

# To filter out the cells: 
sce$retain = !low_detected & !high_mt


# Number of cell retained and number of cell filtered out:
table(sce$retain)


FALSE  TRUE 
 1006  4241

# All cell with a low detected amount of genes have also a high mt percentage:
which(low_detected==TRUE)%in%which(high_mt==TRUE)

 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

# Dimension of sce object change:
dim(sce)

[1] 33538  5247

sce = sce[, sce$retain]
dim(sce) # 4241 cells left (1006 cell removed --> bad quality cells)

[1] 33538  4241

# look at the distribution of the remaining cells
par(mfrow=c(1,3))
hist(log10(sce$sum), breaks = 30,xlab='log10(sum of counts per cell)',main='distribution of library sizes')
hist(sce$detected, breaks = 30,xlab='number of genes detected per cell',main='distribution of detected genes')
hist(sce$subsets_MT_percent, breaks = 30,main='Distribution of mitochondrial genes proportion',xlab='proportion of counts')

Normalization

Just as for bulk RNA-seq, the raw scRNA-seq counts are not directly comparable across cells due to, e.g., differences in library sizes. This typically arise from technical differences due to the difficulty of achieving consistent library preparation with minimal starting material.

Thus, we need to apply a normalization strategy. There are many approaches to normalization of scRNA-seq data (see e.g. Lytal et al. 2020 and Cole et al. 2019 for comparisons). In this lecture we will focus on scaling normalization, which is the simplest and most commonly used class of normalization strategies. This involves dividing all counts for each cell by a cell-specific scaling factor, often called a “size factor”.

Here, we will use one scaling strategy implemented in the scran package. Similarly to the TMM and DESeq normalization approaches, this works by estimating a size factor for each cell, which incorporates the library size as well as a measure of the RNA composition . The removal of composition biases is a well-studied problem (e.g. bulk RNA-seq). It assumes that most genes are not DE between cells. Any systematic difference in count size across the non-DE majority of genes between two cells is assumed to represent bias that is used to compute an appropriate size factor for its removal.

However, the bulk RNA-seq methods are sometimes struggling with scRNA-seq due to the large number of zeros; the scran approach solves this by repeatedly pooling multiple cells (which reduces the number of zeros), calculating size factors for the pools, and deriving individual size factors for the cells from those of the pools. After calculating the size factors, we normalize the observed counts and log-transform the values. The new “log counts” are placed in a new assay in sce, named logcounts.

The deconvolution stategy

Simplification of a complex signal and consists of several key steps:

Defining a pool of cells
Summing expression values across all cells in the pool
Normalizing the cell pool against an average reference, using the summed expression value
Repeating this for many different pools of cells to construct a linear system
Deconvolving the pool-based size factors to their cell-based counterparts (Fig. 3)

# Normalization by deconvolution: Scaling normalization by deconvolving size factors from cell pools.
suppressPackageStartupMessages({library(scran)})



# quickCluster(): pre-clustering step where cells in each cluster are normalized separately. 
# The size factors are rescaled to be comparable across clusters.  Avoid assumption that most genes are non-DE across the entire population. 
set.seed(100)
clust = quickCluster(sce) 
sce = computeSumFactors(sce, cluster=clust, min.mean=0.1) # min.mean: minimum average count of genes to be used for normalization. 

# New sizeFactor column: 
colnames(colData(sce))

 [1] "Sample"                           
 [2] "Barcode"                          
 [3] "Sequence"                         
 [4] "Library"                          
 [5] "Cell_ranger_version"              
 [6] "Tissue_status"                    
 [7] "Barcode_type"                     
 [8] "Chemistry"                        
 [9] "Sequence_platform"                
[10] "Individual"                       
[11] "Date_published"                   
[12] "sum"                              
[13] "detected"                         
[14] "subsets_MT_sum"                   
[15] "subsets_MT_detected"              
[16] "subsets_MT_percent"               
[17] "altexps_Antibody Capture_sum"     
[18] "altexps_Antibody Capture_detected"
[19] "altexps_Antibody Capture_percent" 
[20] "total"                            
[21] "retain"                           
[22] "sizeFactor"

# The size factor for each cell represents the estimate of the relative bias in that cell, so division of its counts by its size factor should remove that biais
# To access the size factor: 
sizeFactors(sce)[1:10]

 [1] 1.0697706 0.4570165 0.6278514 0.8410376 0.5639728 1.7096026 0.5648314
 [8] 1.3857665 1.4768468 0.5850588

# Compute log-transformed normalized expression values (using size factor): 
# divides the count for each gene with the appropriate size factor for that cell
sce = logNormCounts(sce)
assayNames(sce) # new assays logcounts

[1] "counts"    "logcounts"

Feature selection

scRNA-seq is wiedly used to characterize heterogeneity across cells. To compare cells based on their gene expression profiles, we can use procedures like clustering and dimensionality reduction which involves aggregating per-gene differences into a single (dis)similarity metric. The choice of genes to use in this calculation has a crutial impact on the behavior of the metric.

In pratical, we want to identify genes that contain useful information about the biology of the system while removing genes that contain random noise.

Mean-variance relationship

Variation in gene abundance estimates between different cells can be thought of as the convolution of the technical (mainly sampling) and the biological (e.g cell type) sources of variance. Typically one wants to isolate and focus on the biological variance so that differences due to experimental noise (technical noise) have as small an impact as possible on subsequent analyses.

There are different approaches to disentangling the technical and biological variability in a single-cell data set:

1. The “most” genes exhibit only technical variability (i.e., most genes are not differentially expressed between different studied cell types).

2. The technical variance follows a Poisson distribution (a common distribution capturing sampling variability in count data), and that deviations from the Poisson expectation corresponds to biological variability.

Quantifying per-gene variation

The simplest approach to quantifying per-gene variation is to compute the variance of the log-normalized expression values (i.e., “log-counts” ) for each gene across all cells. This requires modelling of the mean-variance relationship beacause the log-transformation is often not a variance stabilizing transformation (total variance of a gene is driven more by its abundance than its underlying biological heterogeneity). To account for this effect, we fit a trend to the variance with respect to abundance across all genes

# Fit a trend to the variance with respect to abundance across all genes:
dec.trend = modelGeneVar(sce)

# Visualizing the fit:
fit.trend = metadata(dec.trend)
par(mfrow=c(1,2))
plot(fit.trend$mean, fit.trend$var, xlab = "Mean of log-expression",
     ylab = "Variance of log-expression")
curve(fit.trend$trend(x), col = "dodgerblue", add = TRUE, lwd = 2)

Quantifying technical noise

However, the previous assumption may be problematic in scenarios where many genes at a particular abundance are affected by a biological process. We can avoid this problem by creating a trend by making some distributional assumptions about the noise. UMI counts typically exhibit near-Poisson variation and can be used to construct a mean-variance trend

## Poisson assumption: Makes some distributional assumptions about the technial noise. 
# random seed because of stochastic nature of Poisson modeling
set.seed(123)
dec.pois = modelGeneVarByPoisson(sce)
plot(dec.pois$mean, dec.pois$total, xlab = "Mean of log-expression",
     ylab = "Variance of log-expression")
curve(metadata(dec.pois)$trend(x), col = "dodgerblue", add = TRUE)

In each case, the biological variance of a gene can be estimated as the difference between the total variance and the modeled technical variance. This value can be used to rank the genes, and to select a set of “highly variable” genes to use as the basis of further downstream analysis. Here, we select the top 1,000 genes based on the trend fit, and add these to the metadata of the SingleCellExperiment object.

head(dec.trend[order(dec.trend$bio, decreasing = TRUE), ])

DataFrame with 6 rows and 6 columns
                              mean     total      tech       bio
                         <numeric> <numeric> <numeric> <numeric>
ENSG00000090382_LYZ        2.56480   9.27648  0.987912   8.28857
ENSG00000163220_S100A9     2.13558   8.15133  0.993180   7.15815
ENSG00000143546_S100A8     1.54234   5.47531  0.953660   4.52165
ENSG00000204287_HLA-DRA    1.97873   4.83492  0.990328   3.84459
ENSG00000101439_CST3       1.55593   4.12547  0.955645   3.16982
ENSG00000196126_HLA-DRB1   1.87228   4.04842  0.985001   3.06342
                              p.value          FDR
                            <numeric>    <numeric>
ENSG00000090382_LYZ      3.43209e-221 7.08075e-217
ENSG00000163220_S100A9   7.54485e-164 7.78289e-160
ENSG00000143546_S100A8    3.50916e-72  2.41325e-68
ENSG00000204287_HLA-DRA   4.34061e-49  1.49252e-45
ENSG00000101439_CST3      2.18125e-36  4.50015e-33
ENSG00000196126_HLA-DRB1  3.12392e-32  4.60354e-29

head(dec.pois[order(dec.pois$bio, decreasing = TRUE), ])

DataFrame with 6 rows and 6 columns
                              mean     total      tech       bio   p.value
                         <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000090382_LYZ        2.56480   9.27648  0.553318   8.72317         0
ENSG00000163220_S100A9     2.13558   8.15133  0.674412   7.47692         0
ENSG00000143546_S100A8     1.54234   5.47531  0.787616   4.68769         0
ENSG00000204287_HLA-DRA    1.97873   4.83492  0.715043   4.11987         0
ENSG00000101439_CST3       1.55593   4.12547  0.786727   3.33874         0
ENSG00000196126_HLA-DRB1   1.87228   4.04842  0.739656   3.30876         0
                               FDR
                         <numeric>
ENSG00000090382_LYZ              0
ENSG00000163220_S100A9           0
ENSG00000143546_S100A8           0
ENSG00000204287_HLA-DRA          0
ENSG00000101439_CST3             0
ENSG00000196126_HLA-DRB1         0

modelGeneVar_top1000 = rownames(dec.trend[order(dec.trend$bio, decreasing = TRUE), ])[1:1000]
modelGeneVarByPoisson_top1000 = rownames(dec.pois[order(dec.pois$bio, decreasing = TRUE), ])[1:1000]
# majority of genes found in both approach. 
table(modelGeneVar_top1000%in%modelGeneVarByPoisson_top1000)


FALSE  TRUE 
  192   808

hvg.var = getTopHVGs(dec.pois, n = 1000)
head(hvg.var)

[1] "ENSG00000090382_LYZ"      "ENSG00000163220_S100A9"  
[3] "ENSG00000143546_S100A8"   "ENSG00000204287_HLA-DRA" 
[5] "ENSG00000101439_CST3"     "ENSG00000196126_HLA-DRB1"

# Save hvgs into metadata (does not need to have the same mount of row/col as for rowData or colData)
metadata(sce)$hvgs = hvg.var

Data visualization

For bulk RNA-seq data, we typically use PCA for visualization and exploratory analysis. This can of course also be applied to single-cell RNA-seq data. However, other methods (e.g., tSNE and UMAP) are more commonly used for scRNA-seq. Both tSNE and UMAP are non-linear dimension reduction methods, which focus to a larger extent on retaining small distances. That means that cells that are similar to each other (e.g., cells of the same cell type) tend to be placed close together in the low-dimensional representation, whereas larger distances are potentially less faithfully represented. PCA, on the other hand, tend to put more focus on preserving the large cell-to-cell distances. Importantly, even though PCA is rarely used for visualization of large scRNA-seq data sets in two-dimensional plots, it is often used as a first dimension reduction step, and other approaches such as tSNE and UMAP are subsequently applied to the PCA output.

Here, we therefore first apply PCA to our data set. We supply the set of highly variable genes derived above, to only calculate the principal components based on these. By default, the runPCA function from the scater package will apply the PCA to the ‘logcounts’ assay. We plot the first two principal components (note that we extract 50 components, which will be used as the basis for the tSNE later) and color by the different colData entries.

# PCA: Principal component analysis is a technique for reducing the dimensionality of complex datasets, increasing interpretability but at the same time minimizing information loss
sce = runPCA(sce, exprs_values = "logcounts", ncomponents = 50, 
              subset_row = metadata(sce)$hvgs)


# New reducedDimNames entry:
reducedDimNames(sce)

[1] "PCA"

## PCA ploting:
# coloration according to colData:
plotReducedDim(sce, "PCA", colour_by = "detected")

plotReducedDim(sce, "PCA", colour_by = "sum")

plotReducedDim(sce, "PCA", colour_by = "subsets_MT_sum")

Next, we apply tsne, which is a non-linear, stochastic dimension reduction technique that attempts to find a mapping of the data on a low subspace while preserving local distances between cells. The non-linear character of tSNE means that it will often produce projections that better resolve differences between cell groups. The better separation of tSNE comes at the cost of interpretability; while in a tSNE representation similar cells are placed close to each other, longer distances in the representation are not guaranteed to reflect true relationships. This means that it is risky to draw conclusions of “similarity” or “dissimilarity” from the positional relationships of different cell groupings that appear in a tSNE plot. In addition, the stochastic nature of tSNE means that every time the algorithm is applied a different representation will be produced unlesss a random seed is set.

# random seed because of stochastic nature of tSNE 
set.seed(123)
sce = runTSNE(sce, dimred = "PCA")  # use 50 PCs of PCA
reducedDimNames(sce)

[1] "PCA"  "TSNE"

plotReducedDim(sce, "TSNE", colour_by = "detected")

plotReducedDim(sce, "TSNE", colour_by = "sum")

plotReducedDim(sce, "TSNE", colour_by = "subsets_MT_sum")

We can often identify specific cell types and gain some understanding of the data directly from the visualization (other types of cell type identification, e.g. via cell type clustering, will be discussed in the next lecture). Typically, this would be done by colouring the cells according to the expression level of known marker genes for certain cell types.

## Color by expression of a B-cell marker (CD20)
plotReducedDim(sce, "TSNE", colour_by = "ENSG00000156738_MS4A1")

## Color by expression of a T-cell marker
plotReducedDim(sce, "TSNE", colour_by = "ENSG00000167286_CD3D")

plotReducedDim(sce, "TSNE", colour_by = "ENSG00000126353_CCR7")

## Color by expression of a Monocyte marker
plotReducedDim(sce, "TSNE", colour_by = "ENSG00000090382_LYZ")

## Color by expression of a platelet marker
plotReducedDim(sce, "TSNE", colour_by = "ENSG00000163736_PPBP")

Cell type annotation

To better identify specific cell type we can also use computational approaches that exploit prior information.

Here we will directly compare our expression profiles to published reference datasets where each cell has already been annotated with its putative biological state by domain experts (https://rdrr.io/github/LTLA/celldex/man/DatabaseImmuneCellExpressionData.html).

Assigning cell labels from reference data

Compare the single-cell expression profiles with previously annotated reference datasets.
Labels can then be assigned to each cell in our uncharacterized test dataset based on the most similar reference sample/cell.
Any published and labelled RNA-seq dataset (bulk or single-cell) can be used as a reference. However, its reliability highly depends on the expertise of the original authors assigning the labels in the first place.

# libraries 
## BiocManager::install("celldex")
suppressPackageStartupMessages({
library('celldex')
library('SingleR')                          
  })



# Reference annotation (https://dice-database.org)
ref.dice = celldex::DatabaseImmuneCellExpressionData()

see ?celldex and browseVignettes('celldex') for documentation

loading from cache

see ?celldex and browseVignettes('celldex') for documentation

loading from cache

colData(ref.dice) # two labels: label.main and label.fine

DataFrame with 1561 rows and 3 columns
           label.main             label.fine   label.ont
          <character>            <character> <character>
TPM_1         B cells         B cells, naive  CL:0000788
TPM_2         B cells         B cells, naive  CL:0000788
TPM_3         B cells         B cells, naive  CL:0000788
TPM_4         B cells         B cells, naive  CL:0000788
TPM_5         B cells         B cells, naive  CL:0000788
...               ...                    ...         ...
TPM_98  T cells, CD8+ T cells, CD8+, naive..  CL:0000906
TPM_99  T cells, CD8+ T cells, CD8+, naive..  CL:0000906
TPM_100 T cells, CD8+ T cells, CD8+, naive..  CL:0000906
TPM_101 T cells, CD8+ T cells, CD8+, naive..  CL:0000906
TPM_102 T cells, CD8+ T cells, CD8+, naive..  CL:0000906

# Make sure both object have the same rownames (ENSEMBL symbol in this case)
rownames(sce) = rowData(sce)$Symbol_TENx


## Prediction using main label
pred = SingleR(test = sce, ref = ref.dice, labels = ref.dice$label.main)


# Add prediction to the sce object
sce$main.annotation.dice = pred$pruned.labels
# Plot on TSNE
plotReducedDim(sce, "TSNE", colour_by = as.data.frame(sce$main.annotation.dice))

Side note

In this lesson we have used Bioconductor packages for the analysis, and represented the data in a SingleCellExperiment container. Just as for bulk RNA-seq, there are also other containers that are often used for single-cell data. The most common alternative is the Seurat object, which is the basis for the workflow provided by the Seurat package. This package provides to a large extent similar capabilities as the Bioconductor packages we have seen in this lecture, and can be used as an alternative. The webpage contains a collection of tutorials (including one for the same data set that we studied here).

Extend to the classical scRNA-seq

Initially developped to measure gene expression at a single cell level, scRNA-seq approaches have rapidely evolved to allow simultaneous profiling of multiple additional features.

In more simple term: transcriptome + another layer of information from the same cell.

Here are a some examples of multimodal approches becoming more and more commonly used in the scRNA-seq field:

Cellular Indexing of Transcriptomes and Epitopes (CITE)-seq: Method that allows to label cell-surface proteins with uniquely barcoded antibodies. The proteins are then captured and quantified along with the RNA molecules in single-cell RNA sequencing. Cell-surface proteins are often useful markers for cell identity. For more detail, see https://www.nature.com/articles/nmeth.4380

(Image credit: Schematic figure of (A) the CITE-seq antibody linked with the barcoded oligo (B) the CITE-seq protocol. Credit: Winston & Gregory Timp (2020), license: CC-BY 4.0)

Cell Hashing: Uses barcoded antibodies to place a “sample tag” on each single cell, enabling different samples to be multiplexed together and run in a single experiment. Multiplexing is commonly used to minimize batch effects which can mask the biological signal. For more detail, see https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1603-1

(Image credit: cell-hashing in www.protocols.io )

Single Cell Immune Profiling (scTCR/scBCR-seq): Comprehensive approach to simultaneously examine cellular heterogeneity of the immune system, T- and B-cell repertoire diversity, and antigen specificity at single cell resolution. Method used for the determination of paired, full-length receptor sequences for T or B cells, and exploration of antigen specificity.

As a more genereal term, multi-omic single cell sequencing groups all layers of information (eg. Chromatin accessibility, DNA methylation,…) that can be collected together with the transcriptome of a cell.

Exercise

E1 - Exploration

A. Explore the SingleCellExperiment object:

Load the “pbmc6k” dataset from the TENxPBMCData
Dimension : Number of genes and number of cells in matrix
Colnames: What are the column names of the matrix ?
Rownames: What are the row names of the matrix ?
colData: What info is stored in the colData (cell’s metadata )
rowData: What info is stored in the rowData (gene’s metadata)

Solution 1A

# load the data: 
sce = TENxPBMCData("pbmc6k")

see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation

loading from cache

# Dimension of SingleCellExperiment: 
dim(sce) # genes X cells (barcodes)

[1] 32738  5419

# colnames SingleCellExperiment:
head(colnames(sce)) # No colnames yet

NULL

# rownames SingleCellExperiment:
head(rownames(sce)) # represent the mouse Ensembl gene ID (http://www.ensembl.org/Mus_musculus/Info/Index)

[1] "ENSG00000243485" "ENSG00000237613" "ENSG00000186092" "ENSG00000238009"
[5] "ENSG00000239945" "ENSG00000237683"

# SingleCellExperiment colData info:
colData(sce) # sample (=cells) information (Barcode,library, Tissue_status,... )

DataFrame with 5419 rows and 11 columns
          Sample          Barcode       Sequence   Library
     <character>      <character>    <character> <integer>
1         pbmc6k AAACATACAACCAC-1 AAACATACAACCAC         1
2         pbmc6k AAACATACACCAGT-1 AAACATACACCAGT         1
3         pbmc6k AAACATACCCGTAA-1 AAACATACCCGTAA         1
4         pbmc6k AAACATACCTAAGC-1 AAACATACCTAAGC         1
5         pbmc6k AAACATACCTTCCG-1 AAACATACCTTCCG         1
...          ...              ...            ...       ...
5415      pbmc6k TTTGCATGCACTAG-1 TTTGCATGCACTAG         1
5416      pbmc6k TTTGCATGCCGAAT-1 TTTGCATGCCGAAT         1
5417      pbmc6k TTTGCATGGAGGTG-1 TTTGCATGGAGGTG         1
5418      pbmc6k TTTGCATGGATAAG-1 TTTGCATGGATAAG         1
5419      pbmc6k TTTGCATGGGCATT-1 TTTGCATGGGCATT         1
     Cell_ranger_version Tissue_status Barcode_type   Chemistry
             <character>   <character>  <character> <character>
1                 v1.1.0            NA      GemCode Chromium_v1
2                 v1.1.0            NA      GemCode Chromium_v1
3                 v1.1.0            NA      GemCode Chromium_v1
4                 v1.1.0            NA      GemCode Chromium_v1
5                 v1.1.0            NA      GemCode Chromium_v1
...                  ...           ...          ...         ...
5415              v1.1.0            NA      GemCode Chromium_v1
5416              v1.1.0            NA      GemCode Chromium_v1
5417              v1.1.0            NA      GemCode Chromium_v1
5418              v1.1.0            NA      GemCode Chromium_v1
5419              v1.1.0            NA      GemCode Chromium_v1
     Sequence_platform    Individual Date_published
           <character>   <character>    <character>
1           NextSeq500 HealthyDonor2     2016-07-31
2           NextSeq500 HealthyDonor2     2016-07-31
3           NextSeq500 HealthyDonor2     2016-07-31
4           NextSeq500 HealthyDonor2     2016-07-31
5           NextSeq500 HealthyDonor2     2016-07-31
...                ...           ...            ...
5415        NextSeq500 HealthyDonor2     2016-07-31
5416        NextSeq500 HealthyDonor2     2016-07-31
5417        NextSeq500 HealthyDonor2     2016-07-31
5418        NextSeq500 HealthyDonor2     2016-07-31
5419        NextSeq500 HealthyDonor2     2016-07-31

# SingleCellExperiment rowData info:
rowData(sce) # gene information (ENSEMBL_ID, Symbol_TENx, ...)

DataFrame with 32738 rows and 3 columns
                     ENSEMBL_ID  Symbol_TENx       Symbol
                    <character>  <character>  <character>
ENSG00000243485 ENSG00000243485   MIR1302-10           NA
ENSG00000237613 ENSG00000237613      FAM138A      FAM138A
ENSG00000186092 ENSG00000186092        OR4F5        OR4F5
ENSG00000238009 ENSG00000238009 RP11-34P13.7 LOC100996442
ENSG00000239945 ENSG00000239945 RP11-34P13.8           NA
...                         ...          ...          ...
ENSG00000215635 ENSG00000215635   AC145205.1           NA
ENSG00000268590 ENSG00000268590        BAGE5           NA
ENSG00000251180 ENSG00000251180   CU459201.1           NA
ENSG00000215616 ENSG00000215616   AC002321.2           NA
ENSG00000215611 ENSG00000215611   AC002321.1           NA

B. Modify rownames: set the row names to gene identifier (= gene ENSEMBL_ID + Symbol_TENx)

Solution 1B

# combine Ensembl ID and gene symbol and set as row names of the data object. 
rownames(sce) = paste0(rowData(sce)$ENSEMBL_ID, "_", 
                        rowData(sce)$Symbol_TENx)

C. The loaded object has no column names. Set the column names to the cell barcode:

First check that the cell barcodes are all unique.
What would you do if they were not unique ?
Why might that happen?

Solution 1C

all(gsub("-1", "", sce$Barcode) == sce$Sequence)

[1] TRUE

any(duplicated(sce$Sequence)) # No duplicate because there is only one sample (Library) present.

[1] FALSE

# if they would be duplicate, we would need to combine the sequence barcode with the library (incase of multiple sample: e.g AAACATACAACCAC-1 for sample 1 and AAACATACAACCAC-2 for sample 2)
colnames(sce) = sce$Sequence

E2

Check that the counts() accessor returns the same matrix as you would get if you access the counts assay using assay() or assays().

Solution 2

identical(assay(sce, "counts"), counts(sce))

[1] TRUE

all(assay(sce, "counts") == counts(sce))   ## alternative

[1] TRUE

identical(assays(sce)[["counts"]], counts(sce))

[1] TRUE

E3 - Quality control

How many genes are detected in at least one cell? How many percent of genes are detected in at least one cell?
Compare it to the number of genes detected in each of the individual cells.
What does this tell you?

Solution 3

## Total number of detected genes
# sum of read per gene: 
# number of gene that have 0 over all cells:
sum(rowSums(counts(sce)) > 0) # 17220

[1] 17220

sum(rowSums(counts(sce)) > 0)/nrow(sce) # 0.5259943

[1] 0.5259943

## Genes detected in single cells
# Select only gene that have at least 1 read over all cells --> make the data frame smaller. 
# sum over the column (for each cells): number of gene expressed per cell
summary(colSums(counts(sce) > 0)) # very large variation of numbers --> This indicates that the gene detection is to some extent random, as expected by the sampling process. In other words, it is not always the same genes that go undetected in all cells.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   81.0   592.0   716.0   748.1   850.0  3185.0

E4

Provide per cell and per gene general QC
Provide per cell QC for a the mitochrondrial subset of genes
Would you expect an association between the total count and the number of detected genes in a cell? Test your prediction.

Solution 4

# General per gene QC:
sce = addPerFeatureQC(sce) # sum and detected for each gene

## # General per cell QC:
mt = rownames(sce)[grep("^MT-", rowData(sce)$Symbol_TENx)] 
sce = addPerCellQC(sce, subsets = list(MT = mt)) # sum and detected gene and targeted genes for each cell 

plot(sce$sum, sce$detected) # yes, the more you sequenced, the more you can detecte genes.

E5

Filter out the cells identified as being of low quality according to:

Too small library size (cells with bellow 1000 total reads)
Too small number of gene detected (outlier approach as shown before)
How many cells have a too small library size AND not enough gene detected ? What can you conclude from this ?
How many cells are left after removal of bad quality cells?

Solution 5

low_sum = sce$sum < 1000
low_detected = isOutlier(sce$detected, type = "lower", 
                          log = TRUE, nmads = 4) # minimum of 243 genes detected to be consider as good quality cells

table(low_sum == TRUE & low_detected==TRUE)


FALSE  TRUE 
 5413     6

# ALL 6 cells with not enough gene detected have also not enough reads to be considered as good quality cells. 
# As shown before, association between the total count and the number of detected genes
# But also: bad quality cells can be identified by multiple QC approach. 

sce$retain = !low_detected & !low_sum
table(sce$retain)


FALSE  TRUE 
  446  4973

sce = sce[, sce$retain]
dim(sce) # 4973 cells left

[1] 32738  4973

E6 - Normalization and data vizualisation

Use a tSNE to visualization the proportion of mitochondrial genes

Identify highly variable genes (n=1000)
Produce a PCA (Hint: you need logcount values)
Produce a tSNE (do not forget to set your seed)
Plot the tSNE accoding to the number of mitochondrial gene detected

Solution 6

# Use Poisson or per-gene variation:
dec.pois = modelGeneVarByPoisson(sce)
hvg.var = getTopHVGs(dec.pois, n = 1000)
metadata(sce)$hvgs = hvg.var

# Compute size factor: 

## Option 1: Using pre-clustering step (quickCluster()) where cells in each cluster are normalized separately and the size factors are rescaled to be comparable across clusters. This avoids the assumption that most genes are non-DE across the entire population - only a non-DE majority is required between pairs of clusters, which is a weaker assumption for highly heterogeneous populations.
set.seed(100)
clust = quickCluster(sce)
sce = computeSumFactors(sce,clust=clust, min.mean = 0.1) 

## Option2: No pre-clustering
sce = computeSumFactors(sce, min.mean = 0.1) 
# logcount for PCA: 
sce = logNormCounts(sce)

# PCA with 50 PCS and HVGs
sce = runPCA(sce, exprs_values = "logcounts", ncomponents = 50, 
              subset_row = metadata(sce)$hvgs)


set.seed(123)
# use 50 PCs of PCA: 
sce = runTSNE(sce, dimred = "PCA")
# plot according to the number of Mitochondrial gene detected
plotReducedDim(sce, "TSNE", colour_by = "subsets_MT_detected")

Session info

R version 4.3.3 (2024-02-29)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Big Sur 11.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] SingleR_2.4.1               celldex_1.12.0             
 [3] scran_1.30.2                scater_1.30.1              
 [5] ggplot2_3.5.0               scuttle_1.12.0             
 [7] TENxPBMCData_1.20.0         HDF5Array_1.30.1           
 [9] rhdf5_2.46.1                DelayedArray_0.28.0        
[11] SparseArray_1.2.4           S4Arrays_1.2.1             
[13] abind_1.4-5                 Matrix_1.6-5               
[15] SingleCellExperiment_1.24.0 SummarizedExperiment_1.32.0
[17] Biobase_2.62.0              GenomicRanges_1.54.1       
[19] GenomeInfoDb_1.38.8         IRanges_2.36.0             
[21] S4Vectors_0.40.2            BiocGenerics_0.48.1        
[23] MatrixGenerics_1.14.0       matrixStats_1.2.0          
[25] BiocStyle_2.30.0            png_0.1-8                  
[27] knitr_1.45                 

loaded via a namespace (and not attached):
  [1] rstudioapi_0.16.0             jsonlite_1.8.8               
  [3] magrittr_2.0.3                ggbeeswarm_0.7.2             
  [5] farver_2.1.1                  rmarkdown_2.26               
  [7] zlibbioc_1.48.2               vctrs_0.6.5                  
  [9] memoise_2.0.1                 DelayedMatrixStats_1.24.0    
 [11] RCurl_1.98-1.14               htmltools_0.5.8              
 [13] AnnotationHub_3.10.0          curl_5.2.1                   
 [15] BiocNeighbors_1.20.2          Rhdf5lib_1.24.2              
 [17] htmlwidgets_1.6.4             cachem_1.0.8                 
 [19] igraph_2.0.3                  mime_0.12                    
 [21] lifecycle_1.0.4               pkgconfig_2.0.3              
 [23] rsvd_1.0.5                    R6_2.5.1                     
 [25] fastmap_1.1.1                 GenomeInfoDbData_1.2.11      
 [27] shiny_1.8.1                   digest_0.6.35                
 [29] colorspace_2.1-0              AnnotationDbi_1.64.1         
 [31] dqrng_0.3.2                   irlba_2.3.5.1                
 [33] ExperimentHub_2.10.0          RSQLite_2.3.6                
 [35] beachmat_2.18.1               filelock_1.0.3               
 [37] labeling_0.4.3                fansi_1.0.6                  
 [39] httr_1.4.7                    compiler_4.3.3               
 [41] bit64_4.0.5                   withr_3.0.0                  
 [43] BiocParallel_1.36.0           viridis_0.6.5                
 [45] DBI_1.2.2                     rappdirs_0.3.3               
 [47] bluster_1.12.0                tools_4.3.3                  
 [49] vipor_0.4.7                   beeswarm_0.4.0               
 [51] interactiveDisplayBase_1.40.0 httpuv_1.6.15                
 [53] glue_1.7.0                    rhdf5filters_1.14.1          
 [55] promises_1.2.1                Rtsne_0.17                   
 [57] cluster_2.1.6                 generics_0.1.3               
 [59] gtable_0.3.4                  BiocSingular_1.18.0          
 [61] ScaledMatrix_1.10.0           metapod_1.10.1               
 [63] utf8_1.2.4                    XVector_0.42.0               
 [65] ggrepel_0.9.5                 BiocVersion_3.18.1           
 [67] pillar_1.9.0                  limma_3.58.1                 
 [69] later_1.3.2                   dplyr_1.1.4                  
 [71] BiocFileCache_2.10.2          lattice_0.22-6               
 [73] bit_4.0.5                     tidyselect_1.2.1             
 [75] locfit_1.5-9.9                Biostrings_2.70.3            
 [77] gridExtra_2.3                 edgeR_4.0.16                 
 [79] xfun_0.43                     statmod_1.5.0                
 [81] yaml_2.3.8                    evaluate_0.23                
 [83] codetools_0.2-20              tibble_3.2.1                 
 [85] BiocManager_1.30.22           cli_3.6.2                    
 [87] xtable_1.8-4                  munsell_0.5.1                
 [89] Rcpp_1.0.12                   dbplyr_2.5.0                 
 [91] parallel_4.3.3                blob_1.2.4                   
 [93] sparseMatrixStats_1.14.0      bitops_1.0-7                 
 [95] viridisLite_0.4.2             scales_1.3.0                 
 [97] purrr_1.0.2                   crayon_1.5.2                 
 [99] rlang_1.1.3                   KEGGREST_1.42.0