suppressPackageStartupMessages({
library(TENxPBMCData)
library(SingleCellExperiment)
})# Import dataset:
= TENxPBMCData("pbmc5k-CITEseq") sce
see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation
loading from cache
(Image credit: Elizabeth Sergison, Honeycomb Biotechnologies)
(Image credit: 10x Genomics: Getting started with single cell gene expression)
(Image credit: lexogen: rna-lexicon chapter8)
(Image credit: lexogen: rna-lexicon chapter8)
(Image credit: 10x Genomics)
DropletUtils
and Seurat
contain functions for reading CellRanger
outputtximeta
can read alevin output (specify type = "alevin"
)BUSpaRse
can read kallisto|bustools outputSummarizedExperiment
object (at least within the Bioconductor framework)SingleCellExperiment
objectSingleCellExperiment
object is an extension of the SummarizedExperiment
objectSingleCellExperiment
object can also store reduced dimension representations (PCA, tSNE, UMAP)(Original image credit: Jim Hester / https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html)
For this lecture, we will use an example data set from the TENxPBMCData
Bioconductor package. This package contains a number of PBMC data sets obtained with the 10x Genomics Chromium platform. Here we will use one containing approximately 3,000 peripheral blood mononuclear cells.
suppressPackageStartupMessages({
library(TENxPBMCData)
library(SingleCellExperiment)
})# Import dataset:
= TENxPBMCData("pbmc5k-CITEseq") sce
see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation
loading from cache
We can see that sce
is indeed a SingleCellExperiment
object, and that the structure is similar to that of a SummarizedExperiment
object.
sce
class: SingleCellExperiment
dim: 33538 5247
metadata(0):
assays(1): counts
rownames(33538): ENSG00000243485 ENSG00000237613 ...
ENSG00000277475 ENSG00000268674
rowData names(4): ENSEMBL_ID Symbol_TENx Type Symbol
colnames: NULL
colData names(11): Sample Barcode ... Individual Date_published
reducedDimNames(0):
mainExpName: Gene Expression
altExpNames(1): Antibody Capture
The object contains gene annotations including the Ensembl gene ID and the gene symbol; to make it easier to interpret the results we combine these into a gene identifier that is used as the row names of the data object.
# Dimension of SingleCellExperiment:
dim(sce) # genes X cells (barcodes)
[1] 33538 5247
# colnames SingleCellExperiment:
head(colnames(sce))
NULL
# rownames SingleCellExperiment:
head(rownames(sce))
[1] "ENSG00000243485" "ENSG00000237613" "ENSG00000186092" "ENSG00000238009"
[5] "ENSG00000239945" "ENSG00000239906"
# SingleCellExperiment colData info:
colData(sce)
DataFrame with 5247 rows and 11 columns
Sample Barcode Sequence Library
<character> <character> <character> <integer>
1 pbmc5k-CITEseq AAACCCAAGAGACAAG-1 AAACCCAAGAGACAAG 1
2 pbmc5k-CITEseq AAACCCAAGGCCTAGA-1 AAACCCAAGGCCTAGA 1
3 pbmc5k-CITEseq AAACCCAGTCGTGCCA-1 AAACCCAGTCGTGCCA 1
4 pbmc5k-CITEseq AAACCCATCGTGCATA-1 AAACCCATCGTGCATA 1
5 pbmc5k-CITEseq AAACGAAAGACAAGCC-1 AAACGAAAGACAAGCC 1
... ... ... ... ...
5243 pbmc5k-CITEseq TTTGTTGCAGCACAAG-1 TTTGTTGCAGCACAAG 1
5244 pbmc5k-CITEseq TTTGTTGCAGTCTTCC-1 TTTGTTGCAGTCTTCC 1
5245 pbmc5k-CITEseq TTTGTTGCATGGCCCA-1 TTTGTTGCATGGCCCA 1
5246 pbmc5k-CITEseq TTTGTTGCATTGCCGG-1 TTTGTTGCATTGCCGG 1
5247 pbmc5k-CITEseq TTTGTTGGTCCGGCAT-1 TTTGTTGGTCCGGCAT 1
Cell_ranger_version Tissue_status Barcode_type Chemistry
<character> <character> <character> <character>
1 v3.0.2 NA Chromium Chromium_v3
2 v3.0.2 NA Chromium Chromium_v3
3 v3.0.2 NA Chromium Chromium_v3
4 v3.0.2 NA Chromium Chromium_v3
5 v3.0.2 NA Chromium Chromium_v3
... ... ... ... ...
5243 v3.0.2 NA Chromium Chromium_v3
5244 v3.0.2 NA Chromium Chromium_v3
5245 v3.0.2 NA Chromium Chromium_v3
5246 v3.0.2 NA Chromium Chromium_v3
5247 v3.0.2 NA Chromium Chromium_v3
Sequence_platform Individual Date_published
<character> <character> <character>
1 NovaSeq HealthyDonor 2019-05-29
2 NovaSeq HealthyDonor 2019-05-29
3 NovaSeq HealthyDonor 2019-05-29
4 NovaSeq HealthyDonor 2019-05-29
5 NovaSeq HealthyDonor 2019-05-29
... ... ... ...
5243 NovaSeq HealthyDonor 2019-05-29
5244 NovaSeq HealthyDonor 2019-05-29
5245 NovaSeq HealthyDonor 2019-05-29
5246 NovaSeq HealthyDonor 2019-05-29
5247 NovaSeq HealthyDonor 2019-05-29
# SingleCellExperiment rowData info:
rowData(sce)
DataFrame with 33538 rows and 4 columns
ENSEMBL_ID Symbol_TENx Type Symbol
<character> <character> <character> <character>
ENSG00000243485 ENSG00000243485 MIR1302-2HG Gene Expression NA
ENSG00000237613 ENSG00000237613 FAM138A Gene Expression FAM138A
ENSG00000186092 ENSG00000186092 OR4F5 Gene Expression OR4F5
ENSG00000238009 ENSG00000238009 AL627309.1 Gene Expression LOC100996442
ENSG00000239945 ENSG00000239945 AL627309.3 Gene Expression NA
... ... ... ... ...
ENSG00000277856 ENSG00000277856 AC233755.2 Gene Expression NA
ENSG00000275063 ENSG00000275063 AC233755.1 Gene Expression LOC102723407
ENSG00000271254 ENSG00000271254 AC240274.1 Gene Expression LOC102724250
ENSG00000277475 ENSG00000277475 AC213203.1 Gene Expression NA
ENSG00000268674 ENSG00000268674 FAM231C Gene Expression NA
# Rename rownames:
# combine Ensembl ID and gene symbol and set as row names of the data object.
rownames(sce) = paste0(rowData(sce)$ENSEMBL_ID, "_",
rowData(sce)$Symbol_TENx)
# Name the colnames:
colnames(sce) = sce$Sequence
Accessing the assay data, row and column annotations is done in the same way as for SummarizedExperiment
objects. There is also an additional counts()
accessor for the counts
assay.
# All assays present in the sce object:
assays(sce)
List of length 1
names(1): counts
# Access sce's assays:
= assays(sce)[['counts']]
counts_method1 1:5,1:3] counts_method1[
<5 x 3> sparse DelayedMatrix object of type "integer":
AAACCCAAGAGACAAG AAACCCAAGGCCTAGA
ENSG00000243485_MIR1302-2HG 0 0
ENSG00000237613_FAM138A 0 0
ENSG00000186092_OR4F5 0 0
ENSG00000238009_AL627309.1 0 0
ENSG00000239945_AL627309.3 0 0
AAACCCAGTCGTGCCA
ENSG00000243485_MIR1302-2HG 0
ENSG00000237613_FAM138A 0
ENSG00000186092_OR4F5 0
ENSG00000238009_AL627309.1 0
ENSG00000239945_AL627309.3 0
= assay(sce,'counts')
counts_method2 1:5,1:3] counts_method2[
<5 x 3> sparse DelayedMatrix object of type "integer":
AAACCCAAGAGACAAG AAACCCAAGGCCTAGA
ENSG00000243485_MIR1302-2HG 0 0
ENSG00000237613_FAM138A 0 0
ENSG00000186092_OR4F5 0 0
ENSG00000238009_AL627309.1 0 0
ENSG00000239945_AL627309.3 0 0
AAACCCAGTCGTGCCA
ENSG00000243485_MIR1302-2HG 0
ENSG00000237613_FAM138A 0
ENSG00000186092_OR4F5 0
ENSG00000238009_AL627309.1 0
ENSG00000239945_AL627309.3 0
= counts(sce)
counts_method3 1:5,1:3] counts_method3[
<5 x 3> sparse DelayedMatrix object of type "integer":
AAACCCAAGAGACAAG AAACCCAAGGCCTAGA
ENSG00000243485_MIR1302-2HG 0 0
ENSG00000237613_FAM138A 0 0
ENSG00000186092_OR4F5 0 0
ENSG00000238009_AL627309.1 0 0
ENSG00000239945_AL627309.3 0 0
AAACCCAGTCGTGCCA
ENSG00000243485_MIR1302-2HG 0
ENSG00000237613_FAM138A 0
ENSG00000186092_OR4F5 0
ENSG00000238009_AL627309.1 0
ENSG00000239945_AL627309.3 0
While the structure of the scRNAseq data is similar to that of the bulk data, there are also important differences that affect the downstream analysis. One of these differences is that single-cell data is much more sparse; in other words, there are many more zeros in the count matrix from a single-cell experiment than from a bulk experiment.
counts(sce)[1:10,1:3]
<10 x 3> sparse DelayedMatrix object of type "integer":
AAACCCAAGAGACAAG AAACCCAAGGCCTAGA
ENSG00000243485_MIR1302-2HG 0 0
ENSG00000237613_FAM138A 0 0
ENSG00000186092_OR4F5 0 0
ENSG00000238009_AL627309.1 0 0
ENSG00000239945_AL627309.3 0 0
ENSG00000239906_AL627309.2 0 0
ENSG00000241599_AL627309.4 0 0
ENSG00000236601_AL732372.1 0 0
ENSG00000284733_OR4F29 0 0
ENSG00000235146_AC114498.1 0 0
AAACCCAGTCGTGCCA
ENSG00000243485_MIR1302-2HG 0
ENSG00000237613_FAM138A 0
ENSG00000186092_OR4F5 0
ENSG00000238009_AL627309.1 0
ENSG00000239945_AL627309.3 0
ENSG00000239906_AL627309.2 0
ENSG00000241599_AL627309.4 0
ENSG00000236601_AL732372.1 0
ENSG00000284733_OR4F29 0
ENSG00000235146_AC114498.1 0
This is due to things such as:
Let’s check the fraction of zeros in our count matrix:
# fraction of 0 per barcode/cell:
= colMeans(counts(sce) == 0)
frac_0_per_cell
# fraction of 0 for all cells:
mean(counts(sce) == 0) # On average, the faction of 0 in a cell is of 94%
[1] 0.9473629
We also calculate the range of library sizes, noting that these are much smaller than the typical values for bulk RNA-seq:
# Library size (=total amount of reads per cell/barcode):
head(colSums(counts(sce)))
AAACCCAAGAGACAAG AAACCCAAGGCCTAGA AAACCCAGTCGTGCCA AAACCCATCGTGCATA
7375 3772 4902 6704
AAACGAAAGACAAGCC AAACGAAAGAGTGACC
3900 11781
# Summary of library size per cells:
summary(colSums(counts(sce)))
Min. 1st Qu. Median Mean 3rd Qu. Max.
501 3221 5506 6142 7686 53587
The number of cells in a scRNA-seq data set is typically much (several orders of magnitude) larger than the number of samples in a bulk RNA-seq experiment. Hence, the count matrices can get very large. However, since most of the values are zero, efficient storage modes, where only the non-zero values and the corresponding matrix positions are stored, can be employed. We can make sure that the count matrix in our object is indeed such a sparse matrix (in this particular data set, it is actually provided as a DelayedMatrix
, which is beyond the scope of this course to discuss in detail, but which can be suitable for very large data matrices that do not fit in memory).
class(counts(sce))
[1] "DelayedMatrix"
attr(,"package")
[1] "DelayedArray"
counts(sce) = as(counts(sce), "dgCMatrix")
class(counts(sce))
[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"
counts(sce)[1:10, 1:10]
10 x 10 sparse Matrix of class "dgCMatrix"
[[ suppressing 10 column names 'AAACCCAAGAGACAAG', 'AAACCCAAGGCCTAGA', 'AAACCCAGTCGTGCCA' ... ]]
ENSG00000243485_MIR1302-2HG . . . . . . . . . .
ENSG00000237613_FAM138A . . . . . . . . . .
ENSG00000186092_OR4F5 . . . . . . . . . .
ENSG00000238009_AL627309.1 . . . . . . . . . .
ENSG00000239945_AL627309.3 . . . . . . . . . .
ENSG00000239906_AL627309.2 . . . . . . . . . .
ENSG00000241599_AL627309.4 . . . . . . . . . .
ENSG00000236601_AL732372.1 . . . . . . . . . .
ENSG00000284733_OR4F29 . . . . . . . . . .
ENSG00000235146_AC114498.1 . . . . . . . . . .
We already noted a couple of differences between scRNA-seq data and bulk data:
There are many genes that are not observed in one of the cells, but still have a high number of assigned counts in the other cell. This indicates that the gene detection is to some extent random, as expected by the sampling process. In other words, it is not always the same genes that go undetected in all cells.
While we don’t observe each gene in each cell, there is still a clear association between the overall expression level of a gene (e.g., total count) and the fraction of cells where it is detected.
plot(rowSums(counts(sce)), 1-rowMeans(counts(sce) == 0), log = "x",xlab='total count per gene',ylab='fraction of cell in which the gene is detected')
Warning in xy.coords(x, y, xlabel, ylabel, log): 12698 x values <= 0
omitted from logarithmic plot
scater
. For both for the cells and for the genes.suppressPackageStartupMessages({library(scater)})
# QC statistics per gene:
= addPerFeatureQCMetrics(sce)
sce
# Two new colunms (mean and detected) in rowData matrix:
rowData(sce)
DataFrame with 33538 rows and 6 columns
ENSEMBL_ID Symbol_TENx Type
<character> <character> <character>
ENSG00000243485_MIR1302-2HG ENSG00000243485 MIR1302-2HG Gene Expression
ENSG00000237613_FAM138A ENSG00000237613 FAM138A Gene Expression
ENSG00000186092_OR4F5 ENSG00000186092 OR4F5 Gene Expression
ENSG00000238009_AL627309.1 ENSG00000238009 AL627309.1 Gene Expression
ENSG00000239945_AL627309.3 ENSG00000239945 AL627309.3 Gene Expression
... ... ... ...
ENSG00000277856_AC233755.2 ENSG00000277856 AC233755.2 Gene Expression
ENSG00000275063_AC233755.1 ENSG00000275063 AC233755.1 Gene Expression
ENSG00000271254_AC240274.1 ENSG00000271254 AC240274.1 Gene Expression
ENSG00000277475_AC213203.1 ENSG00000277475 AC213203.1 Gene Expression
ENSG00000268674_FAM231C ENSG00000268674 FAM231C Gene Expression
Symbol mean detected
<character> <numeric> <numeric>
ENSG00000243485_MIR1302-2HG NA 0.0000000 0.00000
ENSG00000237613_FAM138A FAM138A 0.0000000 0.00000
ENSG00000186092_OR4F5 OR4F5 0.0000000 0.00000
ENSG00000238009_AL627309.1 LOC100996442 0.0051458 0.51458
ENSG00000239945_AL627309.3 NA 0.0000000 0.00000
... ... ... ...
ENSG00000277856_AC233755.2 NA 0.000190585 0.0190585
ENSG00000275063_AC233755.1 LOC102723407 0.000381170 0.0381170
ENSG00000271254_AC240274.1 LOC102724250 0.011244521 1.0672765
ENSG00000277475_AC213203.1 NA 0.000000000 0.0000000
ENSG00000268674_FAM231C NA 0.000000000 0.0000000
# Information per cells: stored in colData (not in rowData)
# 1. mean: mean expression of the gene (mean number of read across all cells)
# 2. detected: number of cell where the gene is detected (number of cell where count>0)
## QC statistics per cell:
# Information on subset of genes (target)
# Mitochrondrial gene ("MT" pattern within their symbol name):
= rownames(sce)[grep("^MT-", rowData(sce)$Symbol_TENx)]
mt # 13 genes mt
[1] "ENSG00000198888_MT-ND1" "ENSG00000198763_MT-ND2"
[3] "ENSG00000198804_MT-CO1" "ENSG00000198712_MT-CO2"
[5] "ENSG00000228253_MT-ATP8" "ENSG00000198899_MT-ATP6"
[7] "ENSG00000198938_MT-CO3" "ENSG00000198840_MT-ND3"
[9] "ENSG00000212907_MT-ND4L" "ENSG00000198886_MT-ND4"
[11] "ENSG00000198786_MT-ND5" "ENSG00000198695_MT-ND6"
[13] "ENSG00000198727_MT-CYB"
# For each cell, general statistics and statistics about the genes in the targeted group:
= addPerCellQCMetrics(sce, subsets = list(MT = mt))
sce
# Six new colunms in colData matrix:
colData(sce)
DataFrame with 5247 rows and 20 columns
Sample Barcode Sequence
<character> <character> <character>
AAACCCAAGAGACAAG pbmc5k-CITEseq AAACCCAAGAGACAAG-1 AAACCCAAGAGACAAG
AAACCCAAGGCCTAGA pbmc5k-CITEseq AAACCCAAGGCCTAGA-1 AAACCCAAGGCCTAGA
AAACCCAGTCGTGCCA pbmc5k-CITEseq AAACCCAGTCGTGCCA-1 AAACCCAGTCGTGCCA
AAACCCATCGTGCATA pbmc5k-CITEseq AAACCCATCGTGCATA-1 AAACCCATCGTGCATA
AAACGAAAGACAAGCC pbmc5k-CITEseq AAACGAAAGACAAGCC-1 AAACGAAAGACAAGCC
... ... ... ...
TTTGTTGCAGCACAAG pbmc5k-CITEseq TTTGTTGCAGCACAAG-1 TTTGTTGCAGCACAAG
TTTGTTGCAGTCTTCC pbmc5k-CITEseq TTTGTTGCAGTCTTCC-1 TTTGTTGCAGTCTTCC
TTTGTTGCATGGCCCA pbmc5k-CITEseq TTTGTTGCATGGCCCA-1 TTTGTTGCATGGCCCA
TTTGTTGCATTGCCGG pbmc5k-CITEseq TTTGTTGCATTGCCGG-1 TTTGTTGCATTGCCGG
TTTGTTGGTCCGGCAT pbmc5k-CITEseq TTTGTTGGTCCGGCAT-1 TTTGTTGGTCCGGCAT
Library Cell_ranger_version Tissue_status Barcode_type
<integer> <character> <character> <character>
AAACCCAAGAGACAAG 1 v3.0.2 NA Chromium
AAACCCAAGGCCTAGA 1 v3.0.2 NA Chromium
AAACCCAGTCGTGCCA 1 v3.0.2 NA Chromium
AAACCCATCGTGCATA 1 v3.0.2 NA Chromium
AAACGAAAGACAAGCC 1 v3.0.2 NA Chromium
... ... ... ... ...
TTTGTTGCAGCACAAG 1 v3.0.2 NA Chromium
TTTGTTGCAGTCTTCC 1 v3.0.2 NA Chromium
TTTGTTGCATGGCCCA 1 v3.0.2 NA Chromium
TTTGTTGCATTGCCGG 1 v3.0.2 NA Chromium
TTTGTTGGTCCGGCAT 1 v3.0.2 NA Chromium
Chemistry Sequence_platform Individual Date_published
<character> <character> <character> <character>
AAACCCAAGAGACAAG Chromium_v3 NovaSeq HealthyDonor 2019-05-29
AAACCCAAGGCCTAGA Chromium_v3 NovaSeq HealthyDonor 2019-05-29
AAACCCAGTCGTGCCA Chromium_v3 NovaSeq HealthyDonor 2019-05-29
AAACCCATCGTGCATA Chromium_v3 NovaSeq HealthyDonor 2019-05-29
AAACGAAAGACAAGCC Chromium_v3 NovaSeq HealthyDonor 2019-05-29
... ... ... ... ...
TTTGTTGCAGCACAAG Chromium_v3 NovaSeq HealthyDonor 2019-05-29
TTTGTTGCAGTCTTCC Chromium_v3 NovaSeq HealthyDonor 2019-05-29
TTTGTTGCATGGCCCA Chromium_v3 NovaSeq HealthyDonor 2019-05-29
TTTGTTGCATTGCCGG Chromium_v3 NovaSeq HealthyDonor 2019-05-29
TTTGTTGGTCCGGCAT Chromium_v3 NovaSeq HealthyDonor 2019-05-29
sum detected subsets_MT_sum subsets_MT_detected
<numeric> <integer> <numeric> <integer>
AAACCCAAGAGACAAG 7375 2363 467 11
AAACCCAAGGCCTAGA 3772 1259 343 13
AAACCCAGTCGTGCCA 4902 1578 646 11
AAACCCATCGTGCATA 6704 1908 426 11
AAACGAAAGACAAGCC 3900 1589 363 10
... ... ... ... ...
TTTGTTGCAGCACAAG 4754 1649 468 12
TTTGTTGCAGTCTTCC 6373 1901 553 11
TTTGTTGCATGGCCCA 1110 496 383 11
TTTGTTGCATTGCCGG 12220 3443 1287 12
TTTGTTGGTCCGGCAT 1932 651 752 12
subsets_MT_percent altexps_Antibody Capture_sum
<numeric> <numeric>
AAACCCAAGAGACAAG 6.33220 5178
AAACCCAAGGCCTAGA 9.09332 2893
AAACCCAGTCGTGCCA 13.17829 3635
AAACCCATCGTGCATA 6.35442 3700
AAACGAAAGACAAGCC 9.30769 3423
... ... ...
TTTGTTGCAGCACAAG 9.84434 2262
TTTGTTGCAGTCTTCC 8.67723 2886
TTTGTTGCATGGCCCA 34.50450 1140
TTTGTTGCATTGCCGG 10.53191 1919
TTTGTTGGTCCGGCAT 38.92340 5634
altexps_Antibody Capture_detected
<integer>
AAACCCAAGAGACAAG 31
AAACCCAAGGCCTAGA 29
AAACCCAGTCGTGCCA 29
AAACCCATCGTGCATA 31
AAACGAAAGACAAGCC 28
... ...
TTTGTTGCAGCACAAG 31
TTTGTTGCAGTCTTCC 30
TTTGTTGCATGGCCCA 24
TTTGTTGCATTGCCGG 28
TTTGTTGGTCCGGCAT 31
altexps_Antibody Capture_percent total
<numeric> <numeric>
AAACCCAAGAGACAAG 41.2491 12553
AAACCCAAGGCCTAGA 43.4059 6665
AAACCCAGTCGTGCCA 42.5794 8537
AAACCCATCGTGCATA 35.5632 10404
AAACGAAAGACAAGCC 46.7431 7323
... ... ...
TTTGTTGCAGCACAAG 32.2406 7016
TTTGTTGCAGTCTTCC 31.1697 9259
TTTGTTGCATGGCCCA 50.6667 2250
TTTGTTGCATTGCCGG 13.5724 14139
TTTGTTGGTCCGGCAT 74.4647 7566
# Information per cells: stored in colData (not in rowData)
# 1. sum: library size of the cell
# 2. detected: number of genes detected in the cell
# 3. subsets_MT_sum: total number of reads on the targeted genes
# 4. subsets_MT_detected: number of targeted genes detected
# 5. subsets_MT_percent: percentage of reads mapped on mitochondrial genes
# - Additional information on altExp (see end of the lecture)
we can plot the distribution of library sizes (sum
) and the number of detected genes (detected
) across the cells
par(mfrow=c(1,3))
hist(log10(sce$sum), breaks = 30,xlab='log10(sum of counts per cell)',main='distribution of library sizes')
hist(sce$detected, breaks = 30,xlab='number of genes detected per cell',main='distribution of detected genes')
The proportion of counts assigned to mitochondrial genes is another useful indicator of cell quality, since high numbers of such reads can be associated to cell damage.
hist(sce$subsets_MT_percent, breaks = 30,main='Distribution of mitochondrial genes proportion',xlab='proportion of counts')
Now that we have calculated the QC metrics, we will use it to filter out low-quality cells that will be excluded from the rest of the analysis.
The optimal parameters for filtering are debated and likely data set dependent. Here are two commonly used strategies:
Fixed threshold: For example, you might consider cells to be of low quality if their library sizes is below 100,000 reads; or if they express fewer than 5,000 genes; have mitochondrial proportions above 10%. However, while simple, this strategy requires substantial experience to determine appropriate thresholds for each experimental protocol and biological system.
Adaptive thresholds: A typical approach is to remove cells that fall ‘too far’ from the average cells on one or more of the considered criteria. This makes the implicit assumption that ‘most’ cells are of good quality, which is often sensible. It should also be noted that in some cases, cells that seem to be of bad quality can do so for biological reasons. For example, certain cell types express very few genes, or have a high metabolic rate and consequently express a lot of mitochondrial genes.
For exemple, here we exclude cells according to two criteria:
For each of these criteria, we exclude cells that are more than 4 median absolute deviations (MAD) from the median across cells, in the direction indicating low quality.
= isOutlier(sce$detected, type = "lower",
low_detected log = TRUE, nmads = 4) # lower than 225.1728 genes
= isOutlier(sce$subsets_MT_percent, type = "higher",
high_mt log = FALSE, nmads = 4) # higher than 21.94558 % percent of read on mitochondrial genes
par(mfrow=c(1,2))
plot(rank(-sce$detected), sce$detected, col = low_detected + 1,ylab='number of gene detected',xlab='cell rank')
plot(rank(sce$subsets_MT_percent), sce$subsets_MT_percent,
col = high_mt + 1,ylab=' proportion of counts assigned to mitochondrial genes',xlab='cell rank')
# To filter out the cells:
$retain = !low_detected & !high_mt
sce
# Number of cell retained and number of cell filtered out:
table(sce$retain)
FALSE TRUE
1006 4241
# All cell with a low detected amount of genes have also a high mt percentage:
which(low_detected==TRUE)%in%which(high_mt==TRUE)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# Dimension of sce object change:
dim(sce)
[1] 33538 5247
= sce[, sce$retain]
sce dim(sce) # 4241 cells left (1006 cell removed --> bad quality cells)
[1] 33538 4241
# look at the distribution of the remaining cells
par(mfrow=c(1,3))
hist(log10(sce$sum), breaks = 30,xlab='log10(sum of counts per cell)',main='distribution of library sizes')
hist(sce$detected, breaks = 30,xlab='number of genes detected per cell',main='distribution of detected genes')
hist(sce$subsets_MT_percent, breaks = 30,main='Distribution of mitochondrial genes proportion',xlab='proportion of counts')
Just as for bulk RNA-seq, the raw scRNA-seq counts are not directly comparable across cells due to, e.g., differences in library sizes. This typically arise from technical differences due to the difficulty of achieving consistent library preparation with minimal starting material.
Thus, we need to apply a normalization strategy. There are many approaches to normalization of scRNA-seq data (see e.g. Lytal et al. 2020 and Cole et al. 2019 for comparisons). In this lecture we will focus on scaling normalization, which is the simplest and most commonly used class of normalization strategies. This involves dividing all counts for each cell by a cell-specific scaling factor, often called a “size factor”.
Here, we will use one scaling strategy implemented in the scran
package. Similarly to the TMM and DESeq normalization approaches, this works by estimating a size factor for each cell, which incorporates the library size as well as a measure of the RNA composition . The removal of composition biases is a well-studied problem (e.g. bulk RNA-seq). It assumes that most genes are not DE between cells. Any systematic difference in count size across the non-DE majority of genes between two cells is assumed to represent bias that is used to compute an appropriate size factor for its removal.
However, the bulk RNA-seq methods are sometimes struggling with scRNA-seq due to the large number of zeros; the scran
approach solves this by repeatedly pooling multiple cells (which reduces the number of zeros), calculating size factors for the pools, and deriving individual size factors for the cells from those of the pools. After calculating the size factors, we normalize the observed counts and log-transform the values. The new “log counts” are placed in a new assay in sce
, named logcounts
.
Simplification of a complex signal and consists of several key steps:
# Normalization by deconvolution: Scaling normalization by deconvolving size factors from cell pools.
suppressPackageStartupMessages({library(scran)})
# quickCluster(): pre-clustering step where cells in each cluster are normalized separately.
# The size factors are rescaled to be comparable across clusters. Avoid assumption that most genes are non-DE across the entire population.
set.seed(100)
= quickCluster(sce)
clust = computeSumFactors(sce, cluster=clust, min.mean=0.1) # min.mean: minimum average count of genes to be used for normalization.
sce
# New sizeFactor column:
colnames(colData(sce))
[1] "Sample"
[2] "Barcode"
[3] "Sequence"
[4] "Library"
[5] "Cell_ranger_version"
[6] "Tissue_status"
[7] "Barcode_type"
[8] "Chemistry"
[9] "Sequence_platform"
[10] "Individual"
[11] "Date_published"
[12] "sum"
[13] "detected"
[14] "subsets_MT_sum"
[15] "subsets_MT_detected"
[16] "subsets_MT_percent"
[17] "altexps_Antibody Capture_sum"
[18] "altexps_Antibody Capture_detected"
[19] "altexps_Antibody Capture_percent"
[20] "total"
[21] "retain"
[22] "sizeFactor"
# The size factor for each cell represents the estimate of the relative bias in that cell, so division of its counts by its size factor should remove that biais
# To access the size factor:
sizeFactors(sce)[1:10]
[1] 1.0697706 0.4570165 0.6278514 0.8410376 0.5639728 1.7096026 0.5648314
[8] 1.3857665 1.4768468 0.5850588
# Compute log-transformed normalized expression values (using size factor):
# divides the count for each gene with the appropriate size factor for that cell
= logNormCounts(sce)
sce assayNames(sce) # new assays logcounts
[1] "counts" "logcounts"
scRNA-seq is wiedly used to characterize heterogeneity across cells. To compare cells based on their gene expression profiles, we can use procedures like clustering and dimensionality reduction which involves aggregating per-gene differences into a single (dis)similarity metric. The choice of genes to use in this calculation has a crutial impact on the behavior of the metric.
In pratical, we want to identify genes that contain useful information about the biology of the system while removing genes that contain random noise.
Variation in gene abundance estimates between different cells can be thought of as the convolution of the technical (mainly sampling) and the biological (e.g cell type) sources of variance. Typically one wants to isolate and focus on the biological variance so that differences due to experimental noise (technical noise) have as small an impact as possible on subsequent analyses.
There are different approaches to disentangling the technical and biological variability in a single-cell data set:
1. The “most” genes exhibit only technical variability (i.e., most genes are not differentially expressed between different studied cell types).
2. The technical variance follows a Poisson distribution (a common distribution capturing sampling variability in count data), and that deviations from the Poisson expectation corresponds to biological variability.
The simplest approach to quantifying per-gene variation is to compute the variance of the log-normalized expression values (i.e., “log-counts” ) for each gene across all cells. This requires modelling of the mean-variance relationship beacause the log-transformation is often not a variance stabilizing transformation (total variance of a gene is driven more by its abundance than its underlying biological heterogeneity). To account for this effect, we fit a trend to the variance with respect to abundance across all genes
# Fit a trend to the variance with respect to abundance across all genes:
= modelGeneVar(sce)
dec.trend
# Visualizing the fit:
= metadata(dec.trend)
fit.trend par(mfrow=c(1,2))
plot(fit.trend$mean, fit.trend$var, xlab = "Mean of log-expression",
ylab = "Variance of log-expression")
curve(fit.trend$trend(x), col = "dodgerblue", add = TRUE, lwd = 2)
However, the previous assumption may be problematic in scenarios where many genes at a particular abundance are affected by a biological process. We can avoid this problem by creating a trend by making some distributional assumptions about the noise. UMI counts typically exhibit near-Poisson variation and can be used to construct a mean-variance trend
## Poisson assumption: Makes some distributional assumptions about the technial noise.
# random seed because of stochastic nature of Poisson modeling
set.seed(123)
= modelGeneVarByPoisson(sce)
dec.pois plot(dec.pois$mean, dec.pois$total, xlab = "Mean of log-expression",
ylab = "Variance of log-expression")
curve(metadata(dec.pois)$trend(x), col = "dodgerblue", add = TRUE)
In each case, the biological variance of a gene can be estimated as the difference between the total variance and the modeled technical variance. This value can be used to rank the genes, and to select a set of “highly variable” genes to use as the basis of further downstream analysis. Here, we select the top 1,000 genes based on the trend fit, and add these to the metadata
of the SingleCellExperiment
object.
head(dec.trend[order(dec.trend$bio, decreasing = TRUE), ])
DataFrame with 6 rows and 6 columns
mean total tech bio
<numeric> <numeric> <numeric> <numeric>
ENSG00000090382_LYZ 2.56480 9.27648 0.987912 8.28857
ENSG00000163220_S100A9 2.13558 8.15133 0.993180 7.15815
ENSG00000143546_S100A8 1.54234 5.47531 0.953660 4.52165
ENSG00000204287_HLA-DRA 1.97873 4.83492 0.990328 3.84459
ENSG00000101439_CST3 1.55593 4.12547 0.955645 3.16982
ENSG00000196126_HLA-DRB1 1.87228 4.04842 0.985001 3.06342
p.value FDR
<numeric> <numeric>
ENSG00000090382_LYZ 3.43209e-221 7.08075e-217
ENSG00000163220_S100A9 7.54485e-164 7.78289e-160
ENSG00000143546_S100A8 3.50916e-72 2.41325e-68
ENSG00000204287_HLA-DRA 4.34061e-49 1.49252e-45
ENSG00000101439_CST3 2.18125e-36 4.50015e-33
ENSG00000196126_HLA-DRB1 3.12392e-32 4.60354e-29
head(dec.pois[order(dec.pois$bio, decreasing = TRUE), ])
DataFrame with 6 rows and 6 columns
mean total tech bio p.value
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000090382_LYZ 2.56480 9.27648 0.553318 8.72317 0
ENSG00000163220_S100A9 2.13558 8.15133 0.674412 7.47692 0
ENSG00000143546_S100A8 1.54234 5.47531 0.787616 4.68769 0
ENSG00000204287_HLA-DRA 1.97873 4.83492 0.715043 4.11987 0
ENSG00000101439_CST3 1.55593 4.12547 0.786727 3.33874 0
ENSG00000196126_HLA-DRB1 1.87228 4.04842 0.739656 3.30876 0
FDR
<numeric>
ENSG00000090382_LYZ 0
ENSG00000163220_S100A9 0
ENSG00000143546_S100A8 0
ENSG00000204287_HLA-DRA 0
ENSG00000101439_CST3 0
ENSG00000196126_HLA-DRB1 0
= rownames(dec.trend[order(dec.trend$bio, decreasing = TRUE), ])[1:1000]
modelGeneVar_top1000 = rownames(dec.pois[order(dec.pois$bio, decreasing = TRUE), ])[1:1000]
modelGeneVarByPoisson_top1000 # majority of genes found in both approach.
table(modelGeneVar_top1000%in%modelGeneVarByPoisson_top1000)
FALSE TRUE
192 808
= getTopHVGs(dec.pois, n = 1000)
hvg.var head(hvg.var)
[1] "ENSG00000090382_LYZ" "ENSG00000163220_S100A9"
[3] "ENSG00000143546_S100A8" "ENSG00000204287_HLA-DRA"
[5] "ENSG00000101439_CST3" "ENSG00000196126_HLA-DRB1"
# Save hvgs into metadata (does not need to have the same mount of row/col as for rowData or colData)
metadata(sce)$hvgs = hvg.var
For bulk RNA-seq data, we typically use PCA for visualization and exploratory analysis. This can of course also be applied to single-cell RNA-seq data. However, other methods (e.g., tSNE and UMAP) are more commonly used for scRNA-seq. Both tSNE and UMAP are non-linear dimension reduction methods, which focus to a larger extent on retaining small distances. That means that cells that are similar to each other (e.g., cells of the same cell type) tend to be placed close together in the low-dimensional representation, whereas larger distances are potentially less faithfully represented. PCA, on the other hand, tend to put more focus on preserving the large cell-to-cell distances. Importantly, even though PCA is rarely used for visualization of large scRNA-seq data sets in two-dimensional plots, it is often used as a first dimension reduction step, and other approaches such as tSNE and UMAP are subsequently applied to the PCA output.
Here, we therefore first apply PCA to our data set. We supply the set of highly variable genes derived above, to only calculate the principal components based on these. By default, the runPCA
function from the scater
package will apply the PCA to the ‘logcounts’ assay. We plot the first two principal components (note that we extract 50 components, which will be used as the basis for the tSNE later) and color by the different colData entries.
# PCA: Principal component analysis is a technique for reducing the dimensionality of complex datasets, increasing interpretability but at the same time minimizing information loss
= runPCA(sce, exprs_values = "logcounts", ncomponents = 50,
sce subset_row = metadata(sce)$hvgs)
# New reducedDimNames entry:
reducedDimNames(sce)
[1] "PCA"
## PCA ploting:
# coloration according to colData:
plotReducedDim(sce, "PCA", colour_by = "detected")
plotReducedDim(sce, "PCA", colour_by = "sum")
plotReducedDim(sce, "PCA", colour_by = "subsets_MT_sum")
Next, we apply tsne, which is a non-linear, stochastic dimension reduction technique that attempts to find a mapping of the data on a low subspace while preserving local distances between cells. The non-linear character of tSNE means that it will often produce projections that better resolve differences between cell groups. The better separation of tSNE comes at the cost of interpretability; while in a tSNE representation similar cells are placed close to each other, longer distances in the representation are not guaranteed to reflect true relationships. This means that it is risky to draw conclusions of “similarity” or “dissimilarity” from the positional relationships of different cell groupings that appear in a tSNE plot. In addition, the stochastic nature of tSNE means that every time the algorithm is applied a different representation will be produced unlesss a random seed is set.
# random seed because of stochastic nature of tSNE
set.seed(123)
= runTSNE(sce, dimred = "PCA") # use 50 PCs of PCA
sce reducedDimNames(sce)
[1] "PCA" "TSNE"
plotReducedDim(sce, "TSNE", colour_by = "detected")
plotReducedDim(sce, "TSNE", colour_by = "sum")
plotReducedDim(sce, "TSNE", colour_by = "subsets_MT_sum")
We can often identify specific cell types and gain some understanding of the data directly from the visualization (other types of cell type identification, e.g. via cell type clustering, will be discussed in the next lecture). Typically, this would be done by colouring the cells according to the expression level of known marker genes for certain cell types.
## Color by expression of a B-cell marker (CD20)
plotReducedDim(sce, "TSNE", colour_by = "ENSG00000156738_MS4A1")
## Color by expression of a T-cell marker
plotReducedDim(sce, "TSNE", colour_by = "ENSG00000167286_CD3D")
plotReducedDim(sce, "TSNE", colour_by = "ENSG00000126353_CCR7")
## Color by expression of a Monocyte marker
plotReducedDim(sce, "TSNE", colour_by = "ENSG00000090382_LYZ")
## Color by expression of a platelet marker
plotReducedDim(sce, "TSNE", colour_by = "ENSG00000163736_PPBP")
To better identify specific cell type we can also use computational approaches that exploit prior information.
Here we will directly compare our expression profiles to published reference datasets where each cell has already been annotated with its putative biological state by domain experts (https://rdrr.io/github/LTLA/celldex/man/DatabaseImmuneCellExpressionData.html).
# libraries
## BiocManager::install("celldex")
suppressPackageStartupMessages({
library('celldex')
library('SingleR')
})
# Reference annotation (https://dice-database.org)
= celldex::DatabaseImmuneCellExpressionData() ref.dice
see ?celldex and browseVignettes('celldex') for documentation
loading from cache
see ?celldex and browseVignettes('celldex') for documentation
loading from cache
colData(ref.dice) # two labels: label.main and label.fine
DataFrame with 1561 rows and 3 columns
label.main label.fine label.ont
<character> <character> <character>
TPM_1 B cells B cells, naive CL:0000788
TPM_2 B cells B cells, naive CL:0000788
TPM_3 B cells B cells, naive CL:0000788
TPM_4 B cells B cells, naive CL:0000788
TPM_5 B cells B cells, naive CL:0000788
... ... ... ...
TPM_98 T cells, CD8+ T cells, CD8+, naive.. CL:0000906
TPM_99 T cells, CD8+ T cells, CD8+, naive.. CL:0000906
TPM_100 T cells, CD8+ T cells, CD8+, naive.. CL:0000906
TPM_101 T cells, CD8+ T cells, CD8+, naive.. CL:0000906
TPM_102 T cells, CD8+ T cells, CD8+, naive.. CL:0000906
# Make sure both object have the same rownames (ENSEMBL symbol in this case)
rownames(sce) = rowData(sce)$Symbol_TENx
## Prediction using main label
= SingleR(test = sce, ref = ref.dice, labels = ref.dice$label.main)
pred
# Add prediction to the sce object
$main.annotation.dice = pred$pruned.labels
sce# Plot on TSNE
plotReducedDim(sce, "TSNE", colour_by = as.data.frame(sce$main.annotation.dice))
In this lesson we have used Bioconductor packages for the analysis, and represented the data in a SingleCellExperiment
container. Just as for bulk RNA-seq, there are also other containers that are often used for single-cell data. The most common alternative is the Seurat
object, which is the basis for the workflow provided by the Seurat
package. This package provides to a large extent similar capabilities as the Bioconductor packages we have seen in this lecture, and can be used as an alternative. The webpage contains a collection of tutorials (including one for the same data set that we studied here).
Initially developped to measure gene expression at a single cell level, scRNA-seq approaches have rapidely evolved to allow simultaneous profiling of multiple additional features.
In more simple term: transcriptome + another layer of information from the same cell.
Here are a some examples of multimodal approches becoming more and more commonly used in the scRNA-seq field:
(Image credit: Schematic figure of (A) the CITE-seq antibody linked with the barcoded oligo (B) the CITE-seq protocol. Credit: Winston & Gregory Timp (2020), license: CC-BY 4.0)
(Image credit: cell-hashing in www.protocols.io )
As a more genereal term, multi-omic single cell sequencing groups all layers of information (eg. Chromatin accessibility, DNA methylation,…) that can be collected together with the transcriptome of a cell.
A. Explore the SingleCellExperiment object:
# load the data:
= TENxPBMCData("pbmc6k") sce
see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation
loading from cache
# Dimension of SingleCellExperiment:
dim(sce) # genes X cells (barcodes)
[1] 32738 5419
# colnames SingleCellExperiment:
head(colnames(sce)) # No colnames yet
NULL
# rownames SingleCellExperiment:
head(rownames(sce)) # represent the mouse Ensembl gene ID (http://www.ensembl.org/Mus_musculus/Info/Index)
[1] "ENSG00000243485" "ENSG00000237613" "ENSG00000186092" "ENSG00000238009"
[5] "ENSG00000239945" "ENSG00000237683"
# SingleCellExperiment colData info:
colData(sce) # sample (=cells) information (Barcode,library, Tissue_status,... )
DataFrame with 5419 rows and 11 columns
Sample Barcode Sequence Library
<character> <character> <character> <integer>
1 pbmc6k AAACATACAACCAC-1 AAACATACAACCAC 1
2 pbmc6k AAACATACACCAGT-1 AAACATACACCAGT 1
3 pbmc6k AAACATACCCGTAA-1 AAACATACCCGTAA 1
4 pbmc6k AAACATACCTAAGC-1 AAACATACCTAAGC 1
5 pbmc6k AAACATACCTTCCG-1 AAACATACCTTCCG 1
... ... ... ... ...
5415 pbmc6k TTTGCATGCACTAG-1 TTTGCATGCACTAG 1
5416 pbmc6k TTTGCATGCCGAAT-1 TTTGCATGCCGAAT 1
5417 pbmc6k TTTGCATGGAGGTG-1 TTTGCATGGAGGTG 1
5418 pbmc6k TTTGCATGGATAAG-1 TTTGCATGGATAAG 1
5419 pbmc6k TTTGCATGGGCATT-1 TTTGCATGGGCATT 1
Cell_ranger_version Tissue_status Barcode_type Chemistry
<character> <character> <character> <character>
1 v1.1.0 NA GemCode Chromium_v1
2 v1.1.0 NA GemCode Chromium_v1
3 v1.1.0 NA GemCode Chromium_v1
4 v1.1.0 NA GemCode Chromium_v1
5 v1.1.0 NA GemCode Chromium_v1
... ... ... ... ...
5415 v1.1.0 NA GemCode Chromium_v1
5416 v1.1.0 NA GemCode Chromium_v1
5417 v1.1.0 NA GemCode Chromium_v1
5418 v1.1.0 NA GemCode Chromium_v1
5419 v1.1.0 NA GemCode Chromium_v1
Sequence_platform Individual Date_published
<character> <character> <character>
1 NextSeq500 HealthyDonor2 2016-07-31
2 NextSeq500 HealthyDonor2 2016-07-31
3 NextSeq500 HealthyDonor2 2016-07-31
4 NextSeq500 HealthyDonor2 2016-07-31
5 NextSeq500 HealthyDonor2 2016-07-31
... ... ... ...
5415 NextSeq500 HealthyDonor2 2016-07-31
5416 NextSeq500 HealthyDonor2 2016-07-31
5417 NextSeq500 HealthyDonor2 2016-07-31
5418 NextSeq500 HealthyDonor2 2016-07-31
5419 NextSeq500 HealthyDonor2 2016-07-31
# SingleCellExperiment rowData info:
rowData(sce) # gene information (ENSEMBL_ID, Symbol_TENx, ...)
DataFrame with 32738 rows and 3 columns
ENSEMBL_ID Symbol_TENx Symbol
<character> <character> <character>
ENSG00000243485 ENSG00000243485 MIR1302-10 NA
ENSG00000237613 ENSG00000237613 FAM138A FAM138A
ENSG00000186092 ENSG00000186092 OR4F5 OR4F5
ENSG00000238009 ENSG00000238009 RP11-34P13.7 LOC100996442
ENSG00000239945 ENSG00000239945 RP11-34P13.8 NA
... ... ... ...
ENSG00000215635 ENSG00000215635 AC145205.1 NA
ENSG00000268590 ENSG00000268590 BAGE5 NA
ENSG00000251180 ENSG00000251180 CU459201.1 NA
ENSG00000215616 ENSG00000215616 AC002321.2 NA
ENSG00000215611 ENSG00000215611 AC002321.1 NA
B. Modify rownames: set the row names to gene identifier (= gene ENSEMBL_ID + Symbol_TENx)
# combine Ensembl ID and gene symbol and set as row names of the data object.
rownames(sce) = paste0(rowData(sce)$ENSEMBL_ID, "_",
rowData(sce)$Symbol_TENx)
C. The loaded object has no column names. Set the column names to the cell barcode:
all(gsub("-1", "", sce$Barcode) == sce$Sequence)
[1] TRUE
any(duplicated(sce$Sequence)) # No duplicate because there is only one sample (Library) present.
[1] FALSE
# if they would be duplicate, we would need to combine the sequence barcode with the library (incase of multiple sample: e.g AAACATACAACCAC-1 for sample 1 and AAACATACAACCAC-2 for sample 2)
colnames(sce) = sce$Sequence
Check that the counts()
accessor returns the same matrix as you would get if you access the counts assay using assay()
or assays()
.
identical(assay(sce, "counts"), counts(sce))
[1] TRUE
all(assay(sce, "counts") == counts(sce)) ## alternative
[1] TRUE
identical(assays(sce)[["counts"]], counts(sce))
[1] TRUE
## Total number of detected genes
# sum of read per gene:
# number of gene that have 0 over all cells:
sum(rowSums(counts(sce)) > 0) # 17220
[1] 17220
sum(rowSums(counts(sce)) > 0)/nrow(sce) # 0.5259943
[1] 0.5259943
## Genes detected in single cells
# Select only gene that have at least 1 read over all cells --> make the data frame smaller.
# sum over the column (for each cells): number of gene expressed per cell
summary(colSums(counts(sce) > 0)) # very large variation of numbers --> This indicates that the gene detection is to some extent random, as expected by the sampling process. In other words, it is not always the same genes that go undetected in all cells.
Min. 1st Qu. Median Mean 3rd Qu. Max.
81.0 592.0 716.0 748.1 850.0 3185.0
# General per gene QC:
= addPerFeatureQC(sce) # sum and detected for each gene
sce
## # General per cell QC:
= rownames(sce)[grep("^MT-", rowData(sce)$Symbol_TENx)]
mt = addPerCellQC(sce, subsets = list(MT = mt)) # sum and detected gene and targeted genes for each cell
sce
plot(sce$sum, sce$detected) # yes, the more you sequenced, the more you can detecte genes.
Filter out the cells identified as being of low quality according to:
= sce$sum < 1000
low_sum = isOutlier(sce$detected, type = "lower",
low_detected log = TRUE, nmads = 4) # minimum of 243 genes detected to be consider as good quality cells
table(low_sum == TRUE & low_detected==TRUE)
FALSE TRUE
5413 6
# ALL 6 cells with not enough gene detected have also not enough reads to be considered as good quality cells.
# As shown before, association between the total count and the number of detected genes
# But also: bad quality cells can be identified by multiple QC approach.
$retain = !low_detected & !low_sum
scetable(sce$retain)
FALSE TRUE
446 4973
= sce[, sce$retain]
sce dim(sce) # 4973 cells left
[1] 32738 4973
Use a tSNE to visualization the proportion of mitochondrial genes
# Use Poisson or per-gene variation:
= modelGeneVarByPoisson(sce)
dec.pois = getTopHVGs(dec.pois, n = 1000)
hvg.var metadata(sce)$hvgs = hvg.var
# Compute size factor:
## Option 1: Using pre-clustering step (quickCluster()) where cells in each cluster are normalized separately and the size factors are rescaled to be comparable across clusters. This avoids the assumption that most genes are non-DE across the entire population - only a non-DE majority is required between pairs of clusters, which is a weaker assumption for highly heterogeneous populations.
set.seed(100)
= quickCluster(sce)
clust = computeSumFactors(sce,clust=clust, min.mean = 0.1)
sce
## Option2: No pre-clustering
= computeSumFactors(sce, min.mean = 0.1)
sce # logcount for PCA:
= logNormCounts(sce)
sce
# PCA with 50 PCS and HVGs
= runPCA(sce, exprs_values = "logcounts", ncomponents = 50,
sce subset_row = metadata(sce)$hvgs)
set.seed(123)
# use 50 PCs of PCA:
= runTSNE(sce, dimred = "PCA")
sce # plot according to the number of Mitochondrial gene detected
plotReducedDim(sce, "TSNE", colour_by = "subsets_MT_detected")
R version 4.3.3 (2024-02-29)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Big Sur 11.4
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Zurich
tzcode source: internal
attached base packages:
[1] stats4 grid stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] SingleR_2.4.1 celldex_1.12.0
[3] scran_1.30.2 scater_1.30.1
[5] ggplot2_3.5.0 scuttle_1.12.0
[7] TENxPBMCData_1.20.0 HDF5Array_1.30.1
[9] rhdf5_2.46.1 DelayedArray_0.28.0
[11] SparseArray_1.2.4 S4Arrays_1.2.1
[13] abind_1.4-5 Matrix_1.6-5
[15] SingleCellExperiment_1.24.0 SummarizedExperiment_1.32.0
[17] Biobase_2.62.0 GenomicRanges_1.54.1
[19] GenomeInfoDb_1.38.8 IRanges_2.36.0
[21] S4Vectors_0.40.2 BiocGenerics_0.48.1
[23] MatrixGenerics_1.14.0 matrixStats_1.2.0
[25] BiocStyle_2.30.0 png_0.1-8
[27] knitr_1.45
loaded via a namespace (and not attached):
[1] rstudioapi_0.16.0 jsonlite_1.8.8
[3] magrittr_2.0.3 ggbeeswarm_0.7.2
[5] farver_2.1.1 rmarkdown_2.26
[7] zlibbioc_1.48.2 vctrs_0.6.5
[9] memoise_2.0.1 DelayedMatrixStats_1.24.0
[11] RCurl_1.98-1.14 htmltools_0.5.8
[13] AnnotationHub_3.10.0 curl_5.2.1
[15] BiocNeighbors_1.20.2 Rhdf5lib_1.24.2
[17] htmlwidgets_1.6.4 cachem_1.0.8
[19] igraph_2.0.3 mime_0.12
[21] lifecycle_1.0.4 pkgconfig_2.0.3
[23] rsvd_1.0.5 R6_2.5.1
[25] fastmap_1.1.1 GenomeInfoDbData_1.2.11
[27] shiny_1.8.1 digest_0.6.35
[29] colorspace_2.1-0 AnnotationDbi_1.64.1
[31] dqrng_0.3.2 irlba_2.3.5.1
[33] ExperimentHub_2.10.0 RSQLite_2.3.6
[35] beachmat_2.18.1 filelock_1.0.3
[37] labeling_0.4.3 fansi_1.0.6
[39] httr_1.4.7 compiler_4.3.3
[41] bit64_4.0.5 withr_3.0.0
[43] BiocParallel_1.36.0 viridis_0.6.5
[45] DBI_1.2.2 rappdirs_0.3.3
[47] bluster_1.12.0 tools_4.3.3
[49] vipor_0.4.7 beeswarm_0.4.0
[51] interactiveDisplayBase_1.40.0 httpuv_1.6.15
[53] glue_1.7.0 rhdf5filters_1.14.1
[55] promises_1.2.1 Rtsne_0.17
[57] cluster_2.1.6 generics_0.1.3
[59] gtable_0.3.4 BiocSingular_1.18.0
[61] ScaledMatrix_1.10.0 metapod_1.10.1
[63] utf8_1.2.4 XVector_0.42.0
[65] ggrepel_0.9.5 BiocVersion_3.18.1
[67] pillar_1.9.0 limma_3.58.1
[69] later_1.3.2 dplyr_1.1.4
[71] BiocFileCache_2.10.2 lattice_0.22-6
[73] bit_4.0.5 tidyselect_1.2.1
[75] locfit_1.5-9.9 Biostrings_2.70.3
[77] gridExtra_2.3 edgeR_4.0.16
[79] xfun_0.43 statmod_1.5.0
[81] yaml_2.3.8 evaluate_0.23
[83] codetools_0.2-20 tibble_3.2.1
[85] BiocManager_1.30.22 cli_3.6.2
[87] xtable_1.8-4 munsell_0.5.1
[89] Rcpp_1.0.12 dbplyr_2.5.0
[91] parallel_4.3.3 blob_1.2.4
[93] sparseMatrixStats_1.14.0 bitops_1.0-7
[95] viridisLite_0.4.2 scales_1.3.0
[97] purrr_1.0.2 crayon_1.5.2
[99] rlang_1.1.3 KEGGREST_1.42.0