# If any package is missing, install it using the following commands and then load it.
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#
# BiocManager::install("SingleCellExperiment")
# library(SingleCellExperiment)
# Loading libraries
suppressPackageStartupMessages({
library(SpatialExperiment)
library(ggspavis)
library(STexampleData)
library(scater)
library(scran)
library(pals)
})
Introduction to spatial transcriptomics
Install necessary libraries
Goal
Introduce spatial transcriptomics technologies and basic analysis of spatial transcriptomics data sets:
- Present technologies for generating spatial transcriptomics data.
- Introduce
SpatialExperiment
andSpatialImage
classes. - Demonstrate an example of a spatial transcriptomics data set.
- Perform quality control of the example data set.
- Perform principal component analysis of the example data set.
Motivation
Single-cell transcriptomics methods presented in two previous lectures allow to study the gene expression in single cells instead of averaged expression of genes in a sample. However, these methods contain a step, where the tissue is dissociated, and thus spatial information of cells is lost. Recently developed spatial omics technologies overcome this limitation and enable molecular measurements with spatial resolution.
A widely used allegory to describe different transcriptomic methods is that the data in bulk RNA sequencing is like a fruit smoothie and in single-cell RNA sequencing like a fruit salad. Spatial data can be represented like a fruit tart, where each piece of fruit has its own distinct place with regard to other pieces of fruit.
Figure: The output of transcriptomics methods pictured as different forms of fruit. In this analogy, each piece of fruit represents a single cell. Figure was taken from pixabay.
Spatial technologies allow to address numerous research questions that can not be answered with bulk and/or single-cell technologies. For example:
- how cells interact and communicate within tissues,
- how cellular context influences gene expression,
- how spatial organization contributes to tissue development and disease.
Spatial technologies including sequencing platforms and recipes for the analysis of generated spatial transcriptmics data evolve extremely fast. This session will provide a short introduction into this field based on what is available right now. For more information refer to “Orchestrating Spatial Transcriptomics Analysis with Bioconductor” (shortly OSTA) book, which introduces the majority of commercially available sequencing platforms to generate spatial data and Bioconductor
packages for the analysis of different types of spatial data.
Platforms for generating spatial transcriptomics data
Platforms for generating spatial transcriptomics data can be broadly grouped into “sequencing-based” and “imaging-based” technologies. While both approaches reveal spatial locations of the gene expression, their underlying technologies differ drastically in capturing spatial information and determining abundance of specific mRNA molecules within tissue.
Sequencing-based platforms
The majority of sequencing-based technologies integrate spatially barcoded arrays with next-generation sequencing to determine the locations and expression levels of transcripts within tissues. In most cases, this is achieved by capturing mRNA transcripts within the tissue using a polyT tail built into unique, spatially barcoded probes on the array. During cDNA synthesis, these spatial barcodes are incorporated into each cDNA molecule. By subsequent library construction and sequencing, researchers can determine the expression levels of transcripts and map them back to their precise locations within the tissue based on their spatial barcodes.
Figure was taken from Rao et al. 2021.
These are representative widely used platforms for sequencing-based spatial transcriptomics:
- Visium, see the provider here,
- Visium HD, see the provider here,
- Stereoseq, see the provider here,
- GeoMx, see the provider here.
The fundamental difference among these technologies is the feature size of the array, which largely determines spatial resolution (see figure below).
Figure was taken from Ju Lim et al. 2025.
Imaging-based platforms
Imaging-based technologies employ single-molecule fluorescence in situ hybridization (smFISH) Femino et al. 1998 as their backbone technology. These technologies enable the simultaneous detection of up to 6000 RNA transcripts in a single experiment through cyclic, highly multiplexed smFISH. This is achieved by using dozens of primary probes that hybridize to specific RNA transcripts, followed by secondary probes labeled with different fluorophores hybridizing to the primary probes. By sequentially hybridizing and imaging fluorescence from these secondary probes, researchers can determine the spatial location and expression levels of individual RNA transcripts within tissues based on transcript-specific fluorescent signatures and their intensity.
Figure was taken from Moffitt et al. 2022.
These are representative widely used platforms for sequencing-based spatial transcriptomics:
The differences between imaging-based platforms are mainly in probe design, probe hybridization, signal amplification and gene decoding (see figure below).
Figure was taken from Ju Lim et al. 2025.
For this type of experiments accurate cell boundary segmentation is crucial for identifying cell types and studying cell-cell interactions.
Generally, sequencing-based platforms tend to provide higher gene coverage (e.g. full-transcriptome), while imaging-based platforms tend to provide higher spatial resolution (e.g. single-cell or subcellular resolution).
Introduction to SpatialExperiment
class
The file structure and formats of data generated by spatial transcriptomics platforms vary between different providers. Nevertheless, all outputs are similar in their essence. Sequencing-based data include spatial locations of array spots and a count matrix. Imaging-based data include transcript locations from spot-calling, polygon boundaries from segmentation, and a count matrix from allocating transcripts to cells. The R/Bioconductor
S4
class called SpatialExperiment
was created for storing data from various spatial -omics experiments.
Structure of SpatialExperiment
class
The SpatialExperiment
class extends the SingleCellExperiment
class for single-cell data to support storage and retrieval of additional information from spot-based and molecule-based platforms, including spatial coordinates, images and image metadata.
SpatialExperiment
class has the following structure:
Figure was taken from Righelli et al. 2022.
As shown in the above figure, a SpatialExperiment
object consists of:
assays
containing expression counts,rowData
containing information on features, i.e. genes,colData
containing information on spots or cells, including nonspatial and spatial metadata,spatialCoords
containing spatial coordinates,imgData
containing image data.
For sequencing-based data, assays
contains a table named counts
containing the gene counts. For imaging-based data, assays
may contain two tables named counts
and molecules
containing total gene counts per cell as well as molecule-level information such as spatial coordinates per molecule.
An example of a 10x Genomics Visium data set
The STexampleData
Bioconductor
package contains a collection of spatial transcriptomics datasets, which have been formatted into the SpatialExperiment
Bioconductor
class, for use in examples, demonstrations, and tutorials. The datasets stem from several technological platforms and have been shaped from various publicly available sources. Some of the datasets include images and/or reference annotation labels.
For the demonstration of the general class structure, we will study an example SpatialExperiment
dataset generated by 10x Genomics Visium technology provided by the STexampleData
package. It was produced for a healthy human brain sample from the dorsolateral prefrontal cortex (DLPFC) region (see here). A tissue block of DLPFC was acquired in the anatomical plane perpendicular to the pial surface and extended to the gray–white matter junction. The block spanned the six cortical layers and white matter (see figure below).
Figure was taken from the original manuscript.
Let us load the data set.
<- Visium_humanDLPFC() spe
see ?STexampleData and browseVignettes('STexampleData') for documentation
loading from cache
spe
class: SpatialExperiment
dim: 33538 4992
metadata(0):
assays(1): counts
rownames(33538): ENSG00000243485 ENSG00000237613 ...
ENSG00000277475 ENSG00000268674
rowData names(3): gene_id gene_name feature_type
colnames(4992): AAACAACGAATAGTTC-1 AAACAAGTATCTCCCA-1 ...
TTGTTTGTATTACACG-1 TTGTTTGTGTAAATTC-1
colData names(8): barcode_id sample_id ... reference
cell_count
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
imgData names(4): sample_id image_id data scaleFactor
Let us explore counts
of the spe
object and its dimensions corresponding to the number of measured genes and spots, respectively:
dim(spe)
[1] 33538 4992
counts(spe)[1:10,1:10]
10 x 10 sparse Matrix of class "dgTMatrix"
[[ suppressing 10 column names 'AAACAACGAATAGTTC-1', 'AAACAAGTATCTCCCA-1', 'AAACAATCTACTAGCA-1' ... ]]
ENSG00000243485 . . . . . . . . . .
ENSG00000237613 . . . . . . . . . .
ENSG00000186092 . . . . . . . . . .
ENSG00000238009 . . . . . . . . . .
ENSG00000239945 . . . . . . . . . .
ENSG00000239906 . . . . . . . . . .
ENSG00000241599 . . . . . . . . . .
ENSG00000236601 . . . . . . . . . .
ENSG00000284733 . . . . . . . . . .
ENSG00000235146 . . . . . . . . . .
Let us explore rowData
slot:
head(rowData(spe))
DataFrame with 6 rows and 3 columns
gene_id gene_name feature_type
<character> <character> <character>
ENSG00000243485 ENSG00000243485 MIR1302-2HG Gene Expression
ENSG00000237613 ENSG00000237613 FAM138A Gene Expression
ENSG00000186092 ENSG00000186092 OR4F5 Gene Expression
ENSG00000238009 ENSG00000238009 AL627309.1 Gene Expression
ENSG00000239945 ENSG00000239945 AL627309.3 Gene Expression
ENSG00000239906 ENSG00000239906 AL627309.2 Gene Expression
Let us have a look at colData
slot:
head(colData(spe))
DataFrame with 6 rows and 8 columns
barcode_id sample_id in_tissue
<character> <character> <integer>
AAACAACGAATAGTTC-1 AAACAACGAATAGTTC-1 sample_151673 0
AAACAAGTATCTCCCA-1 AAACAAGTATCTCCCA-1 sample_151673 1
AAACAATCTACTAGCA-1 AAACAATCTACTAGCA-1 sample_151673 1
AAACACCAATAACTGC-1 AAACACCAATAACTGC-1 sample_151673 1
AAACAGAGCGACTCCT-1 AAACAGAGCGACTCCT-1 sample_151673 1
AAACAGCTTTCAGAAG-1 AAACAGCTTTCAGAAG-1 sample_151673 1
array_row array_col ground_truth reference
<integer> <integer> <character> <character>
AAACAACGAATAGTTC-1 0 16 NA NA
AAACAAGTATCTCCCA-1 50 102 Layer3 Layer3
AAACAATCTACTAGCA-1 3 43 Layer1 Layer1
AAACACCAATAACTGC-1 59 19 WM WM
AAACAGAGCGACTCCT-1 14 94 Layer3 Layer3
AAACAGCTTTCAGAAG-1 43 9 Layer5 Layer5
cell_count
<integer>
AAACAACGAATAGTTC-1 NA
AAACAAGTATCTCCCA-1 6
AAACAATCTACTAGCA-1 16
AAACACCAATAACTGC-1 5
AAACAGAGCGACTCCT-1 2
AAACAGCTTTCAGAAG-1 4
In addition, the SpatialExperiment
object stores spatial coordinates of spots on the image:
head(spatialCoords(spe))
pxl_col_in_fullres pxl_row_in_fullres
AAACAACGAATAGTTC-1 3913 2435
AAACAAGTATCTCCCA-1 9791 8468
AAACAATCTACTAGCA-1 5769 2807
AAACACCAATAACTGC-1 4068 9505
AAACAGAGCGACTCCT-1 9271 4151
AAACAGCTTTCAGAAG-1 3393 7583
Run the following function plotSpots
from ggspavis
package to plot spots using their coordinates:
plotSpots(spe, point_size = 1.3)
The imgData()
accessor can be used to retrieve the image data stored within the object:
imgData(spe)
DataFrame with 2 rows and 4 columns
sample_id image_id data scaleFactor
<character> <character> <list> <numeric>
1 sample_151673 lowres #### 0.0450045
2 sample_151673 hires #### 0.1500150
Here, each row corresponds to one image for a given sample with a given unique image identifier (e.g., its resolution). For each image, columns specify:
- which
sample_id
the image belongs to; - a unique
image_id
in order to accommodate multiple images for a given sample, e.g., of different resolutions; - the image’s
data
, which is aSpatialImage
object; - the
scaleFactor
is used to re-scale spatial coordinates according to the resolution of the image.
Structure of SpatialImage
class
Let us now explore the object data
within the imgData()
accessor. It contains tissue images as a list of SpatialImages
. Each image may be of one of the following sub-classes:
LoadedSpatialImage
- fully realized into memory as a raster object;@image
contains a raster object: a matrix of RGB colors for each pixel in the image.StoredSpatialImage
- an image that is stored in a local file (e.g., as.png
,.jpg
or.tif
), and loaded into memory only on request;@path
specifies a local file from which to retrieve the image.RemoteSpatialImage
- an image that is remotely hosted under some URL and retrieved only on request;@url
specifies where to retrieve the image from.
A SpatialImage
can be accessed using getImg()
or retrieved directly from the imgData()
:
<- getImg(spe)
spi spi
600 x 600 (width x height) LoadedSpatialImage
Data available in an object of class SpatialImage
may be accessed via the imgRaster()
and imgSource()
accessors. Let us visualize the histology staining of the tissue block:
plot(imgRaster(spe))
Use the function plotVisium
from ggspavis
package to visualize the expression of certain genes SNAP25, MOBP and PCP4:
<- rownames(spe)[which(rowData(spe)$gene_name=="SNAP25")]
gene_id plotVisium(spe, annotate = gene_id, highlight = "in_tissue", point_size = 1.3)
<- rownames(spe)[which(rowData(spe)$gene_name=="MOBP")]
gene_id plotVisium(spe, annotate = gene_id, highlight = "in_tissue", point_size = 1.3)
<- rownames(spe)[which(rowData(spe)$gene_name=="PCP4")]
gene_id plotVisium(spe, annotate = gene_id, highlight = "in_tissue", point_size = 1.3)
Quality control of 10x Visium spots
Next, we will learn how to perform the quality control at the spot level.
Low-quality spots can be identified by checking the following characteristics:
- library size accounting for the total unique molecular identifier (UMI) counts per spot,
- alternatively to the library size one can check the number of expressed features accounting for the number of genes with non-zero UMI counts per spot,
- proportion of reads mapping to mitochondrial genes (a high proportion indicates cell damage),
- number of cells per spot (unusually high values can indicate problems with the experiment for the spot).
First, we subset the original SpatialExperiment
object by keeping only spots over tissue:
table(colData(spe)$in_tissue)
0 1
1353 3639
<- spe[, colData(spe)$in_tissue == 1] spe
Now we will treat spots as equivalent to cells and calculate the QC metrics from the scater
package, which were also used for assessing the quality of scRNA-seq data.
Identify mitochondrial genes in the data set:
<- grepl("^MT-", rowData(spe)$gene_name)
is_mito rowData(spe)$gene_name[is_mito]
[1] "MT-ND1" "MT-ND2" "MT-CO1" "MT-CO2" "MT-ATP8" "MT-ATP6"
[7] "MT-CO3" "MT-ND3" "MT-ND4L" "MT-ND4" "MT-ND5" "MT-ND6"
[13] "MT-CYB"
Calculate per-spot QC metrics and store in colData
:
<- addPerCellQC(spe, subsets = list(mito = is_mito))
spe head(colData(spe))
DataFrame with 6 rows and 14 columns
barcode_id sample_id in_tissue
<character> <character> <integer>
AAACAAGTATCTCCCA-1 AAACAAGTATCTCCCA-1 sample_151673 1
AAACAATCTACTAGCA-1 AAACAATCTACTAGCA-1 sample_151673 1
AAACACCAATAACTGC-1 AAACACCAATAACTGC-1 sample_151673 1
AAACAGAGCGACTCCT-1 AAACAGAGCGACTCCT-1 sample_151673 1
AAACAGCTTTCAGAAG-1 AAACAGCTTTCAGAAG-1 sample_151673 1
AAACAGGGTCTATATT-1 AAACAGGGTCTATATT-1 sample_151673 1
array_row array_col ground_truth reference
<integer> <integer> <character> <character>
AAACAAGTATCTCCCA-1 50 102 Layer3 Layer3
AAACAATCTACTAGCA-1 3 43 Layer1 Layer1
AAACACCAATAACTGC-1 59 19 WM WM
AAACAGAGCGACTCCT-1 14 94 Layer3 Layer3
AAACAGCTTTCAGAAG-1 43 9 Layer5 Layer5
AAACAGGGTCTATATT-1 47 13 Layer6 Layer6
cell_count sum detected subsets_mito_sum
<integer> <numeric> <integer> <numeric>
AAACAAGTATCTCCCA-1 6 8458 3586 1407
AAACAATCTACTAGCA-1 16 1667 1150 204
AAACACCAATAACTGC-1 5 3769 1960 430
AAACAGAGCGACTCCT-1 2 5433 2424 1316
AAACAGCTTTCAGAAG-1 4 4278 2264 651
AAACAGGGTCTATATT-1 6 4004 2178 621
subsets_mito_detected subsets_mito_percent
<integer> <numeric>
AAACAAGTATCTCCCA-1 13 16.6351
AAACAATCTACTAGCA-1 11 12.2376
AAACACCAATAACTGC-1 13 11.4089
AAACAGAGCGACTCCT-1 13 24.2223
AAACAGCTTTCAGAAG-1 12 15.2174
AAACAGGGTCTATATT-1 13 15.5095
total
<numeric>
AAACAAGTATCTCCCA-1 8458
AAACAATCTACTAGCA-1 1667
AAACACCAATAACTGC-1 3769
AAACAGAGCGACTCCT-1 5433
AAACAGCTTTCAGAAG-1 4278
AAACAGGGTCTATATT-1 4004
QC based on the library size
Plot spots and color them by the library size:
plotVisium(spe, annotate = "sum", highlight = "in_tissue", point_size = 1.3)
Create a histogram of library sizes:
hist(colData(spe)$sum, breaks = 20)
This data set additionally provides the number of cells per spot. We will make use of this information for performing QC and plot it against the library size. For example, it might help to keep spots with low number of cells despite lower sum of counts. The blue curve corresponds to the smoothed trend line using loess
method. The red horizontal line (argument threshold
) shows our first guess of a possible filtering threshold for library size.
plotSpotQC(spe, plot_type = "scatter",
x_metric = "cell_count", y_metric = "sum",
y_threshold = 600)
`geom_smooth()` using formula = 'y ~ x'
`stat_xsidebin()` using `bins = 30`. Pick better value with
`binwidth`.
`stat_ysidebin()` using `bins = 30`. Pick better value with
`binwidth`.
Store results of this QC step in the colData
object:
<- colData(spe)$sum < 600
qc_lib_size table(qc_lib_size)
qc_lib_size
FALSE TRUE
3631 8
colData(spe)$qc_lib_size <- qc_lib_size
Check spatial pattern of of low-quality spots:
plotSpotQC(spe, plot_type = "spot",annotate = "qc_lib_size", point_size = 1.3)
The plot shows that setting the selected filtering threshold for the library size does not appear to discard any obvious biologically consistent group of spots. Check spatial pattern of filtered spots if the threshold is too high:
<- colData(spe)$sum < 2000
qc_lib_size_2000 colData(spe)$qc_lib_size_2000 <- qc_lib_size_2000
plotSpotQC(spe, plot_type = "spot",
annotate = "qc_lib_size_2000", point_size = 1.3)
When we check the manual annotation of cortical layers on the slide, it becomes apparent that too high threshold tends to discard cells belonging to a specific layer:
plotSpots(spe, annotate = "ground_truth",
pal = "libd_layer_colors", point_size = 1.3)
QC based on the expression of mitochondrial genes
Now we will check the expression of mitochondrial genes. The high expression of these genes indicates presence of damaged cells in the spot.
Create a histogram of the percentage of UMI counts assigned to mitochondrial genes:
hist(colData(spe)$subsets_mito_percent, breaks = 20)
Plot the proportion of UMI counts assigned to mitochondrial genes versus the number of cells per spot. As the threshold, the value 28 was selected:
plotSpotQC(spe, plot_type = "scatter",
x_metric = "cell_count", y_metric = "subsets_mito_percent",
y_threshold = 28)
`geom_smooth()` using formula = 'y ~ x'
`stat_xsidebin()` using `bins = 30`. Pick better value with
`binwidth`.
`stat_ysidebin()` using `bins = 30`. Pick better value with
`binwidth`.
Store results of this QC step in the colData
object:
<- colData(spe)$subsets_mito_percent > 28
qc_mito table(qc_mito)
qc_mito
FALSE TRUE
3622 17
colData(spe)$qc_mito <- qc_mito
Check spatial pattern of low-quality spots:
plotSpotQC(spe, plot_type = "spot",
annotate = "qc_mito", point_size = 1.3)
QC based on the number of cells per spot
The final QC metric will be the number of cells per spot. As it was mentioned above, this measure is available in this data set and can also be used for assessing the quality control of spots.
Plot the histogram of cell counts:
hist(colData(spe)$cell_count, breaks = 20)
Check the distribution of cells per spot:
table(colData(spe)$cell_count)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
84 211 483 623 617 541 421 287 140 92 50 25 18 10 9 3 8
17 18 19 20 21 22 23 25 26 27
2 1 2 3 2 1 1 2 2 1
Plot the number of expressed genes versus the number of cells per spot. Spots with high cell counts have low numbers of expressed genes. This indicates that the experiment has failed for these spots and they should be removed. We selected the value 10 as the threshold:
plotSpotQC(spe, plot_type = "scatter",
x_metric = "cell_count", y_metric = "detected",
x_threshold = 10)
`geom_smooth()` using formula = 'y ~ x'
`stat_xsidebin()` using `bins = 30`. Pick better value with
`binwidth`.
`stat_ysidebin()` using `bins = 30`. Pick better value with
`binwidth`.
Store results of this QC step in the colData
object:
<- colData(spe)$cell_count > 10
qc_cell_count table(qc_cell_count)
qc_cell_count
FALSE TRUE
3549 90
colData(spe)$qc_cell_count <- qc_cell_count
Check spatial pattern of low-quality spots:
plotSpotQC(spe, plot_type = "spot",
annotate = "qc_cell_count", point_size = 1.3)
Shall we remove spots with zero cells? Although the spot is not assigned to any particular cell, it may still contain biologically meaningful information. In this dataset, the region between cell bodies consists of neuropil (dense networks of axons and dendrites). There might be an interest in exploring the information contained in these neuropil spots.
Combination of QC metrics
Next, we will collect low-quality spots based on all calculated metrics and remove them from our object.
Label spots to be discarded:
<- qc_lib_size | qc_mito | qc_cell_count
discard table(discard)
discard
FALSE TRUE
3524 115
Store the result in the colData
object and check the spatial pattern of the combined set of discarded spots:
colData(spe)$discard <- discard
plotSpotQC(spe, plot_type = "spot",
annotate = "discard", point_size = 1.3)
Remove the combined set of low-quality spots:
<- spe[, !colData(spe)$discard]
spe dim(spe)
[1] 33538 3524
Principal component analysis
For further data processing, we will use methods from scater
and scran
packages, which were originally developed for scRNA-seq data, with the assumption that they can be applied to spot-based spatial transcriptomics data by treating spots as cells.
Normalization
We will calculate log-transformed normalized counts (“logcounts”) using library size factors.
<- computeLibraryFactors(spe)
spe summary(sizeFactors(spe))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1321 0.6312 0.9000 1.0000 1.2849 3.7582
<- logNormCounts(spe)
spe assayNames(spe)
[1] "counts" "logcounts"
Defining highly variable genes
Further, we fit the mean-variance relationship and select top highly variable genes:
<- modelGeneVar(spe)
dec <- getTopHVGs(dec, prop=0.1)
sel length(sel)
[1] 1529
Calculating principal components
Calculate principal components using the expression of highly variable genes:
<- runPCA(spe, subset_row=sel) spe
Create a PCA plot and color spots using manually annotated reference. It is noticeable that spots from the same layer tend to cluster together.
plotReducedDim(spe, "PCA", colour_by="ground_truth")
Visualize the coordinates of first four principal components on the slide:
<- reducedDim(spe, "PCA")
pcs <- pcs[, seq_len(4)]
pcs lapply(colnames(pcs), \(.) {
<- pcs[, .]
spe[[.]] plotSpots(spe, annotate=.)
|>
}) wrap_plots(nrow=1) &
geom_point(shape=16, stroke=0, size=0.2) &
scale_color_gradientn(colors=pals::jet()) &
coord_equal() & theme_void() & theme(
legend.position="bottom",
plot.title=element_text(hjust=0.5),
legend.key.width=unit(0.8, "lines"),
legend.key.height=unit(0.4, "lines"))
Look how good principal components recapitulate the manual annotation of cortical layers:
plotSpots(spe, annotate = "ground_truth",
pal = "libd_layer_colors", point_size = 1.3)