Introduction to spatial transcriptomics

Published

April 23, 2025

Install necessary libraries

# If any package is missing, install it using the following commands and then load it.

# if (!require("BiocManager", quietly = TRUE))
#   install.packages("BiocManager")
# 
# BiocManager::install("SingleCellExperiment")
# library(SingleCellExperiment)

# Loading libraries
suppressPackageStartupMessages({
library(SpatialExperiment)
library(ggspavis)
library(STexampleData)
  
library(scater)
library(scran)
  
library(pals)  
})

Goal

Introduce spatial transcriptomics technologies and basic analysis of spatial transcriptomics data sets:

  • Present technologies for generating spatial transcriptomics data.
  • Introduce SpatialExperiment and SpatialImage classes.
  • Demonstrate an example of a spatial transcriptomics data set.
  • Perform quality control of the example data set.
  • Perform principal component analysis of the example data set.

Motivation

Single-cell transcriptomics methods presented in two previous lectures allow to study the gene expression in single cells instead of averaged expression of genes in a sample. However, these methods contain a step, where the tissue is dissociated, and thus spatial information of cells is lost. Recently developed spatial omics technologies overcome this limitation and enable molecular measurements with spatial resolution.

A widely used allegory to describe different transcriptomic methods is that the data in bulk RNA sequencing is like a fruit smoothie and in single-cell RNA sequencing like a fruit salad. Spatial data can be represented like a fruit tart, where each piece of fruit has its own distinct place with regard to other pieces of fruit.

Figure: The output of transcriptomics methods pictured as different forms of fruit. In this analogy, each piece of fruit represents a single cell. Figure was taken from pixabay.

Spatial technologies allow to address numerous research questions that can not be answered with bulk and/or single-cell technologies. For example:

  • how cells interact and communicate within tissues,
  • how cellular context influences gene expression,
  • how spatial organization contributes to tissue development and disease.

Spatial technologies including sequencing platforms and recipes for the analysis of generated spatial transcriptmics data evolve extremely fast. This session will provide a short introduction into this field based on what is available right now. For more information refer to “Orchestrating Spatial Transcriptomics Analysis with Bioconductor” (shortly OSTA) book, which introduces the majority of commercially available sequencing platforms to generate spatial data and Bioconductor packages for the analysis of different types of spatial data.

Platforms for generating spatial transcriptomics data

Platforms for generating spatial transcriptomics data can be broadly grouped into “sequencing-based” and “imaging-based” technologies. While both approaches reveal spatial locations of the gene expression, their underlying technologies differ drastically in capturing spatial information and determining abundance of specific mRNA molecules within tissue.

Sequencing-based platforms

The majority of sequencing-based technologies integrate spatially barcoded arrays with next-generation sequencing to determine the locations and expression levels of transcripts within tissues. In most cases, this is achieved by capturing mRNA transcripts within the tissue using a polyT tail built into unique, spatially barcoded probes on the array. During cDNA synthesis, these spatial barcodes are incorporated into each cDNA molecule. By subsequent library construction and sequencing, researchers can determine the expression levels of transcripts and map them back to their precise locations within the tissue based on their spatial barcodes.

Figure was taken from Rao et al. 2021.

These are representative widely used platforms for sequencing-based spatial transcriptomics:

The fundamental difference among these technologies is the feature size of the array, which largely determines spatial resolution (see figure below).

Figure was taken from Ju Lim et al. 2025.

Imaging-based platforms

Imaging-based technologies employ single-molecule fluorescence in situ hybridization (smFISH) Femino et al. 1998 as their backbone technology. These technologies enable the simultaneous detection of up to 6000 RNA transcripts in a single experiment through cyclic, highly multiplexed smFISH. This is achieved by using dozens of primary probes that hybridize to specific RNA transcripts, followed by secondary probes labeled with different fluorophores hybridizing to the primary probes. By sequentially hybridizing and imaging fluorescence from these secondary probes, researchers can determine the spatial location and expression levels of individual RNA transcripts within tissues based on transcript-specific fluorescent signatures and their intensity.

Figure was taken from Moffitt et al. 2022.

These are representative widely used platforms for sequencing-based spatial transcriptomics:

The differences between imaging-based platforms are mainly in probe design, probe hybridization, signal amplification and gene decoding (see figure below).

Figure was taken from Ju Lim et al. 2025.

For this type of experiments accurate cell boundary segmentation is crucial for identifying cell types and studying cell-cell interactions.

Generally, sequencing-based platforms tend to provide higher gene coverage (e.g. full-transcriptome), while imaging-based platforms tend to provide higher spatial resolution (e.g. single-cell or subcellular resolution).

Introduction to SpatialExperiment class

The file structure and formats of data generated by spatial transcriptomics platforms vary between different providers. Nevertheless, all outputs are similar in their essence. Sequencing-based data include spatial locations of array spots and a count matrix. Imaging-based data include transcript locations from spot-calling, polygon boundaries from segmentation, and a count matrix from allocating transcripts to cells. The R/Bioconductor S4 class called SpatialExperiment was created for storing data from various spatial -omics experiments.

Structure of SpatialExperiment class

The SpatialExperiment class extends the SingleCellExperiment class for single-cell data to support storage and retrieval of additional information from spot-based and molecule-based platforms, including spatial coordinates, images and image metadata.

SpatialExperiment class has the following structure:

Figure was taken from Righelli et al. 2022.

As shown in the above figure, a SpatialExperiment object consists of:

  • assays containing expression counts,
  • rowData containing information on features, i.e. genes,
  • colData containing information on spots or cells, including nonspatial and spatial metadata,
  • spatialCoords containing spatial coordinates,
  • imgData containing image data.

For sequencing-based data, assays contains a table named counts containing the gene counts. For imaging-based data, assays may contain two tables named counts and molecules containing total gene counts per cell as well as molecule-level information such as spatial coordinates per molecule.

An example of a 10x Genomics Visium data set

The STexampleData Bioconductor package contains a collection of spatial transcriptomics datasets, which have been formatted into the SpatialExperiment Bioconductor class, for use in examples, demonstrations, and tutorials. The datasets stem from several technological platforms and have been shaped from various publicly available sources. Some of the datasets include images and/or reference annotation labels.

For the demonstration of the general class structure, we will study an example SpatialExperiment dataset generated by 10x Genomics Visium technology provided by the STexampleData package. It was produced for a healthy human brain sample from the dorsolateral prefrontal cortex (DLPFC) region (see here). A tissue block of DLPFC was acquired in the anatomical plane perpendicular to the pial surface and extended to the gray–white matter junction. The block spanned the six cortical layers and white matter (see figure below).

Figure was taken from the original manuscript.

Let us load the data set.

spe <- Visium_humanDLPFC()
see ?STexampleData and browseVignettes('STexampleData') for documentation
loading from cache
spe
class: SpatialExperiment 
dim: 33538 4992 
metadata(0):
assays(1): counts
rownames(33538): ENSG00000243485 ENSG00000237613 ...
  ENSG00000277475 ENSG00000268674
rowData names(3): gene_id gene_name feature_type
colnames(4992): AAACAACGAATAGTTC-1 AAACAAGTATCTCCCA-1 ...
  TTGTTTGTATTACACG-1 TTGTTTGTGTAAATTC-1
colData names(8): barcode_id sample_id ... reference
  cell_count
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
imgData names(4): sample_id image_id data scaleFactor

Let us explore counts of the spe object and its dimensions corresponding to the number of measured genes and spots, respectively:

dim(spe)
[1] 33538  4992
counts(spe)[1:10,1:10]
10 x 10 sparse Matrix of class "dgTMatrix"
  [[ suppressing 10 column names 'AAACAACGAATAGTTC-1', 'AAACAAGTATCTCCCA-1', 'AAACAATCTACTAGCA-1' ... ]]
                                   
ENSG00000243485 . . . . . . . . . .
ENSG00000237613 . . . . . . . . . .
ENSG00000186092 . . . . . . . . . .
ENSG00000238009 . . . . . . . . . .
ENSG00000239945 . . . . . . . . . .
ENSG00000239906 . . . . . . . . . .
ENSG00000241599 . . . . . . . . . .
ENSG00000236601 . . . . . . . . . .
ENSG00000284733 . . . . . . . . . .
ENSG00000235146 . . . . . . . . . .

Let us explore rowData slot:

head(rowData(spe))
DataFrame with 6 rows and 3 columns
                        gene_id   gene_name    feature_type
                    <character> <character>     <character>
ENSG00000243485 ENSG00000243485 MIR1302-2HG Gene Expression
ENSG00000237613 ENSG00000237613     FAM138A Gene Expression
ENSG00000186092 ENSG00000186092       OR4F5 Gene Expression
ENSG00000238009 ENSG00000238009  AL627309.1 Gene Expression
ENSG00000239945 ENSG00000239945  AL627309.3 Gene Expression
ENSG00000239906 ENSG00000239906  AL627309.2 Gene Expression

Let us have a look at colData slot:

head(colData(spe))
DataFrame with 6 rows and 8 columns
                           barcode_id     sample_id in_tissue
                          <character>   <character> <integer>
AAACAACGAATAGTTC-1 AAACAACGAATAGTTC-1 sample_151673         0
AAACAAGTATCTCCCA-1 AAACAAGTATCTCCCA-1 sample_151673         1
AAACAATCTACTAGCA-1 AAACAATCTACTAGCA-1 sample_151673         1
AAACACCAATAACTGC-1 AAACACCAATAACTGC-1 sample_151673         1
AAACAGAGCGACTCCT-1 AAACAGAGCGACTCCT-1 sample_151673         1
AAACAGCTTTCAGAAG-1 AAACAGCTTTCAGAAG-1 sample_151673         1
                   array_row array_col ground_truth   reference
                   <integer> <integer>  <character> <character>
AAACAACGAATAGTTC-1         0        16           NA          NA
AAACAAGTATCTCCCA-1        50       102       Layer3      Layer3
AAACAATCTACTAGCA-1         3        43       Layer1      Layer1
AAACACCAATAACTGC-1        59        19           WM          WM
AAACAGAGCGACTCCT-1        14        94       Layer3      Layer3
AAACAGCTTTCAGAAG-1        43         9       Layer5      Layer5
                   cell_count
                    <integer>
AAACAACGAATAGTTC-1         NA
AAACAAGTATCTCCCA-1          6
AAACAATCTACTAGCA-1         16
AAACACCAATAACTGC-1          5
AAACAGAGCGACTCCT-1          2
AAACAGCTTTCAGAAG-1          4

In addition, the SpatialExperiment object stores spatial coordinates of spots on the image:

head(spatialCoords(spe))
                   pxl_col_in_fullres pxl_row_in_fullres
AAACAACGAATAGTTC-1               3913               2435
AAACAAGTATCTCCCA-1               9791               8468
AAACAATCTACTAGCA-1               5769               2807
AAACACCAATAACTGC-1               4068               9505
AAACAGAGCGACTCCT-1               9271               4151
AAACAGCTTTCAGAAG-1               3393               7583

Run the following function plotSpots from ggspavis package to plot spots using their coordinates:

plotSpots(spe, point_size = 1.3)

The imgData() accessor can be used to retrieve the image data stored within the object:

imgData(spe)
DataFrame with 2 rows and 4 columns
      sample_id    image_id   data scaleFactor
    <character> <character> <list>   <numeric>
1 sample_151673      lowres   ####   0.0450045
2 sample_151673       hires   ####   0.1500150

Here, each row corresponds to one image for a given sample with a given unique image identifier (e.g., its resolution). For each image, columns specify:

  • which sample_id the image belongs to;
  • a unique image_id in order to accommodate multiple images for a given sample, e.g., of different resolutions;
  • the image’s data, which is a SpatialImage object;
  • the scaleFactor is used to re-scale spatial coordinates according to the resolution of the image.

Structure of SpatialImage class

Let us now explore the object data within the imgData() accessor. It contains tissue images as a list of SpatialImages. Each image may be of one of the following sub-classes:

  • LoadedSpatialImage - fully realized into memory as a raster object; @image contains a raster object: a matrix of RGB colors for each pixel in the image.
  • StoredSpatialImage - an image that is stored in a local file (e.g., as .png, .jpg or .tif), and loaded into memory only on request; @path specifies a local file from which to retrieve the image.
  • RemoteSpatialImage - an image that is remotely hosted under some URL and retrieved only on request; @url specifies where to retrieve the image from.

A SpatialImage can be accessed using getImg() or retrieved directly from the imgData():

spi <- getImg(spe)
spi
600 x 600 (width x height) LoadedSpatialImage

Data available in an object of class SpatialImage may be accessed via the imgRaster() and imgSource() accessors. Let us visualize the histology staining of the tissue block:

plot(imgRaster(spe))

Use the function plotVisium from ggspavis package to visualize the expression of certain genes SNAP25, MOBP and PCP4:

gene_id <- rownames(spe)[which(rowData(spe)$gene_name=="SNAP25")]
plotVisium(spe, annotate = gene_id, highlight = "in_tissue", point_size = 1.3)

gene_id <- rownames(spe)[which(rowData(spe)$gene_name=="MOBP")]
plotVisium(spe, annotate = gene_id, highlight = "in_tissue", point_size = 1.3)

gene_id <- rownames(spe)[which(rowData(spe)$gene_name=="PCP4")]
plotVisium(spe, annotate = gene_id, highlight = "in_tissue", point_size = 1.3)

Quality control of 10x Visium spots

Next, we will learn how to perform the quality control at the spot level.

Low-quality spots can be identified by checking the following characteristics:

  • library size accounting for the total unique molecular identifier (UMI) counts per spot,
  • alternatively to the library size one can check the number of expressed features accounting for the number of genes with non-zero UMI counts per spot,
  • proportion of reads mapping to mitochondrial genes (a high proportion indicates cell damage),
  • number of cells per spot (unusually high values can indicate problems with the experiment for the spot).

First, we subset the original SpatialExperiment object by keeping only spots over tissue:

table(colData(spe)$in_tissue)

   0    1 
1353 3639 
spe <- spe[, colData(spe)$in_tissue == 1]

Now we will treat spots as equivalent to cells and calculate the QC metrics from the scater package, which were also used for assessing the quality of scRNA-seq data.

Identify mitochondrial genes in the data set:

is_mito <- grepl("^MT-", rowData(spe)$gene_name)
rowData(spe)$gene_name[is_mito]
 [1] "MT-ND1"  "MT-ND2"  "MT-CO1"  "MT-CO2"  "MT-ATP8" "MT-ATP6"
 [7] "MT-CO3"  "MT-ND3"  "MT-ND4L" "MT-ND4"  "MT-ND5"  "MT-ND6" 
[13] "MT-CYB" 

Calculate per-spot QC metrics and store in colData:

spe <- addPerCellQC(spe, subsets = list(mito = is_mito))
head(colData(spe))
DataFrame with 6 rows and 14 columns
                           barcode_id     sample_id in_tissue
                          <character>   <character> <integer>
AAACAAGTATCTCCCA-1 AAACAAGTATCTCCCA-1 sample_151673         1
AAACAATCTACTAGCA-1 AAACAATCTACTAGCA-1 sample_151673         1
AAACACCAATAACTGC-1 AAACACCAATAACTGC-1 sample_151673         1
AAACAGAGCGACTCCT-1 AAACAGAGCGACTCCT-1 sample_151673         1
AAACAGCTTTCAGAAG-1 AAACAGCTTTCAGAAG-1 sample_151673         1
AAACAGGGTCTATATT-1 AAACAGGGTCTATATT-1 sample_151673         1
                   array_row array_col ground_truth   reference
                   <integer> <integer>  <character> <character>
AAACAAGTATCTCCCA-1        50       102       Layer3      Layer3
AAACAATCTACTAGCA-1         3        43       Layer1      Layer1
AAACACCAATAACTGC-1        59        19           WM          WM
AAACAGAGCGACTCCT-1        14        94       Layer3      Layer3
AAACAGCTTTCAGAAG-1        43         9       Layer5      Layer5
AAACAGGGTCTATATT-1        47        13       Layer6      Layer6
                   cell_count       sum  detected subsets_mito_sum
                    <integer> <numeric> <integer>        <numeric>
AAACAAGTATCTCCCA-1          6      8458      3586             1407
AAACAATCTACTAGCA-1         16      1667      1150              204
AAACACCAATAACTGC-1          5      3769      1960              430
AAACAGAGCGACTCCT-1          2      5433      2424             1316
AAACAGCTTTCAGAAG-1          4      4278      2264              651
AAACAGGGTCTATATT-1          6      4004      2178              621
                   subsets_mito_detected subsets_mito_percent
                               <integer>            <numeric>
AAACAAGTATCTCCCA-1                    13              16.6351
AAACAATCTACTAGCA-1                    11              12.2376
AAACACCAATAACTGC-1                    13              11.4089
AAACAGAGCGACTCCT-1                    13              24.2223
AAACAGCTTTCAGAAG-1                    12              15.2174
AAACAGGGTCTATATT-1                    13              15.5095
                       total
                   <numeric>
AAACAAGTATCTCCCA-1      8458
AAACAATCTACTAGCA-1      1667
AAACACCAATAACTGC-1      3769
AAACAGAGCGACTCCT-1      5433
AAACAGCTTTCAGAAG-1      4278
AAACAGGGTCTATATT-1      4004

QC based on the library size

Plot spots and color them by the library size:

plotVisium(spe, annotate = "sum", highlight = "in_tissue", point_size = 1.3)

Create a histogram of library sizes:

hist(colData(spe)$sum, breaks = 20)

This data set additionally provides the number of cells per spot. We will make use of this information for performing QC and plot it against the library size. For example, it might help to keep spots with low number of cells despite lower sum of counts. The blue curve corresponds to the smoothed trend line using loess method. The red horizontal line (argument threshold) shows our first guess of a possible filtering threshold for library size.

plotSpotQC(spe, plot_type = "scatter",
          x_metric = "cell_count", y_metric = "sum",
          y_threshold = 600)
`geom_smooth()` using formula = 'y ~ x'
`stat_xsidebin()` using `bins = 30`. Pick better value with
`binwidth`.
`stat_ysidebin()` using `bins = 30`. Pick better value with
`binwidth`.

Store results of this QC step in the colData object:

qc_lib_size <- colData(spe)$sum < 600
table(qc_lib_size)
qc_lib_size
FALSE  TRUE 
 3631     8 
colData(spe)$qc_lib_size <- qc_lib_size

Check spatial pattern of of low-quality spots:

plotSpotQC(spe, plot_type = "spot",annotate = "qc_lib_size", point_size = 1.3)

The plot shows that setting the selected filtering threshold for the library size does not appear to discard any obvious biologically consistent group of spots. Check spatial pattern of filtered spots if the threshold is too high:

qc_lib_size_2000 <- colData(spe)$sum < 2000
colData(spe)$qc_lib_size_2000 <- qc_lib_size_2000
plotSpotQC(spe, plot_type = "spot",
          annotate = "qc_lib_size_2000", point_size = 1.3)

When we check the manual annotation of cortical layers on the slide, it becomes apparent that too high threshold tends to discard cells belonging to a specific layer:

plotSpots(spe, annotate = "ground_truth",
          pal = "libd_layer_colors", point_size = 1.3)

QC based on the expression of mitochondrial genes

Now we will check the expression of mitochondrial genes. The high expression of these genes indicates presence of damaged cells in the spot.

Create a histogram of the percentage of UMI counts assigned to mitochondrial genes:

hist(colData(spe)$subsets_mito_percent, breaks = 20)

Plot the proportion of UMI counts assigned to mitochondrial genes versus the number of cells per spot. As the threshold, the value 28 was selected:

plotSpotQC(spe, plot_type = "scatter", 
           x_metric = "cell_count", y_metric = "subsets_mito_percent", 
           y_threshold = 28)
`geom_smooth()` using formula = 'y ~ x'
`stat_xsidebin()` using `bins = 30`. Pick better value with
`binwidth`.
`stat_ysidebin()` using `bins = 30`. Pick better value with
`binwidth`.

Store results of this QC step in the colData object:

qc_mito <- colData(spe)$subsets_mito_percent > 28
table(qc_mito)
qc_mito
FALSE  TRUE 
 3622    17 
colData(spe)$qc_mito <- qc_mito

Check spatial pattern of low-quality spots:

plotSpotQC(spe, plot_type = "spot", 
           annotate = "qc_mito", point_size = 1.3)

QC based on the number of cells per spot

The final QC metric will be the number of cells per spot. As it was mentioned above, this measure is available in this data set and can also be used for assessing the quality control of spots.

Plot the histogram of cell counts:

hist(colData(spe)$cell_count, breaks = 20)

Check the distribution of cells per spot:

table(colData(spe)$cell_count)

  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16 
 84 211 483 623 617 541 421 287 140  92  50  25  18  10   9   3   8 
 17  18  19  20  21  22  23  25  26  27 
  2   1   2   3   2   1   1   2   2   1 

Plot the number of expressed genes versus the number of cells per spot. Spots with high cell counts have low numbers of expressed genes. This indicates that the experiment has failed for these spots and they should be removed. We selected the value 10 as the threshold:

plotSpotQC(spe, plot_type = "scatter", 
           x_metric = "cell_count", y_metric = "detected", 
           x_threshold = 10)
`geom_smooth()` using formula = 'y ~ x'
`stat_xsidebin()` using `bins = 30`. Pick better value with
`binwidth`.
`stat_ysidebin()` using `bins = 30`. Pick better value with
`binwidth`.

Store results of this QC step in the colData object:

qc_cell_count <- colData(spe)$cell_count > 10
table(qc_cell_count)
qc_cell_count
FALSE  TRUE 
 3549    90 
colData(spe)$qc_cell_count <- qc_cell_count

Check spatial pattern of low-quality spots:

plotSpotQC(spe, plot_type = "spot", 
           annotate = "qc_cell_count", point_size = 1.3)

Shall we remove spots with zero cells? Although the spot is not assigned to any particular cell, it may still contain biologically meaningful information. In this dataset, the region between cell bodies consists of neuropil (dense networks of axons and dendrites). There might be an interest in exploring the information contained in these neuropil spots.

Combination of QC metrics

Next, we will collect low-quality spots based on all calculated metrics and remove them from our object.

Label spots to be discarded:

discard <- qc_lib_size | qc_mito | qc_cell_count
table(discard)
discard
FALSE  TRUE 
 3524   115 

Store the result in the colData object and check the spatial pattern of the combined set of discarded spots:

colData(spe)$discard <- discard
plotSpotQC(spe, plot_type = "spot", 
           annotate = "discard", point_size = 1.3)

Remove the combined set of low-quality spots:

spe <- spe[, !colData(spe)$discard]
dim(spe)
[1] 33538  3524

Principal component analysis

For further data processing, we will use methods from scater and scran packages, which were originally developed for scRNA-seq data, with the assumption that they can be applied to spot-based spatial transcriptomics data by treating spots as cells.

Normalization

We will calculate log-transformed normalized counts (“logcounts”) using library size factors.

spe <- computeLibraryFactors(spe)
summary(sizeFactors(spe))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1321  0.6312  0.9000  1.0000  1.2849  3.7582 
spe <- logNormCounts(spe)
assayNames(spe)
[1] "counts"    "logcounts"

Defining highly variable genes

Further, we fit the mean-variance relationship and select top highly variable genes:

dec <- modelGeneVar(spe)
sel <- getTopHVGs(dec, prop=0.1)
length(sel)
[1] 1529

Calculating principal components

Calculate principal components using the expression of highly variable genes:

spe <- runPCA(spe, subset_row=sel)

Create a PCA plot and color spots using manually annotated reference. It is noticeable that spots from the same layer tend to cluster together.

plotReducedDim(spe, "PCA", colour_by="ground_truth")

Visualize the coordinates of first four principal components on the slide:

pcs <- reducedDim(spe, "PCA")
pcs <- pcs[, seq_len(4)]
lapply(colnames(pcs), \(.) {
    spe[[.]] <- pcs[, .]
    plotSpots(spe, annotate=.)
}) |>
    wrap_plots(nrow=1) &
    geom_point(shape=16, stroke=0, size=0.2) &
    scale_color_gradientn(colors=pals::jet()) &
    coord_equal() & theme_void() & theme(
        legend.position="bottom",
        plot.title=element_text(hjust=0.5),
        legend.key.width=unit(0.8, "lines"),
        legend.key.height=unit(0.4, "lines"))

Look how good principal components recapitulate the manual annotation of cortical layers:

plotSpots(spe, annotate = "ground_truth",
          pal = "libd_layer_colors", point_size = 1.3)