Introduction to Bioconductor and standard classes
What is Bioconductor?
- Bioconductor is an R-based ecosystem of packages and workflows.
- It support and disseminate free open source software that facilitates rigorous and reproducible analysis of data from current and emerging biological assays.
- The project was started in 2001 (back then, mostly focused on microarray data)
- Bioconductor updates its release version twice a year. For example, the latest release added 59 new software packages, 6 new experiment data packages, 2 new annotation packages, 1 new book, and many updates to existing packages.
- Currently (release 3.22) hosts:
- 2361 software packages
- 926 annotation packages
- 435 experiment data packages
- 29 workflows
- 6 books
- Covers a wide range of domains:
- sequencing (RNA-seq, ChIP-seq, single-cell analysis, variant calling, …)
- microarrays (methylation, gene expression, copy number, …)
- flow and mass cytometry
- proteomics
- imaging
- …
- There is a core team which, among other things, oversees the project, coordinates review of new package submissions, develops and maintains core infrastructure and maintains daily building and checking of all packages. Most packages, however, are contributed by the community.
- There is an active Bioconductor community, where you can find help, support and people with shared interests. For example, check the website, join the slack workspace, or attend one of the yearly conferences (BioC, EuroBioC, Bioc Asia).
How to install Bioconductor packages
- Detailed instructions are available online.
- There is a new Bioconductor release twice per year (typically April and October).
- Between those times, development is going on in the parallel devel branch.
- Do not mix packages from different releases!
- The available Bioconductor releases also depend on your version of R - if you want the latest Bioconductor packages, you need the latest version of R (eg. Bioconductor 3.22 is compatible with R 4.5)
- Bioconductor packages are installed using the
BiocManagerCRAN package (note that this works also for packages hosted on CRAN or GitHub):
install.packages("BiocManager")
BiocManager::install("limma")- We can check the validity of our Bioconductor installation with
BiocManager(e.g., to see if there are outdated packages):
BiocManager::valid()- We can also see which version is currently installed:
BiocManager::version()[1] '3.22'
BiocManagercan be used to search for available packages. Here, we look for package names containing the word “Mmusculus” (assuming that we are interested in looking for packages containing annotation information for mouse):
BiocManager::available(pattern = "Mmusculus") [1] "BSgenome.Mmusculus.UCSC.mm10" "BSgenome.Mmusculus.UCSC.mm10.masked"
[3] "BSgenome.Mmusculus.UCSC.mm39" "BSgenome.Mmusculus.UCSC.mm8"
[5] "BSgenome.Mmusculus.UCSC.mm8.masked" "BSgenome.Mmusculus.UCSC.mm9"
[7] "BSgenome.Mmusculus.UCSC.mm9.masked" "EnsDb.Mmusculus.v75"
[9] "EnsDb.Mmusculus.v79" "PWMEnrich.Mmusculus.background"
[11] "TxDb.Mmusculus.UCSC.mm10.ensGene" "TxDb.Mmusculus.UCSC.mm10.knownGene"
[13] "TxDb.Mmusculus.UCSC.mm39.knownGene" "TxDb.Mmusculus.UCSC.mm39.refGene"
[15] "TxDb.Mmusculus.UCSC.mm9.knownGene"
- Each Bioconductor package has a landing page (e.g., https://bioconductor.org/packages/limma/) that contains relevant information, including installation instructions, vignettes and the reference manual:
How to get help with Bioconductor packages
- Each function has a help page, detailing all arguments and outputs, and containing runnable examples.
library(limma)
?lmFit- Each package comes with at least one vignette, outlining the intended use of the functions in the package to analyze data. The vignettes can be browsed from the landing page, or directly in your R session, e.g. with
browseVignettes(package = "Biostrings")- If you need help with a Bioconductor package:
- Check the function help pages and the vignette(s) to see whether the answer is there (some packages, like DESeq2, have FAQ sections in their vignette)
- Go to https://support.bioconductor.org/
- Search to see whether your question has already been asked (and answered)
- If not, look through the Posting guide and post your question following these guidelines
- If you follow these steps, you will often get a rapid and informative answer
- The slack workspace has channels covering many topics related to Bioconductor
- The Workflows are complete walkthroughs of specific types of analyses, typically making use of multiple (Bioconductor) packages
Standard Bioconductor data structures
- Every object in R is of a particular data type.
- You have already seen some data types, e.g. atomic vectors (
numeric,characteretc),list,matrix,data.frame,factor. - (Almost) all objects in R have a class. A class is a formal description of the data type, its properties, functionality and relation to other types .
- The
classfunction tells you which class(es) an object is of:
class(c("one", "two", "three"))[1] "character"
class(matrix(1:6, 2, 3))[1] "matrix" "array"
- The class of an object defines the set of allowed operations.
relevel(c("one", "two", "three"), ref = "one")Error in `relevel.default()`:
! 'relevel' only for (unordered) factors
relevel(factor(c("one", "two", "three")), ref = "two")[1] one two three
Levels: two one three
c("1", "2", "3") + 1Error in `c("1", "2", "3") + 1`:
! non-numeric argument to binary operator
as.numeric(c("1", "2", "3")) + 1[1] 2 3 4
- When working with (biological) data, it often makes sense to define more complex data structures, with several building blocks, and to define appropriate methods for each such class.
What types of data do we have to deal with?
In bioinformatics, we work with different types of data.
- Sequences (DNA/RNA)
- Alignments (sequences lined up for comparison)
- Rectangular data (Matrices of numeric values for features and samples)
- Genomic ranges (eg. chromosome intervals)
- Genomic variants (eg. DNA changes like SNPs or insertions)
- Annotation databases (eg. information about genes and their functions)
- Images (Microscopy images, for example.)
- …
For each of these data types, Bioconductor provides standard object classes designed specifically to store and handle them properly.
There are often many ways of creating an object of a given class, including:
- Objects of these classes can be created in three main ways: Reading data from files (e.g., instrument output) using specialized import functions. This is the most common approach. It is very similar to creating a data.frame with read.delim().
- Loading processed data or annotations from Bioconductor packages
- Creating the object manually using a constructor function (usually with the same name as the class). This method requires providing all necessary information and is mainly used for small examples or teaching purposes.
Why use standardized classes?
- Interoperability - When data is stored in standardized formats, different packages can work together seamlessly (eg. workflows). Developers do not need to reimplement the same data structures in multiple places. Functions can be written to operate on a common class and will automatically work across many datasets.
- Recognition - Users do not need to learn a new data representation for every package. Once you understand a standard class, you can apply that knowledge across many tools. This makes it easier to move between workflows.
- Robustness - Standard classes are carefully designed, optimized, and extensively tested. They often include built-in validity checks to ensure that the data has the correct structure and content. This reduces errors, improves reliability, and ensures that analyses are based on well-formed data objects.
As an example, all these Bioconductor packages make use of the SummarizedExperiment class:
In the following we will examine a few standard Bioconductor classes in more detail. Of course, there are many more, and some of these will be covered in later lectures.
GenomicRanges
- The
GenomicRangespackage holds infrastructure for working with (genomic) ranges. - Genomic ranges are regions on a genome, like gene coordinates, ChIP-seq peaks, promoters, SNPs, CpG islands.
- A
GRangesobject stores:- Chromosome (called sequence)
- Interval (start and end)
- Strand (either
+,-or*) - Metadata (optional additional data)
- The
IRangesobject represents just the intervals (start, end, width) without genomic context.
# Load the GenomicRanges package
# This also loads GenomicRanges package dependencies: IRanges and GenomeInfoDb.
library(GenomicRanges)Creating an IRanges object
An IRanges object defines intervals on a number line. You can specify start and end, or start and width, and the third value is automatically calculated.
# Create an IRanges object using start and end positions
(ir <- IRanges::IRanges(start = c(3, 10, 50),
end = c(17, 28, 104)))IRanges object with 3 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 3 17 15
[2] 10 28 19
[3] 50 104 55
# Equivalent: Create the same IRanges object using start and width
(ir <- IRanges(start = c(3, 10, 50),
width = c(15, 19, 55)))IRanges object with 3 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 3 17 15
[2] 10 28 19
[3] 50 104 55
Creating a GRanges object
A GRanges object adds genomic context (chromosome, strand) to the intervals (= IRanges object).
# Create a GRanges object using the IRanges object 'ir'
(gr <- GenomicRanges::GRanges(
seqnames = c("chr1", "chr1", "chr2"),
ranges = ir,
strand = c("+", "+", "-")))GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 3-17 +
[2] chr1 10-28 +
[3] chr2 50-104 -
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
Accessing Range Information
You can extract individual components of the object (start, end, and width) of each range using accessor functions. All these functions come from the IRanges package. They are automatically loaded when you load GenomicRanges.
# Get the start positions of each range
start(gr) [1] 3 10 50
# Get the end positions of each range
end(gr)[1] 17 28 104
# Get the width (length) of each range
width(gr)[1] 15 19 55
Working with chromosome naming conventions
Different databases use different naming styles for chromosomes (e.g., “1” vs “chr1”). You can check and change the reference sequences using function seqlevelsStyle().
# Check the current chromosome naming style
seqlevelsStyle(gr)[1] "UCSC"
# Change to Ensembl style (e.g., "1", "2", ...)
seqlevelsStyle(gr) <- "Ensembl"
grGRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 3-17 +
[2] 1 10-28 +
[3] 2 50-104 -
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
# Change back to UCSC style (e.g., "chr1", "chr2", ...)
seqlevelsStyle(gr) <- "UCSC"Subsetting Ranges
There are many ways of subsetting ranges objects (check out methods(class = "GRanges")). You can subset ranges using indexing or by selecting specific chromosomes.
# Keep the first two ranges
gr[1:2]GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 3-17 +
[2] chr1 10-28 +
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
# Subset using a logical condition
gr[seqnames(gr) == "chr1"]GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 3-17 +
[2] chr1 10-28 +
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
# Remove ranges on chromosome "chr1"
dropSeqlevels(gr, "chr1",pruning.mode="coarse")GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr2 50-104 -
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
# Keep only ranges on chromosome "chr1"
keepSeqlevels(gr, "chr1",pruning.mode="coarse")GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 3-17 +
[2] chr1 10-28 +
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
# Keep only standard chromosomes.
keepStandardChromosomes(gr)GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 3-17 +
[2] chr1 10-28 +
[3] chr2 50-104 -
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
Extending Ranges
You can extend ranges by a fixed number of bases. If no sequence lengths are set, negative coordinates are allowed.
# Extend each range by 5 bases in both directions
grGRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 3-17 +
[2] chr1 10-28 +
[3] chr2 50-104 -
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
gr + 5GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 -2-22 +
[2] chr1 5-33 +
[3] chr2 45-109 -
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
Setting Sequence Lengths
To avoid negative coordinates or ranges exceeding chromosome boundaries, set sequence lengths. Now we get a warning that one of the ranges is extending outside of the reference sequences.
# Check current sequence lengths (NULL if not set)
seqlengths(gr)chr1 chr2
NA NA
# Set sequence lengths for "chr1" and "chr2"
seqlengths(gr) <- c(chr1 = 100, chr2 = 200)
seqlengths(gr)chr1 chr2
100 200
# Now, extending ranges beyond boundaries will trigger a warning
gr + 5Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence chr1. Note
that ranges located on a sequence whose length is unknown (NA) or on a
circular sequence are not considered out-of-bound (use seqlengths() and
isCircular() to get the lengths and circularity flags of the underlying
sequences). You can use trim() to trim these ranges. See
?`trim,GenomicRanges-method` for more information.
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 -2-22 +
[2] chr1 5-33 +
[3] chr2 45-109 -
-------
seqinfo: 2 sequences from an unspecified genome
Trimming Ranges
Use trim() to remove parts of ranges that fall outside sequence boundaries. This would not do anything if we hadn’t set the sequence lengths.
# Trim ranges to fit within sequence boundaries
trim(gr + 5)Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence chr1. Note
that ranges located on a sequence whose length is unknown (NA) or on a
circular sequence are not considered out-of-bound (use seqlengths() and
isCircular() to get the lengths and circularity flags of the underlying
sequences). You can use trim() to trim these ranges. See
?`trim,GenomicRanges-method` for more information.
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 1-22 +
[2] chr1 5-33 +
[3] chr2 45-109 -
-------
seqinfo: 2 sequences from an unspecified genome
Setting the genome
You can assign a genome build (e.g., “hg19”) to the ranges for clarity. This can be useful e.g. when comparing different sets of ranges, making sure they all correspond to the same underlying genome.
# Assign the genome build "hg19" to the ranges
genome(gr) <- "hg19"
grGRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 3-17 +
[2] chr1 10-28 +
[3] chr2 50-104 -
-------
seqinfo: 2 sequences from hg19 genome
Other useful operations
- shifting: Shift ranges by a fixed number of bases.
# Shift all ranges 5 bases to the right
shift(gr, 5)GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 8-22 +
[2] chr1 15-33 +
[3] chr2 55-109 -
-------
seqinfo: 2 sequences from hg19 genome
- Flanking Regions: obtaining flanking sequence. Get flanking regions (e.g., upstream/downstream) of each range. Strand information is used to determine direction.
# Get 2 bases upstream of each range
flank(gr, 2, start = TRUE, both = FALSE) # Upstream has different meaning for +/- strand ! GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 1-2 +
[2] chr1 8-9 +
[3] chr2 105-106 -
-------
seqinfo: 2 sequences from hg19 genome
- Midpoints: finding the midpoint of each range
# Resize each range to width=1, centered at the midpoint
resize(gr, width = 1, fix = "center")GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 10 +
[2] chr1 19 +
[3] chr2 77 -
-------
seqinfo: 2 sequences from hg19 genome
- Overlaps: find overlaps between ranges. Overlaps are strand-specific unless you use
ignore.strand = TRUE. Overlaps are only found between ranges on the same chromosome.
# Create a second GRanges object
gr2 <- GRanges(seqnames = c("chr2", "chr1"),
ranges = IRanges(start = c(90, 5),
end = c(100, 15)),
strand = c("-", "+"))
# Find overlaps between 'gr' and 'gr2'
findOverlaps(query = gr, subject = gr2)Hits object with 3 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 2
[2] 2 2
[3] 3 1
-------
queryLength: 3 / subjectLength: 2
Metadata
GRanges objects can store metadata in the mcols() slot.
# Add a 'score' column to the metadata
mcols(gr)$score <- 1:3
grGRanges object with 3 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <integer>
[1] chr1 3-17 + | 1
[2] chr1 10-28 + | 2
[3] chr2 50-104 - | 3
-------
seqinfo: 2 sequences from hg19 genome
# Extract only the metadata
mcols(gr)DataFrame with 3 rows and 1 column
score
<integer>
1 1
2 2
3 3
# Create a GRanges object with metadata directly
GRanges(seqnames = c("chr2", "chr1"),
ranges = IRanges(start = c(90, 5),
end = c(100, 15)),
strand = c("-", "+"),
extra_column = c("A", "B"))GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | extra_column
<Rle> <IRanges> <Rle> | <character>
[1] chr2 90-100 - | A
[2] chr1 5-15 + | B
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
Exercises
In practice, the locations of known genomic features (exons, genes, transcripts, UTRs, …) along the genome are often stored in so called gtf (gene transfer format) files. These files can be imported into a GRanges object using the import() function from the rtracklayer Bioconductor package.
- Download the
gencode.v28.annotation.1.1.10M.gtffile from the course web page, and read it into R in this way.
BiocManager::install("rtracklayer")Bioconductor version 3.22 (BiocManager 1.30.27), R 4.5.2 (2025-10-31)
Warning: package(s) not installed when version(s) same as or greater than current; use
`force = TRUE` to re-install: 'rtracklayer'
library(rtracklayer)
Attaching package: 'rtracklayer'
The following object is masked from 'package:igraph':
blocks
gtf <- rtracklayer::import("data/02/gencode.v28.annotation.1.1.10M.gtf")- Confirm that the
gtfobject is indeed aGRangesobject
class(gtf)[1] "GRanges"
attr(,"package")
[1] "GenomicRanges"
- How many records are there in the object?
length(gtf)[1] 17955
- From which chromosomes do the records come?
seqlevels(gtf)[1] "chr1"
- Use the
subsetfunction to extract only the records for theCHD5gene. How many annotated transcripts does this gene have?
gtfsub <- subset(gtf, gene_name == "CHD5")
length(subset(gtfsub, type == "transcript"))[1] 6
Lists of genomic ranges
Sometimes, ranges belong together in groups (e.g., exons in a transcript). For this, use a GRangesList:
- Each group is a
GRangesobject. - All groups share the same genome information.
# Create a GRangesList object to store multiple transcripts (or grouped genomic features)
grl <- GRangesList(
## Each transcript is a GRanges object with each exon as individual range:
# Transcript tx1.1 of gene1
tx1.1 = GRanges(seqnames = "chr1",
ranges = IRanges(start = c(50, 125, 188),
end = c(75, 130, 257)),
strand = "+",
# Metadata columns:
symbol = "tx1.1",
transcript = "tx1.1",
gene = "gene1",
exon = paste0("ex1.", 1:3)),
# Transcript tx1.2 of gene1
# Alternative transcript of gene1 with a different exon structure
tx1.2 = GRanges(seqnames = "chr1",
ranges = IRanges(start = c(50, 125, 200),
end = c(75, 130, 227)),
strand = "+",
# Metadata columns:
symbol = "tx1.2",
transcript = "tx1.2",
gene = "gene1",
exon = paste0("ex1.", c(1, 2, 4))),
# Transcript tx2.1 of gene2
# Transcript from a different gene on a different chromosome
tx2.1 = GRanges(seqnames = "chr2",
ranges = IRanges(start = c(289, 318),
end = c(300, 350)),
strand = "-",
symbol = "tx2.1", transcript = "tx2.1",
gene = "gene2", exon = paste0("ex2.", 1:2)))
# Set the lengths of the chromosomes for boundary checks and operations
seqlengths(grl) <- c(chr1 = 1000, chr2 = 500)
grlGRangesList object of length 3:
$tx1.1
GRanges object with 3 ranges and 4 metadata columns:
seqnames ranges strand | symbol transcript gene
<Rle> <IRanges> <Rle> | <character> <character> <character>
[1] chr1 50-75 + | tx1.1 tx1.1 gene1
[2] chr1 125-130 + | tx1.1 tx1.1 gene1
[3] chr1 188-257 + | tx1.1 tx1.1 gene1
exon
<character>
[1] ex1.1
[2] ex1.2
[3] ex1.3
-------
seqinfo: 2 sequences from an unspecified genome
$tx1.2
GRanges object with 3 ranges and 4 metadata columns:
seqnames ranges strand | symbol transcript gene
<Rle> <IRanges> <Rle> | <character> <character> <character>
[1] chr1 50-75 + | tx1.2 tx1.2 gene1
[2] chr1 125-130 + | tx1.2 tx1.2 gene1
[3] chr1 200-227 + | tx1.2 tx1.2 gene1
exon
<character>
[1] ex1.1
[2] ex1.2
[3] ex1.4
-------
seqinfo: 2 sequences from an unspecified genome
$tx2.1
GRanges object with 2 ranges and 4 metadata columns:
seqnames ranges strand | symbol transcript gene
<Rle> <IRanges> <Rle> | <character> <character> <character>
[1] chr2 289-300 - | tx2.1 tx2.1 gene2
[2] chr2 318-350 - | tx2.1 tx2.1 gene2
exon
<character>
[1] ex2.1
[2] ex2.2
-------
seqinfo: 2 sequences from an unspecified genome
The Gviz package can be used to generate a visualization of the first two transcripts (which belong to the same gene):
library(Gviz)
# Create a GeneRegionTrack for transcript tx1.1
g1 <- GeneRegionTrack(
grl$tx1.1,
showId = TRUE, # Show exon IDs on the plot
col = NULL, # No outline color for exons (transparent borders)
fill = "#377EB8", # Fill color for exons (blue)
name = "", # No track name (empty string)
background.title = "transparent", # Transparent background for the track title
col.title = "black" # Color for the track title text
)
# Create a GeneRegionTrack for transcript tx1.2
g2 <- GeneRegionTrack(
grl$tx1.2,
showId = TRUE,
col = NULL,
fill = "#E41A1C",
name = "",
background.title = "transparent",
col.title = "black"
)
# Create a GenomeAxisTrack
# This adds a genomic axis (chromosome scale) to the plot
gtr <- GenomeAxisTrack()
# Combine the tracks into a list
# Order matters: The GenomeAxisTrack is plotted first (bottom), followed by g1 and g2
tracks <- c(gtr, g1, g2)
# Plot the tracks
# This generates the visualization with all three tracks
plotTracks(tracks)SummarizedExperiment
The SummarizedExperiment class (SummarizedExperiment package) stores:
- Your experiment data in a ‘rectangular’ data.
- Annotations and metadata for rows and columns.
It’s widely used in Bioconductor (e.g., for RNA-seq, ChIP-seq, and more) and is the foundation for specialized classes like SingleCellExperiment and DESeqDataSet.
### Structure
- The
SummarizedExperimentobject consists of three types of ‘tables’, denoted- assays (actually a list of matrices of the same size, containing the observed data, e.g., gene expression values)
- rowData (annotations for the rows, i.e., the features). This can be replaced by rowRanges (a
GRangesobject) that can also hold genomic range information, e.g., the locations of genes - colData (annotations for the columns, i.e., the samples)
library(SummarizedExperiment)se <- SummarizedExperiment(
# Assays: The observed data
assays = list(exprs = matrix(10 * runif(15), 5, 3)),
# Column metadata
colData = DataFrame( sample = paste0("Sample", 1:3),
condition = c("A", "A", "B")),
# Row metadata
rowData = DataFrame(gene = paste0("G", 1:5)))
rownames(se) <- rowData(se)$gene
colnames(se) <- colData(se)$sample
seclass: SummarizedExperiment
dim: 5 3
metadata(0):
assays(1): exprs
rownames(5): G1 G2 G3 G4 G5
rowData names(1): gene
colnames(3): Sample1 Sample2 Sample3
colData names(2): sample condition
Accessor functions
The SummarizedExperiment class has accessor functions to extract the individual components.
# Vector of assay names
assayNames(se) # Output: "exprs"[1] "exprs"
# matrix of expression values
assay(se, "exprs") Sample1 Sample2 Sample3
G1 2.595406 0.3962741 7.027536
G2 1.411956 8.4105452 7.270798
G3 1.699197 2.8136119 4.505360
G4 9.351497 1.7157312 6.343644
G5 1.267353 7.6316145 5.252465
# column data (sample metadata)
colData(se)DataFrame with 3 rows and 2 columns
sample condition
<character> <character>
Sample1 Sample1 A
Sample2 Sample2 A
Sample3 Sample3 B
# row data (feature metadata)
rowData(se)DataFrame with 5 rows and 1 column
gene
<character>
G1 G1
G2 G2
G3 G3
G4 G4
G5 G5
There are also convenient subsetting functions. When you subset a SummarizedExperiment object, all parts (assays, row data, and column data) are automatically subset together.
# Subset first 2 rows (genes)
sesub <- se[1:2, ]
# Expression matrix for subset
assay(sesub, "exprs") Sample1 Sample2 Sample3
G1 2.595406 0.3962741 7.027536
G2 1.411956 8.4105452 7.270798
# column metadata for subset
colData(sesub)DataFrame with 3 rows and 2 columns
sample condition
<character> <character>
Sample1 Sample1 A
Sample2 Sample2 A
Sample3 Sample3 B
# Row metadata for subset
rowData(sesub)DataFrame with 2 rows and 1 column
gene
<character>
G1 G1
G2 G2
Exercises
Here we will explore a real RNA-seq data set, which is provided in the airway Bioconductor package and stored in a RangedSummarizedExperiment object.
- Load the package and attach the data set.
library(airway)
data(airway)
airwayclass: RangedSummarizedExperiment
dim: 63677 8
metadata(1): ''
assays(1): counts
rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
ENSG00000273493
rowData names(10): gene_id gene_name ... seq_coord_system symbol
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(9): SampleName cell ... Sample BioSample
- What type of data value are available in this object? How many samples are there? How many features?
assayNames(airway)[1] "counts"
dim(airway)[1] 63677 8
- What is the total number of sequencing reads assigned to genes for each sample?
colSums(assay(airway, "counts"))SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520
20637971 18809481 25348649 15163415 24448408 30818215 19126151
SRR1039521
21164133
Class inheritance
Classes can build on top of other classes, like adding features to a basic model.
How it works: * Start with a basic class (e.g., SummarizedExperiment). * Extend it to create a new class with extra features (e.g., SingleCellExperiment).
The new class keeps all the features of the original class. You don’t have to rewrite what’s already there.
library(SingleCellExperiment)getClass("SingleCellExperiment")Class "SingleCellExperiment" [package "SingleCellExperiment"]
Slots:
Name: int_elementMetadata int_colData
Class: DataFrame DataFrame
Name: int_metadata rowRanges
Class: list GenomicRanges_OR_GRangesList
Name: colData assays
Class: DataFrame Assays_OR_NULL
Name: NAMES elementMetadata
Class: character_OR_NULL DataFrame
Name: metadata
Class: list
Extends:
Class "RangedSummarizedExperiment", directly
Class "SummarizedExperiment", by class "RangedSummarizedExperiment", distance 2
Class "RectangularData", by class "RangedSummarizedExperiment", distance 3
Class "Vector", by class "RangedSummarizedExperiment", distance 3
Class "Annotated", by class "RangedSummarizedExperiment", distance 4
Class "vector_OR_Vector", by class "RangedSummarizedExperiment", distance 4
Due to inheritance it might not be immediately clear what methods are defined for a given class. For most Bioconductor classes (all so called S4 classes) one way to find out is via the showMethods() function. This lists all functions that can be used with SingleCellExperiment, including methods inherited from parent classes.
showMethods(class = "SingleCellExperiment")Biostrings
Biostrings is a package containing infrastructure for dealing with biological sequences.
XString: Represents a single sequence (e.g., DNAString, RNAString).XStringSet: Represents multiple sequences (e.g., DNAStringSet, RNAStringSet).
The X stands for the type of sequence: DNA, RNA, amino acids (AA), or raw bases (B).
library(Biostrings)Structure
To create a DNAString object, we can use the DNAString() constructor function. To see how the function works, type ?DNAString.
(dna <- DNAString(x = "AGGCATAGA"))9-letter DNAString object
seq: AGGCATAGA
The classes automatically check if sequences are valid. For example, a DNA string can only contain valid DNA letters (including ambiguous bases like N, R, Y, etc., as defined by the IUPAC code.
alphabet(DNAString()) [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" "."
IUPAC_CODE_MAP A C G T M R W S Y K V
"A" "C" "G" "T" "AC" "AG" "AT" "CG" "CT" "GT" "ACG"
H D B N
"ACT" "AGT" "CGT" "ACGT"
## Invalid DNA string
DNAString(x = "EQI")Error in `.Call2()`:
! key 69 (char 'E') not in lookup table
alphabet(AAString()) [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T" "W" "Y"
[20] "V" "U" "O" "B" "J" "Z" "X" "*" "-" "+" "."
AAString(x = "EQI")3-letter AAString object
seq: EQI
Operation
Once we have defined the object, we can perform operations like:
- Reverse complement of the DNA sequence
reverseComplement(dna)9-letter DNAString object
seq: TCTATGCCT
Not possible with a regular character string
reverseComplement("AGGCATAGA")Error:
! unable to find an inherited method for function 'reverseComplement' for signature 'x = "character"'
- Translating the DNA sequence to an amino acid sequence
translate(dna)3-letter AAString object
seq: RHR
We can not translate an amino acid sequence:
translate(AAString(x = "EQI"))Error:
! unable to find an inherited method for function 'translate' for signature 'x = "AAString"'
If we have more than one sequence, we can represent them in a DNAStringSet object, which is a collection of DNAString objects
(dna_multiple <- DNAStringSet(c(x = "AGGCATAGA", y = "GGTATGAG")))DNAStringSet object of length 2:
width seq names
[1] 9 AGGCATAGA x
[2] 8 GGTATGAG y
## Subsetting with [] gives back a DNAStringSet
dna_multiple[1]DNAStringSet object of length 1:
width seq names
[1] 9 AGGCATAGA x
## Subsetting with [[]] extracts the individual objects
dna_multiple[[1]]9-letter DNAString object
seq: AGGCATAGA
It is also easy to extract subsequences of either an XString or an XStringSet object
## Get the length of each string
width(dna_multiple)[1] 9 8
## Extract the first three positions in each string
Biostrings::subseq(dna_multiple, start = 1, end = 3)DNAStringSet object of length 2:
width seq names
[1] 3 AGG x
[2] 3 GGT y
## Extract subsequences of different length
Biostrings::subseq(dna_multiple, start = 1, end = c(2, 5))DNAStringSet object of length 2:
width seq names
[1] 2 AG x
[2] 5 GGTAT y
Biostrings::subseq(dna_multiple, start = 1,
end = width(dna_multiple) - 4)DNAStringSet object of length 2:
width seq names
[1] 5 AGGCA x
[2] 4 GGTA y
Exercises
- Can you tell whether
AATTGGCCRGGCCAATTis a valid DNA sequence?
Biostrings::DNAString("AATTGGCCRGGCCAATT")17-letter DNAString object
seq: AATTGGCCRGGCCAATT
Genomic sequence information is often stored in fasta files. DNAStringSet objects can be created from such files by the readDNAStringSet() function from the Biostrings package. Download the gencode.v28.transcripts.1.1.10M.fa file from the course web page and read it into a DNAStringSet object using this function. What do you think these sequences represent?
txs <- Biostrings::readDNAStringSet("data/02/gencode.v28.transcripts.1.1.10M.fa")- How many sequences are there in this file?
length(txs)[1] 1373
- What are the lengths of the shortest and longest sequence?
range(width(txs))[1] 59 11666
- Extract the first 10 bases of each sequence
Biostrings::subseq(txs, start = 1, end = 10)DNAStringSet object of length 1373:
width seq names
[1] 10 GTTAACTTGC ENST00000456328.2...
[2] 10 GTGTCTGACT ENST00000450305.2...
[3] 10 ATGGGAGCCG ENST00000488147.1...
[4] 10 TGTGGGAGAG ENST00000619216.1...
[5] 10 GTGCACACGG ENST00000473358.1...
... ... ...
[1369] 10 ATGATTCACT ENST00000636827.2...
[1370] 10 AGCATATTCT ENST00000578045.1...
[1371] 10 GCTAAAATAA ENST00000413148.1...
[1372] 10 GTCCCGGCCC ENST00000315901.4...
[1373] 10 GAGCCTCCGG ENST00000294435.7...
Annotation resources
Bioconductor provides tools to access different types of biological data:
- Gene information for specific organisms (e.g., OrgDb).
- Transcript locations for specific genome versions (e.g., TxDb, EnsDb).
- Genome sequences for specific genome versions (e.g., BSgenome).
How to Access These Resources * Annotation Packages: Updated every 6 months with the Bioconductor releases * Online Resources: pulling down information from online resourceslike AnnotationHub.
As an example lets have a look at a human transcript annotation package TxDb.Hsapiens.UCSC.hg38.knownGene. As the name suggest, the annotation is derived from UCSC “knownGene” track for the hg38 genome build.
library(TxDb.Hsapiens.UCSC.hg38.knownGene)Two accessor functions are transcripts() and transcriptsBy(). Lets compare their output:
# Extract all transcripts from the TxDb database as a single GRanges object
gr <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene)
grGRanges object with 412044 ranges and 2 metadata columns:
seqnames ranges strand | tx_id
<Rle> <IRanges> <Rle> | <integer>
[1] chr1 11121-14413 + | 1
[2] chr1 11125-14405 + | 2
[3] chr1 11410-14413 + | 3
[4] chr1 11411-14413 + | 4
[5] chr1 11426-14409 + | 5
... ... ... ... . ...
[412040] chrX_MU273397v1_alt 239036-260095 - | 412040
[412041] chrX_MU273397v1_alt 272358-282686 - | 412041
[412042] chrX_MU273397v1_alt 314193-316302 - | 412042
[412043] chrX_MU273397v1_alt 314813-315236 - | 412043
[412044] chrX_MU273397v1_alt 324527-324923 - | 412044
tx_name
<character>
[1] ENST00000832824.1
[2] ENST00000832825.1
[3] ENST00000832826.1
[4] ENST00000832827.1
[5] ENST00000832828.1
... ...
[412040] ENST00000710260.1
[412041] ENST00000710028.1
[412042] ENST00000710030.1
[412043] ENST00000710216.1
[412044] ENST00000710031.1
-------
seqinfo: 711 sequences (1 circular) from hg38 genome
# Extract transcripts grouped by gene as a GRangesList object
grl <- transcriptsBy(TxDb.Hsapiens.UCSC.hg38.knownGene, by = "gene")
grlGRangesList object of length 37525:
$`1`
GRanges object with 10 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr19 58345178-58347634 - | 349232 ENST00000596924.1
[2] chr19 58345183-58353492 - | 349233 ENST00000263100.8
[3] chr19 58345183-58353492 - | 349234 ENST00000850949.1
[4] chr19 58345183-58353492 - | 349235 ENST00000850950.1
[5] chr19 58346854-58356225 - | 349236 ENST00000600123.5
[6] chr19 58346858-58353491 - | 349237 ENST00000595014.1
[7] chr19 58346860-58347657 - | 349238 ENST00000598345.2
[8] chr19 58348466-58362751 - | 349239 ENST00000599109.5
[9] chr19 58350594-58353129 - | 349240 ENST00000600966.1
[10] chr19 58353021-58356083 - | 349241 ENST00000596636.1
-------
seqinfo: 711 sequences (1 circular) from hg38 genome
$`10`
GRanges object with 6 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr8 18386273-18388323 + | 166654 ENST00000660285.2
[2] chr8 18386315-18388318 + | 166655 ENST00000739947.1
[3] chr8 18386348-18388323 + | 166656 ENST00000739948.1
[4] chr8 18386783-18388318 + | 166659 ENST00000739951.1
[5] chr8 18391282-18401218 + | 166661 ENST00000286479.4
[6] chr8 18391287-18400993 + | 166662 ENST00000520116.1
-------
seqinfo: 711 sequences (1 circular) from hg38 genome
...
<37523 more elements>
Functions like
exons(),transcripts()orgenes()returnGRangesobjects with a set of ranges representing the level of detail of the query.The “xxxBy” functions, e.g.
transcriptsBy()orexonsBy(), returnGRangesListobjects. They enable grouping of the output in terms of another genomic feature, in our case genes.
Transcript annotations and their functionality play an important role when analyzing gene expression data. The coming lectures on RNA-Seq will introduce more functionality in the context of examples.
Exercises
The OrgDb package org.Hs.eg.db provides human gene annotation. It offers a simple interface to an underlying SQLite database. The main accessor functions for this (and other SQLite) annotation objects are select(), columns() and keys().
- Install the
org.Hs.eg.dbpackage, load it and check its help page. Detailed examples on how to work with these annotation objects can be found on the linkedAnnotationDbclass help page.
library(org.Hs.eg.db)
?org.Hs.eg.db- What kind of data can be retrieved from the
org.Hs.eg.dbobject? Check thecolumns()function!
columns(org.Hs.eg.db) [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
[6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
[16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
[21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
[26] "UNIPROT"
- Use the
select()function to retrieve the gene names as well as ENTREZ and ENSEMBL gene IDs for the three genes: RAF1, FLT3 and MAPK1.
select(org.Hs.eg.db, keys = c("RAF1", "FLT3", "MAPK1"),
keytype = "SYMBOL", columns = c("ENTREZID", "ENSEMBL", "GENENAME"))'select()' returned 1:1 mapping between keys and columns
SYMBOL ENTREZID ENSEMBL GENENAME
1 RAF1 5894 ENSG00000132155 Raf-1 proto-oncogene, serine/threonine kinase
2 FLT3 2322 ENSG00000122025 fms related receptor tyrosine kinase 3
3 MAPK1 5594 ENSG00000100030 mitogen-activated protein kinase 1
Further reading
- An overview of the Bioconductor project and its capabilities is given in Huber et al.: Orchestrating high-throughput genomic analysis with Bioconductor. Nature Methods 12:115–121 (2015).
- An introduction to the analysis of ranges in Bioconductor (particularly useful if you have experience with bedtools) is provided in the HelloRanges package
- Bioconductor provides a long list of material from previous courses, some videos and links to upcoming events
- A Bioconductor cheat sheet from Mike Love
- A lesson on the Bioconductor project (under development)