Introduction to Bioconductor and standard classes

Author

Author: Charlotte Soneson and Athimed El Taher

Published

March 4, 2026

What is Bioconductor?

  • Bioconductor is an R-based ecosystem of packages and workflows.
  • It support and disseminate free open source software that facilitates rigorous and reproducible analysis of data from current and emerging biological assays.
  • The project was started in 2001 (back then, mostly focused on microarray data)
  • Bioconductor updates its release version twice a year. For example, the latest release added 59 new software packages, 6 new experiment data packages, 2 new annotation packages, 1 new book, and many updates to existing packages.
  • Currently (release 3.22) hosts:

  • Covers a wide range of domains:
    • sequencing (RNA-seq, ChIP-seq, single-cell analysis, variant calling, …)
    • microarrays (methylation, gene expression, copy number, …)
    • flow and mass cytometry
    • proteomics
    • imaging
  • There is a core team which, among other things, oversees the project, coordinates review of new package submissions, develops and maintains core infrastructure and maintains daily building and checking of all packages. Most packages, however, are contributed by the community.
  • There is an active Bioconductor community, where you can find help, support and people with shared interests. For example, check the website, join the slack workspace, or attend one of the yearly conferences (BioC, EuroBioC, Bioc Asia).

How to install Bioconductor packages

  • Detailed instructions are available online.
  • There is a new Bioconductor release twice per year (typically April and October).
  • Between those times, development is going on in the parallel devel branch.
  • Do not mix packages from different releases!
  • The available Bioconductor releases also depend on your version of R - if you want the latest Bioconductor packages, you need the latest version of R (eg. Bioconductor 3.22 is compatible with R 4.5)
  • Bioconductor packages are installed using the BiocManager CRAN package (note that this works also for packages hosted on CRAN or GitHub):
install.packages("BiocManager")
BiocManager::install("limma")
  • We can check the validity of our Bioconductor installation with BiocManager (e.g., to see if there are outdated packages):
BiocManager::valid()
  • We can also see which version is currently installed:
BiocManager::version()
[1] '3.22'
  • BiocManager can be used to search for available packages. Here, we look for package names containing the word “Mmusculus” (assuming that we are interested in looking for packages containing annotation information for mouse):
BiocManager::available(pattern = "Mmusculus")
 [1] "BSgenome.Mmusculus.UCSC.mm10"        "BSgenome.Mmusculus.UCSC.mm10.masked"
 [3] "BSgenome.Mmusculus.UCSC.mm39"        "BSgenome.Mmusculus.UCSC.mm8"        
 [5] "BSgenome.Mmusculus.UCSC.mm8.masked"  "BSgenome.Mmusculus.UCSC.mm9"        
 [7] "BSgenome.Mmusculus.UCSC.mm9.masked"  "EnsDb.Mmusculus.v75"                
 [9] "EnsDb.Mmusculus.v79"                 "PWMEnrich.Mmusculus.background"     
[11] "TxDb.Mmusculus.UCSC.mm10.ensGene"    "TxDb.Mmusculus.UCSC.mm10.knownGene" 
[13] "TxDb.Mmusculus.UCSC.mm39.knownGene"  "TxDb.Mmusculus.UCSC.mm39.refGene"   
[15] "TxDb.Mmusculus.UCSC.mm9.knownGene"  
  • Each Bioconductor package has a landing page (e.g., https://bioconductor.org/packages/limma/) that contains relevant information, including installation instructions, vignettes and the reference manual:

How to get help with Bioconductor packages

  • Each function has a help page, detailing all arguments and outputs, and containing runnable examples.
library(limma)
?lmFit
  • Each package comes with at least one vignette, outlining the intended use of the functions in the package to analyze data. The vignettes can be browsed from the landing page, or directly in your R session, e.g. with
browseVignettes(package = "Biostrings")
  • If you need help with a Bioconductor package:
    • Check the function help pages and the vignette(s) to see whether the answer is there (some packages, like DESeq2, have FAQ sections in their vignette)
    • Go to https://support.bioconductor.org/
    • Search to see whether your question has already been asked (and answered)
    • If not, look through the Posting guide and post your question following these guidelines
    • If you follow these steps, you will often get a rapid and informative answer
    • The slack workspace has channels covering many topics related to Bioconductor
  • The Workflows are complete walkthroughs of specific types of analyses, typically making use of multiple (Bioconductor) packages

Standard Bioconductor data structures

  • Every object in R is of a particular data type.
  • You have already seen some data types, e.g. atomic vectors (numeric, character etc), list, matrix, data.frame, factor.
  • (Almost) all objects in R have a class. A class is a formal description of the data type, its properties, functionality and relation to other types .
  • The class function tells you which class(es) an object is of:
class(c("one", "two", "three"))
[1] "character"
class(matrix(1:6, 2, 3))
[1] "matrix" "array" 
  • The class of an object defines the set of allowed operations.
relevel(c("one", "two", "three"), ref = "one")
Error in `relevel.default()`:
! 'relevel' only for (unordered) factors
relevel(factor(c("one", "two", "three")), ref = "two")
[1] one   two   three
Levels: two one three
c("1", "2", "3") + 1
Error in `c("1", "2", "3") + 1`:
! non-numeric argument to binary operator
as.numeric(c("1", "2", "3")) + 1
[1] 2 3 4
  • When working with (biological) data, it often makes sense to define more complex data structures, with several building blocks, and to define appropriate methods for each such class.

What types of data do we have to deal with?

In bioinformatics, we work with different types of data.

  • Sequences (DNA/RNA)
  • Alignments (sequences lined up for comparison)
  • Rectangular data (Matrices of numeric values for features and samples)
  • Genomic ranges (eg. chromosome intervals)
  • Genomic variants (eg. DNA changes like SNPs or insertions)
  • Annotation databases (eg. information about genes and their functions)
  • Images (Microscopy images, for example.)

For each of these data types, Bioconductor provides standard object classes designed specifically to store and handle them properly.

There are often many ways of creating an object of a given class, including:

  1. Objects of these classes can be created in three main ways: Reading data from files (e.g., instrument output) using specialized import functions. This is the most common approach. It is very similar to creating a data.frame with read.delim().
  2. Loading processed data or annotations from Bioconductor packages
  3. Creating the object manually using a constructor function (usually with the same name as the class). This method requires providing all necessary information and is mainly used for small examples or teaching purposes.

Why use standardized classes?

  • Interoperability - When data is stored in standardized formats, different packages can work together seamlessly (eg. workflows). Developers do not need to reimplement the same data structures in multiple places. Functions can be written to operate on a common class and will automatically work across many datasets.
  • Recognition - Users do not need to learn a new data representation for every package. Once you understand a standard class, you can apply that knowledge across many tools. This makes it easier to move between workflows.
  • Robustness - Standard classes are carefully designed, optimized, and extensively tested. They often include built-in validity checks to ensure that the data has the correct structure and content. This reduces errors, improves reliability, and ensures that analyses are based on well-formed data objects.

As an example, all these Bioconductor packages make use of the SummarizedExperiment class:

In the following we will examine a few standard Bioconductor classes in more detail. Of course, there are many more, and some of these will be covered in later lectures.

GenomicRanges

  • The GenomicRanges package holds infrastructure for working with (genomic) ranges.
  • Genomic ranges are regions on a genome, like gene coordinates, ChIP-seq peaks, promoters, SNPs, CpG islands.
  • A GRanges object stores:
    • Chromosome (called sequence)
    • Interval (start and end)
    • Strand (either +, - or *)
    • Metadata (optional additional data)
  • The IRanges object represents just the intervals (start, end, width) without genomic context.
# Load the GenomicRanges package
# This also loads GenomicRanges package dependencies: IRanges and GenomeInfoDb.

library(GenomicRanges)

Creating an IRanges object

An IRanges object defines intervals on a number line. You can specify start and end, or start and width, and the third value is automatically calculated.

# Create an IRanges object using start and end positions
(ir <- IRanges::IRanges(start = c(3, 10, 50), 
               end = c(17, 28, 104)))
IRanges object with 3 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         3        17        15
  [2]        10        28        19
  [3]        50       104        55
# Equivalent: Create the same IRanges object using start and width
(ir <- IRanges(start = c(3, 10, 50), 
               width = c(15, 19, 55)))
IRanges object with 3 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         3        17        15
  [2]        10        28        19
  [3]        50       104        55

Creating a GRanges object

A GRanges object adds genomic context (chromosome, strand) to the intervals (= IRanges object).

# Create a GRanges object using the IRanges object 'ir'
(gr <- GenomicRanges::GRanges(
               seqnames = c("chr1", "chr1", "chr2"), 
               ranges = ir,
               strand = c("+", "+", "-")))
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  [3]     chr2    50-104      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Accessing Range Information

You can extract individual components of the object (start, end, and width) of each range using accessor functions. All these functions come from the IRanges package. They are automatically loaded when you load GenomicRanges.

# Get the start positions of each range
start(gr) 
[1]  3 10 50
# Get the end positions of each range
end(gr)
[1]  17  28 104
# Get the width (length) of each range
width(gr)
[1] 15 19 55

Working with chromosome naming conventions

Different databases use different naming styles for chromosomes (e.g., “1” vs “chr1”). You can check and change the reference sequences using function seqlevelsStyle().

# Check the current chromosome naming style
seqlevelsStyle(gr)
[1] "UCSC"
# Change to Ensembl style (e.g., "1", "2", ...)
seqlevelsStyle(gr) <- "Ensembl"
gr
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1      3-17      +
  [2]        1     10-28      +
  [3]        2    50-104      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
# Change back to UCSC style (e.g., "chr1", "chr2", ...)
seqlevelsStyle(gr) <- "UCSC"

Subsetting Ranges

There are many ways of subsetting ranges objects (check out methods(class = "GRanges")). You can subset ranges using indexing or by selecting specific chromosomes.

# Keep the first two ranges
gr[1:2]
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
# Subset using a logical condition
gr[seqnames(gr) == "chr1"]
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
# Remove ranges on chromosome "chr1"
dropSeqlevels(gr, "chr1",pruning.mode="coarse")
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr2    50-104      -
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
# Keep only ranges on chromosome "chr1"
keepSeqlevels(gr, "chr1",pruning.mode="coarse")
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
# Keep only standard chromosomes.
keepStandardChromosomes(gr)
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  [3]     chr2    50-104      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Extending Ranges

You can extend ranges by a fixed number of bases. If no sequence lengths are set, negative coordinates are allowed.

# Extend each range by 5 bases in both directions
gr
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  [3]     chr2    50-104      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
gr + 5
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     -2-22      +
  [2]     chr1      5-33      +
  [3]     chr2    45-109      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Setting Sequence Lengths

To avoid negative coordinates or ranges exceeding chromosome boundaries, set sequence lengths. Now we get a warning that one of the ranges is extending outside of the reference sequences.

# Check current sequence lengths (NULL if not set)
seqlengths(gr)
chr1 chr2 
  NA   NA 
# Set sequence lengths for "chr1" and "chr2"
seqlengths(gr) <- c(chr1 = 100, chr2 = 200)
seqlengths(gr)
chr1 chr2 
 100  200 
# Now, extending ranges beyond boundaries will trigger a warning
gr + 5
Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence chr1. Note
  that ranges located on a sequence whose length is unknown (NA) or on a
  circular sequence are not considered out-of-bound (use seqlengths() and
  isCircular() to get the lengths and circularity flags of the underlying
  sequences). You can use trim() to trim these ranges. See
  ?`trim,GenomicRanges-method` for more information.
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     -2-22      +
  [2]     chr1      5-33      +
  [3]     chr2    45-109      -
  -------
  seqinfo: 2 sequences from an unspecified genome

Trimming Ranges

Use trim() to remove parts of ranges that fall outside sequence boundaries. This would not do anything if we hadn’t set the sequence lengths.

# Trim ranges to fit within sequence boundaries
trim(gr + 5)
Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence chr1. Note
  that ranges located on a sequence whose length is unknown (NA) or on a
  circular sequence are not considered out-of-bound (use seqlengths() and
  isCircular() to get the lengths and circularity flags of the underlying
  sequences). You can use trim() to trim these ranges. See
  ?`trim,GenomicRanges-method` for more information.
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      1-22      +
  [2]     chr1      5-33      +
  [3]     chr2    45-109      -
  -------
  seqinfo: 2 sequences from an unspecified genome

Setting the genome

You can assign a genome build (e.g., “hg19”) to the ranges for clarity. This can be useful e.g. when comparing different sets of ranges, making sure they all correspond to the same underlying genome.

# Assign the genome build "hg19" to the ranges
genome(gr) <- "hg19"
gr
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  [3]     chr2    50-104      -
  -------
  seqinfo: 2 sequences from hg19 genome

Other useful operations

  • shifting: Shift ranges by a fixed number of bases.
# Shift all ranges 5 bases to the right
shift(gr, 5)
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      8-22      +
  [2]     chr1     15-33      +
  [3]     chr2    55-109      -
  -------
  seqinfo: 2 sequences from hg19 genome
  • Flanking Regions: obtaining flanking sequence. Get flanking regions (e.g., upstream/downstream) of each range. Strand information is used to determine direction.
# Get 2 bases upstream of each range
flank(gr, 2, start = TRUE, both = FALSE) # Upstream has different meaning for +/- strand !  
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1       1-2      +
  [2]     chr1       8-9      +
  [3]     chr2   105-106      -
  -------
  seqinfo: 2 sequences from hg19 genome
  • Midpoints: finding the midpoint of each range
# Resize each range to width=1, centered at the midpoint
resize(gr, width = 1, fix = "center")
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1        10      +
  [2]     chr1        19      +
  [3]     chr2        77      -
  -------
  seqinfo: 2 sequences from hg19 genome
  • Overlaps: find overlaps between ranges. Overlaps are strand-specific unless you use ignore.strand = TRUE. Overlaps are only found between ranges on the same chromosome.
# Create a second GRanges object
gr2 <- GRanges(seqnames = c("chr2", "chr1"),
               ranges = IRanges(start = c(90, 5), 
                                end = c(100, 15)),
               strand = c("-", "+"))

# Find overlaps between 'gr' and 'gr2'
findOverlaps(query = gr, subject = gr2)
Hits object with 3 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           2
  [2]         2           2
  [3]         3           1
  -------
  queryLength: 3 / subjectLength: 2

Metadata

GRanges objects can store metadata in the mcols() slot.

# Add a 'score' column to the metadata
mcols(gr)$score <- 1:3
gr
GRanges object with 3 ranges and 1 metadata column:
      seqnames    ranges strand |     score
         <Rle> <IRanges>  <Rle> | <integer>
  [1]     chr1      3-17      + |         1
  [2]     chr1     10-28      + |         2
  [3]     chr2    50-104      - |         3
  -------
  seqinfo: 2 sequences from hg19 genome
# Extract only the metadata
mcols(gr)
DataFrame with 3 rows and 1 column
      score
  <integer>
1         1
2         2
3         3
# Create a GRanges object with metadata directly
GRanges(seqnames = c("chr2", "chr1"),
        ranges = IRanges(start = c(90, 5), 
                         end = c(100, 15)),
        strand = c("-", "+"),
        extra_column = c("A", "B"))
GRanges object with 2 ranges and 1 metadata column:
      seqnames    ranges strand | extra_column
         <Rle> <IRanges>  <Rle> |  <character>
  [1]     chr2    90-100      - |            A
  [2]     chr1      5-15      + |            B
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Exercises

In practice, the locations of known genomic features (exons, genes, transcripts, UTRs, …) along the genome are often stored in so called gtf (gene transfer format) files. These files can be imported into a GRanges object using the import() function from the rtracklayer Bioconductor package.

  • Download the gencode.v28.annotation.1.1.10M.gtf file from the course web page, and read it into R in this way.
BiocManager::install("rtracklayer")
Bioconductor version 3.22 (BiocManager 1.30.27), R 4.5.2 (2025-10-31)
Warning: package(s) not installed when version(s) same as or greater than current; use
  `force = TRUE` to re-install: 'rtracklayer'
library(rtracklayer)

Attaching package: 'rtracklayer'
The following object is masked from 'package:igraph':

    blocks
gtf <- rtracklayer::import("data/02/gencode.v28.annotation.1.1.10M.gtf")
  • Confirm that the gtf object is indeed a GRanges object
class(gtf)
[1] "GRanges"
attr(,"package")
[1] "GenomicRanges"
  • How many records are there in the object?
length(gtf)
[1] 17955
  • From which chromosomes do the records come?
seqlevels(gtf)
[1] "chr1"
  • Use the subset function to extract only the records for the CHD5 gene. How many annotated transcripts does this gene have?
gtfsub <- subset(gtf, gene_name == "CHD5")
length(subset(gtfsub, type == "transcript"))
[1] 6

Lists of genomic ranges

Sometimes, ranges belong together in groups (e.g., exons in a transcript). For this, use a GRangesList:

  • Each group is a GRanges object.
  • All groups share the same genome information.
# Create a GRangesList object to store multiple transcripts (or grouped genomic features)
grl <- GRangesList(
    ## Each transcript is a GRanges object with each exon as individual range:
    
    # Transcript tx1.1 of gene1
    tx1.1 = GRanges(seqnames = "chr1",
                    ranges = IRanges(start = c(50, 125, 188),
                                     end = c(75, 130, 257)),
                    strand = "+",
                    # Metadata columns:
                    symbol = "tx1.1", 
                    transcript = "tx1.1", 
                    gene = "gene1", 
                    exon = paste0("ex1.", 1:3)),
    
    # Transcript tx1.2 of gene1 
    # Alternative transcript of gene1 with a different exon structure
    tx1.2 = GRanges(seqnames = "chr1",
                    ranges = IRanges(start = c(50, 125, 200),
                                     end = c(75, 130, 227)),
                    strand = "+",
                    # Metadata columns:
                    symbol = "tx1.2", 
                    transcript = "tx1.2", 
                    gene = "gene1", 
                    exon = paste0("ex1.", c(1, 2, 4))),
    
    # Transcript tx2.1 of gene2 
    # Transcript from a different gene on a different chromosome
    tx2.1 = GRanges(seqnames = "chr2",
                    ranges = IRanges(start = c(289, 318),
                                     end = c(300, 350)),
                    strand = "-",
                    symbol = "tx2.1", transcript = "tx2.1",
                    gene = "gene2", exon = paste0("ex2.", 1:2)))


# Set the lengths of the chromosomes for boundary checks and operations
seqlengths(grl) <- c(chr1 = 1000, chr2 = 500)

grl
GRangesList object of length 3:
$tx1.1
GRanges object with 3 ranges and 4 metadata columns:
      seqnames    ranges strand |      symbol  transcript        gene
         <Rle> <IRanges>  <Rle> | <character> <character> <character>
  [1]     chr1     50-75      + |       tx1.1       tx1.1       gene1
  [2]     chr1   125-130      + |       tx1.1       tx1.1       gene1
  [3]     chr1   188-257      + |       tx1.1       tx1.1       gene1
             exon
      <character>
  [1]       ex1.1
  [2]       ex1.2
  [3]       ex1.3
  -------
  seqinfo: 2 sequences from an unspecified genome

$tx1.2
GRanges object with 3 ranges and 4 metadata columns:
      seqnames    ranges strand |      symbol  transcript        gene
         <Rle> <IRanges>  <Rle> | <character> <character> <character>
  [1]     chr1     50-75      + |       tx1.2       tx1.2       gene1
  [2]     chr1   125-130      + |       tx1.2       tx1.2       gene1
  [3]     chr1   200-227      + |       tx1.2       tx1.2       gene1
             exon
      <character>
  [1]       ex1.1
  [2]       ex1.2
  [3]       ex1.4
  -------
  seqinfo: 2 sequences from an unspecified genome

$tx2.1
GRanges object with 2 ranges and 4 metadata columns:
      seqnames    ranges strand |      symbol  transcript        gene
         <Rle> <IRanges>  <Rle> | <character> <character> <character>
  [1]     chr2   289-300      - |       tx2.1       tx2.1       gene2
  [2]     chr2   318-350      - |       tx2.1       tx2.1       gene2
             exon
      <character>
  [1]       ex2.1
  [2]       ex2.2
  -------
  seqinfo: 2 sequences from an unspecified genome

The Gviz package can be used to generate a visualization of the first two transcripts (which belong to the same gene):

library(Gviz)

# Create a GeneRegionTrack for transcript tx1.1
g1 <- GeneRegionTrack(
    grl$tx1.1,  
    showId = TRUE,  # Show exon IDs on the plot
    col = NULL,  # No outline color for exons (transparent borders)
    fill = "#377EB8",  # Fill color for exons (blue)
    name = "",  # No track name (empty string)
    background.title = "transparent",  # Transparent background for the track title
    col.title = "black"  # Color for the track title text
)

# Create a GeneRegionTrack for transcript tx1.2 
g2 <- GeneRegionTrack(
    grl$tx1.2, 
    showId = TRUE, 
    col = NULL,  
    fill = "#E41A1C", 
    name = "",  
    background.title = "transparent",  
    col.title = "black"  
)

# Create a GenomeAxisTrack 
# This adds a genomic axis (chromosome scale) to the plot
gtr <- GenomeAxisTrack()

# Combine the tracks into a list
# Order matters: The GenomeAxisTrack is plotted first (bottom), followed by g1 and g2
tracks <- c(gtr, g1, g2)

# Plot the tracks 
# This generates the visualization with all three tracks
plotTracks(tracks)

SummarizedExperiment

The SummarizedExperiment class (SummarizedExperiment package) stores:

  • Your experiment data in a ‘rectangular’ data.
  • Annotations and metadata for rows and columns.

It’s widely used in Bioconductor (e.g., for RNA-seq, ChIP-seq, and more) and is the foundation for specialized classes like SingleCellExperiment and DESeqDataSet.

From https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html ### Structure

  • The SummarizedExperiment object consists of three types of ‘tables’, denoted
    • assays (actually a list of matrices of the same size, containing the observed data, e.g., gene expression values)
    • rowData (annotations for the rows, i.e., the features). This can be replaced by rowRanges (a GRanges object) that can also hold genomic range information, e.g., the locations of genes
    • colData (annotations for the columns, i.e., the samples)
library(SummarizedExperiment)
se <- SummarizedExperiment(
  
        # Assays: The observed data
        assays = list(exprs = matrix(10 * runif(15), 5, 3)),
        
        # Column metadata
        colData = DataFrame(   sample =  paste0("Sample", 1:3), 
                            condition = c("A", "A", "B")),
        
        # Row metadata
        rowData = DataFrame(gene = paste0("G", 1:5)))

rownames(se) <- rowData(se)$gene
colnames(se) <- colData(se)$sample
se
class: SummarizedExperiment 
dim: 5 3 
metadata(0):
assays(1): exprs
rownames(5): G1 G2 G3 G4 G5
rowData names(1): gene
colnames(3): Sample1 Sample2 Sample3
colData names(2): sample condition

Accessor functions

The SummarizedExperiment class has accessor functions to extract the individual components.

# Vector of assay names 
assayNames(se) # Output: "exprs"
[1] "exprs"
# matrix of expression values
assay(se, "exprs")
    Sample1   Sample2  Sample3
G1 2.595406 0.3962741 7.027536
G2 1.411956 8.4105452 7.270798
G3 1.699197 2.8136119 4.505360
G4 9.351497 1.7157312 6.343644
G5 1.267353 7.6316145 5.252465
# column data (sample metadata)
colData(se)
DataFrame with 3 rows and 2 columns
             sample   condition
        <character> <character>
Sample1     Sample1           A
Sample2     Sample2           A
Sample3     Sample3           B
# row data (feature metadata)
rowData(se)
DataFrame with 5 rows and 1 column
          gene
   <character>
G1          G1
G2          G2
G3          G3
G4          G4
G5          G5

There are also convenient subsetting functions. When you subset a SummarizedExperiment object, all parts (assays, row data, and column data) are automatically subset together.

# Subset first 2 rows (genes)
sesub <- se[1:2, ]

# Expression matrix for subset
assay(sesub, "exprs")
    Sample1   Sample2  Sample3
G1 2.595406 0.3962741 7.027536
G2 1.411956 8.4105452 7.270798
# column metadata for subset
colData(sesub)
DataFrame with 3 rows and 2 columns
             sample   condition
        <character> <character>
Sample1     Sample1           A
Sample2     Sample2           A
Sample3     Sample3           B
# Row metadata for subset
rowData(sesub)
DataFrame with 2 rows and 1 column
          gene
   <character>
G1          G1
G2          G2

Exercises

Here we will explore a real RNA-seq data set, which is provided in the airway Bioconductor package and stored in a RangedSummarizedExperiment object.

  • Load the package and attach the data set.
library(airway)
data(airway)
airway
class: RangedSummarizedExperiment 
dim: 63677 8 
metadata(1): ''
assays(1): counts
rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
  ENSG00000273493
rowData names(10): gene_id gene_name ... seq_coord_system symbol
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(9): SampleName cell ... Sample BioSample
  • What type of data value are available in this object? How many samples are there? How many features?
assayNames(airway)
[1] "counts"
dim(airway)
[1] 63677     8
  • What is the total number of sequencing reads assigned to genes for each sample?
colSums(assay(airway, "counts"))
SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 
  20637971   18809481   25348649   15163415   24448408   30818215   19126151 
SRR1039521 
  21164133 

Class inheritance

Classes can build on top of other classes, like adding features to a basic model.

How it works: * Start with a basic class (e.g., SummarizedExperiment). * Extend it to create a new class with extra features (e.g., SingleCellExperiment).

The new class keeps all the features of the original class. You don’t have to rewrite what’s already there.

library(SingleCellExperiment)
getClass("SingleCellExperiment")
Class "SingleCellExperiment" [package "SingleCellExperiment"]

Slots:
                                                                
Name:           int_elementMetadata                  int_colData
Class:                    DataFrame                    DataFrame
                                                                
Name:                  int_metadata                    rowRanges
Class:                         list GenomicRanges_OR_GRangesList
                                                                
Name:                       colData                       assays
Class:                    DataFrame               Assays_OR_NULL
                                                                
Name:                         NAMES              elementMetadata
Class:            character_OR_NULL                    DataFrame
                                   
Name:                      metadata
Class:                         list

Extends: 
Class "RangedSummarizedExperiment", directly
Class "SummarizedExperiment", by class "RangedSummarizedExperiment", distance 2
Class "RectangularData", by class "RangedSummarizedExperiment", distance 3
Class "Vector", by class "RangedSummarizedExperiment", distance 3
Class "Annotated", by class "RangedSummarizedExperiment", distance 4
Class "vector_OR_Vector", by class "RangedSummarizedExperiment", distance 4

Due to inheritance it might not be immediately clear what methods are defined for a given class. For most Bioconductor classes (all so called S4 classes) one way to find out is via the showMethods() function. This lists all functions that can be used with SingleCellExperiment, including methods inherited from parent classes.

showMethods(class = "SingleCellExperiment")

Biostrings

Biostrings is a package containing infrastructure for dealing with biological sequences.

  • XString: Represents a single sequence (e.g., DNAString, RNAString).
  • XStringSet: Represents multiple sequences (e.g., DNAStringSet, RNAStringSet).

The X stands for the type of sequence: DNA, RNA, amino acids (AA), or raw bases (B).

library(Biostrings)

Structure

To create a DNAString object, we can use the DNAString() constructor function. To see how the function works, type ?DNAString.

(dna <- DNAString(x = "AGGCATAGA"))
9-letter DNAString object
seq: AGGCATAGA

The classes automatically check if sequences are valid. For example, a DNA string can only contain valid DNA letters (including ambiguous bases like N, R, Y, etc., as defined by the IUPAC code.

alphabet(DNAString())
 [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" "."
IUPAC_CODE_MAP
     A      C      G      T      M      R      W      S      Y      K      V 
   "A"    "C"    "G"    "T"   "AC"   "AG"   "AT"   "CG"   "CT"   "GT"  "ACG" 
     H      D      B      N 
 "ACT"  "AGT"  "CGT" "ACGT" 
## Invalid DNA string
DNAString(x = "EQI")
Error in `.Call2()`:
! key 69 (char 'E') not in lookup table
alphabet(AAString())
 [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T" "W" "Y"
[20] "V" "U" "O" "B" "J" "Z" "X" "*" "-" "+" "."
AAString(x = "EQI")
3-letter AAString object
seq: EQI

Operation

Once we have defined the object, we can perform operations like:

  • Reverse complement of the DNA sequence
reverseComplement(dna)
9-letter DNAString object
seq: TCTATGCCT

Not possible with a regular character string

reverseComplement("AGGCATAGA")
Error:
! unable to find an inherited method for function 'reverseComplement' for signature 'x = "character"'
  • Translating the DNA sequence to an amino acid sequence
translate(dna)
3-letter AAString object
seq: RHR

We can not translate an amino acid sequence:

translate(AAString(x = "EQI"))
Error:
! unable to find an inherited method for function 'translate' for signature 'x = "AAString"'

If we have more than one sequence, we can represent them in a DNAStringSet object, which is a collection of DNAString objects

(dna_multiple <- DNAStringSet(c(x = "AGGCATAGA", y = "GGTATGAG")))
DNAStringSet object of length 2:
    width seq                                               names               
[1]     9 AGGCATAGA                                         x
[2]     8 GGTATGAG                                          y
## Subsetting with [] gives back a DNAStringSet
dna_multiple[1]
DNAStringSet object of length 1:
    width seq                                               names               
[1]     9 AGGCATAGA                                         x
## Subsetting with [[]] extracts the individual objects
dna_multiple[[1]]
9-letter DNAString object
seq: AGGCATAGA

It is also easy to extract subsequences of either an XString or an XStringSet object

## Get the length of each string
width(dna_multiple)
[1] 9 8
## Extract the first three positions in each string
Biostrings::subseq(dna_multiple, start = 1, end = 3)
DNAStringSet object of length 2:
    width seq                                               names               
[1]     3 AGG                                               x
[2]     3 GGT                                               y
## Extract subsequences of different length
Biostrings::subseq(dna_multiple, start = 1, end = c(2, 5))
DNAStringSet object of length 2:
    width seq                                               names               
[1]     2 AG                                                x
[2]     5 GGTAT                                             y
Biostrings::subseq(dna_multiple, start = 1, 
                   end = width(dna_multiple) - 4)
DNAStringSet object of length 2:
    width seq                                               names               
[1]     5 AGGCA                                             x
[2]     4 GGTA                                              y

Exercises

  • Can you tell whether AATTGGCCRGGCCAATT is a valid DNA sequence?
Biostrings::DNAString("AATTGGCCRGGCCAATT")
17-letter DNAString object
seq: AATTGGCCRGGCCAATT

Genomic sequence information is often stored in fasta files. DNAStringSet objects can be created from such files by the readDNAStringSet() function from the Biostrings package. Download the gencode.v28.transcripts.1.1.10M.fa file from the course web page and read it into a DNAStringSet object using this function. What do you think these sequences represent?

txs <- Biostrings::readDNAStringSet("data/02/gencode.v28.transcripts.1.1.10M.fa")
  • How many sequences are there in this file?
length(txs)
[1] 1373
  • What are the lengths of the shortest and longest sequence?
range(width(txs))
[1]    59 11666
  • Extract the first 10 bases of each sequence
Biostrings::subseq(txs, start = 1, end = 10)
DNAStringSet object of length 1373:
       width seq                                            names               
   [1]    10 GTTAACTTGC                                     ENST00000456328.2...
   [2]    10 GTGTCTGACT                                     ENST00000450305.2...
   [3]    10 ATGGGAGCCG                                     ENST00000488147.1...
   [4]    10 TGTGGGAGAG                                     ENST00000619216.1...
   [5]    10 GTGCACACGG                                     ENST00000473358.1...
   ...   ... ...
[1369]    10 ATGATTCACT                                     ENST00000636827.2...
[1370]    10 AGCATATTCT                                     ENST00000578045.1...
[1371]    10 GCTAAAATAA                                     ENST00000413148.1...
[1372]    10 GTCCCGGCCC                                     ENST00000315901.4...
[1373]    10 GAGCCTCCGG                                     ENST00000294435.7...

Annotation resources

Bioconductor provides tools to access different types of biological data:

  • Gene information for specific organisms (e.g., OrgDb).
  • Transcript locations for specific genome versions (e.g., TxDb, EnsDb).
  • Genome sequences for specific genome versions (e.g., BSgenome).

How to Access These Resources * Annotation Packages: Updated every 6 months with the Bioconductor releases * Online Resources: pulling down information from online resourceslike AnnotationHub.

As an example lets have a look at a human transcript annotation package TxDb.Hsapiens.UCSC.hg38.knownGene. As the name suggest, the annotation is derived from UCSC “knownGene” track for the hg38 genome build.

library(TxDb.Hsapiens.UCSC.hg38.knownGene)

Two accessor functions are transcripts() and transcriptsBy(). Lets compare their output:

# Extract all transcripts from the TxDb database as a single GRanges object
gr <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene)
gr
GRanges object with 412044 ranges and 2 metadata columns:
                      seqnames        ranges strand |     tx_id
                         <Rle>     <IRanges>  <Rle> | <integer>
       [1]                chr1   11121-14413      + |         1
       [2]                chr1   11125-14405      + |         2
       [3]                chr1   11410-14413      + |         3
       [4]                chr1   11411-14413      + |         4
       [5]                chr1   11426-14409      + |         5
       ...                 ...           ...    ... .       ...
  [412040] chrX_MU273397v1_alt 239036-260095      - |    412040
  [412041] chrX_MU273397v1_alt 272358-282686      - |    412041
  [412042] chrX_MU273397v1_alt 314193-316302      - |    412042
  [412043] chrX_MU273397v1_alt 314813-315236      - |    412043
  [412044] chrX_MU273397v1_alt 324527-324923      - |    412044
                     tx_name
                 <character>
       [1] ENST00000832824.1
       [2] ENST00000832825.1
       [3] ENST00000832826.1
       [4] ENST00000832827.1
       [5] ENST00000832828.1
       ...               ...
  [412040] ENST00000710260.1
  [412041] ENST00000710028.1
  [412042] ENST00000710030.1
  [412043] ENST00000710216.1
  [412044] ENST00000710031.1
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome
# Extract transcripts grouped by gene as a GRangesList object
grl <- transcriptsBy(TxDb.Hsapiens.UCSC.hg38.knownGene, by = "gene")
grl
GRangesList object of length 37525:
$`1`
GRanges object with 10 ranges and 2 metadata columns:
       seqnames            ranges strand |     tx_id           tx_name
          <Rle>         <IRanges>  <Rle> | <integer>       <character>
   [1]    chr19 58345178-58347634      - |    349232 ENST00000596924.1
   [2]    chr19 58345183-58353492      - |    349233 ENST00000263100.8
   [3]    chr19 58345183-58353492      - |    349234 ENST00000850949.1
   [4]    chr19 58345183-58353492      - |    349235 ENST00000850950.1
   [5]    chr19 58346854-58356225      - |    349236 ENST00000600123.5
   [6]    chr19 58346858-58353491      - |    349237 ENST00000595014.1
   [7]    chr19 58346860-58347657      - |    349238 ENST00000598345.2
   [8]    chr19 58348466-58362751      - |    349239 ENST00000599109.5
   [9]    chr19 58350594-58353129      - |    349240 ENST00000600966.1
  [10]    chr19 58353021-58356083      - |    349241 ENST00000596636.1
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome

$`10`
GRanges object with 6 ranges and 2 metadata columns:
      seqnames            ranges strand |     tx_id           tx_name
         <Rle>         <IRanges>  <Rle> | <integer>       <character>
  [1]     chr8 18386273-18388323      + |    166654 ENST00000660285.2
  [2]     chr8 18386315-18388318      + |    166655 ENST00000739947.1
  [3]     chr8 18386348-18388323      + |    166656 ENST00000739948.1
  [4]     chr8 18386783-18388318      + |    166659 ENST00000739951.1
  [5]     chr8 18391282-18401218      + |    166661 ENST00000286479.4
  [6]     chr8 18391287-18400993      + |    166662 ENST00000520116.1
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome

...
<37523 more elements>
  • Functions like exons(), transcripts() or genes() return GRanges objects with a set of ranges representing the level of detail of the query.

  • The “xxxBy” functions, e.g. transcriptsBy() or exonsBy(), return GRangesList objects. They enable grouping of the output in terms of another genomic feature, in our case genes.

Transcript annotations and their functionality play an important role when analyzing gene expression data. The coming lectures on RNA-Seq will introduce more functionality in the context of examples.

Exercises

The OrgDb package org.Hs.eg.db provides human gene annotation. It offers a simple interface to an underlying SQLite database. The main accessor functions for this (and other SQLite) annotation objects are select(), columns() and keys().

  • Install the org.Hs.eg.db package, load it and check its help page. Detailed examples on how to work with these annotation objects can be found on the linked AnnotationDb class help page.
library(org.Hs.eg.db)
?org.Hs.eg.db
  • What kind of data can be retrieved from the org.Hs.eg.db object? Check the columns() function!
columns(org.Hs.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
[16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
[21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
[26] "UNIPROT"     
  • Use the select() function to retrieve the gene names as well as ENTREZ and ENSEMBL gene IDs for the three genes: RAF1, FLT3 and MAPK1.
select(org.Hs.eg.db, keys = c("RAF1", "FLT3", "MAPK1"), 
       keytype = "SYMBOL", columns = c("ENTREZID", "ENSEMBL", "GENENAME"))
'select()' returned 1:1 mapping between keys and columns
  SYMBOL ENTREZID         ENSEMBL                                      GENENAME
1   RAF1     5894 ENSG00000132155 Raf-1 proto-oncogene, serine/threonine kinase
2   FLT3     2322 ENSG00000122025        fms related receptor tyrosine kinase 3
3  MAPK1     5594 ENSG00000100030            mitogen-activated protein kinase 1

Further reading