Introduction to Bioconductor and standard classes

Author

Charlotte Soneson

Published

March 6, 2024

What is Bioconductor?

  • Covers a wide range of domains:
    • sequencing (RNA-seq, ChIP-seq, single-cell analysis, variant calling, …)
    • microarrays (methylation, gene expression, copy number, …)
    • flow and mass cytometry
    • proteomics
    • imaging
  • There is a core team which, among other things, oversees the project, coordinates review of new package submissions, develops and maintains core infrastructure and maintains daily building and checking of all packages. Most packages, however, are contributed by the community.
  • There is an active Bioconductor community, where you can find help, support and people with shared interests. For example, check the website, join the slack workspace, or attend one of the yearly conferences (BioC, EuroBioC, Bioc Asia).

How to install Bioconductor packages

  • Detailed instructions are available online.
  • There is a new Bioconductor release twice per year (typically April and October).
  • Between those times, development is going on in the parallel devel branch.
  • Do not mix packages from different releases!
  • The available Bioconductor releases also depend on your version of R - if you want the latest Bioconductor packages, you need the latest version of R.
  • Bioconductor packages are installed using the BiocManager CRAN package (note that this works also for packages hosted on CRAN or GitHub):
install.packages("BiocManager")
BiocManager::install("limma")
  • We can check the validity of our Bioconductor installation with BiocManager (e.g., to see if there are outdated packages):
BiocManager::valid()
  • We can also see which version is currently installed:
BiocManager::version()
[1] '3.18'
  • BiocManager can be used to search for available packages. Here, we look for package names containing the word “Mmusculus” (assuming that we are interested in looking for packages containing annotation information for mouse):
BiocManager::available(pattern = "Mmusculus")
 [1] "BSgenome.Mmusculus.UCSC.mm10"        "BSgenome.Mmusculus.UCSC.mm10.masked"
 [3] "BSgenome.Mmusculus.UCSC.mm39"        "BSgenome.Mmusculus.UCSC.mm8"        
 [5] "BSgenome.Mmusculus.UCSC.mm8.masked"  "BSgenome.Mmusculus.UCSC.mm9"        
 [7] "BSgenome.Mmusculus.UCSC.mm9.masked"  "EnsDb.Mmusculus.v75"                
 [9] "EnsDb.Mmusculus.v79"                 "PWMEnrich.Mmusculus.background"     
[11] "TxDb.Mmusculus.UCSC.mm10.ensGene"    "TxDb.Mmusculus.UCSC.mm10.knownGene" 
[13] "TxDb.Mmusculus.UCSC.mm39.knownGene"  "TxDb.Mmusculus.UCSC.mm39.refGene"   
[15] "TxDb.Mmusculus.UCSC.mm9.knownGene"  
  • Each Bioconductor package has a landing page (e.g., https://bioconductor.org/packages/limma/) that contains relevant information, including installation instructions, vignettes and the reference manual:

How to get help with Bioconductor packages

  • Each function has a help page, detailing all arguments and outputs, and containing runnable examples.
library(limma)
?lmFit
  • Each package comes with at least one vignette, outlining the intended use of the functions in the package to analyze data. The vignettes can be browsed from the landing page, or directly in your R session, e.g. with
browseVignettes(package = "Biostrings")
  • If you need help with a Bioconductor package:
    • Check the function help pages and the vignette(s) to see whether the answer is there (some packages, like DESeq2, have FAQ sections in their vignette)
    • Go to https://support.bioconductor.org/
    • Search to see whether your question has already been asked (and answered)
    • If not, look through the Posting guide and post your question following these guidelines
    • If you follow these steps, you will often get a rapid and informative answer
    • The slack workspace has channels covering many topics related to Bioconductor
  • The Workflows are complete walkthroughs of specific types of analyses, typically making use of multiple (Bioconductor) packages

Standard Bioconductor data structures

  • Every object in R is of a particular data type.
  • You have already seen some data types, e.g. atomic vectors (numeric, character etc), list, matrix, data.frame, factor.
  • (Almost) all objects in R have a class. A class is a formal description of the data type, its properties, functionality and relation to other types .
  • The class function tells you which class(es) an object is of:
class(c("one", "two", "three"))
[1] "character"
class(matrix(1:6, 2, 3))
[1] "matrix" "array" 
  • The class of an object defines the set of allowed operations.
relevel(c("one", "two", "three"), ref = "one")
Error in relevel.default(c("one", "two", "three"), ref = "one"): 'relevel' only for (unordered) factors
relevel(factor(c("one", "two", "three")), ref = "two")
[1] one   two   three
Levels: two one three
c("1", "2", "3") + 1
Error in c("1", "2", "3") + 1: non-numeric argument to binary operator
as.numeric(c("1", "2", "3")) + 1
[1] 2 3 4
  • When working with (biological) data, it often makes sense to define more complex data structures, with several building blocks, and to define appropriate methods for each such class.

What types of data do we have to deal with?

  • Sequences (strings)
  • Alignments
  • ‘Rectangular’ data (matrices of numeric values for a set of features and samples)
  • Genomic ranges (intervals)
  • Genomic variants
  • Annotation databases
  • Images

For each of these data types, Bioconductor provides standard classes.

There are often many ways of creating an object of a given class, including:

  • reading data from a file (e.g. the output from an instrument) using specialized import functions from Bioconductor or other sources. This is perhaps the most common way of creating objects, and we will see many examples during the course. For example, think of how you can generate a data.frame by reading data from a text file with read.delim().
  • loading processed data or annotations provided via (Bioconductor) packages.
  • directly creating the object using a constructor function (which often has the same name as the object class). In this case, all the information that is needed for a particular type of object must be provided manually. This is the easiest way to create small example objects, and we will see examples of this below. This is also how the small objects in the previous section were created.

Why use standardized classes?

  • Interoperability - having standard ways of storing and moving data between packages means we don’t have to implement the same thing in multiple places. Moreover, it allows us as developers to write functions that are applicable to a very broad set of data objects, since we can make assumptions about the content and structure of these objects (see for example the iSEE package).
  • Recognition - you don’t have to learn a new representation for each package.
  • Robustness - standard classes are efficiently implemented and well tested, and come with additional functions to check the validity of provided data.

As an example, all these Bioconductor packages make use of the SummarizedExperiment class:

In the following we will examine a few standard Bioconductor classes in more detail. Of course, there are many more, and some of these will be covered in later lectures.

GenomicRanges

  • The GenomicRanges package holds infrastructure for working with (genomic) ranges.
  • These can represent, for example, gene coordinates, ChIP-seq peaks, promoters, SNPs, CpG islands.
  • A GRanges (genomic ranges) object contains, for each range, information about
    • the chromosome (called sequence)
    • the interval (start and end)
    • the strand (either +, - or *)
  • The interval is, in turn, represented by a so called IRanges (interval ranges) object.
  • Many of the operations shown below apply equally well to IRanges objects.
library(GenomicRanges)
  • Let’s create a small IRanges object. We can specify any two of the start, end and width of the ranges, and the third one will be inferred.
(ir <- IRanges(start = c(3, 10, 50), 
               end = c(17, 28, 104)))
IRanges object with 3 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         3        17        15
  [2]        10        28        19
  [3]        50       104        55
## Equivalent
(ir <- IRanges(start = c(3, 10, 50), 
               width = c(15, 19, 55)))
IRanges object with 3 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         3        17        15
  [2]        10        28        19
  [3]        50       104        55
  • The IRanges object just represents ranges on the number line. There is no specifically genomic information in there - that’s added when we construct a GRanges object
(gr <- GRanges(seqnames = c("chr1", "chr1", "chr2"), 
               ranges = ir,
               strand = c("+", "+", "-")))
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  [3]     chr2    50-104      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
  • Many operations are defined on the ranges objects. As a first example, it is easy to extract the individual components of the object using accessor functions
start(gr)
[1]  3 10 50
end(gr)
[1]  17  28 104
width(gr)
[1] 15 19 55
  • We can also get and set information about the reference sequences - there are many different conventions for naming chromosomes!
seqlevelsStyle(gr)
[1] "UCSC"
seqlevelsStyle(gr) <- "Ensembl"
gr
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1      3-17      +
  [2]        1     10-28      +
  [3]        2    50-104      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
seqlevelsStyle(gr) <- "UCSC"
  • There are many ways of subsetting ranges objects (check out methods(class = "GRanges")). For example, we can use the regular [ function (the authors of the GRanges class has defined what this function means for such an object), or we can keep only records on specific chromosomes.
gr[1:2]
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
dropSeqlevels(gr, "chr1", pruning.mode = "coarse")
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr2    50-104      -
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
keepSeqlevels(gr, "chr1", pruning.mode = "coarse")
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
gr[seqnames(gr) == "chr1"]
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
keepStandardChromosomes(gr, pruning.mode = "coarse")
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  [3]     chr2    50-104      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
  • Many other operations can be applied to each of the ranges in a GRanges or IRanges object, such as extending all ranges with a certain number of bases in each end
gr + 5
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     -2-22      +
  [2]     chr1      5-33      +
  [3]     chr2    45-109      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Note how this led to a negative start coordinate for the first range. This is not considered a violation since there is no information in the GRanges object which will tell R where the chromosomes (sequences) start and end. To accommodate this, let’s set the sequence lengths to 100 and 200 nt, respectively.

seqlengths(gr)
chr1 chr2 
  NA   NA 
seqlengths(gr) <- c(chr1 = 100, chr2 = 200)
seqlengths(gr)
chr1 chr2 
 100  200 
gr + 5
Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence chr1.
  Note that ranges located on a sequence whose length is unknown (NA) or
  on a circular sequence are not considered out-of-bound (use
  seqlengths() and isCircular() to get the lengths and circularity flags
  of the underlying sequences). You can use trim() to trim these ranges.
  See ?`trim,GenomicRanges-method` for more information.
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     -2-22      +
  [2]     chr1      5-33      +
  [3]     chr2    45-109      -
  -------
  seqinfo: 2 sequences from an unspecified genome

Note how we now get a warning that one of the ranges is extending outside of the reference sequences. We can further use the trim() function to remove all parts of ranges that fall outside the boundaries of the reference sequences (this would not do anything if we hadn’t set the sequence lengths).

trim(gr + 5)
Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence chr1.
  Note that ranges located on a sequence whose length is unknown (NA) or
  on a circular sequence are not considered out-of-bound (use
  seqlengths() and isCircular() to get the lengths and circularity flags
  of the underlying sequences). You can use trim() to trim these ranges.
  See ?`trim,GenomicRanges-method` for more information.
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      1-22      +
  [2]     chr1      5-33      +
  [3]     chr2    45-109      -
  -------
  seqinfo: 2 sequences from an unspecified genome

We can also set a name for the genome that the ranges are supposed to come from. This can be useful e.g. when comparing different sets of ranges, to make sure that they all correspond to the same underlying genome.

genome(gr) <- "hg19"
gr
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  [3]     chr2    50-104      -
  -------
  seqinfo: 2 sequences from hg19 genome

Other useful operations on single ranges include:

  • shifting
shift(gr, 5)
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      8-22      +
  [2]     chr1     15-33      +
  [3]     chr2    55-109      -
  -------
  seqinfo: 2 sequences from hg19 genome
  • obtaining flanking sequence. Note how the strand information is taken into account.
flank(gr, 2, start = TRUE, both = FALSE)
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1       1-2      +
  [2]     chr1       8-9      +
  [3]     chr2   105-106      -
  -------
  seqinfo: 2 sequences from hg19 genome
  • finding the midpoint of each range
resize(gr, width = 1, fix = "center")
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1        10      +
  [2]     chr1        19      +
  [3]     chr2        77      -
  -------
  seqinfo: 2 sequences from hg19 genome

In addition to the intra-range operations, there are also many useful operations that involve multiple ranges. For example, we can:

  • find overlaps between ranges
gr2 <- GRanges(seqnames = c("chr2", "chr1"),
               ranges = IRanges(start = c(90, 5), 
                                end = c(100, 15)),
               strand = c("-", "+"))
findOverlaps(query = gr, subject = gr2)
Hits object with 3 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           2
  [2]         2           2
  [3]         3           1
  -------
  queryLength: 3 / subjectLength: 2
  • GRanges objects can have associated metadata. This is provided in the mcols() slot of the object.
mcols(gr)$score <- 1:3
gr
GRanges object with 3 ranges and 1 metadata column:
      seqnames    ranges strand |     score
         <Rle> <IRanges>  <Rle> | <integer>
  [1]     chr1      3-17      + |         1
  [2]     chr1     10-28      + |         2
  [3]     chr2    50-104      - |         3
  -------
  seqinfo: 2 sequences from hg19 genome
## Extract only the metadata
mcols(gr)
DataFrame with 3 rows and 1 column
      score
  <integer>
1         1
2         2
3         3
## We can also supply the metadata when creating the object
GRanges(seqnames = c("chr2", "chr1"),
        ranges = IRanges(start = c(90, 5), 
                         end = c(100, 15)),
        strand = c("-", "+"),
        extra_column = c("A", "B"))
GRanges object with 2 ranges and 1 metadata column:
      seqnames    ranges strand | extra_column
         <Rle> <IRanges>  <Rle> |  <character>
  [1]     chr2    90-100      - |            A
  [2]     chr1      5-15      + |            B
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Exercises

In practice, the locations of known genomic features (exons, genes, transcripts, UTRs, …) along the genome are often stored in so called gtf (gene transfer format) files. These files can be imported into a GRanges object using the import() function from the rtracklayer Bioconductor package.

  • Download the gencode.v28.annotation.1.1.10M.gtf file from the course web page, and read it into R in this way.
library(rtracklayer)

Attaching package: 'rtracklayer'
The following object is masked from 'package:igraph':

    blocks
gtf <- rtracklayer::import("data/02/gencode.v28.annotation.1.1.10M.gtf")
  • Confirm that the gtf object is indeed a GRanges object
class(gtf)
[1] "GRanges"
attr(,"package")
[1] "GenomicRanges"
  • How many records are there in the object?
length(gtf)
[1] 17955
  • From which chromosomes do the records come?
seqlevels(gtf)
[1] "chr1"
  • Use the subset function to extract only the records for the CHD5 gene. How many annotated transcripts does this gene have?
gtfsub <- subset(gtf, gene_name == "CHD5")
length(subset(gtfsub, type == "transcript"))
[1] 6

Lists of genomic ranges

  • In some cases, ranges come in groups - for example, exons within a transcript.
  • In these situations, it often makes sense to represent each group of ranges as a GRanges object, and the full set of groups as a list of such ranges.
  • The GRangesList class covers these situations.
  • All ranges objects share the same genome information
grl <- GRangesList(
    tx1.1 = GRanges(seqnames = "chr1",
                    ranges = IRanges(start = c(50, 125, 188),
                                     end = c(75, 130, 257)),
                    strand = "+",
                    symbol = "tx1.1", transcript = "tx1.1", 
                    gene = "gene1", exon = paste0("ex1.", 1:3)),
    tx1.2 = GRanges(seqnames = "chr1",
                    ranges = IRanges(start = c(50, 125, 200),
                                     end = c(75, 130, 227)),
                    strand = "+",
                    symbol = "tx1.2", transcript = "tx1.2", 
                    gene = "gene1", exon = paste0("ex1.", c(1, 2, 4))),
    tx2.1 = GRanges(seqnames = "chr2",
                    ranges = IRanges(start = c(289, 318),
                                     end = c(300, 350)),
                    strand = "-",
                    symbol = "tx2.1", transcript = "tx2.1",
                    gene = "gene2", exon = paste0("ex2.", 1:2)))
seqlengths(grl) <- c(chr1 = 1000, chr2 = 500)
grl
GRangesList object of length 3:
$tx1.1
GRanges object with 3 ranges and 4 metadata columns:
      seqnames    ranges strand |      symbol  transcript        gene
         <Rle> <IRanges>  <Rle> | <character> <character> <character>
  [1]     chr1     50-75      + |       tx1.1       tx1.1       gene1
  [2]     chr1   125-130      + |       tx1.1       tx1.1       gene1
  [3]     chr1   188-257      + |       tx1.1       tx1.1       gene1
             exon
      <character>
  [1]       ex1.1
  [2]       ex1.2
  [3]       ex1.3
  -------
  seqinfo: 2 sequences from an unspecified genome

$tx1.2
GRanges object with 3 ranges and 4 metadata columns:
      seqnames    ranges strand |      symbol  transcript        gene
         <Rle> <IRanges>  <Rle> | <character> <character> <character>
  [1]     chr1     50-75      + |       tx1.2       tx1.2       gene1
  [2]     chr1   125-130      + |       tx1.2       tx1.2       gene1
  [3]     chr1   200-227      + |       tx1.2       tx1.2       gene1
             exon
      <character>
  [1]       ex1.1
  [2]       ex1.2
  [3]       ex1.4
  -------
  seqinfo: 2 sequences from an unspecified genome

$tx2.1
GRanges object with 2 ranges and 4 metadata columns:
      seqnames    ranges strand |      symbol  transcript        gene
         <Rle> <IRanges>  <Rle> | <character> <character> <character>
  [1]     chr2   289-300      - |       tx2.1       tx2.1       gene2
  [2]     chr2   318-350      - |       tx2.1       tx2.1       gene2
             exon
      <character>
  [1]       ex2.1
  [2]       ex2.2
  -------
  seqinfo: 2 sequences from an unspecified genome

The Gviz package can be used to generate a visualization of the first two transcripts (which belong to the same gene):

library(Gviz)
g1 <- GeneRegionTrack(grl$tx1.1, showId = TRUE, col = NULL,
                      fill = "#377EB8", name = "",
                      background.title = "transparent",
                      col.title = "black")
g2 <- GeneRegionTrack(grl$tx1.2, showId = TRUE, col = NULL,
                      fill = "#E41A1C", name = "",
                      background.title = "transparent",
                      col.title = "black")
gtr <- GenomeAxisTrack()
tracks <- c(gtr, g1, g2)
plotTracks(tracks)

SummarizedExperiment

  • The purpose of the SummarizedExperiment class (provided in the package with the same name) is to hold ‘rectangular’ data, together with annotations and metadata for the rows and columns. In other words, all the data associated with an experiment!
  • As we already saw above, this class (or extensions like SingleCellExperiment or DESeqDataSet) is widely used throughout the Bioconductor ecosystem, e.g., for representing RNA-seq, ChIP-seq, methylation and mass cytometry data.

From https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html
  • The SummarizedExperiment object consists of three types of ‘tables’, denoted
    • assays (actually a list of matrices of the same size, containing the observed data, e.g., gene expression values)
    • rowData (annotations for the rows, i.e., the features). This can be replaced by rowRanges (a GRanges object) that can also hold genomic range information, e.g., the locations of genes
    • colData (annotations for the columns, i.e., the samples)
library(SummarizedExperiment)
se <- SummarizedExperiment(
        assays = list(exprs = matrix(10 * runif(15), 5, 3)),
        colData = DataFrame(sample =  paste0("Sample", 1:3), 
                            condition = c("A", "A", "B")),
        rowData = DataFrame(gene = paste0("G", 1:5)))
rownames(se) <- rowData(se)$gene
colnames(se) <- colData(se)$sample
se
class: SummarizedExperiment 
dim: 5 3 
metadata(0):
assays(1): exprs
rownames(5): G1 G2 G3 G4 G5
rowData names(1): gene
colnames(3): Sample1 Sample2 Sample3
colData names(2): sample condition
  • The SummarizedExperiment class has accessor functions to extract the individual components.
assayNames(se)
[1] "exprs"
assay(se, "exprs")
     Sample1  Sample2  Sample3
G1 7.7808157 8.899195 5.204971
G2 0.2349146 3.069105 5.425215
G3 0.7296682 8.401116 2.993127
G4 3.3820006 3.421748 8.265988
G5 1.4642998 8.343076 2.130548
colData(se)
DataFrame with 3 rows and 2 columns
             sample   condition
        <character> <character>
Sample1     Sample1           A
Sample2     Sample2           A
Sample3     Sample3           B
rowData(se)
DataFrame with 5 rows and 1 column
          gene
   <character>
G1          G1
G2          G2
G3          G3
G4          G4
G5          G5
  • There are also convenient subsetting functions. Note how all the relevant parts of the objects are subset!
sesub <- se[1:2, ]
assay(sesub, "exprs")
     Sample1  Sample2  Sample3
G1 7.7808157 8.899195 5.204971
G2 0.2349146 3.069105 5.425215
colData(sesub)
DataFrame with 3 rows and 2 columns
             sample   condition
        <character> <character>
Sample1     Sample1           A
Sample2     Sample2           A
Sample3     Sample3           B
rowData(sesub)
DataFrame with 2 rows and 1 column
          gene
   <character>
G1          G1
G2          G2
  • If a GRanges object is used to represent the row annotations, the mcols() slot of the GRanges object is returned by rowData(se).
se <- SummarizedExperiment(
    assays = list(exprs = matrix(10 * runif(15), 5, 3)),
    colData = DataFrame(sample =  paste0("Sample", 1:3), 
                        condition = c("A", "A", "B")),
    rowRanges = GRanges(seqnames = "chr1",
                        ranges = IRanges(start = 100 * runif(5),
                                         width = 50),
                        strand = "+",
                        gene = paste0("G", 1:5)))
class(se)
[1] "RangedSummarizedExperiment"
attr(,"package")
[1] "SummarizedExperiment"
rowData(se) 
DataFrame with 5 rows and 1 column
         gene
  <character>
1          G1
2          G2
3          G3
4          G4
5          G5
rowRanges(se)
GRanges object with 5 ranges and 1 metadata column:
      seqnames    ranges strand |        gene
         <Rle> <IRanges>  <Rle> | <character>
  [1]     chr1    55-104      + |          G1
  [2]     chr1    87-136      + |          G2
  [3]     chr1    89-138      + |          G3
  [4]     chr1    67-116      + |          G4
  [5]     chr1     25-74      + |          G5
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Exercises

Here we will explore a real RNA-seq data set, which is provided in the airway Bioconductor package and stored in a RangedSummarizedExperiment object.

  • Load the package and attach the data set.
library(airway)
data(airway)
airway
class: RangedSummarizedExperiment 
dim: 63677 8 
metadata(1): ''
assays(1): counts
rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
  ENSG00000273493
rowData names(10): gene_id gene_name ... seq_coord_system symbol
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(9): SampleName cell ... Sample BioSample
  • What type of data value are available in this object? How many samples are there? How many features?
assayNames(airway)
[1] "counts"
dim(airway)
[1] 63677     8
  • What is the total number of sequencing reads assigned to genes for each sample?
colSums(assay(airway, "counts"))
SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 
  20637971   18809481   25348649   15163415   24448408   30818215   19126151 
SRR1039521 
  21164133 

Class inheritance

  • Classes are usually organized into a hierarchy by class inheritance.

  • Basically, it involves taking a ‘standard’ base class that has some of the properties you need, and extending or building upon it to create a new class that fits your purpose.

  • The new class will inherit the attributes and methods from the original class, and thus you don’t need to redefine them for your new class.

  • One example from Bioconductor is the SingleCellExperiment class, which extends the SummarizedExperiment class by, among a few other things, allowing the inclusion of dimension reduction results (e.g. PCA, tSNE, or UMAP).

library(SingleCellExperiment)
getClass("SingleCellExperiment")
Class "SingleCellExperiment" [package "SingleCellExperiment"]

Slots:
                                                                
Name:           int_elementMetadata                  int_colData
Class:                    DataFrame                    DataFrame
                                                                
Name:                  int_metadata                    rowRanges
Class:                         list GenomicRanges_OR_GRangesList
                                                                
Name:                       colData                       assays
Class:                    DataFrame               Assays_OR_NULL
                                                                
Name:                         NAMES              elementMetadata
Class:            character_OR_NULL                    DataFrame
                                   
Name:                      metadata
Class:                         list

Extends: 
Class "RangedSummarizedExperiment", directly
Class "SummarizedExperiment", by class "RangedSummarizedExperiment", distance 2
Class "RectangularData", by class "RangedSummarizedExperiment", distance 3
Class "Vector", by class "RangedSummarizedExperiment", distance 3
Class "Annotated", by class "RangedSummarizedExperiment", distance 4
Class "vector_OR_Vector", by class "RangedSummarizedExperiment", distance 4
getClass("RangedSummarizedExperiment")
Class "RangedSummarizedExperiment" [package "SummarizedExperiment"]

Slots:
                                                                
Name:                     rowRanges                      colData
Class: GenomicRanges_OR_GRangesList                    DataFrame
                                                                
Name:                        assays                        NAMES
Class:               Assays_OR_NULL            character_OR_NULL
                                                                
Name:               elementMetadata                     metadata
Class:                    DataFrame                         list

Extends: 
Class "SummarizedExperiment", directly
Class "RectangularData", by class "SummarizedExperiment", distance 2
Class "Vector", by class "SummarizedExperiment", distance 2
Class "Annotated", by class "SummarizedExperiment", distance 3
Class "vector_OR_Vector", by class "SummarizedExperiment", distance 3
  • Due to inheritance it might not be immediately clear what methods are defined for a given class. For most Bioconductor classes (all so called S4 classes) one way to find out is via the showMethods() function.
showMethods(class = "SingleCellExperiment")
  • A more comprehensive output however is provided by the sloop package. Its s4_methods_class() function also lists the source of inherited methods (make sure that the SingleCellExperiment package has been loaded into your R session):
library(sloop)
s4_methods_class("SingleCellExperiment")
Methods of SingleCellExperiment
generic class visible source
!= SingleCellExperiment TRUE S4Vectors
!= SingleCellExperiment TRUE S4Vectors
!= SingleCellExperiment TRUE S4Vectors
[ SingleCellExperiment TRUE SingleCellExperiment
[[ SingleCellExperiment TRUE SummarizedExperiment
[[<- SingleCellExperiment TRUE SummarizedExperiment
[<- SingleCellExperiment TRUE SingleCellExperiment
[<- SingleCellExperiment TRUE S4Vectors
%in% SingleCellExperiment TRUE S4Vectors
%in% SingleCellExperiment TRUE S4Vectors
%in% SingleCellExperiment TRUE S4Vectors
< SingleCellExperiment TRUE S4Vectors
< SingleCellExperiment TRUE S4Vectors
< SingleCellExperiment TRUE S4Vectors
<= SingleCellExperiment TRUE S4Vectors
<= SingleCellExperiment TRUE S4Vectors
<= SingleCellExperiment TRUE S4Vectors
== SingleCellExperiment TRUE S4Vectors
== SingleCellExperiment TRUE S4Vectors
== SingleCellExperiment TRUE S4Vectors
> SingleCellExperiment TRUE S4Vectors
> SingleCellExperiment TRUE S4Vectors
> SingleCellExperiment TRUE S4Vectors
>= SingleCellExperiment TRUE S4Vectors
>= SingleCellExperiment TRUE S4Vectors
>= SingleCellExperiment TRUE S4Vectors
$ SingleCellExperiment TRUE SummarizedExperiment
$<- SingleCellExperiment TRUE SummarizedExperiment
aggregate SingleCellExperiment TRUE S4Vectors
altExp SingleCellExperiment TRUE SingleCellExperiment
altExp SingleCellExperiment TRUE SingleCellExperiment
altExp SingleCellExperiment TRUE SingleCellExperiment
altExp<- SingleCellExperiment TRUE SingleCellExperiment
altExp<- SingleCellExperiment TRUE SingleCellExperiment
altExp<- SingleCellExperiment TRUE SingleCellExperiment
altExpNames SingleCellExperiment TRUE SingleCellExperiment
altExpNames<- SingleCellExperiment TRUE SingleCellExperiment
altExps SingleCellExperiment TRUE SingleCellExperiment
altExps<- SingleCellExperiment TRUE SingleCellExperiment
anyDuplicated SingleCellExperiment TRUE S4Vectors
anyNA SingleCellExperiment TRUE S4Vectors
append SingleCellExperiment TRUE S4Vectors
as.character SingleCellExperiment TRUE S4Vectors
as.complex SingleCellExperiment TRUE S4Vectors
as.data.frame SingleCellExperiment TRUE S4Vectors
as.env SingleCellExperiment TRUE S4Vectors
as.integer SingleCellExperiment TRUE S4Vectors
as.list SingleCellExperiment TRUE S4Vectors
as.logical SingleCellExperiment TRUE S4Vectors
as.matrix SingleCellExperiment TRUE S4Vectors
as.numeric SingleCellExperiment TRUE S4Vectors
as.raw SingleCellExperiment TRUE S4Vectors
assay SingleCellExperiment TRUE SummarizedExperiment
assay SingleCellExperiment TRUE SummarizedExperiment
assay SingleCellExperiment TRUE SummarizedExperiment
assay<- SingleCellExperiment TRUE SummarizedExperiment
assay<- SingleCellExperiment TRUE SummarizedExperiment
assay<- SingleCellExperiment TRUE SummarizedExperiment
assayNames SingleCellExperiment TRUE SummarizedExperiment
assayNames<- SingleCellExperiment TRUE SummarizedExperiment
assays SingleCellExperiment TRUE SummarizedExperiment
assays<- SingleCellExperiment TRUE SummarizedExperiment
assays<- SingleCellExperiment TRUE SummarizedExperiment
bindROWS SingleCellExperiment TRUE S4Vectors
by SingleCellExperiment TRUE S4Vectors
c SingleCellExperiment TRUE S4Vectors
cbind SingleCellExperiment TRUE SingleCellExperiment
coerce SingleCellExperiment TRUE methods
coerce SingleCellExperiment TRUE methods
coerce SingleCellExperiment TRUE SummarizedExperiment
coerce SingleCellExperiment TRUE methods
coerce SingleCellExperiment TRUE methods
coerce SingleCellExperiment TRUE SummarizedExperiment
coerce SingleCellExperiment TRUE SingleCellExperiment
coerce SingleCellExperiment TRUE methods
coerce SingleCellExperiment TRUE methods
coerce SingleCellExperiment TRUE S4Vectors
coerce SingleCellExperiment TRUE S4Vectors
coerce SingleCellExperiment TRUE S4Vectors
coerce SingleCellExperiment TRUE S4Vectors
coerce SingleCellExperiment TRUE S4Vectors
coerce SingleCellExperiment TRUE S4Vectors
coerce SingleCellExperiment TRUE S4Vectors
coerce SingleCellExperiment TRUE S4Vectors
coerce SingleCellExperiment TRUE S4Vectors
coerce SingleCellExperiment TRUE S4Vectors
coerce SingleCellExperiment TRUE S4Vectors
coerce SingleCellExperiment TRUE IRanges
coerce<- SingleCellExperiment TRUE S4Vectors
coerce<- SingleCellExperiment TRUE S4Vectors
coerce<- SingleCellExperiment TRUE SummarizedExperiment
colData SingleCellExperiment TRUE SingleCellExperiment
colData<- SingleCellExperiment TRUE SummarizedExperiment
colData<- SingleCellExperiment TRUE SummarizedExperiment
colLabels SingleCellExperiment TRUE SingleCellExperiment
colLabels<- SingleCellExperiment TRUE SingleCellExperiment
colnames SingleCellExperiment TRUE SummarizedExperiment
colPair SingleCellExperiment TRUE SingleCellExperiment
colPair SingleCellExperiment TRUE SingleCellExperiment
colPair SingleCellExperiment TRUE SingleCellExperiment
colPair<- SingleCellExperiment TRUE SingleCellExperiment
colPair<- SingleCellExperiment TRUE SingleCellExperiment
colPair<- SingleCellExperiment TRUE SingleCellExperiment
colPairNames SingleCellExperiment TRUE SingleCellExperiment
colPairNames<- SingleCellExperiment TRUE SingleCellExperiment
colPairs SingleCellExperiment TRUE SingleCellExperiment
colPairs<- SingleCellExperiment TRUE SingleCellExperiment
combineCols SingleCellExperiment TRUE SingleCellExperiment
combineRows SingleCellExperiment TRUE SummarizedExperiment
Compare SingleCellExperiment TRUE SummarizedExperiment
Compare SingleCellExperiment TRUE SummarizedExperiment
Compare SingleCellExperiment TRUE SummarizedExperiment
countOverlaps SingleCellExperiment TRUE IRanges
countOverlaps SingleCellExperiment TRUE IRanges
countOverlaps SingleCellExperiment TRUE IRanges
counts SingleCellExperiment TRUE
counts<- SingleCellExperiment TRUE
coverage SingleCellExperiment TRUE SummarizedExperiment
cpm SingleCellExperiment TRUE
cpm<- SingleCellExperiment TRUE
dim SingleCellExperiment TRUE S4Vectors
dimnames SingleCellExperiment TRUE S4Vectors
dimnames<- SingleCellExperiment TRUE SummarizedExperiment
dimnames<- SingleCellExperiment TRUE S4Vectors
dimnames<- SingleCellExperiment TRUE SummarizedExperiment
disjointBins SingleCellExperiment TRUE SummarizedExperiment
distance SingleCellExperiment TRUE SummarizedExperiment
distance SingleCellExperiment TRUE SummarizedExperiment
distance SingleCellExperiment TRUE SummarizedExperiment
distanceToNearest SingleCellExperiment TRUE SummarizedExperiment
distanceToNearest SingleCellExperiment TRUE SummarizedExperiment
distanceToNearest SingleCellExperiment TRUE SummarizedExperiment
duplicated SingleCellExperiment TRUE SummarizedExperiment
elementMetadata SingleCellExperiment TRUE SummarizedExperiment
elementMetadata<- SingleCellExperiment TRUE SummarizedExperiment
end SingleCellExperiment TRUE SummarizedExperiment
end<- SingleCellExperiment TRUE SummarizedExperiment
eval SingleCellExperiment TRUE S4Vectors
eval SingleCellExperiment TRUE S4Vectors
expand SingleCellExperiment TRUE S4Vectors
expand.grid SingleCellExperiment TRUE S4Vectors
extractROWS SingleCellExperiment TRUE S4Vectors
FactorToClass SingleCellExperiment TRUE S4Vectors
findOverlaps SingleCellExperiment TRUE GenomicAlignments
findOverlaps SingleCellExperiment TRUE GenomicAlignments
findOverlaps SingleCellExperiment TRUE GenomicAlignments
findOverlaps SingleCellExperiment TRUE SummarizedExperiment
findOverlaps SingleCellExperiment TRUE GenomicAlignments
findOverlaps SingleCellExperiment TRUE GenomicAlignments
findOverlaps SingleCellExperiment TRUE GenomicAlignments
findOverlaps SingleCellExperiment TRUE IRanges
flank SingleCellExperiment TRUE SummarizedExperiment
follow SingleCellExperiment TRUE SummarizedExperiment
follow SingleCellExperiment TRUE SummarizedExperiment
follow SingleCellExperiment TRUE SummarizedExperiment
granges SingleCellExperiment TRUE SummarizedExperiment
head SingleCellExperiment TRUE utils
horizontal_slot_names SingleCellExperiment TRUE SummarizedExperiment
int_colData SingleCellExperiment TRUE SingleCellExperiment
int_colData<- SingleCellExperiment TRUE SingleCellExperiment
int_elementMetadata SingleCellExperiment TRUE SingleCellExperiment
int_elementMetadata<- SingleCellExperiment TRUE SingleCellExperiment
int_metadata SingleCellExperiment TRUE SingleCellExperiment
int_metadata<- SingleCellExperiment TRUE SingleCellExperiment
intersect SingleCellExperiment TRUE GenomicRanges
intersect SingleCellExperiment TRUE GenomicRanges
intersect SingleCellExperiment TRUE S4Vectors
is.na SingleCellExperiment TRUE S4Vectors
is.unsorted SingleCellExperiment TRUE SummarizedExperiment
isDisjoint SingleCellExperiment TRUE SummarizedExperiment
length SingleCellExperiment TRUE SummarizedExperiment
lengths SingleCellExperiment TRUE S4Vectors
logcounts SingleCellExperiment TRUE
logcounts<- SingleCellExperiment TRUE
mainExpName SingleCellExperiment TRUE SingleCellExperiment
mainExpName<- SingleCellExperiment TRUE SingleCellExperiment
match SingleCellExperiment TRUE S4Vectors
match SingleCellExperiment TRUE S4Vectors
match SingleCellExperiment TRUE Biostrings
match SingleCellExperiment TRUE Biostrings
mcols SingleCellExperiment TRUE SummarizedExperiment
mcols<- SingleCellExperiment TRUE SummarizedExperiment
merge SingleCellExperiment TRUE S4Vectors
mergeROWS SingleCellExperiment TRUE S4Vectors
metadata SingleCellExperiment TRUE S4Vectors
metadata<- SingleCellExperiment TRUE S4Vectors
mstack SingleCellExperiment TRUE S4Vectors
names SingleCellExperiment TRUE SummarizedExperiment
names<- SingleCellExperiment TRUE SummarizedExperiment
narrow SingleCellExperiment TRUE SummarizedExperiment
ncol SingleCellExperiment TRUE SummarizedExperiment
nearest SingleCellExperiment TRUE SummarizedExperiment
nearest SingleCellExperiment TRUE SummarizedExperiment
nearest SingleCellExperiment TRUE SummarizedExperiment
normcounts SingleCellExperiment TRUE
normcounts<- SingleCellExperiment TRUE
nrow SingleCellExperiment TRUE SummarizedExperiment
objectVersion SingleCellExperiment TRUE SingleCellExperiment
order SingleCellExperiment TRUE SummarizedExperiment
overlapsAny SingleCellExperiment TRUE IRanges
overlapsAny SingleCellExperiment TRUE IRanges
overlapsAny SingleCellExperiment TRUE IRanges
parallel_slot_names SingleCellExperiment TRUE SingleCellExperiment
pcompare SingleCellExperiment TRUE SummarizedExperiment
pcompare SingleCellExperiment TRUE SummarizedExperiment
pcompare SingleCellExperiment TRUE SummarizedExperiment
pcompare SingleCellExperiment TRUE Biostrings
pcompare SingleCellExperiment TRUE Biostrings
precede SingleCellExperiment TRUE SummarizedExperiment
precede SingleCellExperiment TRUE SummarizedExperiment
precede SingleCellExperiment TRUE SummarizedExperiment
promoters SingleCellExperiment TRUE SummarizedExperiment
ranges SingleCellExperiment TRUE SummarizedExperiment
ranges<- SingleCellExperiment TRUE SummarizedExperiment
rank SingleCellExperiment TRUE SummarizedExperiment
rbind SingleCellExperiment TRUE SingleCellExperiment
realize SingleCellExperiment TRUE SummarizedExperiment
reducedDim SingleCellExperiment TRUE SingleCellExperiment
reducedDim SingleCellExperiment TRUE SingleCellExperiment
reducedDim SingleCellExperiment TRUE SingleCellExperiment
reducedDim<- SingleCellExperiment TRUE SingleCellExperiment
reducedDim<- SingleCellExperiment TRUE SingleCellExperiment
reducedDim<- SingleCellExperiment TRUE SingleCellExperiment
reducedDimNames SingleCellExperiment TRUE SingleCellExperiment
reducedDimNames<- SingleCellExperiment TRUE SingleCellExperiment
reducedDims SingleCellExperiment TRUE SingleCellExperiment
reducedDims<- SingleCellExperiment TRUE SingleCellExperiment
relist SingleCellExperiment TRUE IRanges
rename SingleCellExperiment TRUE S4Vectors
rep SingleCellExperiment TRUE S4Vectors
rep.int SingleCellExperiment TRUE S4Vectors
replaceROWS SingleCellExperiment TRUE S4Vectors
resize SingleCellExperiment TRUE SummarizedExperiment
restrict SingleCellExperiment TRUE SummarizedExperiment
rev SingleCellExperiment TRUE S4Vectors
rowData SingleCellExperiment TRUE SingleCellExperiment
rowData<- SingleCellExperiment TRUE SummarizedExperiment
ROWNAMES SingleCellExperiment TRUE S4Vectors
rownames SingleCellExperiment TRUE SummarizedExperiment
ROWNAMES<- SingleCellExperiment TRUE S4Vectors
rowPair SingleCellExperiment TRUE SingleCellExperiment
rowPair SingleCellExperiment TRUE SingleCellExperiment
rowPair SingleCellExperiment TRUE SingleCellExperiment
rowPair<- SingleCellExperiment TRUE SingleCellExperiment
rowPair<- SingleCellExperiment TRUE SingleCellExperiment
rowPair<- SingleCellExperiment TRUE SingleCellExperiment
rowPairNames SingleCellExperiment TRUE SingleCellExperiment
rowPairNames<- SingleCellExperiment TRUE SingleCellExperiment
rowPairs SingleCellExperiment TRUE SingleCellExperiment
rowPairs<- SingleCellExperiment TRUE SingleCellExperiment
rowRanges SingleCellExperiment TRUE SummarizedExperiment
rowRanges<- SingleCellExperiment TRUE SummarizedExperiment
rowRanges<- SingleCellExperiment TRUE SummarizedExperiment
rowRanges<- SingleCellExperiment TRUE SummarizedExperiment
rowSubset SingleCellExperiment TRUE SingleCellExperiment
rowSubset<- SingleCellExperiment TRUE SingleCellExperiment
selfmatch SingleCellExperiment TRUE S4Vectors
seqinfo SingleCellExperiment TRUE SummarizedExperiment
seqinfo<- SingleCellExperiment TRUE SummarizedExperiment
seqlevelsInUse SingleCellExperiment TRUE GenomeInfoDb
seqnames SingleCellExperiment TRUE SummarizedExperiment
setdiff SingleCellExperiment TRUE GenomicRanges
setdiff SingleCellExperiment TRUE GenomicRanges
setdiff SingleCellExperiment TRUE S4Vectors
setequal SingleCellExperiment TRUE S4Vectors
shift SingleCellExperiment TRUE SummarizedExperiment
shiftApply SingleCellExperiment TRUE S4Vectors
show SingleCellExperiment TRUE SingleCellExperiment
showAsCell SingleCellExperiment TRUE SummarizedExperiment
sizeFactors SingleCellExperiment TRUE SingleCellExperiment
sizeFactors<- SingleCellExperiment TRUE SingleCellExperiment
sort SingleCellExperiment TRUE SummarizedExperiment
split SingleCellExperiment TRUE S4Vectors
split SingleCellExperiment TRUE S4Vectors
split SingleCellExperiment TRUE SummarizedExperiment
split SingleCellExperiment TRUE rtracklayer
split SingleCellExperiment TRUE S4Vectors
split<- SingleCellExperiment TRUE IRanges
start SingleCellExperiment TRUE SummarizedExperiment
start<- SingleCellExperiment TRUE SummarizedExperiment
strand SingleCellExperiment TRUE SummarizedExperiment
strand<- SingleCellExperiment TRUE SummarizedExperiment
subset SingleCellExperiment TRUE SummarizedExperiment
subsetByOverlaps SingleCellExperiment TRUE IRanges
summary SingleCellExperiment TRUE S4Vectors
table SingleCellExperiment TRUE S4Vectors
tail SingleCellExperiment TRUE utils
tapply SingleCellExperiment TRUE IRanges
tapply SingleCellExperiment TRUE IRanges
tapply SingleCellExperiment TRUE IRanges
tpm SingleCellExperiment TRUE
tpm<- SingleCellExperiment TRUE
transform SingleCellExperiment TRUE S4Vectors
trim SingleCellExperiment TRUE SummarizedExperiment
union SingleCellExperiment TRUE GenomicRanges
union SingleCellExperiment TRUE GenomicRanges
union SingleCellExperiment TRUE S4Vectors
unique SingleCellExperiment TRUE S4Vectors
updateObject SingleCellExperiment TRUE SingleCellExperiment
values SingleCellExperiment TRUE S4Vectors
values<- SingleCellExperiment TRUE S4Vectors
vertical_slot_names SingleCellExperiment TRUE SummarizedExperiment
weights SingleCellExperiment TRUE
weights<- SingleCellExperiment TRUE
width SingleCellExperiment TRUE SummarizedExperiment
width<- SingleCellExperiment TRUE SummarizedExperiment
window SingleCellExperiment TRUE S4Vectors
window<- SingleCellExperiment TRUE IRanges
with SingleCellExperiment TRUE S4Vectors
xtabs SingleCellExperiment TRUE S4Vectors
xtfrm SingleCellExperiment TRUE S4Vectors

Biostrings

  • Biostrings is a package containing infrastructure for dealing with biological sequences, or strings.
  • These are represented as so called XString objects (or XStringSet objects if you want to include multiple sequences), where X can be DNA, RNA, AA or B and signifies the type of sequence(s) in the object.
library(Biostrings)
  • To create a DNAString object, we can use the DNAString() constructor function. To see how the function works, type ?DNAString.
(dna <- DNAString(x = "AGGCATAGA"))
9-letter DNAString object
seq: AGGCATAGA
  • The classes come with useful validity checks. For example, not all characters are allowed in a DNA string (but note that it can contain other bases than ‘A’, ‘C’, ‘G’ and ‘T’, following the IUPAC code of ambiguous bases).
alphabet(DNAString())
 [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" "."
IUPAC_CODE_MAP
     A      C      G      T      M      R      W      S      Y      K      V 
   "A"    "C"    "G"    "T"   "AC"   "AG"   "AT"   "CG"   "CT"   "GT"  "ACG" 
     H      D      B      N 
 "ACT"  "AGT"  "CGT" "ACGT" 
## Invalid DNA string
DNAString(x = "EQI")
Error in .Call2("new_XString_from_CHARACTER", class(x0), string, start, : key 69 (char 'E') not in lookup table
alphabet(AAString())
 [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T" "W" "Y"
[20] "V" "U" "O" "B" "J" "Z" "X" "*" "-" "+" "."
AAString(x = "EQI")
3-letter AAString object
seq: EQI

Once we have defined the object, we can perform operations like:

  • getting the reverse complement of the DNA sequence
reverseComplement(dna)
9-letter DNAString object
seq: TCTATGCCT

We could not do this with a regular character string

reverseComplement("AGGCATAGA")
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'reverseComplement' for signature '"character"'
  • translating the DNA sequence to an amino acid sequence
translate(dna)
3-letter AAString object
seq: RHR

We can not translate an amino acid sequence:

translate(AAString(x = "EQI"))
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'translate' for signature '"AAString"'
  • If we have more than one sequence, we can represent them in a DNAStringSet object, which is a collection of DNAString objects
(dna_multiple <- DNAStringSet(c(x = "AGGCATAGA", y = "GGTATGAG")))
DNAStringSet object of length 2:
    width seq                                               names               
[1]     9 AGGCATAGA                                         x
[2]     8 GGTATGAG                                          y
## Subsetting with [] gives back a DNAStringSet
dna_multiple[1]
DNAStringSet object of length 1:
    width seq                                               names               
[1]     9 AGGCATAGA                                         x
## Subsetting with [[]] extracts the individual objects
dna_multiple[[1]]
9-letter DNAString object
seq: AGGCATAGA
  • It is also easy to extract subsequences of either an XString or an XStringSet object
## Get the length of each string
width(dna_multiple)
[1] 9 8
## Extract the first three positions in each string
Biostrings::subseq(dna_multiple, start = 1, end = 3)
DNAStringSet object of length 2:
    width seq                                               names               
[1]     3 AGG                                               x
[2]     3 GGT                                               y
## Extract subsequences of different length
Biostrings::subseq(dna_multiple, start = 1, end = c(2, 5))
DNAStringSet object of length 2:
    width seq                                               names               
[1]     2 AG                                                x
[2]     5 GGTAT                                             y
Biostrings::subseq(dna_multiple, start = 1, 
                   end = width(dna_multiple) - 4)
DNAStringSet object of length 2:
    width seq                                               names               
[1]     5 AGGCA                                             x
[2]     4 GGTA                                              y

Exercises

  • Can you tell whether AATTGGCCRGGCCAATT is a valid DNA sequence?
Biostrings::DNAString("AATTGGCCRGGCCAATT")
17-letter DNAString object
seq: AATTGGCCRGGCCAATT

Genomic sequence information is often stored in fasta files. DNAStringSet objects can be created from such files by the readDNAStringSet() function from the Biostrings package. Download the gencode.v28.transcripts.1.1.10M.fa file from the course web page and read it into a DNAStringSet object using this function. What do you think these sequences represent?

txs <- Biostrings::readDNAStringSet("data/02/gencode.v28.transcripts.1.1.10M.fa")
  • How many sequences are there in this file?
length(txs)
[1] 1373
  • What are the lengths of the shortest and longest sequence?
range(width(txs))
[1]    59 11666
  • Extract the first 10 bases of each sequence
Biostrings::subseq(txs, start = 1, end = 10)
DNAStringSet object of length 1373:
       width seq                                            names               
   [1]    10 GTTAACTTGC                                     ENST00000456328.2...
   [2]    10 GTGTCTGACT                                     ENST00000450305.2...
   [3]    10 ATGGGAGCCG                                     ENST00000488147.1...
   [4]    10 TGTGGGAGAG                                     ENST00000619216.1...
   [5]    10 GTGCACACGG                                     ENST00000473358.1...
   ...   ... ...
[1369]    10 ATGATTCACT                                     ENST00000636827.2...
[1370]    10 AGCATATTCT                                     ENST00000578045.1...
[1371]    10 GCTAAAATAA                                     ENST00000413148.1...
[1372]    10 GTCCCGGCCC                                     ENST00000315901.4...
[1373]    10 GAGCCTCCGG                                     ENST00000294435.7...
  • Get the fraction of A, C, G and T bases in each sequence (hint: there is a function named alphabetFrequency). Plot a histogram of the fraction of Gs across all sequences
freqs <- alphabetFrequency(txs)
fracs <- sweep(freqs, MARGIN = 1, STATS = rowSums(freqs), FUN = "/")
hist(fracs[, "G"])

Annotation resources

Bioconductor provides a range of annotation resources, among them are:

  • Gene-based information for a given organism (OrgDb)
  • Transcript ranges for a given genome build (TxDb and EnsDb)
  • Genome sequence for a given genome build (BSgenome)

These annotation resources can be accessed in different ways:

  • Via annotation packages (updated every 6 months with the Bioconductor releases)
  • By pulling down information from online resources such as the Bioconductor AnnotationHub

As an example lets have a look at a human transcript annotation package TxDb.Hsapiens.UCSC.hg38.knownGene. As the name suggest, the annotation is derived from UCSC “knownGene” track for the hg38 genome build.

library(TxDb.Hsapiens.UCSC.hg38.knownGene)

?TxDb.Hsapiens.UCSC.hg38.knownGene

Two accessor functions are transcripts() and transcriptsBy(). Lets compare their output:

gr <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene)
gr
GRanges object with 276905 ranges and 2 metadata columns:
                      seqnames        ranges strand |     tx_id
                         <Rle>     <IRanges>  <Rle> | <integer>
       [1]                chr1   11869-14409      + |         1
       [2]                chr1   12010-13670      + |         2
       [3]                chr1   29554-31097      + |         3
       [4]                chr1   30267-31109      + |         4
       [5]                chr1   30366-30503      + |         5
       ...                 ...           ...    ... .       ...
  [276901] chrX_MU273397v1_alt 239036-260095      - |    276901
  [276902] chrX_MU273397v1_alt 272358-282686      - |    276902
  [276903] chrX_MU273397v1_alt 314193-316302      - |    276903
  [276904] chrX_MU273397v1_alt 314813-315236      - |    276904
  [276905] chrX_MU273397v1_alt 324527-324923      - |    276905
                     tx_name
                 <character>
       [1] ENST00000456328.2
       [2] ENST00000450305.2
       [3] ENST00000473358.1
       [4] ENST00000469289.1
       [5] ENST00000607096.1
       ...               ...
  [276901] ENST00000710260.1
  [276902] ENST00000710028.1
  [276903] ENST00000710030.1
  [276904] ENST00000710216.1
  [276905] ENST00000710031.1
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome
grl <- transcriptsBy(TxDb.Hsapiens.UCSC.hg38.knownGene, by = "gene")
grl
GRangesList object of length 32868:
$`1`
GRanges object with 8 ranges and 2 metadata columns:
      seqnames            ranges strand |     tx_id           tx_name
         <Rle>         <IRanges>  <Rle> | <integer>       <character>
  [1]    chr19 58345178-58347634      - |    228947 ENST00000596924.1
  [2]    chr19 58345183-58353492      - |    228948 ENST00000263100.8
  [3]    chr19 58346854-58356225      - |    228949 ENST00000600123.5
  [4]    chr19 58346858-58353491      - |    228950 ENST00000595014.1
  [5]    chr19 58346860-58347657      - |    228951 ENST00000598345.1
  [6]    chr19 58348466-58362751      - |    228952 ENST00000599109.5
  [7]    chr19 58350594-58353129      - |    228953 ENST00000600966.1
  [8]    chr19 58353021-58356083      - |    228954 ENST00000596636.1
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome

$`10`
GRanges object with 3 ranges and 2 metadata columns:
      seqnames            ranges strand |     tx_id           tx_name
         <Rle>         <IRanges>  <Rle> | <integer>       <character>
  [1]     chr8 18386311-18388323      + |    104459 ENST00000660285.1
  [2]     chr8 18391282-18401218      + |    104461 ENST00000286479.4
  [3]     chr8 18391287-18400993      + |    104462 ENST00000520116.1
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome

...
<32866 more elements>
  • Functions like exons(), transcripts() or genes() return GRanges objects with a set of ranges representing the level of detail of the query.

  • The “xxxBy” functions, e.g. transcriptsBy() or exonsBy(), return GRangesList objects. They enable grouping of the output in terms of another genomic feature, in our case genes.

Transcript annotations and their functionality play an important role when analyzing gene expression data. The coming lectures on RNA-Seq will introduce more functionality in the context of examples.

Exercises

The OrgDb package org.Hs.eg.db provides human gene annotation. It offers a simple interface to an underlying SQLite database. The main accessor functions for this (and other SQLite) annotation objects are select(), columns() and keys().

  • Install the org.Hs.eg.db package, load it and check its help page. Detailed examples on how to work with these annotation objects can be found on the linked AnnotationDb class help page.
library(org.Hs.eg.db)
?org.Hs.eg.db
  • What kind of data can be retrieved from the org.Hs.eg.db object? Check the columns() function!
columns(org.Hs.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
[16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
[21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
[26] "UNIPROT"     
  • Use the select() function to retrieve the gene names as well as ENTREZ and ENSEMBL gene IDs for the three genes: RAF1, FLT3 and MAPK1.
select(org.Hs.eg.db, keys = c("RAF1", "FLT3", "MAPK1"), 
       keytype = "SYMBOL", columns = c("ENTREZID", "ENSEMBL", "GENENAME"))
'select()' returned 1:1 mapping between keys and columns
  SYMBOL ENTREZID         ENSEMBL                                      GENENAME
1   RAF1     5894 ENSG00000132155 Raf-1 proto-oncogene, serine/threonine kinase
2   FLT3     2322 ENSG00000122025        fms related receptor tyrosine kinase 3
3  MAPK1     5594 ENSG00000100030            mitogen-activated protein kinase 1

Further reading