Introduction to Bioconductor and standard classes

Author

Charlotte Soneson

Published

February 26, 2025

What is Bioconductor?

The project was started in 2001 (back then, mostly focused on microarray data)
Currently (release 3.20) hosts:
- 2289 software packages
- 928 annotation packages
- 431 experiment data packages
- 30 workflows

Covers a wide range of domains:
- sequencing (RNA-seq, ChIP-seq, single-cell analysis, variant calling, …)
- microarrays (methylation, gene expression, copy number, …)
- flow and mass cytometry
- proteomics
- imaging
- …
There is a core team which, among other things, oversees the project, coordinates review of new package submissions, develops and maintains core infrastructure and maintains daily building and checking of all packages. Most packages, however, are contributed by the community.
There is an active Bioconductor community, where you can find help, support and people with shared interests. For example, check the website, join the slack workspace, or attend one of the yearly conferences (BioC, EuroBioC, Bioc Asia).

How to install Bioconductor packages

Detailed instructions are available online.
There is a new Bioconductor release twice per year (typically April and October).
Between those times, development is going on in the parallel devel branch.
Do not mix packages from different releases!
The available Bioconductor releases also depend on your version of R - if you want the latest Bioconductor packages, you need the latest version of R.
Bioconductor packages are installed using the BiocManager CRAN package (note that this works also for packages hosted on CRAN or GitHub):

install.packages("BiocManager")
BiocManager::install("limma")

We can check the validity of our Bioconductor installation with BiocManager (e.g., to see if there are outdated packages):

BiocManager::valid()

We can also see which version is currently installed:

BiocManager::version()

[1] '3.20'

BiocManager can be used to search for available packages. Here, we look for package names containing the word “Mmusculus” (assuming that we are interested in looking for packages containing annotation information for mouse):

BiocManager::available(pattern = "Mmusculus")

 [1] "BSgenome.Mmusculus.UCSC.mm10"        "BSgenome.Mmusculus.UCSC.mm10.masked"
 [3] "BSgenome.Mmusculus.UCSC.mm39"        "BSgenome.Mmusculus.UCSC.mm8"        
 [5] "BSgenome.Mmusculus.UCSC.mm8.masked"  "BSgenome.Mmusculus.UCSC.mm9"        
 [7] "BSgenome.Mmusculus.UCSC.mm9.masked"  "EnsDb.Mmusculus.v75"                
 [9] "EnsDb.Mmusculus.v79"                 "PWMEnrich.Mmusculus.background"     
[11] "TxDb.Mmusculus.UCSC.mm10.ensGene"    "TxDb.Mmusculus.UCSC.mm10.knownGene" 
[13] "TxDb.Mmusculus.UCSC.mm39.knownGene"  "TxDb.Mmusculus.UCSC.mm39.refGene"   
[15] "TxDb.Mmusculus.UCSC.mm9.knownGene"

Each Bioconductor package has a landing page (e.g., https://bioconductor.org/packages/limma/) that contains relevant information, including installation instructions, vignettes and the reference manual:

How to get help with Bioconductor packages

Each function has a help page, detailing all arguments and outputs, and containing runnable examples.

library(limma)
?lmFit

Each package comes with at least one vignette, outlining the intended use of the functions in the package to analyze data. The vignettes can be browsed from the landing page, or directly in your R session, e.g. with

browseVignettes(package = "Biostrings")

If you need help with a Bioconductor package:
- Check the function help pages and the vignette(s) to see whether the answer is there (some packages, like DESeq2, have FAQ sections in their vignette)
- Go to https://support.bioconductor.org/
- Search to see whether your question has already been asked (and answered)
- If not, look through the Posting guide and post your question following these guidelines
- If you follow these steps, you will often get a rapid and informative answer
- The slack workspace has channels covering many topics related to Bioconductor
The Workflows are complete walkthroughs of specific types of analyses, typically making use of multiple (Bioconductor) packages

Standard Bioconductor data structures

Every object in R is of a particular data type.
You have already seen some data types, e.g. atomic vectors (numeric, character etc), list, matrix, data.frame, factor.
(Almost) all objects in R have a class. A class is a formal description of the data type, its properties, functionality and relation to other types .
The class function tells you which class(es) an object is of:

class(c("one", "two", "three"))

[1] "character"

class(matrix(1:6, 2, 3))

[1] "matrix" "array"

The class of an object defines the set of allowed operations.

relevel(c("one", "two", "three"), ref = "one")

Error in relevel.default(c("one", "two", "three"), ref = "one"): 'relevel' only for (unordered) factors

relevel(factor(c("one", "two", "three")), ref = "two")

[1] one   two   three
Levels: two one three

c("1", "2", "3") + 1

Error in c("1", "2", "3") + 1: non-numeric argument to binary operator

as.numeric(c("1", "2", "3")) + 1

[1] 2 3 4

When working with (biological) data, it often makes sense to define more complex data structures, with several building blocks, and to define appropriate methods for each such class.

What types of data do we have to deal with?

Sequences (strings)
Alignments
‘Rectangular’ data (matrices of numeric values for a set of features and samples)
Genomic ranges (intervals)
Genomic variants
Annotation databases
Images
…

For each of these data types, Bioconductor provides standard classes.

There are often many ways of creating an object of a given class, including:

reading data from a file (e.g. the output from an instrument) using specialized import functions from Bioconductor or other sources. This is perhaps the most common way of creating objects, and we will see many examples during the course. For example, think of how you can generate a data.frame by reading data from a text file with read.delim().
loading processed data or annotations provided via (Bioconductor) packages.
directly creating the object using a constructor function (which often has the same name as the object class). In this case, all the information that is needed for a particular type of object must be provided manually. This is the easiest way to create small example objects, and we will see examples of this below. This is also how the small objects in the previous section were created.

Why use standardized classes?

Interoperability - having standard ways of storing and moving data between packages means we don’t have to implement the same thing in multiple places. Moreover, it allows us as developers to write functions that are applicable to a very broad set of data objects, since we can make assumptions about the content and structure of these objects (see for example the iSEE package).
Recognition - you don’t have to learn a new representation for each package.
Robustness - standard classes are efficiently implemented and well tested, and come with additional functions to check the validity of provided data.

As an example, all these Bioconductor packages make use of the SummarizedExperiment class:

In the following we will examine a few standard Bioconductor classes in more detail. Of course, there are many more, and some of these will be covered in later lectures.

GenomicRanges

The GenomicRanges package holds infrastructure for working with (genomic) ranges.
These can represent, for example, gene coordinates, ChIP-seq peaks, promoters, SNPs, CpG islands.
A GRanges (genomic ranges) object contains, for each range, information about
- the chromosome (called sequence)
- the interval (start and end)
- the strand (either +, - or *)
The interval is, in turn, represented by a so called IRanges (interval ranges) object.
Many of the operations shown below apply equally well to IRanges objects.

library(GenomicRanges)

Let’s create a small IRanges object. We can specify any two of the start, end and width of the ranges, and the third one will be inferred.

(ir <- IRanges(start = c(3, 10, 50), 
               end = c(17, 28, 104)))

IRanges object with 3 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         3        17        15
  [2]        10        28        19
  [3]        50       104        55

## Equivalent
(ir <- IRanges(start = c(3, 10, 50), 
               width = c(15, 19, 55)))

IRanges object with 3 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         3        17        15
  [2]        10        28        19
  [3]        50       104        55

The IRanges object just represents ranges on the number line. There is no specifically genomic information in there - that’s added when we construct a GRanges object

(gr <- GRanges(seqnames = c("chr1", "chr1", "chr2"), 
               ranges = ir,
               strand = c("+", "+", "-")))

GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  [3]     chr2    50-104      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Many operations are defined on the ranges objects. As a first example, it is easy to extract the individual components of the object using accessor functions

start(gr)

[1]  3 10 50

end(gr)

[1]  17  28 104

width(gr)

[1] 15 19 55

We can also get and set information about the reference sequences - there are many different conventions for naming chromosomes!

seqlevelsStyle(gr)

[1] "UCSC"

seqlevelsStyle(gr) <- "Ensembl"
gr

GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1      3-17      +
  [2]        1     10-28      +
  [3]        2    50-104      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

seqlevelsStyle(gr) <- "UCSC"

There are many ways of subsetting ranges objects (check out methods(class = "GRanges")). For example, we can use the regular [ function (the authors of the GRanges class has defined what this function means for such an object), or we can keep only records on specific chromosomes.

gr[1:2]

GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

dropSeqlevels(gr, "chr1", pruning.mode = "coarse")

GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr2    50-104      -
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

keepSeqlevels(gr, "chr1", pruning.mode = "coarse")

GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

gr[seqnames(gr) == "chr1"]

GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

keepStandardChromosomes(gr, pruning.mode = "coarse")

GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  [3]     chr2    50-104      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Many other operations can be applied to each of the ranges in a GRanges or IRanges object, such as extending all ranges with a certain number of bases in each end

gr + 5

GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     -2-22      +
  [2]     chr1      5-33      +
  [3]     chr2    45-109      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Note how this led to a negative start coordinate for the first range. This is not considered a violation since there is no information in the GRanges object which will tell R where the chromosomes (sequences) start and end. To accommodate this, let’s set the sequence lengths to 100 and 200 nt, respectively.

seqlengths(gr)

chr1 chr2 
  NA   NA

seqlengths(gr) <- c(chr1 = 100, chr2 = 200)
seqlengths(gr)

chr1 chr2 
 100  200

gr + 5

Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence chr1.
  Note that ranges located on a sequence whose length is unknown (NA) or
  on a circular sequence are not considered out-of-bound (use
  seqlengths() and isCircular() to get the lengths and circularity flags
  of the underlying sequences). You can use trim() to trim these ranges.
  See ?`trim,GenomicRanges-method` for more information.

GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     -2-22      +
  [2]     chr1      5-33      +
  [3]     chr2    45-109      -
  -------
  seqinfo: 2 sequences from an unspecified genome

Note how we now get a warning that one of the ranges is extending outside of the reference sequences. We can further use the trim() function to remove all parts of ranges that fall outside the boundaries of the reference sequences (this would not do anything if we hadn’t set the sequence lengths).

trim(gr + 5)

Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence chr1.
  Note that ranges located on a sequence whose length is unknown (NA) or
  on a circular sequence are not considered out-of-bound (use
  seqlengths() and isCircular() to get the lengths and circularity flags
  of the underlying sequences). You can use trim() to trim these ranges.
  See ?`trim,GenomicRanges-method` for more information.

GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      1-22      +
  [2]     chr1      5-33      +
  [3]     chr2    45-109      -
  -------
  seqinfo: 2 sequences from an unspecified genome

We can also set a name for the genome that the ranges are supposed to come from. This can be useful e.g. when comparing different sets of ranges, to make sure that they all correspond to the same underlying genome.

genome(gr) <- "hg19"
gr

GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  [3]     chr2    50-104      -
  -------
  seqinfo: 2 sequences from hg19 genome

Other useful operations on single ranges include:

shifting

shift(gr, 5)

GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      8-22      +
  [2]     chr1     15-33      +
  [3]     chr2    55-109      -
  -------
  seqinfo: 2 sequences from hg19 genome

obtaining flanking sequence. Note how the strand information is taken into account.

flank(gr, 2, start = TRUE, both = FALSE)

GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1       1-2      +
  [2]     chr1       8-9      +
  [3]     chr2   105-106      -
  -------
  seqinfo: 2 sequences from hg19 genome

finding the midpoint of each range

resize(gr, width = 1, fix = "center")

GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1        10      +
  [2]     chr1        19      +
  [3]     chr2        77      -
  -------
  seqinfo: 2 sequences from hg19 genome

In addition to the intra-range operations, there are also many useful operations that involve multiple ranges. For example, we can:

find overlaps between ranges

gr2 <- GRanges(seqnames = c("chr2", "chr1"),
               ranges = IRanges(start = c(90, 5), 
                                end = c(100, 15)),
               strand = c("-", "+"))
findOverlaps(query = gr, subject = gr2)

Hits object with 3 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           2
  [2]         2           2
  [3]         3           1
  -------
  queryLength: 3 / subjectLength: 2

GRanges objects can have associated metadata. This is provided in the mcols() slot of the object.

mcols(gr)$score <- 1:3
gr

GRanges object with 3 ranges and 1 metadata column:
      seqnames    ranges strand |     score
         <Rle> <IRanges>  <Rle> | <integer>
  [1]     chr1      3-17      + |         1
  [2]     chr1     10-28      + |         2
  [3]     chr2    50-104      - |         3
  -------
  seqinfo: 2 sequences from hg19 genome

## Extract only the metadata
mcols(gr)

DataFrame with 3 rows and 1 column
      score
  <integer>
1         1
2         2
3         3

## We can also supply the metadata when creating the object
GRanges(seqnames = c("chr2", "chr1"),
        ranges = IRanges(start = c(90, 5), 
                         end = c(100, 15)),
        strand = c("-", "+"),
        extra_column = c("A", "B"))

GRanges object with 2 ranges and 1 metadata column:
      seqnames    ranges strand | extra_column
         <Rle> <IRanges>  <Rle> |  <character>
  [1]     chr2    90-100      - |            A
  [2]     chr1      5-15      + |            B
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Exercises

In practice, the locations of known genomic features (exons, genes, transcripts, UTRs, …) along the genome are often stored in so called gtf (gene transfer format) files. These files can be imported into a GRanges object using the import() function from the rtracklayer Bioconductor package.

Download the gencode.v28.annotation.1.1.10M.gtf file from the course web page, and read it into R in this way.

Solution

library(rtracklayer)


Attaching package: 'rtracklayer'

The following object is masked from 'package:igraph':

    blocks

gtf <- rtracklayer::import("data/02/gencode.v28.annotation.1.1.10M.gtf")

Confirm that the gtf object is indeed a GRanges object

Solution

class(gtf)

[1] "GRanges"
attr(,"package")
[1] "GenomicRanges"

How many records are there in the object?

Solution

length(gtf)

[1] 17955

From which chromosomes do the records come?

Solution

seqlevels(gtf)

[1] "chr1"

Use the subset function to extract only the records for the CHD5 gene. How many annotated transcripts does this gene have?

Solution

gtfsub <- subset(gtf, gene_name == "CHD5")
length(subset(gtfsub, type == "transcript"))

[1] 6

Lists of genomic ranges

In some cases, ranges come in groups - for example, exons within a transcript.
In these situations, it often makes sense to represent each group of ranges as a GRanges object, and the full set of groups as a list of such ranges.
The GRangesList class covers these situations.
All ranges objects share the same genome information

grl <- GRangesList(
    tx1.1 = GRanges(seqnames = "chr1",
                    ranges = IRanges(start = c(50, 125, 188),
                                     end = c(75, 130, 257)),
                    strand = "+",
                    symbol = "tx1.1", transcript = "tx1.1", 
                    gene = "gene1", exon = paste0("ex1.", 1:3)),
    tx1.2 = GRanges(seqnames = "chr1",
                    ranges = IRanges(start = c(50, 125, 200),
                                     end = c(75, 130, 227)),
                    strand = "+",
                    symbol = "tx1.2", transcript = "tx1.2", 
                    gene = "gene1", exon = paste0("ex1.", c(1, 2, 4))),
    tx2.1 = GRanges(seqnames = "chr2",
                    ranges = IRanges(start = c(289, 318),
                                     end = c(300, 350)),
                    strand = "-",
                    symbol = "tx2.1", transcript = "tx2.1",
                    gene = "gene2", exon = paste0("ex2.", 1:2)))
seqlengths(grl) <- c(chr1 = 1000, chr2 = 500)
grl

GRangesList object of length 3:
$tx1.1
GRanges object with 3 ranges and 4 metadata columns:
      seqnames    ranges strand |      symbol  transcript        gene
         <Rle> <IRanges>  <Rle> | <character> <character> <character>
  [1]     chr1     50-75      + |       tx1.1       tx1.1       gene1
  [2]     chr1   125-130      + |       tx1.1       tx1.1       gene1
  [3]     chr1   188-257      + |       tx1.1       tx1.1       gene1
             exon
      <character>
  [1]       ex1.1
  [2]       ex1.2
  [3]       ex1.3
  -------
  seqinfo: 2 sequences from an unspecified genome

$tx1.2
GRanges object with 3 ranges and 4 metadata columns:
      seqnames    ranges strand |      symbol  transcript        gene
         <Rle> <IRanges>  <Rle> | <character> <character> <character>
  [1]     chr1     50-75      + |       tx1.2       tx1.2       gene1
  [2]     chr1   125-130      + |       tx1.2       tx1.2       gene1
  [3]     chr1   200-227      + |       tx1.2       tx1.2       gene1
             exon
      <character>
  [1]       ex1.1
  [2]       ex1.2
  [3]       ex1.4
  -------
  seqinfo: 2 sequences from an unspecified genome

$tx2.1
GRanges object with 2 ranges and 4 metadata columns:
      seqnames    ranges strand |      symbol  transcript        gene
         <Rle> <IRanges>  <Rle> | <character> <character> <character>
  [1]     chr2   289-300      - |       tx2.1       tx2.1       gene2
  [2]     chr2   318-350      - |       tx2.1       tx2.1       gene2
             exon
      <character>
  [1]       ex2.1
  [2]       ex2.2
  -------
  seqinfo: 2 sequences from an unspecified genome

The Gviz package can be used to generate a visualization of the first two transcripts (which belong to the same gene):

library(Gviz)
g1 <- GeneRegionTrack(grl$tx1.1, showId = TRUE, col = NULL,
                      fill = "#377EB8", name = "",
                      background.title = "transparent",
                      col.title = "black")
g2 <- GeneRegionTrack(grl$tx1.2, showId = TRUE, col = NULL,
                      fill = "#E41A1C", name = "",
                      background.title = "transparent",
                      col.title = "black")
gtr <- GenomeAxisTrack()
tracks <- c(gtr, g1, g2)
plotTracks(tracks)

SummarizedExperiment

The purpose of the SummarizedExperiment class (provided in the package with the same name) is to hold ‘rectangular’ data, together with annotations and metadata for the rows and columns. In other words, all the data associated with an experiment!
As we already saw above, this class (or extensions like SingleCellExperiment or DESeqDataSet) is widely used throughout the Bioconductor ecosystem, e.g., for representing RNA-seq, ChIP-seq, methylation and mass cytometry data.

The SummarizedExperiment object consists of three types of ‘tables’, denoted
- assays (actually a list of matrices of the same size, containing the observed data, e.g., gene expression values)
- rowData (annotations for the rows, i.e., the features). This can be replaced by rowRanges (a GRanges object) that can also hold genomic range information, e.g., the locations of genes
- colData (annotations for the columns, i.e., the samples)

library(SummarizedExperiment)

se <- SummarizedExperiment(
        assays = list(exprs = matrix(10 * runif(15), 5, 3)),
        colData = DataFrame(sample =  paste0("Sample", 1:3), 
                            condition = c("A", "A", "B")),
        rowData = DataFrame(gene = paste0("G", 1:5)))
rownames(se) <- rowData(se)$gene
colnames(se) <- colData(se)$sample
se

class: SummarizedExperiment 
dim: 5 3 
metadata(0):
assays(1): exprs
rownames(5): G1 G2 G3 G4 G5
rowData names(1): gene
colnames(3): Sample1 Sample2 Sample3
colData names(2): sample condition

The SummarizedExperiment class has accessor functions to extract the individual components.

assayNames(se)

[1] "exprs"

assay(se, "exprs")

    Sample1   Sample2  Sample3
G1 5.862581 8.7089339 0.199966
G2 7.627656 1.3942255 1.463375
G3 3.897536 7.9219906 3.722251
G4 5.430816 2.0393493 9.654674
G5 7.732064 0.5613301 1.532833

colData(se)

DataFrame with 3 rows and 2 columns
             sample   condition
        <character> <character>
Sample1     Sample1           A
Sample2     Sample2           A
Sample3     Sample3           B

rowData(se)

DataFrame with 5 rows and 1 column
          gene
   <character>
G1          G1
G2          G2
G3          G3
G4          G4
G5          G5

There are also convenient subsetting functions. Note how all the relevant parts of the objects are subset!

sesub <- se[1:2, ]
assay(sesub, "exprs")

    Sample1  Sample2  Sample3
G1 5.862581 8.708934 0.199966
G2 7.627656 1.394225 1.463375

colData(sesub)

DataFrame with 3 rows and 2 columns
             sample   condition
        <character> <character>
Sample1     Sample1           A
Sample2     Sample2           A
Sample3     Sample3           B

rowData(sesub)

DataFrame with 2 rows and 1 column
          gene
   <character>
G1          G1
G2          G2

If a GRanges object is used to represent the row annotations, the mcols() slot of the GRanges object is returned by rowData(se).

se <- SummarizedExperiment(
    assays = list(exprs = matrix(10 * runif(15), 5, 3)),
    colData = DataFrame(sample =  paste0("Sample", 1:3), 
                        condition = c("A", "A", "B")),
    rowRanges = GRanges(seqnames = "chr1",
                        ranges = IRanges(start = 100 * runif(5),
                                         width = 50),
                        strand = "+",
                        gene = paste0("G", 1:5)))
class(se)

[1] "RangedSummarizedExperiment"
attr(,"package")
[1] "SummarizedExperiment"

rowData(se)

DataFrame with 5 rows and 1 column
         gene
  <character>
1          G1
2          G2
3          G3
4          G4
5          G5

rowRanges(se)

GRanges object with 5 ranges and 1 metadata column:
      seqnames    ranges strand |        gene
         <Rle> <IRanges>  <Rle> | <character>
  [1]     chr1    60-109      + |          G1
  [2]     chr1      3-52      + |          G2
  [3]     chr1    75-124      + |          G3
  [4]     chr1     28-77      + |          G4
  [5]     chr1    55-104      + |          G5
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Exercises

Here we will explore a real RNA-seq data set, which is provided in the airway Bioconductor package and stored in a RangedSummarizedExperiment object.

Load the package and attach the data set.

Solution

library(airway)
data(airway)
airway

class: RangedSummarizedExperiment 
dim: 63677 8 
metadata(1): ''
assays(1): counts
rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
  ENSG00000273493
rowData names(10): gene_id gene_name ... seq_coord_system symbol
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(9): SampleName cell ... Sample BioSample

What type of data value are available in this object? How many samples are there? How many features?

Solution

assayNames(airway)

[1] "counts"

dim(airway)

[1] 63677     8

What is the total number of sequencing reads assigned to genes for each sample?

Solution

colSums(assay(airway, "counts"))

SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 
  20637971   18809481   25348649   15163415   24448408   30818215   19126151 
SRR1039521 
  21164133

Class inheritance

Classes are usually organized into a hierarchy by class inheritance.
Basically, it involves taking a ‘standard’ base class that has some of the properties you need, and extending or building upon it to create a new class that fits your purpose.
The new class will inherit the attributes and methods from the original class, and thus you don’t need to redefine them for your new class.
One example from Bioconductor is the SingleCellExperiment class, which extends the SummarizedExperiment class by, among a few other things, allowing the inclusion of dimension reduction results (e.g. PCA, tSNE, or UMAP).

library(SingleCellExperiment)

getClass("SingleCellExperiment")

Class "SingleCellExperiment" [package "SingleCellExperiment"]

Slots:
                                                                
Name:           int_elementMetadata                  int_colData
Class:                    DataFrame                    DataFrame
                                                                
Name:                  int_metadata                    rowRanges
Class:                         list GenomicRanges_OR_GRangesList
                                                                
Name:                       colData                       assays
Class:                    DataFrame               Assays_OR_NULL
                                                                
Name:                         NAMES              elementMetadata
Class:            character_OR_NULL                    DataFrame
                                   
Name:                      metadata
Class:                         list

Extends: 
Class "RangedSummarizedExperiment", directly
Class "SummarizedExperiment", by class "RangedSummarizedExperiment", distance 2
Class "RectangularData", by class "RangedSummarizedExperiment", distance 3
Class "Vector", by class "RangedSummarizedExperiment", distance 3
Class "Annotated", by class "RangedSummarizedExperiment", distance 4
Class "vector_OR_Vector", by class "RangedSummarizedExperiment", distance 4

getClass("RangedSummarizedExperiment")

Class "RangedSummarizedExperiment" [package "SummarizedExperiment"]

Slots:
                                                                
Name:                     rowRanges                      colData
Class: GenomicRanges_OR_GRangesList                    DataFrame
                                                                
Name:                        assays                        NAMES
Class:               Assays_OR_NULL            character_OR_NULL
                                                                
Name:               elementMetadata                     metadata
Class:                    DataFrame                         list

Extends: 
Class "SummarizedExperiment", directly
Class "RectangularData", by class "SummarizedExperiment", distance 2
Class "Vector", by class "SummarizedExperiment", distance 2
Class "Annotated", by class "SummarizedExperiment", distance 3
Class "vector_OR_Vector", by class "SummarizedExperiment", distance 3

Due to inheritance it might not be immediately clear what methods are defined for a given class. For most Bioconductor classes (all so called S4 classes) one way to find out is via the showMethods() function.

showMethods(class = "SingleCellExperiment")

A more comprehensive output however is provided by the sloop package. Its s4_methods_class() function also lists the source of inherited methods (make sure that the SingleCellExperiment package has been loaded into your R session):

library(sloop)
s4_methods_class("SingleCellExperiment")

Methods of SingleCellExperiment
generic	class	visible	source
!=	SingleCellExperiment	TRUE	S4Vectors
!=	SingleCellExperiment	TRUE	S4Vectors
!=	SingleCellExperiment	TRUE	S4Vectors
[	SingleCellExperiment	TRUE	SingleCellExperiment
[[	SingleCellExperiment	TRUE	SummarizedExperiment
[[<-	SingleCellExperiment	TRUE	SummarizedExperiment
[<-	SingleCellExperiment	TRUE	SingleCellExperiment
[<-	SingleCellExperiment	TRUE	S4Vectors
%in%	SingleCellExperiment	TRUE	S4Vectors
%in%	SingleCellExperiment	TRUE	S4Vectors
%in%	SingleCellExperiment	TRUE	S4Vectors
<	SingleCellExperiment	TRUE	S4Vectors
<	SingleCellExperiment	TRUE	S4Vectors
<	SingleCellExperiment	TRUE	S4Vectors
<=	SingleCellExperiment	TRUE	S4Vectors
<=	SingleCellExperiment	TRUE	S4Vectors
<=	SingleCellExperiment	TRUE	S4Vectors
==	SingleCellExperiment	TRUE	S4Vectors
==	SingleCellExperiment	TRUE	S4Vectors
==	SingleCellExperiment	TRUE	S4Vectors
>	SingleCellExperiment	TRUE	S4Vectors
>	SingleCellExperiment	TRUE	S4Vectors
>	SingleCellExperiment	TRUE	S4Vectors
>=	SingleCellExperiment	TRUE	S4Vectors
>=	SingleCellExperiment	TRUE	S4Vectors
>=	SingleCellExperiment	TRUE	S4Vectors
$	SingleCellExperiment	TRUE	SummarizedExperiment
$<-	SingleCellExperiment	TRUE	SummarizedExperiment
aggregate	SingleCellExperiment	TRUE	S4Vectors
altExp	SingleCellExperiment	TRUE	SingleCellExperiment
altExp	SingleCellExperiment	TRUE	SingleCellExperiment
altExp	SingleCellExperiment	TRUE	SingleCellExperiment
altExp<-	SingleCellExperiment	TRUE	SingleCellExperiment
altExp<-	SingleCellExperiment	TRUE	SingleCellExperiment
altExp<-	SingleCellExperiment	TRUE	SingleCellExperiment
altExpNames	SingleCellExperiment	TRUE	SingleCellExperiment
altExpNames<-	SingleCellExperiment	TRUE	SingleCellExperiment
altExps	SingleCellExperiment	TRUE	SingleCellExperiment
altExps<-	SingleCellExperiment	TRUE	SingleCellExperiment
anyDuplicated	SingleCellExperiment	TRUE	S4Vectors
anyNA	SingleCellExperiment	TRUE	S4Vectors
append	SingleCellExperiment	TRUE	S4Vectors
as.character	SingleCellExperiment	TRUE	S4Vectors
as.complex	SingleCellExperiment	TRUE	S4Vectors
as.data.frame	SingleCellExperiment	TRUE	S4Vectors
as.env	SingleCellExperiment	TRUE	S4Vectors
as.integer	SingleCellExperiment	TRUE	S4Vectors
as.list	SingleCellExperiment	TRUE	S4Vectors
as.logical	SingleCellExperiment	TRUE	S4Vectors
as.matrix	SingleCellExperiment	TRUE	S4Vectors
as.numeric	SingleCellExperiment	TRUE	S4Vectors
as.raw	SingleCellExperiment	TRUE	S4Vectors
assay	SingleCellExperiment	TRUE	SummarizedExperiment
assay	SingleCellExperiment	TRUE	SummarizedExperiment
assay	SingleCellExperiment	TRUE	SummarizedExperiment
assay<-	SingleCellExperiment	TRUE	SummarizedExperiment
assay<-	SingleCellExperiment	TRUE	SummarizedExperiment
assay<-	SingleCellExperiment	TRUE	SummarizedExperiment
assayNames	SingleCellExperiment	TRUE	SummarizedExperiment
assayNames<-	SingleCellExperiment	TRUE	SummarizedExperiment
assays	SingleCellExperiment	TRUE	SummarizedExperiment
assays<-	SingleCellExperiment	TRUE	SummarizedExperiment
assays<-	SingleCellExperiment	TRUE	SummarizedExperiment
bindROWS	SingleCellExperiment	TRUE	S4Vectors
by	SingleCellExperiment	TRUE	S4Vectors
c	SingleCellExperiment	TRUE	S4Vectors
cbind	SingleCellExperiment	TRUE	SingleCellExperiment
coerce	SingleCellExperiment	TRUE	methods
coerce	SingleCellExperiment	TRUE	methods
coerce	SingleCellExperiment	TRUE	SummarizedExperiment
coerce	SingleCellExperiment	TRUE	methods
coerce	SingleCellExperiment	TRUE	methods
coerce	SingleCellExperiment	TRUE	SummarizedExperiment
coerce	SingleCellExperiment	TRUE	SingleCellExperiment
coerce	SingleCellExperiment	TRUE	methods
coerce	SingleCellExperiment	TRUE	methods
coerce	SingleCellExperiment	TRUE	S4Vectors
coerce	SingleCellExperiment	TRUE	S4Vectors
coerce	SingleCellExperiment	TRUE	S4Vectors
coerce	SingleCellExperiment	TRUE	S4Vectors
coerce	SingleCellExperiment	TRUE	S4Vectors
coerce	SingleCellExperiment	TRUE	S4Vectors
coerce	SingleCellExperiment	TRUE	S4Vectors
coerce	SingleCellExperiment	TRUE	S4Vectors
coerce	SingleCellExperiment	TRUE	S4Vectors
coerce	SingleCellExperiment	TRUE	S4Vectors
coerce	SingleCellExperiment	TRUE	S4Vectors
coerce	SingleCellExperiment	TRUE	IRanges
coerce<-	SingleCellExperiment	TRUE	S4Vectors
coerce<-	SingleCellExperiment	TRUE	S4Vectors
coerce<-	SingleCellExperiment	TRUE	SummarizedExperiment
colData	SingleCellExperiment	TRUE	SingleCellExperiment
colData<-	SingleCellExperiment	TRUE	SummarizedExperiment
colData<-	SingleCellExperiment	TRUE	SummarizedExperiment
colLabels	SingleCellExperiment	TRUE	SingleCellExperiment
colLabels<-	SingleCellExperiment	TRUE	SingleCellExperiment
colnames	SingleCellExperiment	TRUE	SummarizedExperiment
colPair	SingleCellExperiment	TRUE	SingleCellExperiment
colPair	SingleCellExperiment	TRUE	SingleCellExperiment
colPair	SingleCellExperiment	TRUE	SingleCellExperiment
colPair<-	SingleCellExperiment	TRUE	SingleCellExperiment
colPair<-	SingleCellExperiment	TRUE	SingleCellExperiment
colPair<-	SingleCellExperiment	TRUE	SingleCellExperiment
colPairNames	SingleCellExperiment	TRUE	SingleCellExperiment
colPairNames<-	SingleCellExperiment	TRUE	SingleCellExperiment
colPairs	SingleCellExperiment	TRUE	SingleCellExperiment
colPairs<-	SingleCellExperiment	TRUE	SingleCellExperiment
combineCols	SingleCellExperiment	TRUE	SingleCellExperiment
combineRows	SingleCellExperiment	TRUE	SummarizedExperiment
Compare	SingleCellExperiment	TRUE	SummarizedExperiment
Compare	SingleCellExperiment	TRUE	SummarizedExperiment
Compare	SingleCellExperiment	TRUE	SummarizedExperiment
countOverlaps	SingleCellExperiment	TRUE	IRanges
countOverlaps	SingleCellExperiment	TRUE	IRanges
countOverlaps	SingleCellExperiment	TRUE	IRanges
counts	SingleCellExperiment	TRUE
counts<-	SingleCellExperiment	TRUE
coverage	SingleCellExperiment	TRUE	SummarizedExperiment
cpm	SingleCellExperiment	TRUE
cpm<-	SingleCellExperiment	TRUE
dim	SingleCellExperiment	TRUE	S4Vectors
dimnames	SingleCellExperiment	TRUE	S4Vectors
dimnames<-	SingleCellExperiment	TRUE	SummarizedExperiment
dimnames<-	SingleCellExperiment	TRUE	S4Vectors
dimnames<-	SingleCellExperiment	TRUE	SummarizedExperiment
disjointBins	SingleCellExperiment	TRUE	SummarizedExperiment
distance	SingleCellExperiment	TRUE	SummarizedExperiment
distance	SingleCellExperiment	TRUE	SummarizedExperiment
distance	SingleCellExperiment	TRUE	SummarizedExperiment
distanceToNearest	SingleCellExperiment	TRUE	SummarizedExperiment
distanceToNearest	SingleCellExperiment	TRUE	SummarizedExperiment
distanceToNearest	SingleCellExperiment	TRUE	SummarizedExperiment
duplicated	SingleCellExperiment	TRUE	SummarizedExperiment
elementMetadata	SingleCellExperiment	TRUE	SummarizedExperiment
elementMetadata<-	SingleCellExperiment	TRUE	SummarizedExperiment
end	SingleCellExperiment	TRUE	SummarizedExperiment
end<-	SingleCellExperiment	TRUE	SummarizedExperiment
eval	SingleCellExperiment	TRUE	S4Vectors
eval	SingleCellExperiment	TRUE	S4Vectors
expand	SingleCellExperiment	TRUE	S4Vectors
expand.grid	SingleCellExperiment	TRUE	S4Vectors
extractROWS	SingleCellExperiment	TRUE	S4Vectors
FactorToClass	SingleCellExperiment	TRUE	S4Vectors
findOverlaps	SingleCellExperiment	TRUE	GenomicAlignments
findOverlaps	SingleCellExperiment	TRUE	GenomicAlignments
findOverlaps	SingleCellExperiment	TRUE	GenomicAlignments
findOverlaps	SingleCellExperiment	TRUE	SummarizedExperiment
findOverlaps	SingleCellExperiment	TRUE	GenomicAlignments
findOverlaps	SingleCellExperiment	TRUE	GenomicAlignments
findOverlaps	SingleCellExperiment	TRUE	GenomicAlignments
findOverlaps	SingleCellExperiment	TRUE	IRanges
flank	SingleCellExperiment	TRUE	SummarizedExperiment
follow	SingleCellExperiment	TRUE	SummarizedExperiment
follow	SingleCellExperiment	TRUE	SummarizedExperiment
follow	SingleCellExperiment	TRUE	SummarizedExperiment
granges	SingleCellExperiment	TRUE	SummarizedExperiment
head	SingleCellExperiment	TRUE	utils
horizontal_slot_names	SingleCellExperiment	TRUE	SummarizedExperiment
int_colData	SingleCellExperiment	TRUE	SingleCellExperiment
int_colData<-	SingleCellExperiment	TRUE	SingleCellExperiment
int_elementMetadata	SingleCellExperiment	TRUE	SingleCellExperiment
int_elementMetadata<-	SingleCellExperiment	TRUE	SingleCellExperiment
int_metadata	SingleCellExperiment	TRUE	SingleCellExperiment
int_metadata<-	SingleCellExperiment	TRUE	SingleCellExperiment
intersect	SingleCellExperiment	TRUE	GenomicRanges
intersect	SingleCellExperiment	TRUE	GenomicRanges
intersect	SingleCellExperiment	TRUE	S4Vectors
is.na	SingleCellExperiment	TRUE	S4Vectors
is.unsorted	SingleCellExperiment	TRUE	SummarizedExperiment
isDisjoint	SingleCellExperiment	TRUE	SummarizedExperiment
length	SingleCellExperiment	TRUE	SummarizedExperiment
lengths	SingleCellExperiment	TRUE	S4Vectors
logcounts	SingleCellExperiment	TRUE
logcounts<-	SingleCellExperiment	TRUE
mainExpName	SingleCellExperiment	TRUE	SingleCellExperiment
mainExpName<-	SingleCellExperiment	TRUE	SingleCellExperiment
match	SingleCellExperiment	TRUE	S4Vectors
match	SingleCellExperiment	TRUE	S4Vectors
match	SingleCellExperiment	TRUE	Biostrings
match	SingleCellExperiment	TRUE	Biostrings
mcols	SingleCellExperiment	TRUE	SummarizedExperiment
mcols<-	SingleCellExperiment	TRUE	SummarizedExperiment
merge	SingleCellExperiment	TRUE	S4Vectors
mergeROWS	SingleCellExperiment	TRUE	S4Vectors
metadata	SingleCellExperiment	TRUE	S4Vectors
metadata<-	SingleCellExperiment	TRUE	S4Vectors
mstack	SingleCellExperiment	TRUE	S4Vectors
names	SingleCellExperiment	TRUE	SummarizedExperiment
names<-	SingleCellExperiment	TRUE	SummarizedExperiment
narrow	SingleCellExperiment	TRUE	SummarizedExperiment
ncol	SingleCellExperiment	TRUE	SummarizedExperiment
nearest	SingleCellExperiment	TRUE	SummarizedExperiment
nearest	SingleCellExperiment	TRUE	SummarizedExperiment
nearest	SingleCellExperiment	TRUE	SummarizedExperiment
normcounts	SingleCellExperiment	TRUE
normcounts<-	SingleCellExperiment	TRUE
nrow	SingleCellExperiment	TRUE	SummarizedExperiment
objectVersion	SingleCellExperiment	TRUE	SingleCellExperiment
order	SingleCellExperiment	TRUE	SummarizedExperiment
overlapsAny	SingleCellExperiment	TRUE	IRanges
overlapsAny	SingleCellExperiment	TRUE	IRanges
overlapsAny	SingleCellExperiment	TRUE	IRanges
parallel_slot_names	SingleCellExperiment	TRUE	SingleCellExperiment
pcompare	SingleCellExperiment	TRUE	SummarizedExperiment
pcompare	SingleCellExperiment	TRUE	SummarizedExperiment
pcompare	SingleCellExperiment	TRUE	SummarizedExperiment
pcompare	SingleCellExperiment	TRUE	Biostrings
pcompare	SingleCellExperiment	TRUE	Biostrings
precede	SingleCellExperiment	TRUE	SummarizedExperiment
precede	SingleCellExperiment	TRUE	SummarizedExperiment
precede	SingleCellExperiment	TRUE	SummarizedExperiment
promoters	SingleCellExperiment	TRUE	SummarizedExperiment
ranges	SingleCellExperiment	TRUE	SummarizedExperiment
ranges<-	SingleCellExperiment	TRUE	SummarizedExperiment
rank	SingleCellExperiment	TRUE	SummarizedExperiment
rbind	SingleCellExperiment	TRUE	SingleCellExperiment
realize	SingleCellExperiment	TRUE	SummarizedExperiment
reducedDim	SingleCellExperiment	TRUE	SingleCellExperiment
reducedDim	SingleCellExperiment	TRUE	SingleCellExperiment
reducedDim	SingleCellExperiment	TRUE	SingleCellExperiment
reducedDim<-	SingleCellExperiment	TRUE	SingleCellExperiment
reducedDim<-	SingleCellExperiment	TRUE	SingleCellExperiment
reducedDim<-	SingleCellExperiment	TRUE	SingleCellExperiment
reducedDimNames	SingleCellExperiment	TRUE	SingleCellExperiment
reducedDimNames<-	SingleCellExperiment	TRUE	SingleCellExperiment
reducedDims	SingleCellExperiment	TRUE	SingleCellExperiment
reducedDims<-	SingleCellExperiment	TRUE	SingleCellExperiment
relist	SingleCellExperiment	TRUE	IRanges
rename	SingleCellExperiment	TRUE	S4Vectors
rep	SingleCellExperiment	TRUE	S4Vectors
rep.int	SingleCellExperiment	TRUE	S4Vectors
replaceROWS	SingleCellExperiment	TRUE	S4Vectors
resize	SingleCellExperiment	TRUE	SummarizedExperiment
restrict	SingleCellExperiment	TRUE	SummarizedExperiment
rev	SingleCellExperiment	TRUE	S4Vectors
rowData	SingleCellExperiment	TRUE	SingleCellExperiment
rowData<-	SingleCellExperiment	TRUE	SummarizedExperiment
ROWNAMES	SingleCellExperiment	TRUE	S4Vectors
rownames	SingleCellExperiment	TRUE	SummarizedExperiment
ROWNAMES<-	SingleCellExperiment	TRUE	S4Vectors
rowPair	SingleCellExperiment	TRUE	SingleCellExperiment
rowPair	SingleCellExperiment	TRUE	SingleCellExperiment
rowPair	SingleCellExperiment	TRUE	SingleCellExperiment
rowPair<-	SingleCellExperiment	TRUE	SingleCellExperiment
rowPair<-	SingleCellExperiment	TRUE	SingleCellExperiment
rowPair<-	SingleCellExperiment	TRUE	SingleCellExperiment
rowPairNames	SingleCellExperiment	TRUE	SingleCellExperiment
rowPairNames<-	SingleCellExperiment	TRUE	SingleCellExperiment
rowPairs	SingleCellExperiment	TRUE	SingleCellExperiment
rowPairs<-	SingleCellExperiment	TRUE	SingleCellExperiment
rowRanges	SingleCellExperiment	TRUE	SummarizedExperiment
rowRanges<-	SingleCellExperiment	TRUE	SummarizedExperiment
rowRanges<-	SingleCellExperiment	TRUE	SummarizedExperiment
rowRanges<-	SingleCellExperiment	TRUE	SummarizedExperiment
rowSubset	SingleCellExperiment	TRUE	SingleCellExperiment
rowSubset<-	SingleCellExperiment	TRUE	SingleCellExperiment
saveRDS	SingleCellExperiment	TRUE	SummarizedExperiment
selfmatch	SingleCellExperiment	TRUE	S4Vectors
seqinfo	SingleCellExperiment	TRUE	SummarizedExperiment
seqinfo<-	SingleCellExperiment	TRUE	SummarizedExperiment
seqlevelsInUse	SingleCellExperiment	TRUE	GenomeInfoDb
seqnames	SingleCellExperiment	TRUE	SummarizedExperiment
setdiff	SingleCellExperiment	TRUE	GenomicRanges
setdiff	SingleCellExperiment	TRUE	GenomicRanges
setdiff	SingleCellExperiment	TRUE	S4Vectors
setequal	SingleCellExperiment	TRUE	S4Vectors
shift	SingleCellExperiment	TRUE	SummarizedExperiment
shiftApply	SingleCellExperiment	TRUE	S4Vectors
show	SingleCellExperiment	TRUE	SingleCellExperiment
showAsCell	SingleCellExperiment	TRUE	SummarizedExperiment
sizeFactors	SingleCellExperiment	TRUE	SingleCellExperiment
sizeFactors<-	SingleCellExperiment	TRUE	SingleCellExperiment
sort	SingleCellExperiment	TRUE	SummarizedExperiment
split	SingleCellExperiment	TRUE	S4Vectors
split	SingleCellExperiment	TRUE	S4Vectors
split	SingleCellExperiment	TRUE	SummarizedExperiment
split	SingleCellExperiment	TRUE	rtracklayer
split	SingleCellExperiment	TRUE	S4Vectors
split<-	SingleCellExperiment	TRUE	IRanges
start	SingleCellExperiment	TRUE	SummarizedExperiment
start<-	SingleCellExperiment	TRUE	SummarizedExperiment
strand	SingleCellExperiment	TRUE	SummarizedExperiment
strand<-	SingleCellExperiment	TRUE	SummarizedExperiment
subset	SingleCellExperiment	TRUE	SummarizedExperiment
subsetByOverlaps	SingleCellExperiment	TRUE	IRanges
summary	SingleCellExperiment	TRUE	S4Vectors
table	SingleCellExperiment	TRUE	S4Vectors
tail	SingleCellExperiment	TRUE	utils
tapply	SingleCellExperiment	TRUE	IRanges
tapply	SingleCellExperiment	TRUE	IRanges
tapply	SingleCellExperiment	TRUE	IRanges
terminators	SingleCellExperiment	TRUE	SummarizedExperiment
tpm	SingleCellExperiment	TRUE
tpm<-	SingleCellExperiment	TRUE
transform	SingleCellExperiment	TRUE	S4Vectors
trim	SingleCellExperiment	TRUE	SummarizedExperiment
union	SingleCellExperiment	TRUE	GenomicRanges
union	SingleCellExperiment	TRUE	GenomicRanges
union	SingleCellExperiment	TRUE	S4Vectors
unique	SingleCellExperiment	TRUE	S4Vectors
updateObject	SingleCellExperiment	TRUE	SingleCellExperiment
values	SingleCellExperiment	TRUE	S4Vectors
values<-	SingleCellExperiment	TRUE	S4Vectors
vertical_slot_names	SingleCellExperiment	TRUE	SummarizedExperiment
weights	SingleCellExperiment	TRUE
weights<-	SingleCellExperiment	TRUE
width	SingleCellExperiment	TRUE	SummarizedExperiment
width<-	SingleCellExperiment	TRUE	SummarizedExperiment
window	SingleCellExperiment	TRUE	S4Vectors
window<-	SingleCellExperiment	TRUE	IRanges
with	SingleCellExperiment	TRUE	S4Vectors
xtabs	SingleCellExperiment	TRUE	S4Vectors
xtfrm	SingleCellExperiment	TRUE	S4Vectors

Biostrings

Biostrings is a package containing infrastructure for dealing with biological sequences, or strings.
These are represented as so called XString objects (or XStringSet objects if you want to include multiple sequences), where X can be DNA, RNA, AA or B and signifies the type of sequence(s) in the object.

library(Biostrings)

To create a DNAString object, we can use the DNAString() constructor function. To see how the function works, type ?DNAString.

(dna <- DNAString(x = "AGGCATAGA"))

9-letter DNAString object
seq: AGGCATAGA

The classes come with useful validity checks. For example, not all characters are allowed in a DNA string (but note that it can contain other bases than ‘A’, ‘C’, ‘G’ and ‘T’, following the IUPAC code of ambiguous bases).

alphabet(DNAString())

 [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" "."

IUPAC_CODE_MAP

     A      C      G      T      M      R      W      S      Y      K      V 
   "A"    "C"    "G"    "T"   "AC"   "AG"   "AT"   "CG"   "CT"   "GT"  "ACG" 
     H      D      B      N 
 "ACT"  "AGT"  "CGT" "ACGT"

## Invalid DNA string
DNAString(x = "EQI")

Error in .Call2("new_XString_from_CHARACTER", class(x0), string, start, : key 69 (char 'E') not in lookup table

alphabet(AAString())

 [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T" "W" "Y"
[20] "V" "U" "O" "B" "J" "Z" "X" "*" "-" "+" "."

AAString(x = "EQI")

3-letter AAString object
seq: EQI

Once we have defined the object, we can perform operations like:

getting the reverse complement of the DNA sequence

reverseComplement(dna)

9-letter DNAString object
seq: TCTATGCCT

We could not do this with a regular character string

reverseComplement("AGGCATAGA")

Error: unable to find an inherited method for function 'reverseComplement' for signature 'x = "character"'

translating the DNA sequence to an amino acid sequence

translate(dna)

3-letter AAString object
seq: RHR

We can not translate an amino acid sequence:

translate(AAString(x = "EQI"))

Error: unable to find an inherited method for function 'translate' for signature 'x = "AAString"'

If we have more than one sequence, we can represent them in a DNAStringSet object, which is a collection of DNAString objects

(dna_multiple <- DNAStringSet(c(x = "AGGCATAGA", y = "GGTATGAG")))

DNAStringSet object of length 2:
    width seq                                               names               
[1]     9 AGGCATAGA                                         x
[2]     8 GGTATGAG                                          y

## Subsetting with [] gives back a DNAStringSet
dna_multiple[1]

DNAStringSet object of length 1:
    width seq                                               names               
[1]     9 AGGCATAGA                                         x

## Subsetting with [[]] extracts the individual objects
dna_multiple[[1]]

9-letter DNAString object
seq: AGGCATAGA

It is also easy to extract subsequences of either an XString or an XStringSet object

## Get the length of each string
width(dna_multiple)

[1] 9 8

## Extract the first three positions in each string
Biostrings::subseq(dna_multiple, start = 1, end = 3)

DNAStringSet object of length 2:
    width seq                                               names               
[1]     3 AGG                                               x
[2]     3 GGT                                               y

## Extract subsequences of different length
Biostrings::subseq(dna_multiple, start = 1, end = c(2, 5))

DNAStringSet object of length 2:
    width seq                                               names               
[1]     2 AG                                                x
[2]     5 GGTAT                                             y

Biostrings::subseq(dna_multiple, start = 1, 
                   end = width(dna_multiple) - 4)

DNAStringSet object of length 2:
    width seq                                               names               
[1]     5 AGGCA                                             x
[2]     4 GGTA                                              y

Exercises

Can you tell whether AATTGGCCRGGCCAATT is a valid DNA sequence?

Solution

Biostrings::DNAString("AATTGGCCRGGCCAATT")

17-letter DNAString object
seq: AATTGGCCRGGCCAATT

Genomic sequence information is often stored in fasta files. DNAStringSet objects can be created from such files by the readDNAStringSet() function from the Biostrings package. Download the gencode.v28.transcripts.1.1.10M.fa file from the course web page and read it into a DNAStringSet object using this function. What do you think these sequences represent?

Solution

txs <- Biostrings::readDNAStringSet("data/02/gencode.v28.transcripts.1.1.10M.fa")

How many sequences are there in this file?

Solution

length(txs)

[1] 1373

What are the lengths of the shortest and longest sequence?

Solution

range(width(txs))

[1]    59 11666

Extract the first 10 bases of each sequence

Solution

Biostrings::subseq(txs, start = 1, end = 10)

DNAStringSet object of length 1373:
       width seq                                            names               
   [1]    10 GTTAACTTGC                                     ENST00000456328.2...
   [2]    10 GTGTCTGACT                                     ENST00000450305.2...
   [3]    10 ATGGGAGCCG                                     ENST00000488147.1...
   [4]    10 TGTGGGAGAG                                     ENST00000619216.1...
   [5]    10 GTGCACACGG                                     ENST00000473358.1...
   ...   ... ...
[1369]    10 ATGATTCACT                                     ENST00000636827.2...
[1370]    10 AGCATATTCT                                     ENST00000578045.1...
[1371]    10 GCTAAAATAA                                     ENST00000413148.1...
[1372]    10 GTCCCGGCCC                                     ENST00000315901.4...
[1373]    10 GAGCCTCCGG                                     ENST00000294435.7...

Get the fraction of A, C, G and T bases in each sequence (hint: there is a function named alphabetFrequency). Plot a histogram of the fraction of Gs across all sequences

Solution

freqs <- alphabetFrequency(txs)
fracs <- sweep(freqs, MARGIN = 1, STATS = rowSums(freqs), FUN = "/")
hist(fracs[, "G"])

Annotation resources

Bioconductor provides a range of annotation resources, among them are:

Gene-based information for a given organism (OrgDb)
Transcript ranges for a given genome build (TxDb and EnsDb)
Genome sequence for a given genome build (BSgenome)

These annotation resources can be accessed in different ways:

Via annotation packages (updated every 6 months with the Bioconductor releases)
By pulling down information from online resources such as the Bioconductor AnnotationHub

As an example lets have a look at a human transcript annotation package TxDb.Hsapiens.UCSC.hg38.knownGene. As the name suggest, the annotation is derived from UCSC “knownGene” track for the hg38 genome build.

library(TxDb.Hsapiens.UCSC.hg38.knownGene)

?TxDb.Hsapiens.UCSC.hg38.knownGene

Two accessor functions are transcripts() and transcriptsBy(). Lets compare their output:

gr <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene)
gr

GRanges object with 278220 ranges and 2 metadata columns:
                      seqnames        ranges strand |     tx_id
                         <Rle>     <IRanges>  <Rle> | <integer>
       [1]                chr1   11869-14409      + |         1
       [2]                chr1   12010-13670      + |         2
       [3]                chr1   29554-31097      + |         3
       [4]                chr1   30267-31109      + |         4
       [5]                chr1   30366-30503      + |         5
       ...                 ...           ...    ... .       ...
  [278216] chrX_MU273397v1_alt 239036-260095      - |    278216
  [278217] chrX_MU273397v1_alt 272358-282686      - |    278217
  [278218] chrX_MU273397v1_alt 314193-316302      - |    278218
  [278219] chrX_MU273397v1_alt 314813-315236      - |    278219
  [278220] chrX_MU273397v1_alt 324527-324923      - |    278220
                     tx_name
                 <character>
       [1] ENST00000456328.2
       [2] ENST00000450305.2
       [3] ENST00000473358.1
       [4] ENST00000469289.1
       [5] ENST00000607096.1
       ...               ...
  [278216] ENST00000710260.1
  [278217] ENST00000710028.1
  [278218] ENST00000710030.1
  [278219] ENST00000710216.1
  [278220] ENST00000710031.1
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome

grl <- transcriptsBy(TxDb.Hsapiens.UCSC.hg38.knownGene, by = "gene")
grl

GRangesList object of length 33131:
$`1`
GRanges object with 8 ranges and 2 metadata columns:
      seqnames            ranges strand |     tx_id           tx_name
         <Rle>         <IRanges>  <Rle> | <integer>       <character>
  [1]    chr19 58345178-58347634      - |    230133 ENST00000596924.1
  [2]    chr19 58345183-58353492      - |    230134 ENST00000263100.8
  [3]    chr19 58346854-58356225      - |    230135 ENST00000600123.5
  [4]    chr19 58346858-58353491      - |    230136 ENST00000595014.1
  [5]    chr19 58346860-58347657      - |    230137 ENST00000598345.1
  [6]    chr19 58348466-58362751      - |    230138 ENST00000599109.5
  [7]    chr19 58350594-58353129      - |    230139 ENST00000600966.1
  [8]    chr19 58353021-58356083      - |    230140 ENST00000596636.1
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome

$`10`
GRanges object with 3 ranges and 2 metadata columns:
      seqnames            ranges strand |     tx_id           tx_name
         <Rle>         <IRanges>  <Rle> | <integer>       <character>
  [1]     chr8 18386311-18388323      + |    104998 ENST00000660285.1
  [2]     chr8 18391282-18401218      + |    105000 ENST00000286479.4
  [3]     chr8 18391287-18400993      + |    105001 ENST00000520116.1
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome

...
<33129 more elements>

Functions like exons(), transcripts() or genes() return GRanges objects with a set of ranges representing the level of detail of the query.
The “xxxBy” functions, e.g. transcriptsBy() or exonsBy(), return GRangesList objects. They enable grouping of the output in terms of another genomic feature, in our case genes.

Transcript annotations and their functionality play an important role when analyzing gene expression data. The coming lectures on RNA-Seq will introduce more functionality in the context of examples.

Exercises

The OrgDb package org.Hs.eg.db provides human gene annotation. It offers a simple interface to an underlying SQLite database. The main accessor functions for this (and other SQLite) annotation objects are select(), columns() and keys().

Install the org.Hs.eg.db package, load it and check its help page. Detailed examples on how to work with these annotation objects can be found on the linked AnnotationDb class help page.

Solution

library(org.Hs.eg.db)

?org.Hs.eg.db

What kind of data can be retrieved from the org.Hs.eg.db object? Check the columns() function!

Solution

columns(org.Hs.eg.db)

 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
[16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
[21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
[26] "UNIPROT"

Use the select() function to retrieve the gene names as well as ENTREZ and ENSEMBL gene IDs for the three genes: RAF1, FLT3 and MAPK1.

Solution

select(org.Hs.eg.db, keys = c("RAF1", "FLT3", "MAPK1"), 
       keytype = "SYMBOL", columns = c("ENTREZID", "ENSEMBL", "GENENAME"))

'select()' returned 1:1 mapping between keys and columns

  SYMBOL ENTREZID         ENSEMBL                                      GENENAME
1   RAF1     5894 ENSG00000132155 Raf-1 proto-oncogene, serine/threonine kinase
2   FLT3     2322 ENSG00000122025        fms related receptor tyrosine kinase 3
3  MAPK1     5594 ENSG00000100030            mitogen-activated protein kinase 1

What is Bioconductor?

How to install Bioconductor packages

How to get help with Bioconductor packages

Standard Bioconductor data structures

What types of data do we have to deal with?

Why use standardized classes?

GenomicRanges

Exercises

Lists of genomic ranges

SummarizedExperiment

Exercises

Class inheritance

Biostrings

Exercises

Annotation resources

Exercises

Further reading