Introduction to Bioconductor and standard classes
What is Bioconductor?
- The project was started in 2001 (back then, mostly focused on microarray data)
- Currently (release 3.20) hosts:
- 2289 software packages
- 928 annotation packages
- 431 experiment data packages
- 30 workflows
 
- Covers a wide range of domains:
- sequencing (RNA-seq, ChIP-seq, single-cell analysis, variant calling, …)
- microarrays (methylation, gene expression, copy number, …)
- flow and mass cytometry
- proteomics
- imaging
- …
 
- There is a core team which, among other things, oversees the project, coordinates review of new package submissions, develops and maintains core infrastructure and maintains daily building and checking of all packages. Most packages, however, are contributed by the community.
- There is an active Bioconductor community, where you can find help, support and people with shared interests. For example, check the website, join the slack workspace, or attend one of the yearly conferences (BioC, EuroBioC, Bioc Asia).
How to install Bioconductor packages
- Detailed instructions are available online.
- There is a new Bioconductor release twice per year (typically April and October).
- Between those times, development is going on in the parallel devel branch.
- Do not mix packages from different releases!
- The available Bioconductor releases also depend on your version of R - if you want the latest Bioconductor packages, you need the latest version of R.
- Bioconductor packages are installed using the BiocManagerCRAN package (note that this works also for packages hosted on CRAN or GitHub):
install.packages("BiocManager")
BiocManager::install("limma")- We can check the validity of our Bioconductor installation with BiocManager(e.g., to see if there are outdated packages):
BiocManager::valid()- We can also see which version is currently installed:
BiocManager::version()[1] '3.20'- BiocManagercan be used to search for available packages. Here, we look for package names containing the word “Mmusculus” (assuming that we are interested in looking for packages containing annotation information for mouse):
BiocManager::available(pattern = "Mmusculus") [1] "BSgenome.Mmusculus.UCSC.mm10"        "BSgenome.Mmusculus.UCSC.mm10.masked"
 [3] "BSgenome.Mmusculus.UCSC.mm39"        "BSgenome.Mmusculus.UCSC.mm8"        
 [5] "BSgenome.Mmusculus.UCSC.mm8.masked"  "BSgenome.Mmusculus.UCSC.mm9"        
 [7] "BSgenome.Mmusculus.UCSC.mm9.masked"  "EnsDb.Mmusculus.v75"                
 [9] "EnsDb.Mmusculus.v79"                 "PWMEnrich.Mmusculus.background"     
[11] "TxDb.Mmusculus.UCSC.mm10.ensGene"    "TxDb.Mmusculus.UCSC.mm10.knownGene" 
[13] "TxDb.Mmusculus.UCSC.mm39.knownGene"  "TxDb.Mmusculus.UCSC.mm39.refGene"   
[15] "TxDb.Mmusculus.UCSC.mm9.knownGene"  - Each Bioconductor package has a landing page (e.g., https://bioconductor.org/packages/limma/) that contains relevant information, including installation instructions, vignettes and the reference manual:
How to get help with Bioconductor packages
- Each function has a help page, detailing all arguments and outputs, and containing runnable examples.
library(limma)
?lmFit- Each package comes with at least one vignette, outlining the intended use of the functions in the package to analyze data. The vignettes can be browsed from the landing page, or directly in your R session, e.g. with
browseVignettes(package = "Biostrings")- If you need help with a Bioconductor package:
- Check the function help pages and the vignette(s) to see whether the answer is there (some packages, like DESeq2, have FAQ sections in their vignette)
- Go to https://support.bioconductor.org/
- Search to see whether your question has already been asked (and answered)
- If not, look through the Posting guide and post your question following these guidelines
- If you follow these steps, you will often get a rapid and informative answer
- The slack workspace has channels covering many topics related to Bioconductor
 
- The Workflows are complete walkthroughs of specific types of analyses, typically making use of multiple (Bioconductor) packages
Standard Bioconductor data structures
- Every object in R is of a particular data type.
- You have already seen some data types, e.g. atomic vectors (numeric,characteretc),list,matrix,data.frame,factor.
- (Almost) all objects in R have a class. A class is a formal description of the data type, its properties, functionality and relation to other types .
- The classfunction tells you which class(es) an object is of:
class(c("one", "two", "three"))[1] "character"class(matrix(1:6, 2, 3))[1] "matrix" "array" - The class of an object defines the set of allowed operations.
relevel(c("one", "two", "three"), ref = "one")Error in relevel.default(c("one", "two", "three"), ref = "one"): 'relevel' only for (unordered) factorsrelevel(factor(c("one", "two", "three")), ref = "two")[1] one   two   three
Levels: two one threec("1", "2", "3") + 1Error in c("1", "2", "3") + 1: non-numeric argument to binary operatoras.numeric(c("1", "2", "3")) + 1[1] 2 3 4- When working with (biological) data, it often makes sense to define more complex data structures, with several building blocks, and to define appropriate methods for each such class.
What types of data do we have to deal with?
- Sequences (strings)
- Alignments
- ‘Rectangular’ data (matrices of numeric values for a set of features and samples)
- Genomic ranges (intervals)
- Genomic variants
- Annotation databases
- Images
- …
For each of these data types, Bioconductor provides standard classes.
There are often many ways of creating an object of a given class, including:
- reading data from a file (e.g. the output from an instrument) using specialized import functions from Bioconductor or other sources. This is perhaps the most common way of creating objects, and we will see many examples during the course. For example, think of how you can generate a data.frameby reading data from a text file withread.delim().
- loading processed data or annotations provided via (Bioconductor) packages.
- directly creating the object using a constructor function (which often has the same name as the object class). In this case, all the information that is needed for a particular type of object must be provided manually. This is the easiest way to create small example objects, and we will see examples of this below. This is also how the small objects in the previous section were created.
Why use standardized classes?
- Interoperability - having standard ways of storing and moving data between packages means we don’t have to implement the same thing in multiple places. Moreover, it allows us as developers to write functions that are applicable to a very broad set of data objects, since we can make assumptions about the content and structure of these objects (see for example the iSEE package).
- Recognition - you don’t have to learn a new representation for each package.
- Robustness - standard classes are efficiently implemented and well tested, and come with additional functions to check the validity of provided data.
As an example, all these Bioconductor packages make use of the SummarizedExperiment class:
In the following we will examine a few standard Bioconductor classes in more detail. Of course, there are many more, and some of these will be covered in later lectures.
GenomicRanges
- The GenomicRangespackage holds infrastructure for working with (genomic) ranges.
- These can represent, for example, gene coordinates, ChIP-seq peaks, promoters, SNPs, CpG islands.
- A GRanges(genomic ranges) object contains, for each range, information about- the chromosome (called sequence)
- the interval (start and end)
- the strand (either +,-or*)
 
- The interval is, in turn, represented by a so called IRanges(interval ranges) object.
- Many of the operations shown below apply equally well to IRangesobjects.
library(GenomicRanges)- Let’s create a small IRangesobject. We can specify any two of the start, end and width of the ranges, and the third one will be inferred.
(ir <- IRanges(start = c(3, 10, 50), 
               end = c(17, 28, 104)))IRanges object with 3 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         3        17        15
  [2]        10        28        19
  [3]        50       104        55## Equivalent
(ir <- IRanges(start = c(3, 10, 50), 
               width = c(15, 19, 55)))IRanges object with 3 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         3        17        15
  [2]        10        28        19
  [3]        50       104        55- The IRangesobject just represents ranges on the number line. There is no specifically genomic information in there - that’s added when we construct aGRangesobject
(gr <- GRanges(seqnames = c("chr1", "chr1", "chr2"), 
               ranges = ir,
               strand = c("+", "+", "-")))GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  [3]     chr2    50-104      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths- Many operations are defined on the ranges objects. As a first example, it is easy to extract the individual components of the object using accessor functions
start(gr)[1]  3 10 50end(gr)[1]  17  28 104width(gr)[1] 15 19 55- We can also get and set information about the reference sequences - there are many different conventions for naming chromosomes!
seqlevelsStyle(gr)[1] "UCSC"seqlevelsStyle(gr) <- "Ensembl"
grGRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1      3-17      +
  [2]        1     10-28      +
  [3]        2    50-104      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengthsseqlevelsStyle(gr) <- "UCSC"- There are many ways of subsetting ranges objects (check out methods(class = "GRanges")). For example, we can use the regular[function (the authors of theGRangesclass has defined what this function means for such an object), or we can keep only records on specific chromosomes.
gr[1:2]GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengthsdropSeqlevels(gr, "chr1", pruning.mode = "coarse")GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr2    50-104      -
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengthskeepSeqlevels(gr, "chr1", pruning.mode = "coarse")GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengthsgr[seqnames(gr) == "chr1"]GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengthskeepStandardChromosomes(gr, pruning.mode = "coarse")GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  [3]     chr2    50-104      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths- Many other operations can be applied to each of the ranges in a GRangesorIRangesobject, such as extending all ranges with a certain number of bases in each end
gr + 5GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     -2-22      +
  [2]     chr1      5-33      +
  [3]     chr2    45-109      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengthsNote how this led to a negative start coordinate for the first range. This is not considered a violation since there is no information in the GRanges object which will tell R where the chromosomes (sequences) start and end. To accommodate this, let’s set the sequence lengths to 100 and 200 nt, respectively.
seqlengths(gr)chr1 chr2 
  NA   NA seqlengths(gr) <- c(chr1 = 100, chr2 = 200)
seqlengths(gr)chr1 chr2 
 100  200 gr + 5Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence chr1.
  Note that ranges located on a sequence whose length is unknown (NA) or
  on a circular sequence are not considered out-of-bound (use
  seqlengths() and isCircular() to get the lengths and circularity flags
  of the underlying sequences). You can use trim() to trim these ranges.
  See ?`trim,GenomicRanges-method` for more information.GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     -2-22      +
  [2]     chr1      5-33      +
  [3]     chr2    45-109      -
  -------
  seqinfo: 2 sequences from an unspecified genomeNote how we now get a warning that one of the ranges is extending outside of the reference sequences. We can further use the trim() function to remove all parts of ranges that fall outside the boundaries of the reference sequences (this would not do anything if we hadn’t set the sequence lengths).
trim(gr + 5)Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence chr1.
  Note that ranges located on a sequence whose length is unknown (NA) or
  on a circular sequence are not considered out-of-bound (use
  seqlengths() and isCircular() to get the lengths and circularity flags
  of the underlying sequences). You can use trim() to trim these ranges.
  See ?`trim,GenomicRanges-method` for more information.GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      1-22      +
  [2]     chr1      5-33      +
  [3]     chr2    45-109      -
  -------
  seqinfo: 2 sequences from an unspecified genomeWe can also set a name for the genome that the ranges are supposed to come from. This can be useful e.g. when comparing different sets of ranges, to make sure that they all correspond to the same underlying genome.
genome(gr) <- "hg19"
grGRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      3-17      +
  [2]     chr1     10-28      +
  [3]     chr2    50-104      -
  -------
  seqinfo: 2 sequences from hg19 genomeOther useful operations on single ranges include:
- shifting
shift(gr, 5)GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      8-22      +
  [2]     chr1     15-33      +
  [3]     chr2    55-109      -
  -------
  seqinfo: 2 sequences from hg19 genome- obtaining flanking sequence. Note how the strand information is taken into account.
flank(gr, 2, start = TRUE, both = FALSE)GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1       1-2      +
  [2]     chr1       8-9      +
  [3]     chr2   105-106      -
  -------
  seqinfo: 2 sequences from hg19 genome- finding the midpoint of each range
resize(gr, width = 1, fix = "center")GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1        10      +
  [2]     chr1        19      +
  [3]     chr2        77      -
  -------
  seqinfo: 2 sequences from hg19 genomeIn addition to the intra-range operations, there are also many useful operations that involve multiple ranges. For example, we can:
- find overlaps between ranges
gr2 <- GRanges(seqnames = c("chr2", "chr1"),
               ranges = IRanges(start = c(90, 5), 
                                end = c(100, 15)),
               strand = c("-", "+"))
findOverlaps(query = gr, subject = gr2)Hits object with 3 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           2
  [2]         2           2
  [3]         3           1
  -------
  queryLength: 3 / subjectLength: 2- GRangesobjects can have associated metadata. This is provided in the- mcols()slot of the object.
mcols(gr)$score <- 1:3
grGRanges object with 3 ranges and 1 metadata column:
      seqnames    ranges strand |     score
         <Rle> <IRanges>  <Rle> | <integer>
  [1]     chr1      3-17      + |         1
  [2]     chr1     10-28      + |         2
  [3]     chr2    50-104      - |         3
  -------
  seqinfo: 2 sequences from hg19 genome## Extract only the metadata
mcols(gr)DataFrame with 3 rows and 1 column
      score
  <integer>
1         1
2         2
3         3## We can also supply the metadata when creating the object
GRanges(seqnames = c("chr2", "chr1"),
        ranges = IRanges(start = c(90, 5), 
                         end = c(100, 15)),
        strand = c("-", "+"),
        extra_column = c("A", "B"))GRanges object with 2 ranges and 1 metadata column:
      seqnames    ranges strand | extra_column
         <Rle> <IRanges>  <Rle> |  <character>
  [1]     chr2    90-100      - |            A
  [2]     chr1      5-15      + |            B
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengthsExercises
In practice, the locations of known genomic features (exons, genes, transcripts, UTRs, …) along the genome are often stored in so called gtf (gene transfer format) files. These files can be imported into a GRanges object using the import() function from the rtracklayer Bioconductor package.
- Download the gencode.v28.annotation.1.1.10M.gtffile from the course web page, and read it into R in this way.
library(rtracklayer)
Attaching package: 'rtracklayer'The following object is masked from 'package:igraph':
    blocksgtf <- rtracklayer::import("data/02/gencode.v28.annotation.1.1.10M.gtf")- Confirm that the gtfobject is indeed aGRangesobject
class(gtf)[1] "GRanges"
attr(,"package")
[1] "GenomicRanges"- How many records are there in the object?
length(gtf)[1] 17955- From which chromosomes do the records come?
seqlevels(gtf)[1] "chr1"- Use the subsetfunction to extract only the records for theCHD5gene. How many annotated transcripts does this gene have?
gtfsub <- subset(gtf, gene_name == "CHD5")
length(subset(gtfsub, type == "transcript"))[1] 6Lists of genomic ranges
- In some cases, ranges come in groups - for example, exons within a transcript.
- In these situations, it often makes sense to represent each group of ranges as a GRangesobject, and the full set of groups as a list of such ranges.
- The GRangesListclass covers these situations.
- All ranges objects share the same genome information
grl <- GRangesList(
    tx1.1 = GRanges(seqnames = "chr1",
                    ranges = IRanges(start = c(50, 125, 188),
                                     end = c(75, 130, 257)),
                    strand = "+",
                    symbol = "tx1.1", transcript = "tx1.1", 
                    gene = "gene1", exon = paste0("ex1.", 1:3)),
    tx1.2 = GRanges(seqnames = "chr1",
                    ranges = IRanges(start = c(50, 125, 200),
                                     end = c(75, 130, 227)),
                    strand = "+",
                    symbol = "tx1.2", transcript = "tx1.2", 
                    gene = "gene1", exon = paste0("ex1.", c(1, 2, 4))),
    tx2.1 = GRanges(seqnames = "chr2",
                    ranges = IRanges(start = c(289, 318),
                                     end = c(300, 350)),
                    strand = "-",
                    symbol = "tx2.1", transcript = "tx2.1",
                    gene = "gene2", exon = paste0("ex2.", 1:2)))
seqlengths(grl) <- c(chr1 = 1000, chr2 = 500)
grlGRangesList object of length 3:
$tx1.1
GRanges object with 3 ranges and 4 metadata columns:
      seqnames    ranges strand |      symbol  transcript        gene
         <Rle> <IRanges>  <Rle> | <character> <character> <character>
  [1]     chr1     50-75      + |       tx1.1       tx1.1       gene1
  [2]     chr1   125-130      + |       tx1.1       tx1.1       gene1
  [3]     chr1   188-257      + |       tx1.1       tx1.1       gene1
             exon
      <character>
  [1]       ex1.1
  [2]       ex1.2
  [3]       ex1.3
  -------
  seqinfo: 2 sequences from an unspecified genome
$tx1.2
GRanges object with 3 ranges and 4 metadata columns:
      seqnames    ranges strand |      symbol  transcript        gene
         <Rle> <IRanges>  <Rle> | <character> <character> <character>
  [1]     chr1     50-75      + |       tx1.2       tx1.2       gene1
  [2]     chr1   125-130      + |       tx1.2       tx1.2       gene1
  [3]     chr1   200-227      + |       tx1.2       tx1.2       gene1
             exon
      <character>
  [1]       ex1.1
  [2]       ex1.2
  [3]       ex1.4
  -------
  seqinfo: 2 sequences from an unspecified genome
$tx2.1
GRanges object with 2 ranges and 4 metadata columns:
      seqnames    ranges strand |      symbol  transcript        gene
         <Rle> <IRanges>  <Rle> | <character> <character> <character>
  [1]     chr2   289-300      - |       tx2.1       tx2.1       gene2
  [2]     chr2   318-350      - |       tx2.1       tx2.1       gene2
             exon
      <character>
  [1]       ex2.1
  [2]       ex2.2
  -------
  seqinfo: 2 sequences from an unspecified genomeThe Gviz package can be used to generate a visualization of the first two transcripts (which belong to the same gene):
library(Gviz)
g1 <- GeneRegionTrack(grl$tx1.1, showId = TRUE, col = NULL,
                      fill = "#377EB8", name = "",
                      background.title = "transparent",
                      col.title = "black")
g2 <- GeneRegionTrack(grl$tx1.2, showId = TRUE, col = NULL,
                      fill = "#E41A1C", name = "",
                      background.title = "transparent",
                      col.title = "black")
gtr <- GenomeAxisTrack()
tracks <- c(gtr, g1, g2)
plotTracks(tracks)SummarizedExperiment
- The purpose of the SummarizedExperimentclass (provided in the package with the same name) is to hold ‘rectangular’ data, together with annotations and metadata for the rows and columns. In other words, all the data associated with an experiment!
- As we already saw above, this class (or extensions like SingleCellExperimentorDESeqDataSet) is widely used throughout the Bioconductor ecosystem, e.g., for representing RNA-seq, ChIP-seq, methylation and mass cytometry data.
- The SummarizedExperimentobject consists of three types of ‘tables’, denoted- assays (actually a list of matrices of the same size, containing the observed data, e.g., gene expression values)
- rowData (annotations for the rows, i.e., the features). This can be replaced by rowRanges (a GRangesobject) that can also hold genomic range information, e.g., the locations of genes
- colData (annotations for the columns, i.e., the samples)
 
library(SummarizedExperiment)se <- SummarizedExperiment(
        assays = list(exprs = matrix(10 * runif(15), 5, 3)),
        colData = DataFrame(sample =  paste0("Sample", 1:3), 
                            condition = c("A", "A", "B")),
        rowData = DataFrame(gene = paste0("G", 1:5)))
rownames(se) <- rowData(se)$gene
colnames(se) <- colData(se)$sample
seclass: SummarizedExperiment 
dim: 5 3 
metadata(0):
assays(1): exprs
rownames(5): G1 G2 G3 G4 G5
rowData names(1): gene
colnames(3): Sample1 Sample2 Sample3
colData names(2): sample condition- The SummarizedExperimentclass has accessor functions to extract the individual components.
assayNames(se)[1] "exprs"assay(se, "exprs")    Sample1   Sample2  Sample3
G1 5.862581 8.7089339 0.199966
G2 7.627656 1.3942255 1.463375
G3 3.897536 7.9219906 3.722251
G4 5.430816 2.0393493 9.654674
G5 7.732064 0.5613301 1.532833colData(se)DataFrame with 3 rows and 2 columns
             sample   condition
        <character> <character>
Sample1     Sample1           A
Sample2     Sample2           A
Sample3     Sample3           BrowData(se)DataFrame with 5 rows and 1 column
          gene
   <character>
G1          G1
G2          G2
G3          G3
G4          G4
G5          G5- There are also convenient subsetting functions. Note how all the relevant parts of the objects are subset!
sesub <- se[1:2, ]
assay(sesub, "exprs")    Sample1  Sample2  Sample3
G1 5.862581 8.708934 0.199966
G2 7.627656 1.394225 1.463375colData(sesub)DataFrame with 3 rows and 2 columns
             sample   condition
        <character> <character>
Sample1     Sample1           A
Sample2     Sample2           A
Sample3     Sample3           BrowData(sesub)DataFrame with 2 rows and 1 column
          gene
   <character>
G1          G1
G2          G2- If a GRangesobject is used to represent the row annotations, themcols()slot of theGRangesobject is returned byrowData(se).
se <- SummarizedExperiment(
    assays = list(exprs = matrix(10 * runif(15), 5, 3)),
    colData = DataFrame(sample =  paste0("Sample", 1:3), 
                        condition = c("A", "A", "B")),
    rowRanges = GRanges(seqnames = "chr1",
                        ranges = IRanges(start = 100 * runif(5),
                                         width = 50),
                        strand = "+",
                        gene = paste0("G", 1:5)))
class(se)[1] "RangedSummarizedExperiment"
attr(,"package")
[1] "SummarizedExperiment"rowData(se) DataFrame with 5 rows and 1 column
         gene
  <character>
1          G1
2          G2
3          G3
4          G4
5          G5rowRanges(se)GRanges object with 5 ranges and 1 metadata column:
      seqnames    ranges strand |        gene
         <Rle> <IRanges>  <Rle> | <character>
  [1]     chr1    60-109      + |          G1
  [2]     chr1      3-52      + |          G2
  [3]     chr1    75-124      + |          G3
  [4]     chr1     28-77      + |          G4
  [5]     chr1    55-104      + |          G5
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengthsExercises
Here we will explore a real RNA-seq data set, which is provided in the airway Bioconductor package and stored in a RangedSummarizedExperiment object.
- Load the package and attach the data set.
library(airway)
data(airway)
airwayclass: RangedSummarizedExperiment 
dim: 63677 8 
metadata(1): ''
assays(1): counts
rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
  ENSG00000273493
rowData names(10): gene_id gene_name ... seq_coord_system symbol
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(9): SampleName cell ... Sample BioSample- What type of data value are available in this object? How many samples are there? How many features?
assayNames(airway)[1] "counts"dim(airway)[1] 63677     8- What is the total number of sequencing reads assigned to genes for each sample?
colSums(assay(airway, "counts"))SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 
  20637971   18809481   25348649   15163415   24448408   30818215   19126151 
SRR1039521 
  21164133 Class inheritance
- Classes are usually organized into a hierarchy by class inheritance. 
- Basically, it involves taking a ‘standard’ base class that has some of the properties you need, and extending or building upon it to create a new class that fits your purpose. 
- The new class will inherit the attributes and methods from the original class, and thus you don’t need to redefine them for your new class. 
- One example from Bioconductor is the - SingleCellExperimentclass, which extends the- SummarizedExperimentclass by, among a few other things, allowing the inclusion of dimension reduction results (e.g. PCA, tSNE, or UMAP).
library(SingleCellExperiment)getClass("SingleCellExperiment")Class "SingleCellExperiment" [package "SingleCellExperiment"]
Slots:
                                                                
Name:           int_elementMetadata                  int_colData
Class:                    DataFrame                    DataFrame
                                                                
Name:                  int_metadata                    rowRanges
Class:                         list GenomicRanges_OR_GRangesList
                                                                
Name:                       colData                       assays
Class:                    DataFrame               Assays_OR_NULL
                                                                
Name:                         NAMES              elementMetadata
Class:            character_OR_NULL                    DataFrame
                                   
Name:                      metadata
Class:                         list
Extends: 
Class "RangedSummarizedExperiment", directly
Class "SummarizedExperiment", by class "RangedSummarizedExperiment", distance 2
Class "RectangularData", by class "RangedSummarizedExperiment", distance 3
Class "Vector", by class "RangedSummarizedExperiment", distance 3
Class "Annotated", by class "RangedSummarizedExperiment", distance 4
Class "vector_OR_Vector", by class "RangedSummarizedExperiment", distance 4getClass("RangedSummarizedExperiment")Class "RangedSummarizedExperiment" [package "SummarizedExperiment"]
Slots:
                                                                
Name:                     rowRanges                      colData
Class: GenomicRanges_OR_GRangesList                    DataFrame
                                                                
Name:                        assays                        NAMES
Class:               Assays_OR_NULL            character_OR_NULL
                                                                
Name:               elementMetadata                     metadata
Class:                    DataFrame                         list
Extends: 
Class "SummarizedExperiment", directly
Class "RectangularData", by class "SummarizedExperiment", distance 2
Class "Vector", by class "SummarizedExperiment", distance 2
Class "Annotated", by class "SummarizedExperiment", distance 3
Class "vector_OR_Vector", by class "SummarizedExperiment", distance 3- Due to inheritance it might not be immediately clear what methods are defined for a given class. For most Bioconductor classes (all so called S4 classes) one way to find out is via the showMethods()function.
showMethods(class = "SingleCellExperiment")- A more comprehensive output however is provided by the slooppackage. Itss4_methods_class()function also lists the source of inherited methods (make sure that theSingleCellExperimentpackage has been loaded into your R session):
library(sloop)
s4_methods_class("SingleCellExperiment")| generic | class | visible | source | 
|---|---|---|---|
| != | SingleCellExperiment | TRUE | S4Vectors | 
| != | SingleCellExperiment | TRUE | S4Vectors | 
| != | SingleCellExperiment | TRUE | S4Vectors | 
| [ | SingleCellExperiment | TRUE | SingleCellExperiment | 
| [[ | SingleCellExperiment | TRUE | SummarizedExperiment | 
| [[<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| [<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| [<- | SingleCellExperiment | TRUE | S4Vectors | 
| %in% | SingleCellExperiment | TRUE | S4Vectors | 
| %in% | SingleCellExperiment | TRUE | S4Vectors | 
| %in% | SingleCellExperiment | TRUE | S4Vectors | 
| < | SingleCellExperiment | TRUE | S4Vectors | 
| < | SingleCellExperiment | TRUE | S4Vectors | 
| < | SingleCellExperiment | TRUE | S4Vectors | 
| <= | SingleCellExperiment | TRUE | S4Vectors | 
| <= | SingleCellExperiment | TRUE | S4Vectors | 
| <= | SingleCellExperiment | TRUE | S4Vectors | 
| == | SingleCellExperiment | TRUE | S4Vectors | 
| == | SingleCellExperiment | TRUE | S4Vectors | 
| == | SingleCellExperiment | TRUE | S4Vectors | 
| > | SingleCellExperiment | TRUE | S4Vectors | 
| > | SingleCellExperiment | TRUE | S4Vectors | 
| > | SingleCellExperiment | TRUE | S4Vectors | 
| >= | SingleCellExperiment | TRUE | S4Vectors | 
| >= | SingleCellExperiment | TRUE | S4Vectors | 
| >= | SingleCellExperiment | TRUE | S4Vectors | 
| $ | SingleCellExperiment | TRUE | SummarizedExperiment | 
| $<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| aggregate | SingleCellExperiment | TRUE | S4Vectors | 
| altExp | SingleCellExperiment | TRUE | SingleCellExperiment | 
| altExp | SingleCellExperiment | TRUE | SingleCellExperiment | 
| altExp | SingleCellExperiment | TRUE | SingleCellExperiment | 
| altExp<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| altExp<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| altExp<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| altExpNames | SingleCellExperiment | TRUE | SingleCellExperiment | 
| altExpNames<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| altExps | SingleCellExperiment | TRUE | SingleCellExperiment | 
| altExps<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| anyDuplicated | SingleCellExperiment | TRUE | S4Vectors | 
| anyNA | SingleCellExperiment | TRUE | S4Vectors | 
| append | SingleCellExperiment | TRUE | S4Vectors | 
| as.character | SingleCellExperiment | TRUE | S4Vectors | 
| as.complex | SingleCellExperiment | TRUE | S4Vectors | 
| as.data.frame | SingleCellExperiment | TRUE | S4Vectors | 
| as.env | SingleCellExperiment | TRUE | S4Vectors | 
| as.integer | SingleCellExperiment | TRUE | S4Vectors | 
| as.list | SingleCellExperiment | TRUE | S4Vectors | 
| as.logical | SingleCellExperiment | TRUE | S4Vectors | 
| as.matrix | SingleCellExperiment | TRUE | S4Vectors | 
| as.numeric | SingleCellExperiment | TRUE | S4Vectors | 
| as.raw | SingleCellExperiment | TRUE | S4Vectors | 
| assay | SingleCellExperiment | TRUE | SummarizedExperiment | 
| assay | SingleCellExperiment | TRUE | SummarizedExperiment | 
| assay | SingleCellExperiment | TRUE | SummarizedExperiment | 
| assay<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| assay<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| assay<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| assayNames | SingleCellExperiment | TRUE | SummarizedExperiment | 
| assayNames<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| assays | SingleCellExperiment | TRUE | SummarizedExperiment | 
| assays<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| assays<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| bindROWS | SingleCellExperiment | TRUE | S4Vectors | 
| by | SingleCellExperiment | TRUE | S4Vectors | 
| c | SingleCellExperiment | TRUE | S4Vectors | 
| cbind | SingleCellExperiment | TRUE | SingleCellExperiment | 
| coerce | SingleCellExperiment | TRUE | methods | 
| coerce | SingleCellExperiment | TRUE | methods | 
| coerce | SingleCellExperiment | TRUE | SummarizedExperiment | 
| coerce | SingleCellExperiment | TRUE | methods | 
| coerce | SingleCellExperiment | TRUE | methods | 
| coerce | SingleCellExperiment | TRUE | SummarizedExperiment | 
| coerce | SingleCellExperiment | TRUE | SingleCellExperiment | 
| coerce | SingleCellExperiment | TRUE | methods | 
| coerce | SingleCellExperiment | TRUE | methods | 
| coerce | SingleCellExperiment | TRUE | S4Vectors | 
| coerce | SingleCellExperiment | TRUE | S4Vectors | 
| coerce | SingleCellExperiment | TRUE | S4Vectors | 
| coerce | SingleCellExperiment | TRUE | S4Vectors | 
| coerce | SingleCellExperiment | TRUE | S4Vectors | 
| coerce | SingleCellExperiment | TRUE | S4Vectors | 
| coerce | SingleCellExperiment | TRUE | S4Vectors | 
| coerce | SingleCellExperiment | TRUE | S4Vectors | 
| coerce | SingleCellExperiment | TRUE | S4Vectors | 
| coerce | SingleCellExperiment | TRUE | S4Vectors | 
| coerce | SingleCellExperiment | TRUE | S4Vectors | 
| coerce | SingleCellExperiment | TRUE | IRanges | 
| coerce<- | SingleCellExperiment | TRUE | S4Vectors | 
| coerce<- | SingleCellExperiment | TRUE | S4Vectors | 
| coerce<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| colData | SingleCellExperiment | TRUE | SingleCellExperiment | 
| colData<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| colData<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| colLabels | SingleCellExperiment | TRUE | SingleCellExperiment | 
| colLabels<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| colnames | SingleCellExperiment | TRUE | SummarizedExperiment | 
| colPair | SingleCellExperiment | TRUE | SingleCellExperiment | 
| colPair | SingleCellExperiment | TRUE | SingleCellExperiment | 
| colPair | SingleCellExperiment | TRUE | SingleCellExperiment | 
| colPair<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| colPair<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| colPair<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| colPairNames | SingleCellExperiment | TRUE | SingleCellExperiment | 
| colPairNames<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| colPairs | SingleCellExperiment | TRUE | SingleCellExperiment | 
| colPairs<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| combineCols | SingleCellExperiment | TRUE | SingleCellExperiment | 
| combineRows | SingleCellExperiment | TRUE | SummarizedExperiment | 
| Compare | SingleCellExperiment | TRUE | SummarizedExperiment | 
| Compare | SingleCellExperiment | TRUE | SummarizedExperiment | 
| Compare | SingleCellExperiment | TRUE | SummarizedExperiment | 
| countOverlaps | SingleCellExperiment | TRUE | IRanges | 
| countOverlaps | SingleCellExperiment | TRUE | IRanges | 
| countOverlaps | SingleCellExperiment | TRUE | IRanges | 
| counts | SingleCellExperiment | TRUE | |
| counts<- | SingleCellExperiment | TRUE | |
| coverage | SingleCellExperiment | TRUE | SummarizedExperiment | 
| cpm | SingleCellExperiment | TRUE | |
| cpm<- | SingleCellExperiment | TRUE | |
| dim | SingleCellExperiment | TRUE | S4Vectors | 
| dimnames | SingleCellExperiment | TRUE | S4Vectors | 
| dimnames<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| dimnames<- | SingleCellExperiment | TRUE | S4Vectors | 
| dimnames<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| disjointBins | SingleCellExperiment | TRUE | SummarizedExperiment | 
| distance | SingleCellExperiment | TRUE | SummarizedExperiment | 
| distance | SingleCellExperiment | TRUE | SummarizedExperiment | 
| distance | SingleCellExperiment | TRUE | SummarizedExperiment | 
| distanceToNearest | SingleCellExperiment | TRUE | SummarizedExperiment | 
| distanceToNearest | SingleCellExperiment | TRUE | SummarizedExperiment | 
| distanceToNearest | SingleCellExperiment | TRUE | SummarizedExperiment | 
| duplicated | SingleCellExperiment | TRUE | SummarizedExperiment | 
| elementMetadata | SingleCellExperiment | TRUE | SummarizedExperiment | 
| elementMetadata<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| end | SingleCellExperiment | TRUE | SummarizedExperiment | 
| end<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| eval | SingleCellExperiment | TRUE | S4Vectors | 
| eval | SingleCellExperiment | TRUE | S4Vectors | 
| expand | SingleCellExperiment | TRUE | S4Vectors | 
| expand.grid | SingleCellExperiment | TRUE | S4Vectors | 
| extractROWS | SingleCellExperiment | TRUE | S4Vectors | 
| FactorToClass | SingleCellExperiment | TRUE | S4Vectors | 
| findOverlaps | SingleCellExperiment | TRUE | GenomicAlignments | 
| findOverlaps | SingleCellExperiment | TRUE | GenomicAlignments | 
| findOverlaps | SingleCellExperiment | TRUE | GenomicAlignments | 
| findOverlaps | SingleCellExperiment | TRUE | SummarizedExperiment | 
| findOverlaps | SingleCellExperiment | TRUE | GenomicAlignments | 
| findOverlaps | SingleCellExperiment | TRUE | GenomicAlignments | 
| findOverlaps | SingleCellExperiment | TRUE | GenomicAlignments | 
| findOverlaps | SingleCellExperiment | TRUE | IRanges | 
| flank | SingleCellExperiment | TRUE | SummarizedExperiment | 
| follow | SingleCellExperiment | TRUE | SummarizedExperiment | 
| follow | SingleCellExperiment | TRUE | SummarizedExperiment | 
| follow | SingleCellExperiment | TRUE | SummarizedExperiment | 
| granges | SingleCellExperiment | TRUE | SummarizedExperiment | 
| head | SingleCellExperiment | TRUE | utils | 
| horizontal_slot_names | SingleCellExperiment | TRUE | SummarizedExperiment | 
| int_colData | SingleCellExperiment | TRUE | SingleCellExperiment | 
| int_colData<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| int_elementMetadata | SingleCellExperiment | TRUE | SingleCellExperiment | 
| int_elementMetadata<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| int_metadata | SingleCellExperiment | TRUE | SingleCellExperiment | 
| int_metadata<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| intersect | SingleCellExperiment | TRUE | GenomicRanges | 
| intersect | SingleCellExperiment | TRUE | GenomicRanges | 
| intersect | SingleCellExperiment | TRUE | S4Vectors | 
| is.na | SingleCellExperiment | TRUE | S4Vectors | 
| is.unsorted | SingleCellExperiment | TRUE | SummarizedExperiment | 
| isDisjoint | SingleCellExperiment | TRUE | SummarizedExperiment | 
| length | SingleCellExperiment | TRUE | SummarizedExperiment | 
| lengths | SingleCellExperiment | TRUE | S4Vectors | 
| logcounts | SingleCellExperiment | TRUE | |
| logcounts<- | SingleCellExperiment | TRUE | |
| mainExpName | SingleCellExperiment | TRUE | SingleCellExperiment | 
| mainExpName<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| match | SingleCellExperiment | TRUE | S4Vectors | 
| match | SingleCellExperiment | TRUE | S4Vectors | 
| match | SingleCellExperiment | TRUE | Biostrings | 
| match | SingleCellExperiment | TRUE | Biostrings | 
| mcols | SingleCellExperiment | TRUE | SummarizedExperiment | 
| mcols<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| merge | SingleCellExperiment | TRUE | S4Vectors | 
| mergeROWS | SingleCellExperiment | TRUE | S4Vectors | 
| metadata | SingleCellExperiment | TRUE | S4Vectors | 
| metadata<- | SingleCellExperiment | TRUE | S4Vectors | 
| mstack | SingleCellExperiment | TRUE | S4Vectors | 
| names | SingleCellExperiment | TRUE | SummarizedExperiment | 
| names<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| narrow | SingleCellExperiment | TRUE | SummarizedExperiment | 
| ncol | SingleCellExperiment | TRUE | SummarizedExperiment | 
| nearest | SingleCellExperiment | TRUE | SummarizedExperiment | 
| nearest | SingleCellExperiment | TRUE | SummarizedExperiment | 
| nearest | SingleCellExperiment | TRUE | SummarizedExperiment | 
| normcounts | SingleCellExperiment | TRUE | |
| normcounts<- | SingleCellExperiment | TRUE | |
| nrow | SingleCellExperiment | TRUE | SummarizedExperiment | 
| objectVersion | SingleCellExperiment | TRUE | SingleCellExperiment | 
| order | SingleCellExperiment | TRUE | SummarizedExperiment | 
| overlapsAny | SingleCellExperiment | TRUE | IRanges | 
| overlapsAny | SingleCellExperiment | TRUE | IRanges | 
| overlapsAny | SingleCellExperiment | TRUE | IRanges | 
| parallel_slot_names | SingleCellExperiment | TRUE | SingleCellExperiment | 
| pcompare | SingleCellExperiment | TRUE | SummarizedExperiment | 
| pcompare | SingleCellExperiment | TRUE | SummarizedExperiment | 
| pcompare | SingleCellExperiment | TRUE | SummarizedExperiment | 
| pcompare | SingleCellExperiment | TRUE | Biostrings | 
| pcompare | SingleCellExperiment | TRUE | Biostrings | 
| precede | SingleCellExperiment | TRUE | SummarizedExperiment | 
| precede | SingleCellExperiment | TRUE | SummarizedExperiment | 
| precede | SingleCellExperiment | TRUE | SummarizedExperiment | 
| promoters | SingleCellExperiment | TRUE | SummarizedExperiment | 
| ranges | SingleCellExperiment | TRUE | SummarizedExperiment | 
| ranges<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| rank | SingleCellExperiment | TRUE | SummarizedExperiment | 
| rbind | SingleCellExperiment | TRUE | SingleCellExperiment | 
| realize | SingleCellExperiment | TRUE | SummarizedExperiment | 
| reducedDim | SingleCellExperiment | TRUE | SingleCellExperiment | 
| reducedDim | SingleCellExperiment | TRUE | SingleCellExperiment | 
| reducedDim | SingleCellExperiment | TRUE | SingleCellExperiment | 
| reducedDim<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| reducedDim<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| reducedDim<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| reducedDimNames | SingleCellExperiment | TRUE | SingleCellExperiment | 
| reducedDimNames<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| reducedDims | SingleCellExperiment | TRUE | SingleCellExperiment | 
| reducedDims<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| relist | SingleCellExperiment | TRUE | IRanges | 
| rename | SingleCellExperiment | TRUE | S4Vectors | 
| rep | SingleCellExperiment | TRUE | S4Vectors | 
| rep.int | SingleCellExperiment | TRUE | S4Vectors | 
| replaceROWS | SingleCellExperiment | TRUE | S4Vectors | 
| resize | SingleCellExperiment | TRUE | SummarizedExperiment | 
| restrict | SingleCellExperiment | TRUE | SummarizedExperiment | 
| rev | SingleCellExperiment | TRUE | S4Vectors | 
| rowData | SingleCellExperiment | TRUE | SingleCellExperiment | 
| rowData<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| ROWNAMES | SingleCellExperiment | TRUE | S4Vectors | 
| rownames | SingleCellExperiment | TRUE | SummarizedExperiment | 
| ROWNAMES<- | SingleCellExperiment | TRUE | S4Vectors | 
| rowPair | SingleCellExperiment | TRUE | SingleCellExperiment | 
| rowPair | SingleCellExperiment | TRUE | SingleCellExperiment | 
| rowPair | SingleCellExperiment | TRUE | SingleCellExperiment | 
| rowPair<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| rowPair<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| rowPair<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| rowPairNames | SingleCellExperiment | TRUE | SingleCellExperiment | 
| rowPairNames<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| rowPairs | SingleCellExperiment | TRUE | SingleCellExperiment | 
| rowPairs<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| rowRanges | SingleCellExperiment | TRUE | SummarizedExperiment | 
| rowRanges<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| rowRanges<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| rowRanges<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| rowSubset | SingleCellExperiment | TRUE | SingleCellExperiment | 
| rowSubset<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| saveRDS | SingleCellExperiment | TRUE | SummarizedExperiment | 
| selfmatch | SingleCellExperiment | TRUE | S4Vectors | 
| seqinfo | SingleCellExperiment | TRUE | SummarizedExperiment | 
| seqinfo<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| seqlevelsInUse | SingleCellExperiment | TRUE | GenomeInfoDb | 
| seqnames | SingleCellExperiment | TRUE | SummarizedExperiment | 
| setdiff | SingleCellExperiment | TRUE | GenomicRanges | 
| setdiff | SingleCellExperiment | TRUE | GenomicRanges | 
| setdiff | SingleCellExperiment | TRUE | S4Vectors | 
| setequal | SingleCellExperiment | TRUE | S4Vectors | 
| shift | SingleCellExperiment | TRUE | SummarizedExperiment | 
| shiftApply | SingleCellExperiment | TRUE | S4Vectors | 
| show | SingleCellExperiment | TRUE | SingleCellExperiment | 
| showAsCell | SingleCellExperiment | TRUE | SummarizedExperiment | 
| sizeFactors | SingleCellExperiment | TRUE | SingleCellExperiment | 
| sizeFactors<- | SingleCellExperiment | TRUE | SingleCellExperiment | 
| sort | SingleCellExperiment | TRUE | SummarizedExperiment | 
| split | SingleCellExperiment | TRUE | S4Vectors | 
| split | SingleCellExperiment | TRUE | S4Vectors | 
| split | SingleCellExperiment | TRUE | SummarizedExperiment | 
| split | SingleCellExperiment | TRUE | rtracklayer | 
| split | SingleCellExperiment | TRUE | S4Vectors | 
| split<- | SingleCellExperiment | TRUE | IRanges | 
| start | SingleCellExperiment | TRUE | SummarizedExperiment | 
| start<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| strand | SingleCellExperiment | TRUE | SummarizedExperiment | 
| strand<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| subset | SingleCellExperiment | TRUE | SummarizedExperiment | 
| subsetByOverlaps | SingleCellExperiment | TRUE | IRanges | 
| summary | SingleCellExperiment | TRUE | S4Vectors | 
| table | SingleCellExperiment | TRUE | S4Vectors | 
| tail | SingleCellExperiment | TRUE | utils | 
| tapply | SingleCellExperiment | TRUE | IRanges | 
| tapply | SingleCellExperiment | TRUE | IRanges | 
| tapply | SingleCellExperiment | TRUE | IRanges | 
| terminators | SingleCellExperiment | TRUE | SummarizedExperiment | 
| tpm | SingleCellExperiment | TRUE | |
| tpm<- | SingleCellExperiment | TRUE | |
| transform | SingleCellExperiment | TRUE | S4Vectors | 
| trim | SingleCellExperiment | TRUE | SummarizedExperiment | 
| union | SingleCellExperiment | TRUE | GenomicRanges | 
| union | SingleCellExperiment | TRUE | GenomicRanges | 
| union | SingleCellExperiment | TRUE | S4Vectors | 
| unique | SingleCellExperiment | TRUE | S4Vectors | 
| updateObject | SingleCellExperiment | TRUE | SingleCellExperiment | 
| values | SingleCellExperiment | TRUE | S4Vectors | 
| values<- | SingleCellExperiment | TRUE | S4Vectors | 
| vertical_slot_names | SingleCellExperiment | TRUE | SummarizedExperiment | 
| weights | SingleCellExperiment | TRUE | |
| weights<- | SingleCellExperiment | TRUE | |
| width | SingleCellExperiment | TRUE | SummarizedExperiment | 
| width<- | SingleCellExperiment | TRUE | SummarizedExperiment | 
| window | SingleCellExperiment | TRUE | S4Vectors | 
| window<- | SingleCellExperiment | TRUE | IRanges | 
| with | SingleCellExperiment | TRUE | S4Vectors | 
| xtabs | SingleCellExperiment | TRUE | S4Vectors | 
| xtfrm | SingleCellExperiment | TRUE | S4Vectors | 
Biostrings
- Biostringsis a package containing infrastructure for dealing with biological sequences, or strings.
- These are represented as so called XStringobjects (orXStringSetobjects if you want to include multiple sequences), whereXcan beDNA,RNA,AAorBand signifies the type of sequence(s) in the object.
library(Biostrings)- To create a DNAStringobject, we can use theDNAString()constructor function. To see how the function works, type?DNAString.
(dna <- DNAString(x = "AGGCATAGA"))9-letter DNAString object
seq: AGGCATAGA- The classes come with useful validity checks. For example, not all characters are allowed in a DNA string (but note that it can contain other bases than ‘A’, ‘C’, ‘G’ and ‘T’, following the IUPAC code of ambiguous bases).
alphabet(DNAString()) [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" "."IUPAC_CODE_MAP     A      C      G      T      M      R      W      S      Y      K      V 
   "A"    "C"    "G"    "T"   "AC"   "AG"   "AT"   "CG"   "CT"   "GT"  "ACG" 
     H      D      B      N 
 "ACT"  "AGT"  "CGT" "ACGT" ## Invalid DNA string
DNAString(x = "EQI")Error in .Call2("new_XString_from_CHARACTER", class(x0), string, start, : key 69 (char 'E') not in lookup tablealphabet(AAString()) [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T" "W" "Y"
[20] "V" "U" "O" "B" "J" "Z" "X" "*" "-" "+" "."AAString(x = "EQI")3-letter AAString object
seq: EQIOnce we have defined the object, we can perform operations like:
- getting the reverse complement of the DNA sequence
reverseComplement(dna)9-letter DNAString object
seq: TCTATGCCTWe could not do this with a regular character string
reverseComplement("AGGCATAGA")Error: unable to find an inherited method for function 'reverseComplement' for signature 'x = "character"'- translating the DNA sequence to an amino acid sequence
translate(dna)3-letter AAString object
seq: RHRWe can not translate an amino acid sequence:
translate(AAString(x = "EQI"))Error: unable to find an inherited method for function 'translate' for signature 'x = "AAString"'- If we have more than one sequence, we can represent them in a DNAStringSetobject, which is a collection ofDNAStringobjects
(dna_multiple <- DNAStringSet(c(x = "AGGCATAGA", y = "GGTATGAG")))DNAStringSet object of length 2:
    width seq                                               names               
[1]     9 AGGCATAGA                                         x
[2]     8 GGTATGAG                                          y## Subsetting with [] gives back a DNAStringSet
dna_multiple[1]DNAStringSet object of length 1:
    width seq                                               names               
[1]     9 AGGCATAGA                                         x## Subsetting with [[]] extracts the individual objects
dna_multiple[[1]]9-letter DNAString object
seq: AGGCATAGA- It is also easy to extract subsequences of either an XStringor anXStringSetobject
## Get the length of each string
width(dna_multiple)[1] 9 8## Extract the first three positions in each string
Biostrings::subseq(dna_multiple, start = 1, end = 3)DNAStringSet object of length 2:
    width seq                                               names               
[1]     3 AGG                                               x
[2]     3 GGT                                               y## Extract subsequences of different length
Biostrings::subseq(dna_multiple, start = 1, end = c(2, 5))DNAStringSet object of length 2:
    width seq                                               names               
[1]     2 AG                                                x
[2]     5 GGTAT                                             yBiostrings::subseq(dna_multiple, start = 1, 
                   end = width(dna_multiple) - 4)DNAStringSet object of length 2:
    width seq                                               names               
[1]     5 AGGCA                                             x
[2]     4 GGTA                                              yExercises
- Can you tell whether AATTGGCCRGGCCAATTis a valid DNA sequence?
Biostrings::DNAString("AATTGGCCRGGCCAATT")17-letter DNAString object
seq: AATTGGCCRGGCCAATTGenomic sequence information is often stored in fasta files. DNAStringSet objects can be created from such files by the readDNAStringSet() function from the Biostrings package. Download the gencode.v28.transcripts.1.1.10M.fa file from the course web page and read it into a DNAStringSet object using this function. What do you think these sequences represent?
txs <- Biostrings::readDNAStringSet("data/02/gencode.v28.transcripts.1.1.10M.fa")- How many sequences are there in this file?
length(txs)[1] 1373- What are the lengths of the shortest and longest sequence?
range(width(txs))[1]    59 11666- Extract the first 10 bases of each sequence
Biostrings::subseq(txs, start = 1, end = 10)DNAStringSet object of length 1373:
       width seq                                            names               
   [1]    10 GTTAACTTGC                                     ENST00000456328.2...
   [2]    10 GTGTCTGACT                                     ENST00000450305.2...
   [3]    10 ATGGGAGCCG                                     ENST00000488147.1...
   [4]    10 TGTGGGAGAG                                     ENST00000619216.1...
   [5]    10 GTGCACACGG                                     ENST00000473358.1...
   ...   ... ...
[1369]    10 ATGATTCACT                                     ENST00000636827.2...
[1370]    10 AGCATATTCT                                     ENST00000578045.1...
[1371]    10 GCTAAAATAA                                     ENST00000413148.1...
[1372]    10 GTCCCGGCCC                                     ENST00000315901.4...
[1373]    10 GAGCCTCCGG                                     ENST00000294435.7...- Get the fraction of A, C, G and T bases in each sequence (hint: there is a function named alphabetFrequency). Plot a histogram of the fraction of Gs across all sequences
freqs <- alphabetFrequency(txs)
fracs <- sweep(freqs, MARGIN = 1, STATS = rowSums(freqs), FUN = "/")
hist(fracs[, "G"])Annotation resources
Bioconductor provides a range of annotation resources, among them are:
- Gene-based information for a given organism (OrgDb)
- Transcript ranges for a given genome build (TxDb and EnsDb)
- Genome sequence for a given genome build (BSgenome)
These annotation resources can be accessed in different ways:
- Via annotation packages (updated every 6 months with the Bioconductor releases)
- By pulling down information from online resources such as the Bioconductor AnnotationHub
As an example lets have a look at a human transcript annotation package TxDb.Hsapiens.UCSC.hg38.knownGene. As the name suggest, the annotation is derived from UCSC “knownGene” track for the hg38 genome build.
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
?TxDb.Hsapiens.UCSC.hg38.knownGeneTwo accessor functions are transcripts() and transcriptsBy(). Lets compare their output:
gr <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene)
grGRanges object with 278220 ranges and 2 metadata columns:
                      seqnames        ranges strand |     tx_id
                         <Rle>     <IRanges>  <Rle> | <integer>
       [1]                chr1   11869-14409      + |         1
       [2]                chr1   12010-13670      + |         2
       [3]                chr1   29554-31097      + |         3
       [4]                chr1   30267-31109      + |         4
       [5]                chr1   30366-30503      + |         5
       ...                 ...           ...    ... .       ...
  [278216] chrX_MU273397v1_alt 239036-260095      - |    278216
  [278217] chrX_MU273397v1_alt 272358-282686      - |    278217
  [278218] chrX_MU273397v1_alt 314193-316302      - |    278218
  [278219] chrX_MU273397v1_alt 314813-315236      - |    278219
  [278220] chrX_MU273397v1_alt 324527-324923      - |    278220
                     tx_name
                 <character>
       [1] ENST00000456328.2
       [2] ENST00000450305.2
       [3] ENST00000473358.1
       [4] ENST00000469289.1
       [5] ENST00000607096.1
       ...               ...
  [278216] ENST00000710260.1
  [278217] ENST00000710028.1
  [278218] ENST00000710030.1
  [278219] ENST00000710216.1
  [278220] ENST00000710031.1
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genomegrl <- transcriptsBy(TxDb.Hsapiens.UCSC.hg38.knownGene, by = "gene")
grlGRangesList object of length 33131:
$`1`
GRanges object with 8 ranges and 2 metadata columns:
      seqnames            ranges strand |     tx_id           tx_name
         <Rle>         <IRanges>  <Rle> | <integer>       <character>
  [1]    chr19 58345178-58347634      - |    230133 ENST00000596924.1
  [2]    chr19 58345183-58353492      - |    230134 ENST00000263100.8
  [3]    chr19 58346854-58356225      - |    230135 ENST00000600123.5
  [4]    chr19 58346858-58353491      - |    230136 ENST00000595014.1
  [5]    chr19 58346860-58347657      - |    230137 ENST00000598345.1
  [6]    chr19 58348466-58362751      - |    230138 ENST00000599109.5
  [7]    chr19 58350594-58353129      - |    230139 ENST00000600966.1
  [8]    chr19 58353021-58356083      - |    230140 ENST00000596636.1
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome
$`10`
GRanges object with 3 ranges and 2 metadata columns:
      seqnames            ranges strand |     tx_id           tx_name
         <Rle>         <IRanges>  <Rle> | <integer>       <character>
  [1]     chr8 18386311-18388323      + |    104998 ENST00000660285.1
  [2]     chr8 18391282-18401218      + |    105000 ENST00000286479.4
  [3]     chr8 18391287-18400993      + |    105001 ENST00000520116.1
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome
...
<33129 more elements>- Functions like - exons(),- transcripts()or- genes()return- GRangesobjects with a set of ranges representing the level of detail of the query.
- The “xxxBy” functions, e.g. - transcriptsBy()or- exonsBy(), return- GRangesListobjects. They enable grouping of the output in terms of another genomic feature, in our case genes.
Transcript annotations and their functionality play an important role when analyzing gene expression data. The coming lectures on RNA-Seq will introduce more functionality in the context of examples.
Exercises
The OrgDb package org.Hs.eg.db provides human gene annotation. It offers a simple interface to an underlying SQLite database. The main accessor functions for this (and other SQLite) annotation objects are select(), columns() and keys().
- Install the org.Hs.eg.dbpackage, load it and check its help page. Detailed examples on how to work with these annotation objects can be found on the linkedAnnotationDbclass help page.
library(org.Hs.eg.db)?org.Hs.eg.db- What kind of data can be retrieved from the org.Hs.eg.dbobject? Check thecolumns()function!
columns(org.Hs.eg.db) [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
[16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
[21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
[26] "UNIPROT"     - Use the select()function to retrieve the gene names as well as ENTREZ and ENSEMBL gene IDs for the three genes: RAF1, FLT3 and MAPK1.
select(org.Hs.eg.db, keys = c("RAF1", "FLT3", "MAPK1"), 
       keytype = "SYMBOL", columns = c("ENTREZID", "ENSEMBL", "GENENAME"))'select()' returned 1:1 mapping between keys and columns  SYMBOL ENTREZID         ENSEMBL                                      GENENAME
1   RAF1     5894 ENSG00000132155 Raf-1 proto-oncogene, serine/threonine kinase
2   FLT3     2322 ENSG00000122025        fms related receptor tyrosine kinase 3
3  MAPK1     5594 ENSG00000100030            mitogen-activated protein kinase 1Further reading
- An overview of the Bioconductor project and its capabilities is given in Huber et al.: Orchestrating high-throughput genomic analysis with Bioconductor. Nature Methods 12:115–121 (2015).
- An introduction to the analysis of ranges in Bioconductor (particularly useful if you have experience with bedtools) is provided in the HelloRanges package
- Bioconductor provides a long list of material from previous courses, some videos and links to upcoming events
- A Bioconductor cheat sheet from Mike Love
- A lesson on the Bioconductor project (under development)