Introduction to Bioconductor and standard classes
What is Bioconductor?
- The project was started in 2001 (back then, mostly focused on microarray data)
- Currently (release 3.18) hosts:
- 2266 software packages
- 920 annotation packages
- 429 experiment data packages
- 30 workflows
- Covers a wide range of domains:
- sequencing (RNA-seq, ChIP-seq, single-cell analysis, variant calling, …)
- microarrays (methylation, gene expression, copy number, …)
- flow and mass cytometry
- proteomics
- imaging
- …
- There is a core team which, among other things, oversees the project, coordinates review of new package submissions, develops and maintains core infrastructure and maintains daily building and checking of all packages. Most packages, however, are contributed by the community.
- There is an active Bioconductor community, where you can find help, support and people with shared interests. For example, check the website, join the slack workspace, or attend one of the yearly conferences (BioC, EuroBioC, Bioc Asia).
How to install Bioconductor packages
- Detailed instructions are available online.
- There is a new Bioconductor release twice per year (typically April and October).
- Between those times, development is going on in the parallel devel branch.
- Do not mix packages from different releases!
- The available Bioconductor releases also depend on your version of R - if you want the latest Bioconductor packages, you need the latest version of R.
- Bioconductor packages are installed using the
BiocManager
CRAN package (note that this works also for packages hosted on CRAN or GitHub):
install.packages("BiocManager")
::install("limma") BiocManager
- We can check the validity of our Bioconductor installation with
BiocManager
(e.g., to see if there are outdated packages):
::valid() BiocManager
- We can also see which version is currently installed:
::version() BiocManager
[1] '3.18'
BiocManager
can be used to search for available packages. Here, we look for package names containing the word “Mmusculus” (assuming that we are interested in looking for packages containing annotation information for mouse):
::available(pattern = "Mmusculus") BiocManager
[1] "BSgenome.Mmusculus.UCSC.mm10" "BSgenome.Mmusculus.UCSC.mm10.masked"
[3] "BSgenome.Mmusculus.UCSC.mm39" "BSgenome.Mmusculus.UCSC.mm8"
[5] "BSgenome.Mmusculus.UCSC.mm8.masked" "BSgenome.Mmusculus.UCSC.mm9"
[7] "BSgenome.Mmusculus.UCSC.mm9.masked" "EnsDb.Mmusculus.v75"
[9] "EnsDb.Mmusculus.v79" "PWMEnrich.Mmusculus.background"
[11] "TxDb.Mmusculus.UCSC.mm10.ensGene" "TxDb.Mmusculus.UCSC.mm10.knownGene"
[13] "TxDb.Mmusculus.UCSC.mm39.knownGene" "TxDb.Mmusculus.UCSC.mm39.refGene"
[15] "TxDb.Mmusculus.UCSC.mm9.knownGene"
- Each Bioconductor package has a landing page (e.g., https://bioconductor.org/packages/limma/) that contains relevant information, including installation instructions, vignettes and the reference manual:
How to get help with Bioconductor packages
- Each function has a help page, detailing all arguments and outputs, and containing runnable examples.
library(limma)
?lmFit
- Each package comes with at least one vignette, outlining the intended use of the functions in the package to analyze data. The vignettes can be browsed from the landing page, or directly in your R session, e.g. with
browseVignettes(package = "Biostrings")
- If you need help with a Bioconductor package:
- Check the function help pages and the vignette(s) to see whether the answer is there (some packages, like DESeq2, have FAQ sections in their vignette)
- Go to https://support.bioconductor.org/
- Search to see whether your question has already been asked (and answered)
- If not, look through the Posting guide and post your question following these guidelines
- If you follow these steps, you will often get a rapid and informative answer
- The slack workspace has channels covering many topics related to Bioconductor
- The Workflows are complete walkthroughs of specific types of analyses, typically making use of multiple (Bioconductor) packages
Standard Bioconductor data structures
- Every object in R is of a particular data type.
- You have already seen some data types, e.g. atomic vectors (
numeric
,character
etc),list
,matrix
,data.frame
,factor
. - (Almost) all objects in R have a class. A class is a formal description of the data type, its properties, functionality and relation to other types .
- The
class
function tells you which class(es) an object is of:
class(c("one", "two", "three"))
[1] "character"
class(matrix(1:6, 2, 3))
[1] "matrix" "array"
- The class of an object defines the set of allowed operations.
relevel(c("one", "two", "three"), ref = "one")
Error in relevel.default(c("one", "two", "three"), ref = "one"): 'relevel' only for (unordered) factors
relevel(factor(c("one", "two", "three")), ref = "two")
[1] one two three
Levels: two one three
c("1", "2", "3") + 1
Error in c("1", "2", "3") + 1: non-numeric argument to binary operator
as.numeric(c("1", "2", "3")) + 1
[1] 2 3 4
- When working with (biological) data, it often makes sense to define more complex data structures, with several building blocks, and to define appropriate methods for each such class.
What types of data do we have to deal with?
- Sequences (strings)
- Alignments
- ‘Rectangular’ data (matrices of numeric values for a set of features and samples)
- Genomic ranges (intervals)
- Genomic variants
- Annotation databases
- Images
- …
For each of these data types, Bioconductor provides standard classes.
There are often many ways of creating an object of a given class, including:
- reading data from a file (e.g. the output from an instrument) using specialized import functions from Bioconductor or other sources. This is perhaps the most common way of creating objects, and we will see many examples during the course. For example, think of how you can generate a
data.frame
by reading data from a text file withread.delim()
. - loading processed data or annotations provided via (Bioconductor) packages.
- directly creating the object using a constructor function (which often has the same name as the object class). In this case, all the information that is needed for a particular type of object must be provided manually. This is the easiest way to create small example objects, and we will see examples of this below. This is also how the small objects in the previous section were created.
Why use standardized classes?
- Interoperability - having standard ways of storing and moving data between packages means we don’t have to implement the same thing in multiple places. Moreover, it allows us as developers to write functions that are applicable to a very broad set of data objects, since we can make assumptions about the content and structure of these objects (see for example the iSEE package).
- Recognition - you don’t have to learn a new representation for each package.
- Robustness - standard classes are efficiently implemented and well tested, and come with additional functions to check the validity of provided data.
As an example, all these Bioconductor packages make use of the SummarizedExperiment
class:
In the following we will examine a few standard Bioconductor classes in more detail. Of course, there are many more, and some of these will be covered in later lectures.
GenomicRanges
- The
GenomicRanges
package holds infrastructure for working with (genomic) ranges. - These can represent, for example, gene coordinates, ChIP-seq peaks, promoters, SNPs, CpG islands.
- A
GRanges
(genomic ranges) object contains, for each range, information about- the chromosome (called sequence)
- the interval (start and end)
- the strand (either
+
,-
or*
)
- The interval is, in turn, represented by a so called
IRanges
(interval ranges) object. - Many of the operations shown below apply equally well to
IRanges
objects.
library(GenomicRanges)
- Let’s create a small
IRanges
object. We can specify any two of the start, end and width of the ranges, and the third one will be inferred.
<- IRanges(start = c(3, 10, 50),
(ir end = c(17, 28, 104)))
IRanges object with 3 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 3 17 15
[2] 10 28 19
[3] 50 104 55
## Equivalent
<- IRanges(start = c(3, 10, 50),
(ir width = c(15, 19, 55)))
IRanges object with 3 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 3 17 15
[2] 10 28 19
[3] 50 104 55
- The
IRanges
object just represents ranges on the number line. There is no specifically genomic information in there - that’s added when we construct aGRanges
object
<- GRanges(seqnames = c("chr1", "chr1", "chr2"),
(gr ranges = ir,
strand = c("+", "+", "-")))
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 3-17 +
[2] chr1 10-28 +
[3] chr2 50-104 -
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
- Many operations are defined on the ranges objects. As a first example, it is easy to extract the individual components of the object using accessor functions
start(gr)
[1] 3 10 50
end(gr)
[1] 17 28 104
width(gr)
[1] 15 19 55
- We can also get and set information about the reference sequences - there are many different conventions for naming chromosomes!
seqlevelsStyle(gr)
[1] "UCSC"
seqlevelsStyle(gr) <- "Ensembl"
gr
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 3-17 +
[2] 1 10-28 +
[3] 2 50-104 -
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
seqlevelsStyle(gr) <- "UCSC"
- There are many ways of subsetting ranges objects (check out
methods(class = "GRanges")
). For example, we can use the regular[
function (the authors of theGRanges
class has defined what this function means for such an object), or we can keep only records on specific chromosomes.
1:2] gr[
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 3-17 +
[2] chr1 10-28 +
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
dropSeqlevels(gr, "chr1", pruning.mode = "coarse")
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr2 50-104 -
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
keepSeqlevels(gr, "chr1", pruning.mode = "coarse")
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 3-17 +
[2] chr1 10-28 +
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
seqnames(gr) == "chr1"] gr[
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 3-17 +
[2] chr1 10-28 +
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
keepStandardChromosomes(gr, pruning.mode = "coarse")
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 3-17 +
[2] chr1 10-28 +
[3] chr2 50-104 -
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
- Many other operations can be applied to each of the ranges in a
GRanges
orIRanges
object, such as extending all ranges with a certain number of bases in each end
+ 5 gr
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 -2-22 +
[2] chr1 5-33 +
[3] chr2 45-109 -
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
Note how this led to a negative start coordinate for the first range. This is not considered a violation since there is no information in the GRanges
object which will tell R where the chromosomes (sequences) start and end. To accommodate this, let’s set the sequence lengths to 100 and 200 nt, respectively.
seqlengths(gr)
chr1 chr2
NA NA
seqlengths(gr) <- c(chr1 = 100, chr2 = 200)
seqlengths(gr)
chr1 chr2
100 200
+ 5 gr
Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence chr1.
Note that ranges located on a sequence whose length is unknown (NA) or
on a circular sequence are not considered out-of-bound (use
seqlengths() and isCircular() to get the lengths and circularity flags
of the underlying sequences). You can use trim() to trim these ranges.
See ?`trim,GenomicRanges-method` for more information.
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 -2-22 +
[2] chr1 5-33 +
[3] chr2 45-109 -
-------
seqinfo: 2 sequences from an unspecified genome
Note how we now get a warning that one of the ranges is extending outside of the reference sequences. We can further use the trim()
function to remove all parts of ranges that fall outside the boundaries of the reference sequences (this would not do anything if we hadn’t set the sequence lengths).
trim(gr + 5)
Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence chr1.
Note that ranges located on a sequence whose length is unknown (NA) or
on a circular sequence are not considered out-of-bound (use
seqlengths() and isCircular() to get the lengths and circularity flags
of the underlying sequences). You can use trim() to trim these ranges.
See ?`trim,GenomicRanges-method` for more information.
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 1-22 +
[2] chr1 5-33 +
[3] chr2 45-109 -
-------
seqinfo: 2 sequences from an unspecified genome
We can also set a name for the genome that the ranges are supposed to come from. This can be useful e.g. when comparing different sets of ranges, to make sure that they all correspond to the same underlying genome.
genome(gr) <- "hg19"
gr
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 3-17 +
[2] chr1 10-28 +
[3] chr2 50-104 -
-------
seqinfo: 2 sequences from hg19 genome
Other useful operations on single ranges include:
- shifting
shift(gr, 5)
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 8-22 +
[2] chr1 15-33 +
[3] chr2 55-109 -
-------
seqinfo: 2 sequences from hg19 genome
- obtaining flanking sequence. Note how the strand information is taken into account.
flank(gr, 2, start = TRUE, both = FALSE)
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 1-2 +
[2] chr1 8-9 +
[3] chr2 105-106 -
-------
seqinfo: 2 sequences from hg19 genome
- finding the midpoint of each range
resize(gr, width = 1, fix = "center")
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 10 +
[2] chr1 19 +
[3] chr2 77 -
-------
seqinfo: 2 sequences from hg19 genome
In addition to the intra-range operations, there are also many useful operations that involve multiple ranges. For example, we can:
- find overlaps between ranges
<- GRanges(seqnames = c("chr2", "chr1"),
gr2 ranges = IRanges(start = c(90, 5),
end = c(100, 15)),
strand = c("-", "+"))
findOverlaps(query = gr, subject = gr2)
Hits object with 3 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 2
[2] 2 2
[3] 3 1
-------
queryLength: 3 / subjectLength: 2
GRanges
objects can have associated metadata. This is provided in themcols()
slot of the object.
mcols(gr)$score <- 1:3
gr
GRanges object with 3 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <integer>
[1] chr1 3-17 + | 1
[2] chr1 10-28 + | 2
[3] chr2 50-104 - | 3
-------
seqinfo: 2 sequences from hg19 genome
## Extract only the metadata
mcols(gr)
DataFrame with 3 rows and 1 column
score
<integer>
1 1
2 2
3 3
## We can also supply the metadata when creating the object
GRanges(seqnames = c("chr2", "chr1"),
ranges = IRanges(start = c(90, 5),
end = c(100, 15)),
strand = c("-", "+"),
extra_column = c("A", "B"))
GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | extra_column
<Rle> <IRanges> <Rle> | <character>
[1] chr2 90-100 - | A
[2] chr1 5-15 + | B
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
Exercises
In practice, the locations of known genomic features (exons, genes, transcripts, UTRs, …) along the genome are often stored in so called gtf
(gene transfer format) files. These files can be imported into a GRanges
object using the import()
function from the rtracklayer
Bioconductor package.
- Download the
gencode.v28.annotation.1.1.10M.gtf
file from the course web page, and read it into R in this way.
library(rtracklayer)
Attaching package: 'rtracklayer'
The following object is masked from 'package:igraph':
blocks
<- rtracklayer::import("data/02/gencode.v28.annotation.1.1.10M.gtf") gtf
- Confirm that the
gtf
object is indeed aGRanges
object
class(gtf)
[1] "GRanges"
attr(,"package")
[1] "GenomicRanges"
- How many records are there in the object?
length(gtf)
[1] 17955
- From which chromosomes do the records come?
seqlevels(gtf)
[1] "chr1"
- Use the
subset
function to extract only the records for theCHD5
gene. How many annotated transcripts does this gene have?
<- subset(gtf, gene_name == "CHD5")
gtfsub length(subset(gtfsub, type == "transcript"))
[1] 6
Lists of genomic ranges
- In some cases, ranges come in groups - for example, exons within a transcript.
- In these situations, it often makes sense to represent each group of ranges as a
GRanges
object, and the full set of groups as a list of such ranges. - The
GRangesList
class covers these situations. - All ranges objects share the same genome information
<- GRangesList(
grl tx1.1 = GRanges(seqnames = "chr1",
ranges = IRanges(start = c(50, 125, 188),
end = c(75, 130, 257)),
strand = "+",
symbol = "tx1.1", transcript = "tx1.1",
gene = "gene1", exon = paste0("ex1.", 1:3)),
tx1.2 = GRanges(seqnames = "chr1",
ranges = IRanges(start = c(50, 125, 200),
end = c(75, 130, 227)),
strand = "+",
symbol = "tx1.2", transcript = "tx1.2",
gene = "gene1", exon = paste0("ex1.", c(1, 2, 4))),
tx2.1 = GRanges(seqnames = "chr2",
ranges = IRanges(start = c(289, 318),
end = c(300, 350)),
strand = "-",
symbol = "tx2.1", transcript = "tx2.1",
gene = "gene2", exon = paste0("ex2.", 1:2)))
seqlengths(grl) <- c(chr1 = 1000, chr2 = 500)
grl
GRangesList object of length 3:
$tx1.1
GRanges object with 3 ranges and 4 metadata columns:
seqnames ranges strand | symbol transcript gene
<Rle> <IRanges> <Rle> | <character> <character> <character>
[1] chr1 50-75 + | tx1.1 tx1.1 gene1
[2] chr1 125-130 + | tx1.1 tx1.1 gene1
[3] chr1 188-257 + | tx1.1 tx1.1 gene1
exon
<character>
[1] ex1.1
[2] ex1.2
[3] ex1.3
-------
seqinfo: 2 sequences from an unspecified genome
$tx1.2
GRanges object with 3 ranges and 4 metadata columns:
seqnames ranges strand | symbol transcript gene
<Rle> <IRanges> <Rle> | <character> <character> <character>
[1] chr1 50-75 + | tx1.2 tx1.2 gene1
[2] chr1 125-130 + | tx1.2 tx1.2 gene1
[3] chr1 200-227 + | tx1.2 tx1.2 gene1
exon
<character>
[1] ex1.1
[2] ex1.2
[3] ex1.4
-------
seqinfo: 2 sequences from an unspecified genome
$tx2.1
GRanges object with 2 ranges and 4 metadata columns:
seqnames ranges strand | symbol transcript gene
<Rle> <IRanges> <Rle> | <character> <character> <character>
[1] chr2 289-300 - | tx2.1 tx2.1 gene2
[2] chr2 318-350 - | tx2.1 tx2.1 gene2
exon
<character>
[1] ex2.1
[2] ex2.2
-------
seqinfo: 2 sequences from an unspecified genome
The Gviz
package can be used to generate a visualization of the first two transcripts (which belong to the same gene):
library(Gviz)
<- GeneRegionTrack(grl$tx1.1, showId = TRUE, col = NULL,
g1 fill = "#377EB8", name = "",
background.title = "transparent",
col.title = "black")
<- GeneRegionTrack(grl$tx1.2, showId = TRUE, col = NULL,
g2 fill = "#E41A1C", name = "",
background.title = "transparent",
col.title = "black")
<- GenomeAxisTrack()
gtr <- c(gtr, g1, g2)
tracks plotTracks(tracks)
SummarizedExperiment
- The purpose of the
SummarizedExperiment
class (provided in the package with the same name) is to hold ‘rectangular’ data, together with annotations and metadata for the rows and columns. In other words, all the data associated with an experiment! - As we already saw above, this class (or extensions like
SingleCellExperiment
orDESeqDataSet
) is widely used throughout the Bioconductor ecosystem, e.g., for representing RNA-seq, ChIP-seq, methylation and mass cytometry data.
- The
SummarizedExperiment
object consists of three types of ‘tables’, denoted- assays (actually a list of matrices of the same size, containing the observed data, e.g., gene expression values)
- rowData (annotations for the rows, i.e., the features). This can be replaced by rowRanges (a
GRanges
object) that can also hold genomic range information, e.g., the locations of genes - colData (annotations for the columns, i.e., the samples)
library(SummarizedExperiment)
<- SummarizedExperiment(
se assays = list(exprs = matrix(10 * runif(15), 5, 3)),
colData = DataFrame(sample = paste0("Sample", 1:3),
condition = c("A", "A", "B")),
rowData = DataFrame(gene = paste0("G", 1:5)))
rownames(se) <- rowData(se)$gene
colnames(se) <- colData(se)$sample
se
class: SummarizedExperiment
dim: 5 3
metadata(0):
assays(1): exprs
rownames(5): G1 G2 G3 G4 G5
rowData names(1): gene
colnames(3): Sample1 Sample2 Sample3
colData names(2): sample condition
- The
SummarizedExperiment
class has accessor functions to extract the individual components.
assayNames(se)
[1] "exprs"
assay(se, "exprs")
Sample1 Sample2 Sample3
G1 7.7808157 8.899195 5.204971
G2 0.2349146 3.069105 5.425215
G3 0.7296682 8.401116 2.993127
G4 3.3820006 3.421748 8.265988
G5 1.4642998 8.343076 2.130548
colData(se)
DataFrame with 3 rows and 2 columns
sample condition
<character> <character>
Sample1 Sample1 A
Sample2 Sample2 A
Sample3 Sample3 B
rowData(se)
DataFrame with 5 rows and 1 column
gene
<character>
G1 G1
G2 G2
G3 G3
G4 G4
G5 G5
- There are also convenient subsetting functions. Note how all the relevant parts of the objects are subset!
<- se[1:2, ]
sesub assay(sesub, "exprs")
Sample1 Sample2 Sample3
G1 7.7808157 8.899195 5.204971
G2 0.2349146 3.069105 5.425215
colData(sesub)
DataFrame with 3 rows and 2 columns
sample condition
<character> <character>
Sample1 Sample1 A
Sample2 Sample2 A
Sample3 Sample3 B
rowData(sesub)
DataFrame with 2 rows and 1 column
gene
<character>
G1 G1
G2 G2
- If a
GRanges
object is used to represent the row annotations, themcols()
slot of theGRanges
object is returned byrowData(se)
.
<- SummarizedExperiment(
se assays = list(exprs = matrix(10 * runif(15), 5, 3)),
colData = DataFrame(sample = paste0("Sample", 1:3),
condition = c("A", "A", "B")),
rowRanges = GRanges(seqnames = "chr1",
ranges = IRanges(start = 100 * runif(5),
width = 50),
strand = "+",
gene = paste0("G", 1:5)))
class(se)
[1] "RangedSummarizedExperiment"
attr(,"package")
[1] "SummarizedExperiment"
rowData(se)
DataFrame with 5 rows and 1 column
gene
<character>
1 G1
2 G2
3 G3
4 G4
5 G5
rowRanges(se)
GRanges object with 5 ranges and 1 metadata column:
seqnames ranges strand | gene
<Rle> <IRanges> <Rle> | <character>
[1] chr1 55-104 + | G1
[2] chr1 87-136 + | G2
[3] chr1 89-138 + | G3
[4] chr1 67-116 + | G4
[5] chr1 25-74 + | G5
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Exercises
Here we will explore a real RNA-seq data set, which is provided in the airway
Bioconductor package and stored in a RangedSummarizedExperiment
object.
- Load the package and attach the data set.
library(airway)
data(airway)
airway
class: RangedSummarizedExperiment
dim: 63677 8
metadata(1): ''
assays(1): counts
rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
ENSG00000273493
rowData names(10): gene_id gene_name ... seq_coord_system symbol
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(9): SampleName cell ... Sample BioSample
- What type of data value are available in this object? How many samples are there? How many features?
assayNames(airway)
[1] "counts"
dim(airway)
[1] 63677 8
- What is the total number of sequencing reads assigned to genes for each sample?
colSums(assay(airway, "counts"))
SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520
20637971 18809481 25348649 15163415 24448408 30818215 19126151
SRR1039521
21164133
Class inheritance
Classes are usually organized into a hierarchy by class inheritance.
Basically, it involves taking a ‘standard’ base class that has some of the properties you need, and extending or building upon it to create a new class that fits your purpose.
The new class will inherit the attributes and methods from the original class, and thus you don’t need to redefine them for your new class.
One example from Bioconductor is the
SingleCellExperiment
class, which extends theSummarizedExperiment
class by, among a few other things, allowing the inclusion of dimension reduction results (e.g. PCA, tSNE, or UMAP).
library(SingleCellExperiment)
getClass("SingleCellExperiment")
Class "SingleCellExperiment" [package "SingleCellExperiment"]
Slots:
Name: int_elementMetadata int_colData
Class: DataFrame DataFrame
Name: int_metadata rowRanges
Class: list GenomicRanges_OR_GRangesList
Name: colData assays
Class: DataFrame Assays_OR_NULL
Name: NAMES elementMetadata
Class: character_OR_NULL DataFrame
Name: metadata
Class: list
Extends:
Class "RangedSummarizedExperiment", directly
Class "SummarizedExperiment", by class "RangedSummarizedExperiment", distance 2
Class "RectangularData", by class "RangedSummarizedExperiment", distance 3
Class "Vector", by class "RangedSummarizedExperiment", distance 3
Class "Annotated", by class "RangedSummarizedExperiment", distance 4
Class "vector_OR_Vector", by class "RangedSummarizedExperiment", distance 4
getClass("RangedSummarizedExperiment")
Class "RangedSummarizedExperiment" [package "SummarizedExperiment"]
Slots:
Name: rowRanges colData
Class: GenomicRanges_OR_GRangesList DataFrame
Name: assays NAMES
Class: Assays_OR_NULL character_OR_NULL
Name: elementMetadata metadata
Class: DataFrame list
Extends:
Class "SummarizedExperiment", directly
Class "RectangularData", by class "SummarizedExperiment", distance 2
Class "Vector", by class "SummarizedExperiment", distance 2
Class "Annotated", by class "SummarizedExperiment", distance 3
Class "vector_OR_Vector", by class "SummarizedExperiment", distance 3
- Due to inheritance it might not be immediately clear what methods are defined for a given class. For most Bioconductor classes (all so called S4 classes) one way to find out is via the
showMethods()
function.
showMethods(class = "SingleCellExperiment")
- A more comprehensive output however is provided by the
sloop
package. Itss4_methods_class()
function also lists the source of inherited methods (make sure that theSingleCellExperiment
package has been loaded into your R session):
library(sloop)
s4_methods_class("SingleCellExperiment")
generic | class | visible | source |
---|---|---|---|
!= | SingleCellExperiment | TRUE | S4Vectors |
!= | SingleCellExperiment | TRUE | S4Vectors |
!= | SingleCellExperiment | TRUE | S4Vectors |
[ | SingleCellExperiment | TRUE | SingleCellExperiment |
[[ | SingleCellExperiment | TRUE | SummarizedExperiment |
[[<- | SingleCellExperiment | TRUE | SummarizedExperiment |
[<- | SingleCellExperiment | TRUE | SingleCellExperiment |
[<- | SingleCellExperiment | TRUE | S4Vectors |
%in% | SingleCellExperiment | TRUE | S4Vectors |
%in% | SingleCellExperiment | TRUE | S4Vectors |
%in% | SingleCellExperiment | TRUE | S4Vectors |
< | SingleCellExperiment | TRUE | S4Vectors |
< | SingleCellExperiment | TRUE | S4Vectors |
< | SingleCellExperiment | TRUE | S4Vectors |
<= | SingleCellExperiment | TRUE | S4Vectors |
<= | SingleCellExperiment | TRUE | S4Vectors |
<= | SingleCellExperiment | TRUE | S4Vectors |
== | SingleCellExperiment | TRUE | S4Vectors |
== | SingleCellExperiment | TRUE | S4Vectors |
== | SingleCellExperiment | TRUE | S4Vectors |
> | SingleCellExperiment | TRUE | S4Vectors |
> | SingleCellExperiment | TRUE | S4Vectors |
> | SingleCellExperiment | TRUE | S4Vectors |
>= | SingleCellExperiment | TRUE | S4Vectors |
>= | SingleCellExperiment | TRUE | S4Vectors |
>= | SingleCellExperiment | TRUE | S4Vectors |
$ | SingleCellExperiment | TRUE | SummarizedExperiment |
$<- | SingleCellExperiment | TRUE | SummarizedExperiment |
aggregate | SingleCellExperiment | TRUE | S4Vectors |
altExp | SingleCellExperiment | TRUE | SingleCellExperiment |
altExp | SingleCellExperiment | TRUE | SingleCellExperiment |
altExp | SingleCellExperiment | TRUE | SingleCellExperiment |
altExp<- | SingleCellExperiment | TRUE | SingleCellExperiment |
altExp<- | SingleCellExperiment | TRUE | SingleCellExperiment |
altExp<- | SingleCellExperiment | TRUE | SingleCellExperiment |
altExpNames | SingleCellExperiment | TRUE | SingleCellExperiment |
altExpNames<- | SingleCellExperiment | TRUE | SingleCellExperiment |
altExps | SingleCellExperiment | TRUE | SingleCellExperiment |
altExps<- | SingleCellExperiment | TRUE | SingleCellExperiment |
anyDuplicated | SingleCellExperiment | TRUE | S4Vectors |
anyNA | SingleCellExperiment | TRUE | S4Vectors |
append | SingleCellExperiment | TRUE | S4Vectors |
as.character | SingleCellExperiment | TRUE | S4Vectors |
as.complex | SingleCellExperiment | TRUE | S4Vectors |
as.data.frame | SingleCellExperiment | TRUE | S4Vectors |
as.env | SingleCellExperiment | TRUE | S4Vectors |
as.integer | SingleCellExperiment | TRUE | S4Vectors |
as.list | SingleCellExperiment | TRUE | S4Vectors |
as.logical | SingleCellExperiment | TRUE | S4Vectors |
as.matrix | SingleCellExperiment | TRUE | S4Vectors |
as.numeric | SingleCellExperiment | TRUE | S4Vectors |
as.raw | SingleCellExperiment | TRUE | S4Vectors |
assay | SingleCellExperiment | TRUE | SummarizedExperiment |
assay | SingleCellExperiment | TRUE | SummarizedExperiment |
assay | SingleCellExperiment | TRUE | SummarizedExperiment |
assay<- | SingleCellExperiment | TRUE | SummarizedExperiment |
assay<- | SingleCellExperiment | TRUE | SummarizedExperiment |
assay<- | SingleCellExperiment | TRUE | SummarizedExperiment |
assayNames | SingleCellExperiment | TRUE | SummarizedExperiment |
assayNames<- | SingleCellExperiment | TRUE | SummarizedExperiment |
assays | SingleCellExperiment | TRUE | SummarizedExperiment |
assays<- | SingleCellExperiment | TRUE | SummarizedExperiment |
assays<- | SingleCellExperiment | TRUE | SummarizedExperiment |
bindROWS | SingleCellExperiment | TRUE | S4Vectors |
by | SingleCellExperiment | TRUE | S4Vectors |
c | SingleCellExperiment | TRUE | S4Vectors |
cbind | SingleCellExperiment | TRUE | SingleCellExperiment |
coerce | SingleCellExperiment | TRUE | methods |
coerce | SingleCellExperiment | TRUE | methods |
coerce | SingleCellExperiment | TRUE | SummarizedExperiment |
coerce | SingleCellExperiment | TRUE | methods |
coerce | SingleCellExperiment | TRUE | methods |
coerce | SingleCellExperiment | TRUE | SummarizedExperiment |
coerce | SingleCellExperiment | TRUE | SingleCellExperiment |
coerce | SingleCellExperiment | TRUE | methods |
coerce | SingleCellExperiment | TRUE | methods |
coerce | SingleCellExperiment | TRUE | S4Vectors |
coerce | SingleCellExperiment | TRUE | S4Vectors |
coerce | SingleCellExperiment | TRUE | S4Vectors |
coerce | SingleCellExperiment | TRUE | S4Vectors |
coerce | SingleCellExperiment | TRUE | S4Vectors |
coerce | SingleCellExperiment | TRUE | S4Vectors |
coerce | SingleCellExperiment | TRUE | S4Vectors |
coerce | SingleCellExperiment | TRUE | S4Vectors |
coerce | SingleCellExperiment | TRUE | S4Vectors |
coerce | SingleCellExperiment | TRUE | S4Vectors |
coerce | SingleCellExperiment | TRUE | S4Vectors |
coerce | SingleCellExperiment | TRUE | IRanges |
coerce<- | SingleCellExperiment | TRUE | S4Vectors |
coerce<- | SingleCellExperiment | TRUE | S4Vectors |
coerce<- | SingleCellExperiment | TRUE | SummarizedExperiment |
colData | SingleCellExperiment | TRUE | SingleCellExperiment |
colData<- | SingleCellExperiment | TRUE | SummarizedExperiment |
colData<- | SingleCellExperiment | TRUE | SummarizedExperiment |
colLabels | SingleCellExperiment | TRUE | SingleCellExperiment |
colLabels<- | SingleCellExperiment | TRUE | SingleCellExperiment |
colnames | SingleCellExperiment | TRUE | SummarizedExperiment |
colPair | SingleCellExperiment | TRUE | SingleCellExperiment |
colPair | SingleCellExperiment | TRUE | SingleCellExperiment |
colPair | SingleCellExperiment | TRUE | SingleCellExperiment |
colPair<- | SingleCellExperiment | TRUE | SingleCellExperiment |
colPair<- | SingleCellExperiment | TRUE | SingleCellExperiment |
colPair<- | SingleCellExperiment | TRUE | SingleCellExperiment |
colPairNames | SingleCellExperiment | TRUE | SingleCellExperiment |
colPairNames<- | SingleCellExperiment | TRUE | SingleCellExperiment |
colPairs | SingleCellExperiment | TRUE | SingleCellExperiment |
colPairs<- | SingleCellExperiment | TRUE | SingleCellExperiment |
combineCols | SingleCellExperiment | TRUE | SingleCellExperiment |
combineRows | SingleCellExperiment | TRUE | SummarizedExperiment |
Compare | SingleCellExperiment | TRUE | SummarizedExperiment |
Compare | SingleCellExperiment | TRUE | SummarizedExperiment |
Compare | SingleCellExperiment | TRUE | SummarizedExperiment |
countOverlaps | SingleCellExperiment | TRUE | IRanges |
countOverlaps | SingleCellExperiment | TRUE | IRanges |
countOverlaps | SingleCellExperiment | TRUE | IRanges |
counts | SingleCellExperiment | TRUE | |
counts<- | SingleCellExperiment | TRUE | |
coverage | SingleCellExperiment | TRUE | SummarizedExperiment |
cpm | SingleCellExperiment | TRUE | |
cpm<- | SingleCellExperiment | TRUE | |
dim | SingleCellExperiment | TRUE | S4Vectors |
dimnames | SingleCellExperiment | TRUE | S4Vectors |
dimnames<- | SingleCellExperiment | TRUE | SummarizedExperiment |
dimnames<- | SingleCellExperiment | TRUE | S4Vectors |
dimnames<- | SingleCellExperiment | TRUE | SummarizedExperiment |
disjointBins | SingleCellExperiment | TRUE | SummarizedExperiment |
distance | SingleCellExperiment | TRUE | SummarizedExperiment |
distance | SingleCellExperiment | TRUE | SummarizedExperiment |
distance | SingleCellExperiment | TRUE | SummarizedExperiment |
distanceToNearest | SingleCellExperiment | TRUE | SummarizedExperiment |
distanceToNearest | SingleCellExperiment | TRUE | SummarizedExperiment |
distanceToNearest | SingleCellExperiment | TRUE | SummarizedExperiment |
duplicated | SingleCellExperiment | TRUE | SummarizedExperiment |
elementMetadata | SingleCellExperiment | TRUE | SummarizedExperiment |
elementMetadata<- | SingleCellExperiment | TRUE | SummarizedExperiment |
end | SingleCellExperiment | TRUE | SummarizedExperiment |
end<- | SingleCellExperiment | TRUE | SummarizedExperiment |
eval | SingleCellExperiment | TRUE | S4Vectors |
eval | SingleCellExperiment | TRUE | S4Vectors |
expand | SingleCellExperiment | TRUE | S4Vectors |
expand.grid | SingleCellExperiment | TRUE | S4Vectors |
extractROWS | SingleCellExperiment | TRUE | S4Vectors |
FactorToClass | SingleCellExperiment | TRUE | S4Vectors |
findOverlaps | SingleCellExperiment | TRUE | GenomicAlignments |
findOverlaps | SingleCellExperiment | TRUE | GenomicAlignments |
findOverlaps | SingleCellExperiment | TRUE | GenomicAlignments |
findOverlaps | SingleCellExperiment | TRUE | SummarizedExperiment |
findOverlaps | SingleCellExperiment | TRUE | GenomicAlignments |
findOverlaps | SingleCellExperiment | TRUE | GenomicAlignments |
findOverlaps | SingleCellExperiment | TRUE | GenomicAlignments |
findOverlaps | SingleCellExperiment | TRUE | IRanges |
flank | SingleCellExperiment | TRUE | SummarizedExperiment |
follow | SingleCellExperiment | TRUE | SummarizedExperiment |
follow | SingleCellExperiment | TRUE | SummarizedExperiment |
follow | SingleCellExperiment | TRUE | SummarizedExperiment |
granges | SingleCellExperiment | TRUE | SummarizedExperiment |
head | SingleCellExperiment | TRUE | utils |
horizontal_slot_names | SingleCellExperiment | TRUE | SummarizedExperiment |
int_colData | SingleCellExperiment | TRUE | SingleCellExperiment |
int_colData<- | SingleCellExperiment | TRUE | SingleCellExperiment |
int_elementMetadata | SingleCellExperiment | TRUE | SingleCellExperiment |
int_elementMetadata<- | SingleCellExperiment | TRUE | SingleCellExperiment |
int_metadata | SingleCellExperiment | TRUE | SingleCellExperiment |
int_metadata<- | SingleCellExperiment | TRUE | SingleCellExperiment |
intersect | SingleCellExperiment | TRUE | GenomicRanges |
intersect | SingleCellExperiment | TRUE | GenomicRanges |
intersect | SingleCellExperiment | TRUE | S4Vectors |
is.na | SingleCellExperiment | TRUE | S4Vectors |
is.unsorted | SingleCellExperiment | TRUE | SummarizedExperiment |
isDisjoint | SingleCellExperiment | TRUE | SummarizedExperiment |
length | SingleCellExperiment | TRUE | SummarizedExperiment |
lengths | SingleCellExperiment | TRUE | S4Vectors |
logcounts | SingleCellExperiment | TRUE | |
logcounts<- | SingleCellExperiment | TRUE | |
mainExpName | SingleCellExperiment | TRUE | SingleCellExperiment |
mainExpName<- | SingleCellExperiment | TRUE | SingleCellExperiment |
match | SingleCellExperiment | TRUE | S4Vectors |
match | SingleCellExperiment | TRUE | S4Vectors |
match | SingleCellExperiment | TRUE | Biostrings |
match | SingleCellExperiment | TRUE | Biostrings |
mcols | SingleCellExperiment | TRUE | SummarizedExperiment |
mcols<- | SingleCellExperiment | TRUE | SummarizedExperiment |
merge | SingleCellExperiment | TRUE | S4Vectors |
mergeROWS | SingleCellExperiment | TRUE | S4Vectors |
metadata | SingleCellExperiment | TRUE | S4Vectors |
metadata<- | SingleCellExperiment | TRUE | S4Vectors |
mstack | SingleCellExperiment | TRUE | S4Vectors |
names | SingleCellExperiment | TRUE | SummarizedExperiment |
names<- | SingleCellExperiment | TRUE | SummarizedExperiment |
narrow | SingleCellExperiment | TRUE | SummarizedExperiment |
ncol | SingleCellExperiment | TRUE | SummarizedExperiment |
nearest | SingleCellExperiment | TRUE | SummarizedExperiment |
nearest | SingleCellExperiment | TRUE | SummarizedExperiment |
nearest | SingleCellExperiment | TRUE | SummarizedExperiment |
normcounts | SingleCellExperiment | TRUE | |
normcounts<- | SingleCellExperiment | TRUE | |
nrow | SingleCellExperiment | TRUE | SummarizedExperiment |
objectVersion | SingleCellExperiment | TRUE | SingleCellExperiment |
order | SingleCellExperiment | TRUE | SummarizedExperiment |
overlapsAny | SingleCellExperiment | TRUE | IRanges |
overlapsAny | SingleCellExperiment | TRUE | IRanges |
overlapsAny | SingleCellExperiment | TRUE | IRanges |
parallel_slot_names | SingleCellExperiment | TRUE | SingleCellExperiment |
pcompare | SingleCellExperiment | TRUE | SummarizedExperiment |
pcompare | SingleCellExperiment | TRUE | SummarizedExperiment |
pcompare | SingleCellExperiment | TRUE | SummarizedExperiment |
pcompare | SingleCellExperiment | TRUE | Biostrings |
pcompare | SingleCellExperiment | TRUE | Biostrings |
precede | SingleCellExperiment | TRUE | SummarizedExperiment |
precede | SingleCellExperiment | TRUE | SummarizedExperiment |
precede | SingleCellExperiment | TRUE | SummarizedExperiment |
promoters | SingleCellExperiment | TRUE | SummarizedExperiment |
ranges | SingleCellExperiment | TRUE | SummarizedExperiment |
ranges<- | SingleCellExperiment | TRUE | SummarizedExperiment |
rank | SingleCellExperiment | TRUE | SummarizedExperiment |
rbind | SingleCellExperiment | TRUE | SingleCellExperiment |
realize | SingleCellExperiment | TRUE | SummarizedExperiment |
reducedDim | SingleCellExperiment | TRUE | SingleCellExperiment |
reducedDim | SingleCellExperiment | TRUE | SingleCellExperiment |
reducedDim | SingleCellExperiment | TRUE | SingleCellExperiment |
reducedDim<- | SingleCellExperiment | TRUE | SingleCellExperiment |
reducedDim<- | SingleCellExperiment | TRUE | SingleCellExperiment |
reducedDim<- | SingleCellExperiment | TRUE | SingleCellExperiment |
reducedDimNames | SingleCellExperiment | TRUE | SingleCellExperiment |
reducedDimNames<- | SingleCellExperiment | TRUE | SingleCellExperiment |
reducedDims | SingleCellExperiment | TRUE | SingleCellExperiment |
reducedDims<- | SingleCellExperiment | TRUE | SingleCellExperiment |
relist | SingleCellExperiment | TRUE | IRanges |
rename | SingleCellExperiment | TRUE | S4Vectors |
rep | SingleCellExperiment | TRUE | S4Vectors |
rep.int | SingleCellExperiment | TRUE | S4Vectors |
replaceROWS | SingleCellExperiment | TRUE | S4Vectors |
resize | SingleCellExperiment | TRUE | SummarizedExperiment |
restrict | SingleCellExperiment | TRUE | SummarizedExperiment |
rev | SingleCellExperiment | TRUE | S4Vectors |
rowData | SingleCellExperiment | TRUE | SingleCellExperiment |
rowData<- | SingleCellExperiment | TRUE | SummarizedExperiment |
ROWNAMES | SingleCellExperiment | TRUE | S4Vectors |
rownames | SingleCellExperiment | TRUE | SummarizedExperiment |
ROWNAMES<- | SingleCellExperiment | TRUE | S4Vectors |
rowPair | SingleCellExperiment | TRUE | SingleCellExperiment |
rowPair | SingleCellExperiment | TRUE | SingleCellExperiment |
rowPair | SingleCellExperiment | TRUE | SingleCellExperiment |
rowPair<- | SingleCellExperiment | TRUE | SingleCellExperiment |
rowPair<- | SingleCellExperiment | TRUE | SingleCellExperiment |
rowPair<- | SingleCellExperiment | TRUE | SingleCellExperiment |
rowPairNames | SingleCellExperiment | TRUE | SingleCellExperiment |
rowPairNames<- | SingleCellExperiment | TRUE | SingleCellExperiment |
rowPairs | SingleCellExperiment | TRUE | SingleCellExperiment |
rowPairs<- | SingleCellExperiment | TRUE | SingleCellExperiment |
rowRanges | SingleCellExperiment | TRUE | SummarizedExperiment |
rowRanges<- | SingleCellExperiment | TRUE | SummarizedExperiment |
rowRanges<- | SingleCellExperiment | TRUE | SummarizedExperiment |
rowRanges<- | SingleCellExperiment | TRUE | SummarizedExperiment |
rowSubset | SingleCellExperiment | TRUE | SingleCellExperiment |
rowSubset<- | SingleCellExperiment | TRUE | SingleCellExperiment |
selfmatch | SingleCellExperiment | TRUE | S4Vectors |
seqinfo | SingleCellExperiment | TRUE | SummarizedExperiment |
seqinfo<- | SingleCellExperiment | TRUE | SummarizedExperiment |
seqlevelsInUse | SingleCellExperiment | TRUE | GenomeInfoDb |
seqnames | SingleCellExperiment | TRUE | SummarizedExperiment |
setdiff | SingleCellExperiment | TRUE | GenomicRanges |
setdiff | SingleCellExperiment | TRUE | GenomicRanges |
setdiff | SingleCellExperiment | TRUE | S4Vectors |
setequal | SingleCellExperiment | TRUE | S4Vectors |
shift | SingleCellExperiment | TRUE | SummarizedExperiment |
shiftApply | SingleCellExperiment | TRUE | S4Vectors |
show | SingleCellExperiment | TRUE | SingleCellExperiment |
showAsCell | SingleCellExperiment | TRUE | SummarizedExperiment |
sizeFactors | SingleCellExperiment | TRUE | SingleCellExperiment |
sizeFactors<- | SingleCellExperiment | TRUE | SingleCellExperiment |
sort | SingleCellExperiment | TRUE | SummarizedExperiment |
split | SingleCellExperiment | TRUE | S4Vectors |
split | SingleCellExperiment | TRUE | S4Vectors |
split | SingleCellExperiment | TRUE | SummarizedExperiment |
split | SingleCellExperiment | TRUE | rtracklayer |
split | SingleCellExperiment | TRUE | S4Vectors |
split<- | SingleCellExperiment | TRUE | IRanges |
start | SingleCellExperiment | TRUE | SummarizedExperiment |
start<- | SingleCellExperiment | TRUE | SummarizedExperiment |
strand | SingleCellExperiment | TRUE | SummarizedExperiment |
strand<- | SingleCellExperiment | TRUE | SummarizedExperiment |
subset | SingleCellExperiment | TRUE | SummarizedExperiment |
subsetByOverlaps | SingleCellExperiment | TRUE | IRanges |
summary | SingleCellExperiment | TRUE | S4Vectors |
table | SingleCellExperiment | TRUE | S4Vectors |
tail | SingleCellExperiment | TRUE | utils |
tapply | SingleCellExperiment | TRUE | IRanges |
tapply | SingleCellExperiment | TRUE | IRanges |
tapply | SingleCellExperiment | TRUE | IRanges |
tpm | SingleCellExperiment | TRUE | |
tpm<- | SingleCellExperiment | TRUE | |
transform | SingleCellExperiment | TRUE | S4Vectors |
trim | SingleCellExperiment | TRUE | SummarizedExperiment |
union | SingleCellExperiment | TRUE | GenomicRanges |
union | SingleCellExperiment | TRUE | GenomicRanges |
union | SingleCellExperiment | TRUE | S4Vectors |
unique | SingleCellExperiment | TRUE | S4Vectors |
updateObject | SingleCellExperiment | TRUE | SingleCellExperiment |
values | SingleCellExperiment | TRUE | S4Vectors |
values<- | SingleCellExperiment | TRUE | S4Vectors |
vertical_slot_names | SingleCellExperiment | TRUE | SummarizedExperiment |
weights | SingleCellExperiment | TRUE | |
weights<- | SingleCellExperiment | TRUE | |
width | SingleCellExperiment | TRUE | SummarizedExperiment |
width<- | SingleCellExperiment | TRUE | SummarizedExperiment |
window | SingleCellExperiment | TRUE | S4Vectors |
window<- | SingleCellExperiment | TRUE | IRanges |
with | SingleCellExperiment | TRUE | S4Vectors |
xtabs | SingleCellExperiment | TRUE | S4Vectors |
xtfrm | SingleCellExperiment | TRUE | S4Vectors |
Biostrings
Biostrings
is a package containing infrastructure for dealing with biological sequences, or strings.- These are represented as so called
XString
objects (orXStringSet
objects if you want to include multiple sequences), whereX
can beDNA
,RNA
,AA
orB
and signifies the type of sequence(s) in the object.
library(Biostrings)
- To create a
DNAString
object, we can use theDNAString()
constructor function. To see how the function works, type?DNAString
.
<- DNAString(x = "AGGCATAGA")) (dna
9-letter DNAString object
seq: AGGCATAGA
- The classes come with useful validity checks. For example, not all characters are allowed in a DNA string (but note that it can contain other bases than ‘A’, ‘C’, ‘G’ and ‘T’, following the IUPAC code of ambiguous bases).
alphabet(DNAString())
[1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+" "."
IUPAC_CODE_MAP
A C G T M R W S Y K V
"A" "C" "G" "T" "AC" "AG" "AT" "CG" "CT" "GT" "ACG"
H D B N
"ACT" "AGT" "CGT" "ACGT"
## Invalid DNA string
DNAString(x = "EQI")
Error in .Call2("new_XString_from_CHARACTER", class(x0), string, start, : key 69 (char 'E') not in lookup table
alphabet(AAString())
[1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T" "W" "Y"
[20] "V" "U" "O" "B" "J" "Z" "X" "*" "-" "+" "."
AAString(x = "EQI")
3-letter AAString object
seq: EQI
Once we have defined the object, we can perform operations like:
- getting the reverse complement of the DNA sequence
reverseComplement(dna)
9-letter DNAString object
seq: TCTATGCCT
We could not do this with a regular character string
reverseComplement("AGGCATAGA")
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'reverseComplement' for signature '"character"'
- translating the DNA sequence to an amino acid sequence
translate(dna)
3-letter AAString object
seq: RHR
We can not translate an amino acid sequence:
translate(AAString(x = "EQI"))
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'translate' for signature '"AAString"'
- If we have more than one sequence, we can represent them in a
DNAStringSet
object, which is a collection ofDNAString
objects
<- DNAStringSet(c(x = "AGGCATAGA", y = "GGTATGAG"))) (dna_multiple
DNAStringSet object of length 2:
width seq names
[1] 9 AGGCATAGA x
[2] 8 GGTATGAG y
## Subsetting with [] gives back a DNAStringSet
1] dna_multiple[
DNAStringSet object of length 1:
width seq names
[1] 9 AGGCATAGA x
## Subsetting with [[]] extracts the individual objects
1]] dna_multiple[[
9-letter DNAString object
seq: AGGCATAGA
- It is also easy to extract subsequences of either an
XString
or anXStringSet
object
## Get the length of each string
width(dna_multiple)
[1] 9 8
## Extract the first three positions in each string
::subseq(dna_multiple, start = 1, end = 3) Biostrings
DNAStringSet object of length 2:
width seq names
[1] 3 AGG x
[2] 3 GGT y
## Extract subsequences of different length
::subseq(dna_multiple, start = 1, end = c(2, 5)) Biostrings
DNAStringSet object of length 2:
width seq names
[1] 2 AG x
[2] 5 GGTAT y
::subseq(dna_multiple, start = 1,
Biostringsend = width(dna_multiple) - 4)
DNAStringSet object of length 2:
width seq names
[1] 5 AGGCA x
[2] 4 GGTA y
Exercises
- Can you tell whether
AATTGGCCRGGCCAATT
is a valid DNA sequence?
::DNAString("AATTGGCCRGGCCAATT") Biostrings
17-letter DNAString object
seq: AATTGGCCRGGCCAATT
Genomic sequence information is often stored in fasta
files. DNAStringSet
objects can be created from such files by the readDNAStringSet()
function from the Biostrings
package. Download the gencode.v28.transcripts.1.1.10M.fa
file from the course web page and read it into a DNAStringSet
object using this function. What do you think these sequences represent?
<- Biostrings::readDNAStringSet("data/02/gencode.v28.transcripts.1.1.10M.fa") txs
- How many sequences are there in this file?
length(txs)
[1] 1373
- What are the lengths of the shortest and longest sequence?
range(width(txs))
[1] 59 11666
- Extract the first 10 bases of each sequence
::subseq(txs, start = 1, end = 10) Biostrings
DNAStringSet object of length 1373:
width seq names
[1] 10 GTTAACTTGC ENST00000456328.2...
[2] 10 GTGTCTGACT ENST00000450305.2...
[3] 10 ATGGGAGCCG ENST00000488147.1...
[4] 10 TGTGGGAGAG ENST00000619216.1...
[5] 10 GTGCACACGG ENST00000473358.1...
... ... ...
[1369] 10 ATGATTCACT ENST00000636827.2...
[1370] 10 AGCATATTCT ENST00000578045.1...
[1371] 10 GCTAAAATAA ENST00000413148.1...
[1372] 10 GTCCCGGCCC ENST00000315901.4...
[1373] 10 GAGCCTCCGG ENST00000294435.7...
- Get the fraction of A, C, G and T bases in each sequence (hint: there is a function named
alphabetFrequency
). Plot a histogram of the fraction of Gs across all sequences
<- alphabetFrequency(txs)
freqs <- sweep(freqs, MARGIN = 1, STATS = rowSums(freqs), FUN = "/")
fracs hist(fracs[, "G"])
Annotation resources
Bioconductor provides a range of annotation resources, among them are:
- Gene-based information for a given organism (OrgDb)
- Transcript ranges for a given genome build (TxDb and EnsDb)
- Genome sequence for a given genome build (BSgenome)
These annotation resources can be accessed in different ways:
- Via annotation packages (updated every 6 months with the Bioconductor releases)
- By pulling down information from online resources such as the Bioconductor AnnotationHub
As an example lets have a look at a human transcript annotation package TxDb.Hsapiens.UCSC.hg38.knownGene
. As the name suggest, the annotation is derived from UCSC “knownGene” track for the hg38 genome build.
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
?TxDb.Hsapiens.UCSC.hg38.knownGene
Two accessor functions are transcripts()
and transcriptsBy()
. Lets compare their output:
<- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene)
gr gr
GRanges object with 276905 ranges and 2 metadata columns:
seqnames ranges strand | tx_id
<Rle> <IRanges> <Rle> | <integer>
[1] chr1 11869-14409 + | 1
[2] chr1 12010-13670 + | 2
[3] chr1 29554-31097 + | 3
[4] chr1 30267-31109 + | 4
[5] chr1 30366-30503 + | 5
... ... ... ... . ...
[276901] chrX_MU273397v1_alt 239036-260095 - | 276901
[276902] chrX_MU273397v1_alt 272358-282686 - | 276902
[276903] chrX_MU273397v1_alt 314193-316302 - | 276903
[276904] chrX_MU273397v1_alt 314813-315236 - | 276904
[276905] chrX_MU273397v1_alt 324527-324923 - | 276905
tx_name
<character>
[1] ENST00000456328.2
[2] ENST00000450305.2
[3] ENST00000473358.1
[4] ENST00000469289.1
[5] ENST00000607096.1
... ...
[276901] ENST00000710260.1
[276902] ENST00000710028.1
[276903] ENST00000710030.1
[276904] ENST00000710216.1
[276905] ENST00000710031.1
-------
seqinfo: 711 sequences (1 circular) from hg38 genome
<- transcriptsBy(TxDb.Hsapiens.UCSC.hg38.knownGene, by = "gene")
grl grl
GRangesList object of length 32868:
$`1`
GRanges object with 8 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr19 58345178-58347634 - | 228947 ENST00000596924.1
[2] chr19 58345183-58353492 - | 228948 ENST00000263100.8
[3] chr19 58346854-58356225 - | 228949 ENST00000600123.5
[4] chr19 58346858-58353491 - | 228950 ENST00000595014.1
[5] chr19 58346860-58347657 - | 228951 ENST00000598345.1
[6] chr19 58348466-58362751 - | 228952 ENST00000599109.5
[7] chr19 58350594-58353129 - | 228953 ENST00000600966.1
[8] chr19 58353021-58356083 - | 228954 ENST00000596636.1
-------
seqinfo: 711 sequences (1 circular) from hg38 genome
$`10`
GRanges object with 3 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr8 18386311-18388323 + | 104459 ENST00000660285.1
[2] chr8 18391282-18401218 + | 104461 ENST00000286479.4
[3] chr8 18391287-18400993 + | 104462 ENST00000520116.1
-------
seqinfo: 711 sequences (1 circular) from hg38 genome
...
<32866 more elements>
Functions like
exons()
,transcripts()
orgenes()
returnGRanges
objects with a set of ranges representing the level of detail of the query.The “xxxBy” functions, e.g.
transcriptsBy()
orexonsBy()
, returnGRangesList
objects. They enable grouping of the output in terms of another genomic feature, in our case genes.
Transcript annotations and their functionality play an important role when analyzing gene expression data. The coming lectures on RNA-Seq will introduce more functionality in the context of examples.
Exercises
The OrgDb package org.Hs.eg.db
provides human gene annotation. It offers a simple interface to an underlying SQLite database. The main accessor functions for this (and other SQLite) annotation objects are select()
, columns()
and keys()
.
- Install the
org.Hs.eg.db
package, load it and check its help page. Detailed examples on how to work with these annotation objects can be found on the linkedAnnotationDb
class help page.
library(org.Hs.eg.db)
?org.Hs.eg.db
- What kind of data can be retrieved from the
org.Hs.eg.db
object? Check thecolumns()
function!
columns(org.Hs.eg.db)
[1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
[6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
[16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
[21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
[26] "UNIPROT"
- Use the
select()
function to retrieve the gene names as well as ENTREZ and ENSEMBL gene IDs for the three genes: RAF1, FLT3 and MAPK1.
select(org.Hs.eg.db, keys = c("RAF1", "FLT3", "MAPK1"),
keytype = "SYMBOL", columns = c("ENTREZID", "ENSEMBL", "GENENAME"))
'select()' returned 1:1 mapping between keys and columns
SYMBOL ENTREZID ENSEMBL GENENAME
1 RAF1 5894 ENSG00000132155 Raf-1 proto-oncogene, serine/threonine kinase
2 FLT3 2322 ENSG00000122025 fms related receptor tyrosine kinase 3
3 MAPK1 5594 ENSG00000100030 mitogen-activated protein kinase 1
Further reading
- An overview of the Bioconductor project and its capabilities is given in Huber et al.: Orchestrating high-throughput genomic analysis with Bioconductor. Nature Methods 12:115–121 (2015).
- An introduction to the analysis of ranges in Bioconductor (particularly useful if you have experience with bedtools) is provided in the HelloRanges package
- Bioconductor provides a long list of material from previous courses, some videos and links to upcoming events
- A Bioconductor cheat sheet from Mike Love
- A lesson on the Bioconductor project (under development)