How to access annotation resources via Bioconductor?

There are many types of objects used to store annotation information in Bioconductor. We will look at the following main classes:

  • OrgDb (gene-based information for a given organism)
  • TxDb (transcript ranges for a given annotation resource/version)
  • EnsDb (transcript ranges for a given Ensembl release)
  • BSgenome (genome sequence for a given genome build)

These annotation resources can be accessed in different ways:

  • Via annotation packages (updated every 6 months with the Bioconductor releases), e.g.
    • TxDb.Hsapiens.UCSC.hg38.knownGene
    • BSgenome.Hsapiens.UCSC.hg38
    • org.Hs.eg.db
    • GO.db
  • By pulling down information from online resources such as

The AnnotationHub

  • The AnnotationHub represents a relatively recent addition to the Bioconductor infrastructure.
  • The annotation resources are provided as standard Bioconductor objects (such as those covered in the previous lectures) - GRanges, TxDb, etc.
  • The first time you access an annotation resource from the AnnotationHub it will be downloaded and cached on your computer. In that way, the next time you need it you won’t have to download it again.
  • Let’s first explore the content of the AnnotationHub. We need to load the package and create an AnnotationHub instance. Note that creating the instance will also print out the ‘snapshot date’.

    library(AnnotationHub)
    library(GenomicFeatures)
    (ah <- AnnotationHub())
    ## snapshotDate(): 2018-10-24
    ## AnnotationHub with 47474 records
    ## # snapshotDate(): 2018-10-24 
    ## # $dataprovider: BroadInstitute, Ensembl, UCSC, ftp://ftp.ncbi.nlm.nih.gov/g...
    ## # $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Bos taurus,...
    ## # $rdataclass: GRanges, BigWigFile, FaFile, TwoBitFile, OrgDb, Rle, ChainFil...
    ## # additional mcols(): taxonomyid, genome, description,
    ## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
    ## #   rdatapath, sourceurl, sourcetype 
    ## # retrieve records with, e.g., 'object[["AH2"]]' 
    ## 
    ##             title                                               
    ##   AH2     | Ailuropoda_melanoleuca.ailMel1.69.dna.toplevel.fa   
    ##   AH3     | Ailuropoda_melanoleuca.ailMel1.69.dna_rm.toplevel.fa
    ##   AH4     | Ailuropoda_melanoleuca.ailMel1.69.dna_sm.toplevel.fa
    ##   AH5     | Ailuropoda_melanoleuca.ailMel1.69.ncrna.fa          
    ##   AH6     | Ailuropoda_melanoleuca.ailMel1.69.pep.all.fa        
    ##   ...       ...                                                 
    ##   AH67895 | org.Vibrio_vulnificus.eg.sqlite                     
    ##   AH67896 | org.Achromobacter_group_B.eg.sqlite                 
    ##   AH67897 | org.Achromobacter_group_E.eg.sqlite                 
    ##   AH67898 | org.Pannonibacter_phragmitetus.eg.sqlite            
    ##   AH67899 | org.Salinispora_arenicola_CNS-205.eg.sqlite
    class(ah)
    ## [1] "AnnotationHub"
    ## attr(,"package")
    ## [1] "AnnotationHub"
    length(ah)
    ## [1] 47474

    The AnnotationHub instance can be subset using the [ operator

    ah[1:3]
    ## AnnotationHub with 3 records
    ## # snapshotDate(): 2018-10-24 
    ## # $dataprovider: Ensembl
    ## # $species: Ailuropoda melanoleuca
    ## # $rdataclass: FaFile
    ## # additional mcols(): taxonomyid, genome, description,
    ## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
    ## #   rdatapath, sourceurl, sourcetype 
    ## # retrieve records with, e.g., 'object[["AH2"]]' 
    ## 
    ##         title                                               
    ##   AH2 | Ailuropoda_melanoleuca.ailMel1.69.dna.toplevel.fa   
    ##   AH3 | Ailuropoda_melanoleuca.ailMel1.69.dna_rm.toplevel.fa
    ##   AH4 | Ailuropoda_melanoleuca.ailMel1.69.dna_sm.toplevel.fa
    ah[1]
    ## AnnotationHub with 1 record
    ## # snapshotDate(): 2018-10-24 
    ## # names(): AH2
    ## # $dataprovider: Ensembl
    ## # $species: Ailuropoda melanoleuca
    ## # $rdataclass: FaFile
    ## # $rdatadateadded: 2013-03-19
    ## # $title: Ailuropoda_melanoleuca.ailMel1.69.dna.toplevel.fa
    ## # $description: FASTA DNA sequence for Ailuropoda melanoleuca
    ## # $taxonomyid: 9646
    ## # $genome: ailMel1
    ## # $sourcetype: FASTA
    ## # $sourceurl: ftp://ftp.ensembl.org/pub/release-69/fasta/ailuropoda_melanole...
    ## # $sourcesize: 693412448
    ## # $tags: c("FASTA", "ensembl", "sequence") 
    ## # retrieve record with 'object[["AH2"]]'

    As usual in R, subsetting a list-like object with [ will return an object of the same class, while the contents of the list can be extracted using the [[ operator, using either a numeric index or the AnnotationHub ID.

    (cb <- ah[[1]])
    ## downloading 0 resources
    ## loading from cache 
    ##     '/Users/charlotte//.AnnotationHub/2'
    ##     '/Users/charlotte//.AnnotationHub/13963'
    ## require("Rsamtools")
    ## class: FaFile 
    ## path: /Users/charlotte//.AnnotationHub/2
    ## index: /Users/charlotte//.AnnotationHub/13963
    ## isOpen: FALSE 
    ## yieldSize: NA
    seqinfo(cb)
    ## Seqinfo object with 81467 sequences from an unspecified genome:
    ##   seqnames   seqlengths isCircular genome
    ##   GL273092.1        100       <NA>   <NA>
    ##   GL272756.1        100       <NA>   <NA>
    ##   GL273466.1        100       <NA>   <NA>
    ##   GL272527.1        100       <NA>   <NA>
    ##   GL273354.1        100       <NA>   <NA>
    ##   ...               ...        ...    ...
    ##   GL192342.1    5474287       <NA>   <NA>
    ##   GL192341.1    5547311       <NA>   <NA>
    ##   GL192340.1    5628269       <NA>   <NA>
    ##   GL192339.1    5850765       <NA>   <NA>
    ##   GL192338.1    6047896       <NA>   <NA>
    ah[["AH5012"]]
    ## downloading 0 resources
    ## loading from cache 
    ##     '/Users/charlotte//.AnnotationHub/5012'
    ## UCSC track 'cytoBand'
    ## UCSCData object with 862 ranges and 1 metadata column:
    ##         seqnames            ranges strand |        name
    ##            <Rle>         <IRanges>  <Rle> | <character>
    ##     [1]     chr1         1-2300000      * |      p36.33
    ##     [2]     chr1   2300001-5400000      * |      p36.32
    ##     [3]     chr1   5400001-7200000      * |      p36.31
    ##     [4]     chr1   7200001-9200000      * |      p36.23
    ##     [5]     chr1  9200001-12700000      * |      p36.22
    ##     ...      ...               ...    ... .         ...
    ##   [858]    chr22 37600001-41000000      * |       q13.1
    ##   [859]    chr22 41000001-44200000      * |       q13.2
    ##   [860]    chr22 44200001-48400000      * |      q13.31
    ##   [861]    chr22 48400001-49400000      * |      q13.32
    ##   [862]    chr22 49400001-51304566      * |      q13.33
    ##   -------
    ##   seqinfo: 93 sequences (1 circular) from hg19 genome

    We can list the different data types represented in the AnnotationHub object (do ah$<tab> to see a list of all columns).

    unique(ah$rdataclass)
    ##  [1] "FaFile"                            "GRanges"                          
    ##  [3] "data.frame"                        "Inparanoid8Db"                    
    ##  [5] "TwoBitFile"                        "ChainFile"                        
    ##  [7] "SQLiteConnection"                  "biopax"                           
    ##  [9] "BigWigFile"                        "AAStringSet"                      
    ## [11] "MSnSet"                            "mzRpwiz"                          
    ## [13] "mzRident"                          "list"                             
    ## [15] "TxDb"                              "Rle"                              
    ## [17] "EnsDb"                             "VcfFile"                          
    ## [19] "igraph"                            "data.frame, DNAStringSet, GRanges"
    ## [21] "OrgDb"

    How many species are covered by the resources?

    length(unique(ah$species))
    ## [1] 2042

    The AnnotationHub object can be queried in different ways: - The query() function will search in all the fields in the data base for the provided query - The subset() function can also be used, to subset the object based on one or several of its annotation columns.

    Let’s use these approaches to query the AnnotationHub for annotation resources for mouse.

    subset(ah, species == "Mus musculus")
    ## AnnotationHub with 2653 records
    ## # snapshotDate(): 2018-10-24 
    ## # $dataprovider: Ensembl, Haemcode, UCSC, Gencode, BioMart, KEGG, Inparanoid...
    ## # $species: Mus musculus
    ## # $rdataclass: GRanges, TwoBitFile, BigWigFile, FaFile, ChainFile, Rle, EnsD...
    ## # additional mcols(): taxonomyid, genome, description,
    ## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
    ## #   rdatapath, sourceurl, sourcetype 
    ## # retrieve records with, e.g., 'object[["AH187"]]' 
    ## 
    ##             title                                              
    ##   AH187   | Mus_musculus.GRCm38.69.cdna.all.fa                 
    ##   AH188   | Mus_musculus.GRCm38.69.dna.toplevel.fa             
    ##   AH189   | Mus_musculus.GRCm38.69.dna_rm.toplevel.fa          
    ##   AH190   | Mus_musculus.GRCm38.69.dna_sm.toplevel.fa          
    ##   AH191   | Mus_musculus.GRCm38.69.ncrna.fa                    
    ##   ...       ...                                                
    ##   AH65899 | Mus_musculus_wsbeij.WSB_EiJ_v1.dna_rm.toplevel.2bit
    ##   AH65900 | Mus_musculus_wsbeij.WSB_EiJ_v1.dna_sm.toplevel.2bit
    ##   AH65901 | Mus_musculus_wsbeij.WSB_EiJ_v1.ncrna.2bit          
    ##   AH66157 | org.Mm.eg.db.sqlite                                
    ##   AH66175 | TxDb.Mmusculus.UCSC.mm10.knownGene.sqlite
    query(ah, "musculus")
    ## AnnotationHub with 2653 records
    ## # snapshotDate(): 2018-10-24 
    ## # $dataprovider: Ensembl, Haemcode, UCSC, Gencode, BioMart, KEGG, Inparanoid...
    ## # $species: Mus musculus
    ## # $rdataclass: GRanges, TwoBitFile, BigWigFile, FaFile, ChainFile, Rle, EnsD...
    ## # additional mcols(): taxonomyid, genome, description,
    ## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
    ## #   rdatapath, sourceurl, sourcetype 
    ## # retrieve records with, e.g., 'object[["AH187"]]' 
    ## 
    ##             title                                              
    ##   AH187   | Mus_musculus.GRCm38.69.cdna.all.fa                 
    ##   AH188   | Mus_musculus.GRCm38.69.dna.toplevel.fa             
    ##   AH189   | Mus_musculus.GRCm38.69.dna_rm.toplevel.fa          
    ##   AH190   | Mus_musculus.GRCm38.69.dna_sm.toplevel.fa          
    ##   AH191   | Mus_musculus.GRCm38.69.ncrna.fa                    
    ##   ...       ...                                                
    ##   AH65899 | Mus_musculus_wsbeij.WSB_EiJ_v1.dna_rm.toplevel.2bit
    ##   AH65900 | Mus_musculus_wsbeij.WSB_EiJ_v1.dna_sm.toplevel.2bit
    ##   AH65901 | Mus_musculus_wsbeij.WSB_EiJ_v1.ncrna.2bit          
    ##   AH66157 | org.Mm.eg.db.sqlite                                
    ##   AH66175 | TxDb.Mmusculus.UCSC.mm10.knownGene.sqlite

OrgDb objects

  • An OrgDb object is basically a SQLite database containing information for a given species (organism)
  • The AnnotationHub contains a large number of such objects:

    (orgs <- subset(ah, ah$rdataclass == "OrgDb"))
    ## AnnotationHub with 2002 records
    ## # snapshotDate(): 2018-10-24 
    ## # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, FungiDB, TriTrypDB, ...
    ## # $species: Toxoplasma gondii, Fusarium oxysporum, Entamoeba histolytica, Hi...
    ## # $rdataclass: OrgDb
    ## # additional mcols(): taxonomyid, genome, description,
    ## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
    ## #   rdatapath, sourceurl, sourcetype 
    ## # retrieve records with, e.g., 'object[["AH65001"]]' 
    ## 
    ##             title                                                              
    ##   AH65001 | Edhazardia aedis USNM 41457 genome wide annotations                
    ##   AH65002 | Aspergillus versicolor CBS 583.65 genome wide annotations          
    ##   AH65003 | Aspergillus sydowii CBS 593.65 genome wide annotations             
    ##   AH65004 | Phytophthora cinnamomi var. cinnamomi CBS 144.22 genome wide ann...
    ##   AH65005 | Trypanosoma vivax Y486 genome wide annotations                     
    ##   ...       ...                                                                
    ##   AH67895 | org.Vibrio_vulnificus.eg.sqlite                                    
    ##   AH67896 | org.Achromobacter_group_B.eg.sqlite                                
    ##   AH67897 | org.Achromobacter_group_E.eg.sqlite                                
    ##   AH67898 | org.Pannonibacter_phragmitetus.eg.sqlite                           
    ##   AH67899 | org.Salinispora_arenicola_CNS-205.eg.sqlite
  • As expected, there is only one OrgDb object for human:

    (orghs <- subset(ah, ah$rdataclass == "OrgDb" & ah$species == "Homo sapiens"))
    ## AnnotationHub with 1 record
    ## # snapshotDate(): 2018-10-24 
    ## # names(): AH66156
    ## # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
    ## # $species: Homo sapiens
    ## # $rdataclass: OrgDb
    ## # $rdatadateadded: 2018-10-22
    ## # $title: org.Hs.eg.db.sqlite
    ## # $description: NCBI gene ID based annotations about Homo sapiens
    ## # $taxonomyid: 9606
    ## # $genome: NCBI genomes
    ## # $sourcetype: NCBI/ensembl
    ## # $sourceurl: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.ensembl.org/p...
    ## # $sourcesize: NA
    ## # $tags: c("NCBI", "Gene", "Annotation") 
    ## # retrieve record with 'object[["AH66156"]]'
  • We download this object and explore it a bit.

    org <- ah[["AH66156"]]
    ## downloading 0 resources
    ## loading from cache 
    ##     '/Users/charlotte//.AnnotationHub/72902'
    class(org)
    ## [1] "OrgDb"
    ## attr(,"package")
    ## [1] "AnnotationDbi"
    methods(class = "OrgDb")
    ##  [1] $                      $<-                    annotatedDataFrameFrom
    ##  [4] assayData              assayData<-            coerce                
    ##  [7] columns                combine                contents              
    ## [10] dbconn                 dbfile                 dbInfo                
    ## [13] dbmeta                 dbschema               ExpressionSet         
    ## [16] featureNames           featureNames<-         initialize            
    ## [19] isNA                   keys                   keytypes              
    ## [22] mapIds                 mappedkeys             metadata              
    ## [25] nhit                   revmap                 sample                
    ## [28] sampleNames            sampleNames<-          saveDb                
    ## [31] select                 show                   species               
    ## [34] storageMode            storageMode<-          taxonomyId            
    ## [37] updateObject          
    ## see '?methods' for accessing help and source code
    • The columns() function lists all the available columns in the OrgDb object. These columns represent all the different types of information that you can retrieve for the included records.
    • The keytypes() function is similar, but lists all the types of IDs that can be used as input, to retrieve records in the object.
    • The keys() function returns all the available IDs (keys) for a given keytype
    • The general strategy is to provide a set of search keys, indicate what type of information these represent via the keytype (e.g., are they gene symbols, Entrez IDs, GO terms, …), and ask to get other types of information (specified as the columns) about these entries.

      columns(org)
      ##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
      ##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
      ## [11] "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"        
      ## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
      ## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"       "UNIGENE"     
      ## [26] "UNIPROT"
      keytypes(org)
      ##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
      ##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
      ## [11] "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"        
      ## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
      ## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"       "UNIGENE"     
      ## [26] "UNIPROT"
      head(keys(org, keytype = "SYMBOL"))
      ## [1] "A1BG"  "A2M"   "A2MP1" "NAT1"  "NAT2"  "NATP"
  • If we want to know what these columns represent, we can use the corresponding help page

    help("SYMBOL")
  • The main function for extracting information in the way described above is called select().
  • This function expects you to provide a set of keys, the keytype, and the list of columns that you would like to extract
  • Here, we are retrieving the gene symbol and the GO categories for the gene with Entrez ID 1
  • Note that each provided key can be represented by multiple rows in the output data frame, if there is not a 1-1 mapping between the key and some of the columns.

    select(org, keys = c("1"), 
       columns = c("SYMBOL", "GO"), keytype = "ENTREZID")
    ## 'select()' returned 1:many mapping between keys and columns
    ##    ENTREZID SYMBOL         GO EVIDENCE ONTOLOGY
    ## 1         1   A1BG GO:0002576      TAS       BP
    ## 2         1   A1BG GO:0003674       ND       MF
    ## 3         1   A1BG GO:0005576      HDA       CC
    ## 4         1   A1BG GO:0005576      IDA       CC
    ## 5         1   A1BG GO:0005576      TAS       CC
    ## 6         1   A1BG GO:0005615      HDA       CC
    ## 7         1   A1BG GO:0008150       ND       BP
    ## 8         1   A1BG GO:0031093      TAS       CC
    ## 9         1   A1BG GO:0034774      TAS       CC
    ## 10        1   A1BG GO:0043312      TAS       BP
    ## 11        1   A1BG GO:0062023      HDA       CC
    ## 12        1   A1BG GO:0070062      HDA       CC
    ## 13        1   A1BG GO:0072562      HDA       CC
    ## 14        1   A1BG GO:1904813      TAS       CC
  • There is also a function called mapIds(), which extracts one column at the time (e.g. as a named vector), and provides different ways of handling one-to-many mappings (the default is to return only the first match)

    mapIds(org, keys = c("1"), 
       column = "GO", keytype = "ENTREZID", multiVals = "first")
    ## 'select()' returned 1:many mapping between keys and columns
    ##            1 
    ## "GO:0002576"
    mapIds(org, keys = c("1"), 
       column = "GO", keytype = "ENTREZID", multiVals = "list")
    ## 'select()' returned 1:many mapping between keys and columns
    ## $`1`
    ##  [1] "GO:0002576" "GO:0003674" "GO:0005576" "GO:0005576" "GO:0005576"
    ##  [6] "GO:0005615" "GO:0008150" "GO:0031093" "GO:0034774" "GO:0043312"
    ## [11] "GO:0062023" "GO:0070062" "GO:0072562" "GO:1904813"
  • Instead of getting the OrgDb object from the AnnotationHub, we could have installed a species-specific Bioconductor package with this information
  • In this case, the package is called org.Hs.eg.db (Hs for Homo sapiens, eg for Entrez Gene, the main identifier that the package is based on)

    library(org.Hs.eg.db)
    class(org.Hs.eg.db)
    ## [1] "OrgDb"
    ## attr(,"package")
    ## [1] "AnnotationDbi"
    columns(org.Hs.eg.db)
    ##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
    ##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
    ## [11] "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"        
    ## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
    ## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"       "UNIGENE"     
    ## [26] "UNIPROT"
    keytypes(org.Hs.eg.db)
    ##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
    ##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
    ## [11] "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"        
    ## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
    ## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"       "UNIGENE"     
    ## [26] "UNIPROT"
    select(org.Hs.eg.db, keys = c("1"), 
       columns = c("SYMBOL", "GO"), keytype = "ENTREZID")
    ## 'select()' returned 1:many mapping between keys and columns
    ##    ENTREZID SYMBOL         GO EVIDENCE ONTOLOGY
    ## 1         1   A1BG GO:0002576      TAS       BP
    ## 2         1   A1BG GO:0003674       ND       MF
    ## 3         1   A1BG GO:0005576      HDA       CC
    ## 4         1   A1BG GO:0005576      IDA       CC
    ## 5         1   A1BG GO:0005576      TAS       CC
    ## 6         1   A1BG GO:0005615      HDA       CC
    ## 7         1   A1BG GO:0008150       ND       BP
    ## 8         1   A1BG GO:0031093      TAS       CC
    ## 9         1   A1BG GO:0034774      TAS       CC
    ## 10        1   A1BG GO:0043312      TAS       BP
    ## 11        1   A1BG GO:0062023      HDA       CC
    ## 12        1   A1BG GO:0070062      HDA       CC
    ## 13        1   A1BG GO:0072562      HDA       CC
    ## 14        1   A1BG GO:1904813      TAS       CC

Exercise

Load the OrgDb object for mouse. Use the columns() function to find out what types of annotations that can be retrieved from the object. Use the keytypes() function to find the types that can be used to query the database.

library(org.Mm.eg.db)
## 
columns(org.Mm.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GO"           "GOALL"        "IPI"          "MGI"          "ONTOLOGY"    
## [16] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"         "PROSITE"     
## [21] "REFSEQ"       "SYMBOL"       "UNIGENE"      "UNIPROT"
keytypes(org.Mm.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GO"           "GOALL"        "IPI"          "MGI"          "ONTOLOGY"    
## [16] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"         "PROSITE"     
## [21] "REFSEQ"       "SYMBOL"       "UNIGENE"      "UNIPROT"

Use the keys() function to retrieve all the Ensembl gene IDs. Then extract the GO annotation information and the gene symbol for each of these keys.

ks <- keys(org.Mm.eg.db, keytype = "ENSEMBL")
res <- select(org.Mm.eg.db, keys = ks, keytype = "ENSEMBL", columns = c("SYMBOL", "GO"))
## 'select()' returned 1:many mapping between keys and columns
head(res)
##              ENSEMBL SYMBOL         GO EVIDENCE ONTOLOGY
## 1 ENSMUSG00000030359    Pzp GO:0004866      IDA       MF
## 2 ENSMUSG00000030359    Pzp GO:0004867      IEA       MF
## 3 ENSMUSG00000030359    Pzp GO:0005576      TAS       CC
## 4 ENSMUSG00000030359    Pzp GO:0005615      IEA       CC
## 5 ENSMUSG00000030359    Pzp GO:0007566      IGI       BP
## 6 ENSMUSG00000030359    Pzp GO:0010466      IEA       BP

Exercise

Next we will see how to add annotation information to a data set. Let’s first generate a small artificial SummarizedExperiment object with five mouse genes and four samples.

library(SummarizedExperiment)
se <- SummarizedExperiment(
    assays = list(counts = matrix(round(100 * runif(20)), nrow = 5)),
    colData = DataFrame(sample = 1:4, condition = rep(c("A", "B"), each = 2)),
    rowData = DataFrame(gene = c("ENSMUSG00000059430", "ENSMUSG00000070979", 
                                 "ENSMUSG00000070980", "ENSMUSG00000052374",
                                 "ENSMUSG00000006457"))
)
rownames(se) <- rowData(se)$gene
colnames(se) <- colData(se)$sample
se
## class: SummarizedExperiment 
## dim: 5 4 
## metadata(0):
## assays(1): counts
## rownames(5): ENSMUSG00000059430 ENSMUSG00000070979 ENSMUSG00000070980
##   ENSMUSG00000052374 ENSMUSG00000006457
## rowData names(1): gene
## colnames(4): 1 2 3 4
## colData names(2): sample condition

Next, retrieve information about the Entrez Gene ID and the gene symbol from the appropriate annotation database, and add that information to the rowData of the object.

res <- select(org.Mm.eg.db, keys = rowData(se)$gene, keytype = "ENSEMBL", 
              columns = c("ENTREZID", "SYMBOL"))
## 'select()' returned 1:1 mapping between keys and columns
## In this case, there was a 1:1 mapping between the ENSEMBL ID and the other two columns
rowData(se)$ENTREZID <- res$ENTREZID[match(rowData(se)$gene, res$ENSEMBL)]
rowData(se)$SYMBOL <- res$SYMBOL[match(rowData(se)$gene, res$ENSEMBL)]
se
## class: SummarizedExperiment 
## dim: 5 4 
## metadata(0):
## assays(1): counts
## rownames(5): ENSMUSG00000059430 ENSMUSG00000070979 ENSMUSG00000070980
##   ENSMUSG00000052374 ENSMUSG00000006457
## rowData names(3): gene ENTREZID SYMBOL
## colnames(4): 1 2 3 4
## colData names(2): sample condition
rowData(se)
## DataFrame with 5 rows and 3 columns
##                                  gene    ENTREZID      SYMBOL
##                           <character> <character> <character>
## ENSMUSG00000059430 ENSMUSG00000059430       11468       Actg2
## ENSMUSG00000070979 ENSMUSG00000070979       11470      Actl7a
## ENSMUSG00000070980 ENSMUSG00000070980       11471      Actl7b
## ENSMUSG00000052374 ENSMUSG00000052374       11472       Actn2
## ENSMUSG00000006457 ENSMUSG00000006457       11474       Actn3

Next, do the same with the ALIAS column.

(res <- select(org.Mm.eg.db, keys = rowData(se)$gene, keytype = "ENSEMBL", 
               columns = c("ALIAS")))
## 'select()' returned 1:many mapping between keys and columns
##               ENSEMBL         ALIAS
## 1  ENSMUSG00000059430         ACTA3
## 2  ENSMUSG00000059430         Act-4
## 3  ENSMUSG00000059430          Act4
## 4  ENSMUSG00000059430          SMGA
## 5  ENSMUSG00000059430         Actg2
## 6  ENSMUSG00000070979         Tact2
## 7  ENSMUSG00000070979        Actl7a
## 8  ENSMUSG00000070980       Gm26657
## 9  ENSMUSG00000070980         Tact1
## 10 ENSMUSG00000070980        Actl7b
## 11 ENSMUSG00000052374 1110008F24Rik
## 12 ENSMUSG00000052374         Actn2
## 13 ENSMUSG00000006457         Actn3
## Now, there's a 1:many mapping between the ENSEMBL ID and the alias. 
## One way to make sure that we only get one value per key is to use `mapIds()` instead of `select()`:
(ids <- mapIds(org.Mm.eg.db, keys = rowData(se)$gene, keytype = "ENSEMBL",
               column = "ALIAS", multiVals = "first"))
## 'select()' returned 1:many mapping between keys and columns
## ENSMUSG00000059430 ENSMUSG00000070979 ENSMUSG00000070980 ENSMUSG00000052374 
##            "ACTA3"            "Tact2"          "Gm26657"    "1110008F24Rik" 
## ENSMUSG00000006457 
##            "Actn3"
## We could also get a list, and combine the entries for each key into a string
ids <- mapIds(org.Mm.eg.db, keys = rowData(se)$gene, keytype = "ENSEMBL",
              column = "ALIAS", multiVals = "list")
## 'select()' returned 1:many mapping between keys and columns
(ids <- vapply(ids, function(x) paste(x, collapse = ";"), ""))
##            ENSMUSG00000059430            ENSMUSG00000070979 
## "ACTA3;Act-4;Act4;SMGA;Actg2"                "Tact2;Actl7a" 
##            ENSMUSG00000070980            ENSMUSG00000052374 
##        "Gm26657;Tact1;Actl7b"         "1110008F24Rik;Actn2" 
##            ENSMUSG00000006457 
##                       "Actn3"
## Now, we can add the information to the `rowData`
rowData(se)$ALIAS <- ids[rowData(se)$gene]
rowData(se)
## DataFrame with 5 rows and 4 columns
##                                  gene    ENTREZID      SYMBOL
##                           <character> <character> <character>
## ENSMUSG00000059430 ENSMUSG00000059430       11468       Actg2
## ENSMUSG00000070979 ENSMUSG00000070979       11470      Actl7a
## ENSMUSG00000070980 ENSMUSG00000070980       11471      Actl7b
## ENSMUSG00000052374 ENSMUSG00000052374       11472       Actn2
## ENSMUSG00000006457 ENSMUSG00000006457       11474       Actn3
##                                          ALIAS
##                                    <character>
## ENSMUSG00000059430 ACTA3;Act-4;Act4;SMGA;Actg2
## ENSMUSG00000070979                Tact2;Actl7a
## ENSMUSG00000070980        Gm26657;Tact1;Actl7b
## ENSMUSG00000052374         1110008F24Rik;Actn2
## ENSMUSG00000006457                       Actn3

TxDb objects

  • A TxDb object is, again, basically a SQLite database, containing information about the transcriptome features for a given annotation
  • As opposed to the OrgDb, there are multiple TxDb objects for a given organism, corresponding to different reference annotations (different sources and/or different versions)
  • Let’s list all the TxDb objects for human in the AnnotationHub:

    subset(ah, ah$rdataclass == "TxDb" & ah$species == "Homo sapiens")
    ## AnnotationHub with 5 records
    ## # snapshotDate(): 2018-10-24 
    ## # $dataprovider: UCSC
    ## # $species: Homo sapiens
    ## # $rdataclass: TxDb
    ## # additional mcols(): taxonomyid, genome, description,
    ## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
    ## #   rdatapath, sourceurl, sourcetype 
    ## # retrieve records with, e.g., 'object[["AH52256"]]' 
    ## 
    ##             title                                             
    ##   AH52256 | TxDb.Hsapiens.BioMart.igis.sqlite                 
    ##   AH52257 | TxDb.Hsapiens.UCSC.hg18.knownGene.sqlite          
    ##   AH52258 | TxDb.Hsapiens.UCSC.hg19.knownGene.sqlite          
    ##   AH52259 | TxDb.Hsapiens.UCSC.hg19.lincRNAsTranscripts.sqlite
    ##   AH52260 | TxDb.Hsapiens.UCSC.hg38.knownGene.sqlite
  • We pull down the version corresponding to hg38 (the current version of the human reference genome), and the known gene database from the UCSC:

    txdb <- ah[["AH52260"]]
    ## downloading 0 resources
    ## loading from cache 
    ##     '/Users/charlotte//.AnnotationHub/58998'
  • The structure of the TxDb object is very similar to the OrgDb one, and we interact with it in the same way, even if the content is different.
  • For example, the columns() and keytype() functions work in the same way, and we extract information using select() or mapIds()

    columns(txdb)
    ##  [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSSTART"  
    ##  [6] "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"     "EXONNAME"  
    ## [11] "EXONRANK"   "EXONSTART"  "EXONSTRAND" "GENEID"     "TXCHROM"   
    ## [16] "TXEND"      "TXID"       "TXNAME"     "TXSTART"    "TXSTRAND"  
    ## [21] "TXTYPE"
    keytypes(txdb)
    ## [1] "CDSID"    "CDSNAME"  "EXONID"   "EXONNAME" "GENEID"   "TXID"     "TXNAME"
    head(keys(txdb, keytype = "GENEID"))
    ## [1] "1"         "10"        "100"       "1000"      "10000"     "100008587"
    select(txdb, keys = c("1", "100"), 
       keytype = "GENEID", columns = c("TXNAME"))
    ## 'select()' returned 1:many mapping between keys and columns
    ##    GENEID     TXNAME
    ## 1       1 uc061drj.1
    ## 2       1 uc002qsd.5
    ## 3       1 uc061drk.1
    ## 4       1 uc061drl.1
    ## 5       1 uc061drm.1
    ## 6       1 uc061drs.1
    ## 7       1 uc061drt.1
    ## 8       1 uc061drv.1
    ## 9     100 uc061xfj.1
    ## 10    100 uc002xmj.4
    ## 11    100 uc061xfl.1
  • Just as for the OrgDb object above, we can retrieve many TxDb objects from the corresponding Bioconductor packages.
  • For the object above, the package is called TxDb.Hsapiens.UCSC.hg38.knownGene

    library(TxDb.Hsapiens.UCSC.hg38.knownGene)
    class(TxDb.Hsapiens.UCSC.hg38.knownGene)
    ## [1] "TxDb"
    ## attr(,"package")
    ## [1] "GenomicFeatures"
    columns(TxDb.Hsapiens.UCSC.hg38.knownGene)
    ##  [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSSTART"  
    ##  [6] "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"     "EXONNAME"  
    ## [11] "EXONRANK"   "EXONSTART"  "EXONSTRAND" "GENEID"     "TXCHROM"   
    ## [16] "TXEND"      "TXID"       "TXNAME"     "TXSTART"    "TXSTRAND"  
    ## [21] "TXTYPE"
    keytypes(TxDb.Hsapiens.UCSC.hg38.knownGene)
    ## [1] "CDSID"    "CDSNAME"  "EXONID"   "EXONNAME" "GENEID"   "TXID"     "TXNAME"
    head(keys(TxDb.Hsapiens.UCSC.hg38.knownGene, keytype = "GENEID"))
    ## [1] "1"         "10"        "100"       "1000"      "10000"     "100008587"
    select(TxDb.Hsapiens.UCSC.hg38.knownGene, keys = c("1", "100"), 
       keytype = "GENEID", columns = c("TXNAME"))
    ## 'select()' returned 1:many mapping between keys and columns
    ##    GENEID     TXNAME
    ## 1       1 uc061drj.1
    ## 2       1 uc002qsd.5
    ## 3       1 uc061drk.1
    ## 4       1 uc061drl.1
    ## 5       1 uc061drm.1
    ## 6       1 uc061drs.1
    ## 7       1 uc061drt.1
    ## 8       1 uc061drv.1
    ## 9     100 uc061xfj.1
    ## 10    100 uc002xmj.4
    ## 11    100 uc061xfl.1
  • The GenomicFeatures package contains many useful functions for generating and interacting with TxDb objects. For example, we can
    • extract all exons from the database into a GRanges object

      library(GenomicFeatures)
      exons(txdb)
      ## GRanges object with 581036 ranges and 1 metadata column:
      ##                    seqnames        ranges strand |   exon_id
      ##                       <Rle>     <IRanges>  <Rle> | <integer>
      ##        [1]             chr1   29554-30039      + |         1
      ##        [2]             chr1   30267-30667      + |         2
      ##        [3]             chr1   30366-30503      + |         3
      ##        [4]             chr1   30564-30667      + |         4
      ##        [5]             chr1   30976-31097      + |         5
      ##        ...              ...           ...    ... .       ...
      ##   [581032] chrUn_KI270750v1 148668-148843      + |    581032
      ##   [581033] chrUn_KI270752v1       144-268      + |    581033
      ##   [581034] chrUn_KI270752v1   21813-21944      + |    581034
      ##   [581035] chrUn_KI270752v1     3497-3623      - |    581035
      ##   [581036] chrUn_KI270752v1    9943-10067      - |    581036
      ##   -------
      ##   seqinfo: 455 sequences (1 circular) from hg38 genome
    • extract all transcripts into a GRanges object

      transcripts(txdb)
      ## GRanges object with 197782 ranges and 2 metadata columns:
      ##                    seqnames        ranges strand |     tx_id     tx_name
      ##                       <Rle>     <IRanges>  <Rle> | <integer> <character>
      ##        [1]             chr1   29554-31097      + |         1  uc057aty.1
      ##        [2]             chr1   30267-31109      + |         2  uc057atz.1
      ##        [3]             chr1   30366-30503      + |         3  uc031tlb.1
      ##        [4]             chr1   69091-70008      + |         4  uc001aal.1
      ##        [5]             chr1 160446-161525      + |         5  uc057aum.1
      ##        ...              ...           ...    ... .       ...         ...
      ##   [197778] chrUn_KI270750v1 148668-148843      + |    197778  uc064xrp.1
      ##   [197779] chrUn_KI270752v1       144-268      + |    197779  uc064xrq.1
      ##   [197780] chrUn_KI270752v1   21813-21944      + |    197780  uc064xrt.1
      ##   [197781] chrUn_KI270752v1     3497-3623      - |    197781  uc064xrr.1
      ##   [197782] chrUn_KI270752v1    9943-10067      - |    197782  uc064xrs.1
      ##   -------
      ##   seqinfo: 455 sequences (1 circular) from hg38 genome
    • extract all exons and group them by transcript (represented as a GRangesList object)

      exonsBy(txdb, by = "tx")
      ## GRangesList object of length 197782:
      ## $1 
      ## GRanges object with 3 ranges and 3 metadata columns:
      ##       seqnames      ranges strand |   exon_id   exon_name exon_rank
      ##          <Rle>   <IRanges>  <Rle> | <integer> <character> <integer>
      ##   [1]     chr1 29554-30039      + |         1        <NA>         1
      ##   [2]     chr1 30564-30667      + |         4        <NA>         2
      ##   [3]     chr1 30976-31097      + |         5        <NA>         3
      ## 
      ## $2 
      ## GRanges object with 2 ranges and 3 metadata columns:
      ##       seqnames      ranges strand | exon_id exon_name exon_rank
      ##   [1]     chr1 30267-30667      + |       2      <NA>         1
      ##   [2]     chr1 30976-31109      + |       6      <NA>         2
      ## 
      ## $3 
      ## GRanges object with 1 range and 3 metadata columns:
      ##       seqnames      ranges strand | exon_id exon_name exon_rank
      ##   [1]     chr1 30366-30503      + |       3      <NA>         1
      ## 
      ## ...
      ## <197779 more elements>
      ## -------
      ## seqinfo: 455 sequences (1 circular) from hg38 genome
    • similarly, extract all transcripts (including both exons and introns) and group them by gene

      transcriptsBy(txdb, by = "gene")
      ## GRangesList object of length 25221:
      ## $1 
      ## GRanges object with 8 ranges and 2 metadata columns:
      ##       seqnames            ranges strand |     tx_id     tx_name
      ##          <Rle>         <IRanges>  <Rle> | <integer> <character>
      ##   [1]    chr19 58345178-58347634      - |    166436  uc061drj.1
      ##   [2]    chr19 58346850-58353499      - |    166437  uc002qsd.5
      ##   [3]    chr19 58346854-58356225      - |    166438  uc061drk.1
      ##   [4]    chr19 58346858-58353491      - |    166439  uc061drl.1
      ##   [5]    chr19 58346860-58347657      - |    166440  uc061drm.1
      ##   [6]    chr19 58348466-58362751      - |    166441  uc061drs.1
      ##   [7]    chr19 58350594-58353129      - |    166442  uc061drt.1
      ##   [8]    chr19 58353021-58356083      - |    166443  uc061drv.1
      ## 
      ## $10 
      ## GRanges object with 2 ranges and 2 metadata columns:
      ##       seqnames            ranges strand | tx_id    tx_name
      ##   [1]     chr8 18391245-18401218      + | 72647 uc003wyw.2
      ##   [2]     chr8 18391287-18400993      + | 72648 uc064kqw.1
      ## 
      ## $100 
      ## GRanges object with 3 ranges and 2 metadata columns:
      ##       seqnames            ranges strand |  tx_id    tx_name
      ##   [1]    chr20 44619522-44626491      - | 169786 uc061xfj.1
      ##   [2]    chr20 44619522-44651742      - | 169787 uc002xmj.4
      ##   [3]    chr20 44619810-44651691      - | 169789 uc061xfl.1
      ## 
      ## ...
      ## <25218 more elements>
      ## -------
      ## seqinfo: 455 sequences (1 circular) from hg38 genome
    • retrieve all promoters (a given number of bases upstream and downstream from each annotated transcription start site)

      promoters(txdb, upstream = 2000, downstream = 200)
      ## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 174 out-of-bound ranges located on sequences
      ##   chr1_GL383518v1_alt, chr1_KI270762v1_alt, chr2_GL383522v1_alt,
      ##   chr2_KI270774v1_alt, chr3_KI270777v1_alt, chr3_KI270781v1_alt,
      ##   chr4_KI270788v1_alt, chr5_GL339449v2_alt, chr5_KI270795v1_alt,
      ##   chr5_KI270898v1_alt, chr6_GL000254v2_alt, chr6_KI270797v1_alt,
      ##   chr6_KI270798v1_alt, chr6_KI270801v1_alt, chr7_GL383534v2_alt,
      ##   chr7_KI270803v1_alt, chr7_KI270806v1_alt, chr7_KI270809v1_alt,
      ##   chr8_KI270815v1_alt, chr8_KI270821v1_alt, chr9_GL383540v1_alt,
      ##   chr9_GL383541v1_alt, chr11_KI270831v1_alt, chr11_KI270902v1_alt,
      ##   chr12_GL383551v1_alt, chr12_GL383553v2_alt, chr12_KI270834v1_alt,
      ##   chr13_KI270838v1_alt, chr14_KI270847v1_alt, chr15_KI270850v1_alt,
      ##   chr16_GL383557v1_alt, chr16_KI270854v1_alt, chr17_JH159146v1_alt,
      ##   chr17_JH159147v1_alt, chr17_KI270857v1_alt, chr17_KI270860v1_alt,
      ##   chr17_KI270910v1_alt, chr19_GL383575v2_alt, chr19_KI270866v1_alt,
      ##   chr19_KI270868v1_alt, chr19_KI270884v1_alt, chr19_KI270885v1_alt,
      ##   chr19_KI270889v1_alt, chr19_KI270890v1_alt, chr19_KI270891v1_alt,
      ##   chr19_KI270915v1_alt, chr19_KI270916v1_alt, chr19_KI270919v1_alt,
      ##   chr19_KI270922v1_alt, chr19_KI270929v1_alt, chr19_KI270930v1_alt,
      ##   chr19_KI270931v1_alt, chr19_KI270932v1_alt, chr19_KI270933v1_alt,
      ##   chr20_KI270869v1_alt, chr21_GL383581v2_alt, chr21_KI270872v1_alt,
      ##   chr22_KB663609v1_alt, chr22_KI270876v1_alt, chr22_KI270879v1_alt,
      ##   chrX_KI270880v1_alt, chrUn_KI270750v1, and chrUn_KI270752v1. Note that
      ##   ranges located on a sequence whose length is unknown (NA) or on a
      ##   circular sequence are not considered out-of-bound (use seqlengths() and
      ##   isCircular() to get the lengths and circularity flags of the underlying
      ##   sequences). You can use trim() to trim these ranges. See
      ##   ?`trim,GenomicRanges-method` for more information.
      ## GRanges object with 197782 ranges and 2 metadata columns:
      ##                      seqnames        ranges strand |     tx_id     tx_name
      ##                         <Rle>     <IRanges>  <Rle> | <integer> <character>
      ##   uc057aty.1             chr1   27554-29753      + |         1  uc057aty.1
      ##   uc057atz.1             chr1   28267-30466      + |         2  uc057atz.1
      ##   uc031tlb.1             chr1   28366-30565      + |         3  uc031tlb.1
      ##   uc001aal.1             chr1   67091-69290      + |         4  uc001aal.1
      ##   uc057aum.1             chr1 158446-160645      + |         5  uc057aum.1
      ##          ...              ...           ...    ... .       ...         ...
      ##   uc064xrp.1 chrUn_KI270750v1 146668-148867      + |    197778  uc064xrp.1
      ##   uc064xrq.1 chrUn_KI270752v1     -1856-343      + |    197779  uc064xrq.1
      ##   uc064xrt.1 chrUn_KI270752v1   19813-22012      + |    197780  uc064xrt.1
      ##   uc064xrr.1 chrUn_KI270752v1     3424-5623      - |    197781  uc064xrr.1
      ##   uc064xrs.1 chrUn_KI270752v1    9868-12067      - |    197782  uc064xrs.1
      ##   -------
      ##   seqinfo: 455 sequences (1 circular) from hg38 genome

      Note the warning, since some of the regions extend beyond the borders of the chromosomes. This can be fixed using the trim() function, which truncates the ranges to the allowed ranges

      trim(promoters(txdb, upstream = 2000, downstream = 200))
      ## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 174 out-of-bound ranges located on sequences
      ##   chr1_GL383518v1_alt, chr1_KI270762v1_alt, chr2_GL383522v1_alt,
      ##   chr2_KI270774v1_alt, chr3_KI270777v1_alt, chr3_KI270781v1_alt,
      ##   chr4_KI270788v1_alt, chr5_GL339449v2_alt, chr5_KI270795v1_alt,
      ##   chr5_KI270898v1_alt, chr6_GL000254v2_alt, chr6_KI270797v1_alt,
      ##   chr6_KI270798v1_alt, chr6_KI270801v1_alt, chr7_GL383534v2_alt,
      ##   chr7_KI270803v1_alt, chr7_KI270806v1_alt, chr7_KI270809v1_alt,
      ##   chr8_KI270815v1_alt, chr8_KI270821v1_alt, chr9_GL383540v1_alt,
      ##   chr9_GL383541v1_alt, chr11_KI270831v1_alt, chr11_KI270902v1_alt,
      ##   chr12_GL383551v1_alt, chr12_GL383553v2_alt, chr12_KI270834v1_alt,
      ##   chr13_KI270838v1_alt, chr14_KI270847v1_alt, chr15_KI270850v1_alt,
      ##   chr16_GL383557v1_alt, chr16_KI270854v1_alt, chr17_JH159146v1_alt,
      ##   chr17_JH159147v1_alt, chr17_KI270857v1_alt, chr17_KI270860v1_alt,
      ##   chr17_KI270910v1_alt, chr19_GL383575v2_alt, chr19_KI270866v1_alt,
      ##   chr19_KI270868v1_alt, chr19_KI270884v1_alt, chr19_KI270885v1_alt,
      ##   chr19_KI270889v1_alt, chr19_KI270890v1_alt, chr19_KI270891v1_alt,
      ##   chr19_KI270915v1_alt, chr19_KI270916v1_alt, chr19_KI270919v1_alt,
      ##   chr19_KI270922v1_alt, chr19_KI270929v1_alt, chr19_KI270930v1_alt,
      ##   chr19_KI270931v1_alt, chr19_KI270932v1_alt, chr19_KI270933v1_alt,
      ##   chr20_KI270869v1_alt, chr21_GL383581v2_alt, chr21_KI270872v1_alt,
      ##   chr22_KB663609v1_alt, chr22_KI270876v1_alt, chr22_KI270879v1_alt,
      ##   chrX_KI270880v1_alt, chrUn_KI270750v1, and chrUn_KI270752v1. Note that
      ##   ranges located on a sequence whose length is unknown (NA) or on a
      ##   circular sequence are not considered out-of-bound (use seqlengths() and
      ##   isCircular() to get the lengths and circularity flags of the underlying
      ##   sequences). You can use trim() to trim these ranges. See
      ##   ?`trim,GenomicRanges-method` for more information.
      ## GRanges object with 197782 ranges and 2 metadata columns:
      ##                      seqnames        ranges strand |     tx_id     tx_name
      ##                         <Rle>     <IRanges>  <Rle> | <integer> <character>
      ##   uc057aty.1             chr1   27554-29753      + |         1  uc057aty.1
      ##   uc057atz.1             chr1   28267-30466      + |         2  uc057atz.1
      ##   uc031tlb.1             chr1   28366-30565      + |         3  uc031tlb.1
      ##   uc001aal.1             chr1   67091-69290      + |         4  uc001aal.1
      ##   uc057aum.1             chr1 158446-160645      + |         5  uc057aum.1
      ##          ...              ...           ...    ... .       ...         ...
      ##   uc064xrp.1 chrUn_KI270750v1 146668-148850      + |    197778  uc064xrp.1
      ##   uc064xrq.1 chrUn_KI270752v1         1-343      + |    197779  uc064xrq.1
      ##   uc064xrt.1 chrUn_KI270752v1   19813-22012      + |    197780  uc064xrt.1
      ##   uc064xrr.1 chrUn_KI270752v1     3424-5623      - |    197781  uc064xrr.1
      ##   uc064xrs.1 chrUn_KI270752v1    9868-12067      - |    197782  uc064xrs.1
      ##   -------
      ##   seqinfo: 455 sequences (1 circular) from hg38 genome
  • The GenomicFeatures package can be used to generate TxDb objects from, e.g., gtf files or GRanges objects, using the makeTxDbFromGFF() and makeTxDbFromGRanges() functions

EnsDb objects

  • EnsDb objects are similar to TxDb objects, but are specifically focused on annotations from Ensembl
  • The same columns() and keytypes() functions as we used for the OrgDb and TxDb objects can be used to interrogate EnsDb objects

    subset(ah, ah$rdataclass == "EnsDb" & ah$species == "Homo sapiens")
    ## AnnotationHub with 8 records
    ## # snapshotDate(): 2018-10-24 
    ## # $dataprovider: Ensembl
    ## # $species: Homo sapiens
    ## # $rdataclass: EnsDb
    ## # additional mcols(): taxonomyid, genome, description,
    ## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
    ## #   rdatapath, sourceurl, sourcetype 
    ## # retrieve records with, e.g., 'object[["AH53211"]]' 
    ## 
    ##             title                            
    ##   AH53211 | Ensembl 87 EnsDb for Homo Sapiens
    ##   AH53715 | Ensembl 88 EnsDb for Homo Sapiens
    ##   AH56681 | Ensembl 89 EnsDb for Homo Sapiens
    ##   AH57757 | Ensembl 90 EnsDb for Homo Sapiens
    ##   AH60773 | Ensembl 91 EnsDb for Homo Sapiens
    ##   AH60977 | Ensembl 92 EnsDb for Homo Sapiens
    ##   AH64446 | Ensembl 93 EnsDb for Homo Sapiens
    ##   AH64923 | Ensembl 94 EnsDb for Homo sapiens
    ensdb <- ah[["AH64923"]]
    ## downloading 0 resources
    ## loading from cache 
    ##     '/Users/charlotte//.AnnotationHub/71669'
    ## require("ensembldb")
    columns(ensdb)
    ##  [1] "DESCRIPTION"         "ENTREZID"            "EXONID"             
    ##  [4] "EXONIDX"             "EXONSEQEND"          "EXONSEQSTART"       
    ##  [7] "GENEBIOTYPE"         "GENEID"              "GENEIDVERSION"      
    ## [10] "GENENAME"            "GENESEQEND"          "GENESEQSTART"       
    ## [13] "INTERPROACCESSION"   "ISCIRCULAR"          "PROTDOMEND"         
    ## [16] "PROTDOMSTART"        "PROTEINDOMAINID"     "PROTEINDOMAINSOURCE"
    ## [19] "PROTEINID"           "PROTEINSEQUENCE"     "SEQCOORDSYSTEM"     
    ## [22] "SEQLENGTH"           "SEQNAME"             "SEQSTRAND"          
    ## [25] "SYMBOL"              "TXBIOTYPE"           "TXCDSSEQEND"        
    ## [28] "TXCDSSEQSTART"       "TXID"                "TXIDVERSION"        
    ## [31] "TXNAME"              "TXSEQEND"            "TXSEQSTART"         
    ## [34] "TXSUPPORTLEVEL"      "UNIPROTDB"           "UNIPROTID"          
    ## [37] "UNIPROTMAPPINGTYPE"
    keytypes(ensdb)
    ##  [1] "ENTREZID"            "EXONID"              "GENEBIOTYPE"        
    ##  [4] "GENEID"              "GENENAME"            "PROTDOMID"          
    ##  [7] "PROTEINDOMAINID"     "PROTEINDOMAINSOURCE" "PROTEINID"          
    ## [10] "SEQNAME"             "SEQSTRAND"           "SYMBOL"             
    ## [13] "TXBIOTYPE"           "TXID"                "TXNAME"             
    ## [16] "UNIPROTID"
    head(keys(ensdb, keytype = "GENEID"))
    ## [1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
    ## [5] "ENSG00000000460" "ENSG00000000938"
    select(ensdb, keys = c("ENSG00000000003", "ENSG00000000005"),
       columns = c("SYMBOL", "TXID", "DESCRIPTION"), keytype = "GENEID")
    ##            GENEID SYMBOL            TXID
    ## 1 ENSG00000000003 TSPAN6 ENST00000373020
    ## 2 ENSG00000000003 TSPAN6 ENST00000496771
    ## 3 ENSG00000000003 TSPAN6 ENST00000494424
    ## 4 ENSG00000000003 TSPAN6 ENST00000612152
    ## 5 ENSG00000000003 TSPAN6 ENST00000614008
    ## 6 ENSG00000000005   TNMD ENST00000373031
    ## 7 ENSG00000000005   TNMD ENST00000485971
    ##                                         DESCRIPTION
    ## 1 tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
    ## 2 tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
    ## 3 tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
    ## 4 tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
    ## 5 tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
    ## 6   tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757]
    ## 7   tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757]
  • Just as for the other types of objects, some of the EnsDb objects can also be retrieved from dedicated Bioconductor packages

    library(ensembldb)
    library(EnsDb.Hsapiens.v86)

Other types of AnnotationDb objects

Both the OrgDb and TxDb classes are special cases of the AnnotationDb class. There are also other types of AnnotationDb objects. The most important ones may be the ChipDb classes, which hold information about te probes on specific microarray chips. These objects can be interacted with using the same syntax as we have seen for the OrgDb and TxDb objects.

BSgenome objects

  • BSgenome objects are different from the objects we have looked at above
  • They do not contain databases with lookup tables
  • Instead, their main purpose is to store genome sequence information and make sequence extraction efficient
  • Bsgenome objects are available in dedicated Bioconductor packages, and can also be generated manually starting from a genome sequence
  • The BSgenome package contains functions for interacting with BSgenome objects

    library(BSgenome.Hsapiens.UCSC.hg38)
  • The BSgenome object is named in the same way as the package, but often also has a ‘short-hand’ name (for the human genome packages, it is called Hsapiens)

    BSgenome.Hsapiens.UCSC.hg38
    ## Human genome:
    ## # organism: Homo sapiens (Human)
    ## # provider: UCSC
    ## # provider version: hg38
    ## # release date: Dec. 2013
    ## # release name: Genome Reference Consortium GRCh38
    ## # 455 sequences:
    ## #   chr1                    chr2                    chr3                   
    ## #   chr4                    chr5                    chr6                   
    ## #   chr7                    chr8                    chr9                   
    ## #   chr10                   chr11                   chr12                  
    ## #   chr13                   chr14                   chr15                  
    ## #   ...                     ...                     ...                    
    ## #   chrUn_KI270744v1        chrUn_KI270745v1        chrUn_KI270746v1       
    ## #   chrUn_KI270747v1        chrUn_KI270748v1        chrUn_KI270749v1       
    ## #   chrUn_KI270750v1        chrUn_KI270751v1        chrUn_KI270752v1       
    ## #   chrUn_KI270753v1        chrUn_KI270754v1        chrUn_KI270755v1       
    ## #   chrUn_KI270756v1        chrUn_KI270757v1                               
    ## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
    ## # to access a given sequence)
    Hsapiens
    ## Human genome:
    ## # organism: Homo sapiens (Human)
    ## # provider: UCSC
    ## # provider version: hg38
    ## # release date: Dec. 2013
    ## # release name: Genome Reference Consortium GRCh38
    ## # 455 sequences:
    ## #   chr1                    chr2                    chr3                   
    ## #   chr4                    chr5                    chr6                   
    ## #   chr7                    chr8                    chr9                   
    ## #   chr10                   chr11                   chr12                  
    ## #   chr13                   chr14                   chr15                  
    ## #   ...                     ...                     ...                    
    ## #   chrUn_KI270744v1        chrUn_KI270745v1        chrUn_KI270746v1       
    ## #   chrUn_KI270747v1        chrUn_KI270748v1        chrUn_KI270749v1       
    ## #   chrUn_KI270750v1        chrUn_KI270751v1        chrUn_KI270752v1       
    ## #   chrUn_KI270753v1        chrUn_KI270754v1        chrUn_KI270755v1       
    ## #   chrUn_KI270756v1        chrUn_KI270757v1                               
    ## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
    ## # to access a given sequence)
  • When combining annotation objects, it is important to make sure that they correspond to the same genome and annotation version
  • For example, if the chromosomes are called ‘chr1’ and ‘chr2’ in the genome sequence, but ‘1’ and ‘2’ in the TxDb, R will not be able to match them up
  • A given gene may not have the same genomic coordinates in different versions of the human genome - neither the genome sequence nor the gene annotations are static between versions
  • Let’s combine the genomic locations of the promoters extracted above from the TxDb object corresponding to the TxDb.Hsapiens.UCSC.hg38.knownGene.sqlite AnnotationHub record (stored in the variable txdb above), with the genome sequence provided in the BSgenome.Hsapiens.UCSC.hg38 package, and extract the sequence of all the promoters
    • First check that the genome and the chromosome names are the same in the two annotation objects

      all(seqlevels(Hsapiens) %in% seqlevels(txdb))
      ## [1] TRUE
      all(seqlevels(txdb) %in% seqlevels(Hsapiens))
      ## [1] TRUE
      unique(genome(Hsapiens))
      ## [1] "hg38"
      unique(genome(txdb))
      ## [1] "hg38"
    • Next, extract the promoter locations and place them in a GRanges object

      prom <- promoters(txdb)
      ## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 174 out-of-bound ranges located on sequences
      ##   chr1_GL383518v1_alt, chr1_KI270762v1_alt, chr2_GL383522v1_alt,
      ##   chr2_KI270774v1_alt, chr3_KI270777v1_alt, chr3_KI270781v1_alt,
      ##   chr4_KI270788v1_alt, chr5_GL339449v2_alt, chr5_KI270795v1_alt,
      ##   chr5_KI270898v1_alt, chr6_GL000254v2_alt, chr6_KI270797v1_alt,
      ##   chr6_KI270798v1_alt, chr6_KI270801v1_alt, chr7_GL383534v2_alt,
      ##   chr7_KI270803v1_alt, chr7_KI270806v1_alt, chr7_KI270809v1_alt,
      ##   chr8_KI270815v1_alt, chr8_KI270821v1_alt, chr9_GL383540v1_alt,
      ##   chr9_GL383541v1_alt, chr11_KI270831v1_alt, chr11_KI270902v1_alt,
      ##   chr12_GL383551v1_alt, chr12_GL383553v2_alt, chr12_KI270834v1_alt,
      ##   chr13_KI270838v1_alt, chr14_KI270847v1_alt, chr15_KI270850v1_alt,
      ##   chr16_GL383557v1_alt, chr16_KI270854v1_alt, chr17_JH159146v1_alt,
      ##   chr17_JH159147v1_alt, chr17_KI270857v1_alt, chr17_KI270860v1_alt,
      ##   chr17_KI270910v1_alt, chr19_GL383575v2_alt, chr19_KI270866v1_alt,
      ##   chr19_KI270868v1_alt, chr19_KI270884v1_alt, chr19_KI270885v1_alt,
      ##   chr19_KI270889v1_alt, chr19_KI270890v1_alt, chr19_KI270891v1_alt,
      ##   chr19_KI270915v1_alt, chr19_KI270916v1_alt, chr19_KI270919v1_alt,
      ##   chr19_KI270922v1_alt, chr19_KI270929v1_alt, chr19_KI270930v1_alt,
      ##   chr19_KI270931v1_alt, chr19_KI270932v1_alt, chr19_KI270933v1_alt,
      ##   chr20_KI270869v1_alt, chr21_GL383581v2_alt, chr21_KI270872v1_alt,
      ##   chr22_KB663609v1_alt, chr22_KI270876v1_alt, chr22_KI270879v1_alt,
      ##   chrX_KI270880v1_alt, chrUn_KI270750v1, and chrUn_KI270752v1. Note that
      ##   ranges located on a sequence whose length is unknown (NA) or on a
      ##   circular sequence are not considered out-of-bound (use seqlengths() and
      ##   isCircular() to get the lengths and circularity flags of the underlying
      ##   sequences). You can use trim() to trim these ranges. See
      ##   ?`trim,GenomicRanges-method` for more information.
      (prom <- trim(prom))
      ## GRanges object with 197782 ranges and 2 metadata columns:
      ##                      seqnames        ranges strand |     tx_id     tx_name
      ##                         <Rle>     <IRanges>  <Rle> | <integer> <character>
      ##   uc057aty.1             chr1   27554-29753      + |         1  uc057aty.1
      ##   uc057atz.1             chr1   28267-30466      + |         2  uc057atz.1
      ##   uc031tlb.1             chr1   28366-30565      + |         3  uc031tlb.1
      ##   uc001aal.1             chr1   67091-69290      + |         4  uc001aal.1
      ##   uc057aum.1             chr1 158446-160645      + |         5  uc057aum.1
      ##          ...              ...           ...    ... .       ...         ...
      ##   uc064xrp.1 chrUn_KI270750v1 146668-148850      + |    197778  uc064xrp.1
      ##   uc064xrq.1 chrUn_KI270752v1         1-343      + |    197779  uc064xrq.1
      ##   uc064xrt.1 chrUn_KI270752v1   19813-22012      + |    197780  uc064xrt.1
      ##   uc064xrr.1 chrUn_KI270752v1     3424-5623      - |    197781  uc064xrr.1
      ##   uc064xrs.1 chrUn_KI270752v1    9868-12067      - |    197782  uc064xrs.1
      ##   -------
      ##   seqinfo: 455 sequences (1 circular) from hg38 genome
    • Finally, use the getSeq function to extract the sequence of the first five promoter regions from the genome sequence

      (seqs <- getSeq(Hsapiens, prom[1:5]))
      ##   A DNAStringSet instance of length 5
      ##     width seq                                               names               
      ## [1]  2200 TTTTCTTTCCTCTCACTCAGCGT...CGCCCCGGGGCAGGACCCCCAGC uc057aty.1
      ## [2]  2200 AAATAAGAAAACCCAGGCACAAA...CATACTTATGCTAAAAAACATTA uc057atz.1
      ## [3]  2200 TGGACAGACAGTAACTAGTTGAG...AACACTCTCTCCCTCTCCCAGTT uc031tlb.1
      ## [4]  2200 TTAAGGTCTATTCTAAATTGCAC...TCACTCATTGATCTGTCTCTGTC uc001aal.1
      ## [5]  2200 CATAAAATAACTATCTAATCCAA...ACCATTCAGAAGATGCTAAGACC uc057aum.1
    • We can also calculate the GC content of these promoters

      gcProm <- letterFrequency(seqs, "GC", as.prob = TRUE)
      head(gcProm)
      ##            G|C
      ## [1,] 0.5327273
      ## [2,] 0.5631818
      ## [3,] 0.5636364
      ## [4,] 0.3386364
      ## [5,] 0.4445455

Exercise

Extract the sequence of the full gene locus (exons and introns) of the human PTEN gene

(txby <- transcriptsBy(txdb, by = "gene"))
## GRangesList object of length 25221:
## $1 
## GRanges object with 8 ranges and 2 metadata columns:
##       seqnames            ranges strand |     tx_id     tx_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]    chr19 58345178-58347634      - |    166436  uc061drj.1
##   [2]    chr19 58346850-58353499      - |    166437  uc002qsd.5
##   [3]    chr19 58346854-58356225      - |    166438  uc061drk.1
##   [4]    chr19 58346858-58353491      - |    166439  uc061drl.1
##   [5]    chr19 58346860-58347657      - |    166440  uc061drm.1
##   [6]    chr19 58348466-58362751      - |    166441  uc061drs.1
##   [7]    chr19 58350594-58353129      - |    166442  uc061drt.1
##   [8]    chr19 58353021-58356083      - |    166443  uc061drv.1
## 
## $10 
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames            ranges strand | tx_id    tx_name
##   [1]     chr8 18391245-18401218      + | 72647 uc003wyw.2
##   [2]     chr8 18391287-18400993      + | 72648 uc064kqw.1
## 
## $100 
## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames            ranges strand |  tx_id    tx_name
##   [1]    chr20 44619522-44626491      - | 169786 uc061xfj.1
##   [2]    chr20 44619522-44651742      - | 169787 uc002xmj.4
##   [3]    chr20 44619810-44651691      - | 169789 uc061xfl.1
## 
## ...
## <25218 more elements>
## -------
## seqinfo: 455 sequences (1 circular) from hg38 genome
select(org.Hs.eg.db, keys = "PTEN", keytype = "SYMBOL", columns = "ENTREZID")
## 'select()' returned 1:1 mapping between keys and columns
##   SYMBOL ENTREZID
## 1   PTEN     5728
(pten <- txby[["5728"]])
## GRanges object with 6 ranges and 2 metadata columns:
##       seqnames            ranges strand |     tx_id     tx_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]    chr10 87863113-87971930      + |     87010  uc001kfb.4
##   [2]    chr10 87863438-87942691      + |     87011  uc057ush.1
##   [3]    chr10 87864449-87867049      + |     87012  uc057usi.1
##   [4]    chr10 87864468-87894326      + |     87013  uc057usj.1
##   [5]    chr10 87925523-87933487      + |     87016  uc057usm.1
##   [6]    chr10 87952199-87961309      + |     87017  uc057usn.1
##   -------
##   seqinfo: 455 sequences (1 circular) from hg38 genome
(pten <- reduce(pten))
## GRanges object with 1 range and 0 metadata columns:
##       seqnames            ranges strand
##          <Rle>         <IRanges>  <Rle>
##   [1]    chr10 87863113-87971930      +
##   -------
##   seqinfo: 455 sequences (1 circular) from hg38 genome
res <- getSeq(Hsapiens, pten)
res
##   A DNAStringSet instance of length 1
##      width seq
## [1] 108818 GGTAACCTCAGACTCGAGTCAGTGACACTGCTC...TTAAAAAATTAAAAAGGTTAATTATTTTGACTA

Other types of ‘annotations’

  • AnnotationHub also contains processed data from certain big consortia such as the Roadmap Epigenomics project
  • This type of files can often be useful for interpreting your own data
  • For more details on how this type of data can be explored and used, see e.g. this vignette for the AnnotationHub package
query(ah, "EpigenomeRoadmap")
## AnnotationHub with 18248 records
## # snapshotDate(): 2018-10-24 
## # $dataprovider: BroadInstitute
## # $species: Homo sapiens
## # $rdataclass: BigWigFile, GRanges, data.frame
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH28856"]]' 
## 
##             title                                 
##   AH28856 | E001-H3K4me1.broadPeak.gz             
##   AH28857 | E001-H3K4me3.broadPeak.gz             
##   AH28858 | E001-H3K9ac.broadPeak.gz              
##   AH28859 | E001-H3K9me3.broadPeak.gz             
##   AH28860 | E001-H3K27me3.broadPeak.gz            
##   ...       ...                                   
##   AH49540 | E058_mCRF_FractionalMethylation.bigwig
##   AH49541 | E059_mCRF_FractionalMethylation.bigwig
##   AH49542 | E061_mCRF_FractionalMethylation.bigwig
##   AH49543 | E081_mCRF_FractionalMethylation.bigwig
##   AH49544 | E082_mCRF_FractionalMethylation.bigwig

Further reading