1 Introduction to Genomic Ranges

1.1 Features and question regarding genomic data

  • Genomic data, such as mapped reads, gene annotations, single nucleotide polymorphisms (SNPs) or conservation scores, are ranged (=interval based) data represented in a genomic coordinate system
  • Typical questions to such data are:
    • What is the nearest TSS to a given CpG island or enhancer element?
    • What is the overlap of peaks identified from two ChIP-Seq runs?
    • What number of reads support a particular SNP?
  • The GenomicRanges packages of Bioconductor defines classes and functions to handle genomic ranged data

1.2 Creating a GRanges object

  • Essential ingredients: chromosome name, start and end coordinates, strand
  • Additional data (metadata): range-specific data such as feature names, scores or read counts
  • Contextual data: genome information

  • Creation from a data frame using makeGRangesFromDataFrame():

library(GenomicRanges)
df <- data.frame(
    seqnames = c("chr1", "chr1", "chr1", "chr1", "chr1"),
    start = c(1, 13, 5, 17, 7),
    end = c(4, 19, 10, 22, 10),
    strand = c("+", "+", "+", "-", "+"),
    score = 1:5,
    GC = c(0.3,0.32,0.47,0.41,0.6),
    row.names = head(letters, 5))
df
##   seqnames start end strand score   GC
## a     chr1     1   4      +     1 0.30
## b     chr1    13  19      +     2 0.32
## c     chr1     5  10      +     3 0.47
## d     chr1    17  22      -     4 0.41
## e     chr1     7  10      +     5 0.60
gr <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE)
gr
## GRanges object with 5 ranges and 2 metadata columns:
##     seqnames    ranges strand |     score        GC
##        <Rle> <IRanges>  <Rle> | <integer> <numeric>
##   a     chr1       1-4      + |         1       0.3
##   b     chr1     13-19      + |         2      0.32
##   c     chr1      5-10      + |         3      0.47
##   d     chr1     17-22      - |         4      0.41
##   e     chr1      7-10      + |         5       0.6
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
  • Creation from a character vector:
my_ranges <- c('chr1:1-4:+', 'chr1:13-19:+', 'chr1:5-10:+', 
               'chr1:17-22:-', 'chr1:7-10:+')
as(my_ranges,'GRanges')
## GRanges object with 5 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1       1-4      +
##   [2]     chr1     13-19      +
##   [3]     chr1      5-10      +
##   [4]     chr1     17-22      -
##   [5]     chr1      7-10      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
  • By using the GRanges constructor:
GRanges(seqnames=c('chr1','chr1','chr1','chr1','chr1'),  
        ranges=IRanges(start=c(1, 13, 5, 17, 7), end=c(4, 19, 10, 22, 10)), 
        strand=c('+','+','+','-','+'), 
        mcols=DataFrame(score=1:5, GC=c(0.3,0.32,0.47,0.41,0.6)))
## GRanges object with 5 ranges and 2 metadata columns:
##       seqnames    ranges strand | mcols.score  mcols.GC
##          <Rle> <IRanges>  <Rle> |   <integer> <numeric>
##   [1]     chr1       1-4      + |           1       0.3
##   [2]     chr1     13-19      + |           2      0.32
##   [3]     chr1      5-10      + |           3      0.47
##   [4]     chr1     17-22      - |           4      0.41
##   [5]     chr1      7-10      + |           5       0.6
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

  • Re-create the above GRanges object in your R session with your method of choice and store it as gr for future use..
  • Create the following set of ranges with your method of choice and annotate them with the feature name:
    • Feature A on the plus strand: chr3:2-13, chr3:24-27
    • Feature B on the minus strand: chr3:1-5, chr3:10-25
ex1 <- c('chr3:2-13:+', 'chr3:24-27:+', 'chr3:1-5:-', 'chr3:10-25:-')
ex1 <- as(ex1, 'GRanges')
mcols(ex1) <- DataFrame(name=c('A','A','B','B'))
ex1
## GRanges object with 4 ranges and 1 metadata column:
##       seqnames    ranges strand |        name
##          <Rle> <IRanges>  <Rle> | <character>
##   [1]     chr3      2-13      + |           A
##   [2]     chr3     24-27      + |           A
##   [3]     chr3       1-5      - |           B
##   [4]     chr3     10-25      - |           B
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

1.3 Accessing components of GRanges

## GRanges object with 5 ranges and 2 metadata columns:
##     seqnames    ranges strand |     score        GC
##        <Rle> <IRanges>  <Rle> | <integer> <numeric>
##   a     chr1       1-4      + |         1       0.3
##   b     chr1     13-19      + |         2      0.32
##   c     chr1      5-10      + |         3      0.47
##   d     chr1     17-22      - |         4      0.41
##   e     chr1      7-10      + |         5       0.6
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
  • GRanges objects are themself compound objects with components of class:
    • IRanges (=integer ranges) storing the start/end coordinates
    • Rle efficient run-length encoded vectors storing chromosome names and strand information
    • DataFrame storing the metadata
  • There are specific accessor functions for each component of a GRanges object
seqnames(gr)
## factor-Rle of length 5 with 1 run
##   Lengths:    5
##   Values : chr1
## Levels(1): chr1
start(gr)
## [1]  1 13  5 17  7
end(gr)
## [1]  4 19 10 22 10
width(gr)
## [1] 4 7 6 6 4
strand(gr)
## factor-Rle of length 5 with 3 runs
##   Lengths: 3 1 1
##   Values : + - +
## Levels(3): + - *
  • The metadata is accessible via:
mcols(gr)
## DataFrame with 5 rows and 2 columns
##       score        GC
##   <integer> <numeric>
## a         1       0.3
## b         2      0.32
## c         3      0.47
## d         4      0.41
## e         5       0.6

  • Access the feature names of gr
  • Add the names of gr as a new slot called “FeatureName” to the metadata of gr
names(gr)
## [1] "a" "b" "c" "d" "e"
mcols(gr)$FeatureName <- names(gr)
gr
## GRanges object with 5 ranges and 3 metadata columns:
##     seqnames    ranges strand |     score        GC FeatureName
##        <Rle> <IRanges>  <Rle> | <integer> <numeric> <character>
##   a     chr1       1-4      + |         1       0.3           a
##   b     chr1     13-19      + |         2      0.32           b
##   c     chr1      5-10      + |         3      0.47           c
##   d     chr1     17-22      - |         4      0.41           d
##   e     chr1      7-10      + |         5       0.6           e
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

1.4 Contextual information

  • GRanges also allow to store information about the genome where the ranges are defined
  • This contextual information is very usefull when manipulating ranges or comparing between different sets of ranges
genome(gr)
## chr1 
##   NA
seqinfo(gr)
## Seqinfo object with 1 sequence from an unspecified genome; no seqlengths:
##   seqnames seqlengths isCircular genome
##   chr1             NA         NA   <NA>
  • We can supply missing genome information by an on-line fetch from the UCSC Genome Browser archive for a given genome version:
genome(gr) <- 'hg38'
ucsc_seqinfo <- fetchExtendedChromInfoFromUCSC(genome(gr)) # supply information on chromosome lengths etc.
head(ucsc_seqinfo[,1:4])
##   UCSC_seqlevel UCSC_seqlength circular NCBI_seqlevel
## 1          chr1      248956422    FALSE             1
## 2          chr2      242193529    FALSE             2
## 3          chr3      198295559    FALSE             3
## 4          chr4      190214555    FALSE             4
## 5          chr5      181538259    FALSE             5
## 6          chr6      170805979    FALSE             6
seqinfo(gr) <- Seqinfo(seqnames=ucsc_seqinfo$UCSC_seqlevel,
                       seqlength=ucsc_seqinfo$UCSC_seqlength,
                       isCircular=ucsc_seqinfo$circular, 
                       genome=genome(gr))
gr
## GRanges object with 5 ranges and 3 metadata columns:
##     seqnames    ranges strand |     score        GC FeatureName
##        <Rle> <IRanges>  <Rle> | <integer> <numeric> <character>
##   a     chr1       1-4      + |         1       0.3           a
##   b     chr1     13-19      + |         2      0.32           b
##   c     chr1      5-10      + |         3      0.47           c
##   d     chr1     17-22      - |         4      0.41           d
##   e     chr1      7-10      + |         5       0.6           e
##   -------
##   seqinfo: 455 sequences (1 circular) from hg38 genome

1.5 Basic access and other useful functions

  • Elements of a GRanges object are accessible just as vectors by single brackets
gr[2:3] # access by index
## GRanges object with 2 ranges and 3 metadata columns:
##     seqnames    ranges strand |     score        GC FeatureName
##        <Rle> <IRanges>  <Rle> | <integer> <numeric> <character>
##   b     chr1     13-19      + |         2      0.32           b
##   c     chr1      5-10      + |         3      0.47           c
##   -------
##   seqinfo: 455 sequences (1 circular) from hg38 genome
gr["d"] # access by name
## GRanges object with 1 range and 3 metadata columns:
##     seqnames    ranges strand |     score        GC FeatureName
##        <Rle> <IRanges>  <Rle> | <integer> <numeric> <character>
##   d     chr1     17-22      - |         4      0.41           d
##   -------
##   seqinfo: 455 sequences (1 circular) from hg38 genome
gr[mcols(gr)$score < 3] # access by logical indexing
## GRanges object with 2 ranges and 3 metadata columns:
##     seqnames    ranges strand |     score        GC FeatureName
##        <Rle> <IRanges>  <Rle> | <integer> <numeric> <character>
##   a     chr1       1-4      + |         1       0.3           a
##   b     chr1     13-19      + |         2      0.32           b
##   -------
##   seqinfo: 455 sequences (1 circular) from hg38 genome
  • Also, filtering by the subset() command as known from data.frames is possible
subset(gr, strand == "+")
## GRanges object with 4 ranges and 3 metadata columns:
##     seqnames    ranges strand |     score        GC FeatureName
##        <Rle> <IRanges>  <Rle> | <integer> <numeric> <character>
##   a     chr1       1-4      + |         1       0.3           a
##   b     chr1     13-19      + |         2      0.32           b
##   c     chr1      5-10      + |         3      0.47           c
##   e     chr1      7-10      + |         5       0.6           e
##   -------
##   seqinfo: 455 sequences (1 circular) from hg38 genome
  • By default, the sort function will order a GRanges object by chromosome, then by start position then by strand
sort(gr)
## GRanges object with 5 ranges and 3 metadata columns:
##     seqnames    ranges strand |     score        GC FeatureName
##        <Rle> <IRanges>  <Rle> | <integer> <numeric> <character>
##   a     chr1       1-4      + |         1       0.3           a
##   c     chr1      5-10      + |         3      0.47           c
##   e     chr1      7-10      + |         5       0.6           e
##   b     chr1     13-19      + |         2      0.32           b
##   d     chr1     17-22      - |         4      0.41           d
##   -------
##   seqinfo: 455 sequences (1 circular) from hg38 genome

  • Filter the above gr object for ranges with a GC-content higher than 40% and order the output by score
res <- subset(gr, GC > 0.4)
res[order(mcols(res)$score)]
## GRanges object with 3 ranges and 3 metadata columns:
##     seqnames    ranges strand |     score        GC FeatureName
##        <Rle> <IRanges>  <Rle> | <integer> <numeric> <character>
##   c     chr1      5-10      + |         3      0.47           c
##   d     chr1     17-22      - |         4      0.41           d
##   e     chr1      7-10      + |         5       0.6           e
##   -------
##   seqinfo: 455 sequences (1 circular) from hg38 genome

2 Working with genomic ranges

2.1 Intra-range methods

  • Intra-range methods operate on each element of a GRanges object independent of the other ranges in the object
?`intra-range-methods`
  • Example: resize the set of ranges by fixing the width and center
resize(gr, width=3, fix='center')
## GRanges object with 5 ranges and 3 metadata columns:
##     seqnames    ranges strand |     score        GC FeatureName
##        <Rle> <IRanges>  <Rle> | <integer> <numeric> <character>
##   a     chr1       1-3      + |         1       0.3           a
##   b     chr1     15-17      + |         2      0.32           b
##   c     chr1       6-8      + |         3      0.47           c
##   d     chr1     18-20      - |         4      0.41           d
##   e     chr1       7-9      + |         5       0.6           e
##   -------
##   seqinfo: 455 sequences (1 circular) from hg38 genome
  • Lets use the helpful plotting function plotRanges.R to better grasp the changes to our GRanges object
source('plotRanges.R')
## Loading required package: Gviz