Chromatin Immuno-Precipitation and Sequencing (ChIP-seq) 2, data analysis and interpretation

Author

Michael Stadler, michael.stadler@fmi.ch

Published

May 8, 2024

Processing a ChIP-seq Dataset from Raw Reads

The main steps in a ChIP-seq workflow are similar to what we would do for RNA-seq (apart from peak finding - genes are the features of interest) or ATAC-seq:

  1. align reads to reference genome
  2. quality control
  3. identify regions of interest (peak finding)
  4. quantify regions of interest
  5. compare samples (IP vs. input, IP1 vs. IP2)

A large number of tools are available for these steps. In this example, we will use QuasR (Gaidatzis et al. 2015) for alignment and quality control, and csaw (Lun and Smyth 2014) for peak finding.


Prepare: Copy example data to temporary directory

We use a small example dataset that comes with the QuasR package, containing reads from a histone 3 lysine 4 trimethyl (H3K4me3) ChIP-seq experiment. We start the example workflow by copying those files into a temporary directory td, into a subfolder called extdata:

library(QuasR)
td <- tempdir()
file.copy(system.file(package="QuasR", "extdata"), td, recursive=TRUE)
Sys.chmod(file.path(td, "extdata")) # make sure "extdata" is writable

Step 1: Align reads using the qAlign function

As we have seen in the session about RNA-seq, QuasR needs a text file that lists all raw sequence files, and the reference genome sequence (here a fasta file). qAlign will take care of the rest and store newly generated bam files with the raw sequence files:

sampleFile <- file.path(td, "extdata", "samples_chip_single.txt")
genomeFile <- file.path(td, "extdata", "hg19sub.fa")

proj <- qAlign(sampleFile, genome=genomeFile)
proj
Project: qProject
 Options   : maxHits         : 1
             paired          : no
             splicedAlignment: FALSE
             bisulfite       : no
             snpFile         : none
             geneAnnotation  : none
 Aligner   : Rbowtie v1.42.0 (parameters: -m 1 --best --strata)
 Genome    : /private/var/folders/qm/f4lwc7c10193k2zfy4smk2.../hg19sub.fa (file)

 Reads     : 2 files, 2 samples (fastq format):
   1. chip_1_1.fq.bz2  Sample1 (phred33)
   2. chip_2_1.fq.bz2  Sample2 (phred33)

 Genome alignments: directory: same as reads
   1. chip_1_1_a8ad7322799c.bam
   2. chip_2_1_a8ad5c5e2bd1.bam

 Aux. alignments: none

Step 2: Quality Control

In addition to assessing technical quality of sequencing data, quality control in ChIP-seq experiments typically checks for:

  • reproducibiliy (consistency between replicate experiments)
  • spikyness (defined genomic regions of increased read densities)

These features can be assessed in many ways, for example using QuasR’s qQCReport function, tools like FastQC, and by visualizing the alignment density along the genome (for example by creating a wiggle file and upload it to UCSC’s genome browser or look at the bam files directly using IGV. The plot below was generated using the Gviz package that allows you to create genome browser like plots within R (it has a detailed vignette with many examples and is highly recommended).

Genome browser view (created using Gviz).

Step 2a: Quality Control: GC bias

  • alignment density often depends on sequence composition (e.g. GC percentage)
  • this may lead to false positive/negative results, especially when comparing datasets with different degrees of GC bias
  • GC-bias can be assessed by plotting %GC in genomic tiles (e.g. windows of 5kb) versus the number of reads
  • in practice, you want to compare samples whith little GC bias, and importantly, similar GC bias

Three examples of GC-bias plots.

We can create such a plot for our data easily ourselves (even though with the small dataset we have here, it will not be very informative):

# load the genome sequence
library(Biostrings)
Loading required package: XVector

Attaching package: 'Biostrings'
The following object is masked from 'package:grid':

    pattern
The following object is masked from 'package:base':

    strsplit
gnm <- readDNAStringSet(genomeFile)
chrs <- seqlengths(gnm)

# tile genome into bins
bins <- tileGenome(chrs, tilewidth = 1000)
bins <- unlist(bins)

# get sequences of bins
binSeqs <- getSeq(gnm, bins)
binBases <- oligonucleotideFrequency(binSeqs, width = 1)

# get percent G+C values for bins
percGC <- rowSums(binBases[, c("C","G")]) /
    rowSums(binBases) * 100

# get ChIP-seq read counts for bins
binCounts <- qCount(proj, bins)
counting alignments...
done
# visualize
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ lubridate::%within%() masks IRanges::%within%()
✖ dplyr::collapse()     masks Biostrings::collapse(), IRanges::collapse()
✖ dplyr::combine()      masks BiocGenerics::combine()
✖ purrr::compact()      masks XVector::compact()
✖ dplyr::desc()         masks IRanges::desc()
✖ tidyr::expand()       masks S4Vectors::expand()
✖ dplyr::filter()       masks stats::filter()
✖ dplyr::first()        masks S4Vectors::first()
✖ dplyr::lag()          masks stats::lag()
✖ ggplot2::Position()   masks BiocGenerics::Position(), base::Position()
✖ purrr::reduce()       masks GenomicRanges::reduce(), IRanges::reduce()
✖ dplyr::rename()       masks S4Vectors::rename()
✖ lubridate::second()   masks S4Vectors::second()
✖ lubridate::second<-() masks S4Vectors::second<-()
✖ dplyr::slice()        masks XVector::slice(), IRanges::slice()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- data.frame(percGC = percGC,
                 log2count = log2(binCounts[, -1] + 1)) |>
    pivot_longer(cols = starts_with("log2count."),
                 names_to = "sample", values_to = "log2count") |>
    mutate(sample = sub("log2count.", "", sample))
ggplot(df, aes(percGC, log2count, colour = sample)) +
    geom_point() +
    geom_smooth(se = FALSE) +
    labs(x = "Percent G+C",
         y = "Reads per tile (log2)")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

GC bias plot for our (small) example data.

Step 2b: Reproducibility

  • the similarity between replicates is a good quality measure
  • for an unbiased comparison, counts in tiling windows along the genome can be used
  • problematic windows (e.g. alignment artefacts, repeat regions) may have to be manually removed (for a detailed discussion see (Amemiya, Kundaje, and Boyle 2019))
  • if only few windows contain signal (few bindings sites), peak regions should be used instead of tiling windows (see below for identification of peaks)
# re-use binCounts from above
lcnt <- log2(binCounts[, -1] + 1)
ggplot(lcnt, aes(Sample1, Sample2)) +
    geom_point() +
    geom_abline(slope = 1, intercept = 0, colour = "red") +
    annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.5,
             label = sprintf("R = %.3f", cor(lcnt[, "Sample1"],
                                             lcnt[, "Sample2"])), bty = "n")
Warning in annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.5, :
Ignoring unknown parameters: `bty`

Comparison of read counts in genomic tiles.

Step 2c: Variable IP enrichment factors

Assuming that the IP efficiency and the resulting enrichment factor was not the same across your experiments:

  • How would that show itself in the data?
  • What problems would result from that?
  • How could you correct for it?

Let’s use last week’s ChIP-model to simulate artificial ChIP data, one non-enriched and two enriched samples with different IP enrichment factors:

N <- 10000   # number of genomic bins
f <- 0.05    # fraction of enriched bins
M <- 12      # average number of reads per non-enriched bin
e <- c(3, 6) # IP enrichment factors

set.seed(42) # reproducible randomness

# enriched bins
isEnriched <- sample(x = c(TRUE, FALSE), size = N,
                     replace = TRUE, prob = c(f, 1 - f))

# counts
cnt <- cbind(input = rpois(N, lambda = M),
             IP_1  = ifelse(isEnriched, rpois(N, e[1] * M), rpois(N, M)),
             IP_2  = ifelse(isEnriched, rpois(N, e[2] * M), rpois(N, M)))

# normalized counts
lcpm <- log2(t(t(cnt) / colSums(cnt)) * 1e6 + 1)
cor(lcpm)
             input         IP_1         IP_2
input  1.000000000 -0.001684349 -0.002014354
IP_1  -0.001684349  1.000000000  0.494626911
IP_2  -0.002014354  0.494626911  1.000000000

Now we create MA-plot of all pairs:

plotMA <- function(x, ...) { # x: two-column matrix with log counts
    df <- data.frame(M = x[,2] - x[,1],
                     A = (x[,1] + x[,2]) / 2)
    ggplot(df, aes(A, M)) +
        geom_point(position = position_jitter(width = 0.1, height = 0.1),
                   colour = alpha("gray30", 0.3)) +
        labs(x = "A",
             y = paste0("M (", colnames(x)[2], " - ",
                        colnames(x)[1], ")")) +
        geom_hline(yintercept = 0, linetype = "dashed")
}

plotMA(lcpm[, c(1, 2)], ylim = c(-4, 4))

plotMA(lcpm[, c(1, 3)], ylim = c(-4, 4))

plotMA(lcpm[, c(2, 3)], ylim = c(-4, 4))

Notice the clearer signal in IP_2 (mid) compared to IP_1 (top), and the apparent “differential binding” between IP_2 vs. IP_1 (bottom).

Let’s normalize this effect using loess normalization (see ?limma::normalizeBetweenArrays for details):

library(limma)

Attaching package: 'limma'
The following object is masked _by_ '.GlobalEnv':

    plotMA
The following object is masked from 'package:BiocGenerics':

    plotMA
# loess normalize IP samples (replicates)
normLcpm <- lcpm
normLcpm[, 2:3] <- normalizeBetweenArrays(lcpm[, 2:3], method = "cyclicloess")

plotMA(normLcpm[, c(1, 2)], ylim = c(-4, 4))

plotMA(normLcpm[, c(1, 3)], ylim = c(-4, 4))

plotMA(normLcpm[, c(2, 3)], ylim = c(-4, 4))


Step 3: Identifying Regions of Interest

In contrast to typical RNA-seq, the regions of interest are often not a priori known in ChIP-seq. One might:

  • use pre-defined regions in which to quantify the binding, such as:
    promoter regions of known genes, CpG islands, predicted transcription factor binding sites, etc.
  • identify regions as peaks from the data, using tools such as MACS (Zhang et al. 2008), MACS2, MACS3 or PeakSeq
  • identify differentially bound windows based on genome tiling windows, using for example csaw

Because csaw works well with R and is related to edgeR that you already know, we will use it here.


Step 3a: Identifying Peaks with csaw

  • Identifying differentially bound peaks
    The csaw package was primarily designed to identify differentially bound regions between groups of replicated ChIP samples (e.g. for different treatment conditions or genotypes). An excellent description can be opened using the csawUsersGuide() function.
  • Identifying peaks relative to local background (emulating MACS)
    csaw is however quite flexible and can also be used to identify windows of high alignment density relative to a local neighborhood, similar to the approach implemented in MACS (Zhang et al. 2008). This will be illustrated below.

Estimate average fragment length

The average ChIP-seq fragment size can be estimated from the data, as plus- and minus-strand alignments mark the starts and ends of ChIP fragments, respectively (illustration from (Wilbanks and Facciotti 2010)).

ChIP-seq reads on opposite strands are separated by the average fragment length.

Estimate average fragment length using Autocorrelation

# load library
library(csaw)

# get bam files
bam.files <- alignments(proj)$genome$FileName

# obtain average fragment length
# from cross-correlation of plus/minus strand alignments
x <- correlateReads(bam.files, max.dist = 500)

# visualize (raw and smoothed)
df <- data.frame(delay = 0:500,
                 ccf = x,
                 ccf_smooth = runmean(Rle(x), k = 101, endrule = "constant"))
frag_len <- df[which.max(df$ccf_smooth), "delay"]
ggplot(df, aes(delay)) +
    geom_point(aes(y = ccf)) +
    geom_line(aes(y = ccf_smooth), colour = "red") +
    geom_vline(xintercept = frag_len, colour = "blue") +
    labs(x = "Delay (base pairs)",
         y = "Cross-correlation function") +
    annotate("text", x = Inf, y = Inf, hjust = 1.1, vjust = 1.5,
             label = paste0("fragment length = ", frag_len, "bp"))

Autocorrelation between plus- and minus-strand read densities along the genome.

Count alignments per window along the genome

As before, we could use qCount from the QuasR package to obtain counts for windows, but csaw also provides a window counting function, which returns a SummarizedExperiment object. We will use it here because it simplifies the subsequent peak finding using csaw:

data <- windowCounts(bam.files, ext = frag_len)
head(assays(data)$counts) # counts
     [,1] [,2]
[1,]    4    8
[2,]    3   13
[3,]    3   13
[4,]    6   11
[5,]    5   11
[6,]    4    8
head(rowRanges(data)) # ... for windows
GRanges object with 6 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1 1801-1850      *
  [2]     chr1 1851-1900      *
  [3]     chr1 1901-1950      *
  [4]     chr1 1951-2000      *
  [5]     chr1 2001-2050      *
  [6]     chr1 8651-8700      *
  -------
  seqinfo: 3 sequences from an unspecified genome

Please note that windowCounts() counts fragments overlapping each window. In the example here, many fragments are counted several times if they overlap more than a single window. Be careful to avoid this redundant use of your data:

# get the total number of alignments
# ... from the QuasR qProject object
alignmentStats(proj)[, "mapped"]
Sample1:genome Sample2:genome 
          2339           3609 
# ... from the SummarizedExperiment colData
colData(data)[, "totals"]
[1] 2339 3609
# ... and compare to the total counts
colSums(assays(data)$counts)
[1]  7509 14341

Let’s quickly repeat what a SummarizedExperiment is: It is essentially just a fancy matrix (2D array), one that can store more than a single version of the matrix (assays, for example raw and normalized), and that can also store metadata on the rows (rowData()) or columns (colData()). It is also the parent class of the SingleCellExperiment that we used to store single cell data. In summary, it stores feature-by-sample measurements with additional annotation for features (rows) and samples (columns).

You might wonder why we don’t use a simple matrix for that purpose? The specialized classes have the advantage, that they keep annotation (e.g. gene symbols or sample descriptions) together with your quantitative data. If you filter your genes based on the counts, the gene annotation is filtered at the same time, and there is no danger of merge or reorder mistakes. Furthermore, it become very easy to save and share the data as a single R object (for example using saveRDS()).

Below is a graphical representation of the object and the accessors for the different data slots (see (Huber et al. 2015) or the documentation ?"SummarizedExperiment-class" in the SummarizedExperiment package for more details):

Schematic overview of the SummarizedExperiment class.

Strategies to Identify ChIP-seq Peaks

Depending on the chipped feature, the ChIP-seq signal and thus the strategy to identify peaks has to be adjusted (see for example the figure below from (Pepke, Wold, and Mortazavi 2009) and also (Wilbanks and Facciotti 2010))

Types of ChIP-seq peaks (from Pepke et al., 2009).

Identify peaks (enriched regions over local background)

# identify peak windows (3-fold over background)
surrounds <- 5000 # background from local neighborhood
neighbour <- resize(rowRanges(data), width = surrounds, fix = "center")
wider <- regionCounts(bam.files, regions = neighbour, ext = frag_len)
enr.over.neighbour <- filterWindowsLocal(data, wider)$filter
peaks.windows <- rowRanges(data)[enr.over.neighbour > log2(3)]

# merge peak windows into peaks
maxdist <- 1000 # fuse windows closer than this
peaks <- mergeWindows(peaks.windows, tol = maxdist)
peaks
$ids
 [1] 1 1 1 1 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5
[39] 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6

$regions
GRanges object with 6 ranges and 0 metadata columns:
      seqnames      ranges strand
         <Rle>   <IRanges>  <Rle>
  [1]     chr1   1851-2050      *
  [2]     chr1 13101-13300      *
  [3]     chr1 31551-32250      *
  [4]     chr2   3401-3800      *
  [5]     chr3   2451-3150      *
  [6]     chr3 28551-29650      *
  -------
  seqinfo: 3 sequences from an unspecified genome

Export peak coordinages as BED file

# exporting a BED file with peak coordinates
export(peaks$region, file.path(td, "peaks.bed"))

cat(paste(readLines(file.path(td, "peaks.bed"), n = 6), collapse = "\n"))
chr1    1850    2050    .   0   .
chr1    13100   13300   .   0   .
chr1    31550   32250   .   0   .
chr2    3400    3800    .   0   .
chr3    2450    3150    .   0   .
chr3    28550   29650   .   0   .

Visualize peaks with Gviz

Peaks plotted using Gviz.

Visualize peaks with simple R

df <- data.frame(chr = seqnames(rowRanges(data)),
                 pos = start(rowRanges(data)),
                 count = rowSums(assay(data, "counts"))) |>
    filter(chr == "chr3")

j <- seqnames(peaks$region) == "chr3"
ggplot() +
    geom_rect(inherit.aes = FALSE,
              data = data.frame(xmin = start(peaks$region[j]),
                                xmax = end(peaks$region[j]),
                                ymin = 0, ymax = 1.04 * max(df$count)),
              mapping = aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),
              fill = alpha("red", 0.2)) +
    geom_line(data = df, inherit.aes = FALSE,
              mapping = aes(pos, count)) +
    scale_y_continuous(expand = expansion(mult = 0)) +
    labs(x = "Position on chr3",
         y = "Alignments per window")

Peaks plotted using ggplot2 graphics.

Step 4: Quantify regions of interest

Once regions of interest (peaks) are identified, it is often useful to go back to the raw data and quantify the signal per region and individual sample, for example to compare replicates. We will do this using qCount from the QuasR package, which does not count reads redundantly:

cnt <- qCount(proj, peaks$region)
counting alignments...done
head(cnt)
  width Sample1 Sample2
1   200       3      14
2   200       7      14
3   700     161     534
4   400      78     531
5   700     229     556
6  1100     135     410

We can now repeat our comparison of replicates, this time using the counts in peaks (rather than in genomic tiles, which often gives a better comparison as most genomic tiles may only show noise rather than signal):

# calculate counts per million (cpm)
Ntot <- alignmentStats(proj)[, "mapped"]
cpm <- t(t(cnt[, c("Sample1","Sample2")]) / Ntot * 1e6)

df <- data.frame(cpm)
ggplot(df, aes(Sample1, Sample2)) +
    geom_point() +
    geom_abline(slope = 1, intercept = 0, col = "red") +
    scale_x_log10() +
    scale_y_log10() +
    annotate("text", x = min(cpm), y = max(cpm), hjust = 0, vjust = 1,
             label = sprintf("R = %.3f",
                             cor(log2(df$Sample1),
                                 log2(df$Sample2))))

Comparison of read counts in peaks.

Step 5: Compare Samples (IP vs. input, IP1 vs. IP2)

Finally, alignment counts in regions of interest can be compared between samples, similarly as we have done it for RNA-seq counts. The promoter counts can be downloaded from here: ChIP_promoter_counts.rds, and I saved them into the folder data/ChIP/. A useful measure are log-ratios of IP counts over input counts, also referred to as enrichment values:

# load count table:
#    regions: promoter regions (TSS +/- 1kb, mm10)
#    mouse ES cells, H3K4me2 ChIP (GSM686995)
#    mouse ES cells, input chromatin (GSM686996)
cnt <- readRDS("data/ChIP/ChIP_promoter_counts.rds")
head(cnt)
  seqnames     start       end width strand EntrezGeneId
1     chr9  21066093  21068092  2000      -    100009600
2     chr7  84963009  84965008  2000      -    100009609
3    chr10  77710457  77712456  2000      +    100009614
4    chr11  45807083  45809082  2000      +    100009664
5     chr4 144161651 144163650  2000      -       100012
6     chr4 134767004 134769003  2000      -       100017
  raw.GSM686995_MmES_H3K4me2_a raw.GSM747545_MmES_Input_a
1                          195                         35
2                            0                          3
3                            8                         16
4                            0                          8
5                            1                          7
6                          993                         11
  cpm.GSM686995_MmES_H3K4me2_a cpm.GSM747545_MmES_Input_a
1                  12.67881436                  3.2062052
2                   0.00000000                  0.2748176
3                   0.52015649                  1.4656938
4                   0.00000000                  0.7328469
5                   0.06501956                  0.6412410
6                  64.56442388                  1.0076645
# calculate enrichments
df <- cnt |>
    as.data.frame() |>
    mutate(enr = log2((cpm.GSM686995_MmES_H3K4me2_a + 1) /
            (cpm.GSM747545_MmES_Input_a + 1)))

ggplot(df, aes(enr)) +
    geom_histogram(bins = 100) +
    labs(x = "Enrichment (log2 H3K4me2 / input)",
         y = "Number of promoters")

Enrichment histogram (log2 H3K4me2/Input).

Do you remember why we add a constant to the counts?


Compare samples directly

Another useful plot to compare ChIP-seq samples is the MA-plot, that compares mean abundance (A) with the log2 fold-changes (or enrichment, M) between a pair of samples:

df <- df |>
    mutate(A = log2(cpm.GSM686995_MmES_H3K4me2_a + cpm.GSM747545_MmES_Input_a + 1))

ggplot(df, aes(A, enr)) +
    geom_jitter(width = 0.1, height = 0.1, colour = alpha("gray30", 0.2)) +
    geom_hline(yintercept = 0, linetype = "dashed") +
    labs(x = "A", y = "M")

MA-plot of H3K4me2 ChIP and Input read counts in promoters.

Because of the very strong signal-to-noise ratio in this sample, A and M are strongly correlated (there are no non-enriched promoters with large A values).

Finally, a simple scatter plot is always a powerful way to study the relationship between two ChIP-seq sampls:

ggplot(df, aes(cpm.GSM686995_MmES_H3K4me2_a + 1, cpm.GSM747545_MmES_Input_a + 1)) +
    geom_jitter(width = 0.01, height = 0.01, colour = alpha("gray30", 0.2)) +
    geom_abline(slope = 1, intercept = 0) +
    labs(x = "Counts per million in H3K4me2 (log scale)",
         y = "Counts per million in Input (log scale)") +
    scale_x_log10(limits = c(1, 200)) +
    scale_y_log10(limits = c(1, 10))
Warning: Removed 2174 rows containing missing values or values outside the scale range
(`geom_point()`).

Scatter plot of H3K4me2 ChIP and Input read counts in promoters.

Compare ChIP data to other types of data

Much can be learned by comparing ChIP data with other types of data, such as expression (RNA-seq) or DNA sequence features (transcription factor binding sites, sequence composition). Here is an example that calculates the CpG observed-over-expect ratio for promoters and compares that to the H3K4me2 ChIP enrichments calculated above (the code below will require the genome sequence in the form of a BSgenome package to be installed):

# create GRanges object for promoter regions
head(cnt)
  seqnames     start       end width strand EntrezGeneId
1     chr9  21066093  21068092  2000      -    100009600
2     chr7  84963009  84965008  2000      -    100009609
3    chr10  77710457  77712456  2000      +    100009614
4    chr11  45807083  45809082  2000      +    100009664
5     chr4 144161651 144163650  2000      -       100012
6     chr4 134767004 134769003  2000      -       100017
  raw.GSM686995_MmES_H3K4me2_a raw.GSM747545_MmES_Input_a
1                          195                         35
2                            0                          3
3                            8                         16
4                            0                          8
5                            1                          7
6                          993                         11
  cpm.GSM686995_MmES_H3K4me2_a cpm.GSM747545_MmES_Input_a
1                  12.67881436                  3.2062052
2                   0.00000000                  0.2748176
3                   0.52015649                  1.4656938
4                   0.00000000                  0.7328469
5                   0.06501956                  0.6412410
6                  64.56442388                  1.0076645
cntgr <- as(cnt[,1:5], "GRanges")

# get corresponding sequences
library("BSgenome.Mmusculus.UCSC.mm10")
cntseq <- getSeq(BSgenome.Mmusculus.UCSC.mm10, cntgr)

# calculate mono-/di-nucleotide frequencies
f.mono <- oligonucleotideFrequency(cntseq, width = 1, as.prob = TRUE)
f.di   <- oligonucleotideFrequency(cntseq, width = 2, as.prob = TRUE)

# calculate CpG observed/expected ratio
df$CpGexp <- f.mono[,"C"] * f.mono[,"G"]
df$CpGoe <- f.di[,"CG"] / df$CpGexp

# create plot
ggplot(df, aes(CpGoe, enr)) +
    geom_jitter(height = 0.1, colour = alpha("gray30", 0.2)) +
    labs(x = "CpG obs/expected",
         y = "H3K4me2 (log2 IP/input)") +
    geom_hline(yintercept = 0.5, linetype = "dashed") + # H3K4me2 enrichment threshold
    geom_vline(xintercept = 0.4, linetype = "dashed")   # high-CpG promoter threshold

CpG content (observed-over-expected ratio) and H3K4me2 enrichments in promoters.

Exercises

  1. Export the list of identified peaks as a bed file, but just containing the peaks on chromosome 3.
selPeaks <- peaks$regions[seqnames(peaks$regions) == "chr3"]
selPeaks
GRanges object with 2 ranges and 0 metadata columns:
      seqnames      ranges strand
         <Rle>   <IRanges>  <Rle>
  [1]     chr3   2451-3150      *
  [2]     chr3 28551-29650      *
  -------
  seqinfo: 3 sequences from an unspecified genome
tf <- file.path(tempdir(), "peaks_chr3.bed")
export(selPeaks, tf)
cat(paste(readLines(tf), collapse = "\n"))
chr3    2450    3150    .   0   .
chr3    28550   29650   .   0   .
  1. What is the impact of increasing or decreasing the genomic neighborhood (surrounds variable) on the resulting peaks? Try to identify a pattern by rerunning peak finding using different values and visualizing the results using the code above.
# helper function to call peaks given `surrounds`
# expects to find the following in the global environment:
# data, bam.files, frag_len, maxdist
get_peaks <- function(surrounds, cutoff = log2(3)) {
    suppressWarnings({
        neighbour <- resize(rowRanges(data), width = surrounds, fix = "center")
    })
    wider <- regionCounts(bam.files, regions = neighbour, ext = frag_len)
    enr.over.neighbour <- filterWindowsLocal(data, wider)$filter
    peaks.windows <- rowRanges(data)[enr.over.neighbour > cutoff]
    mergeWindows(peaks.windows, tol = maxdist)
}

surrounds_values <- c(1000, 2500, 5000, 50000)
peak_list <- lapply(surrounds_values, get_peaks)

# visualize
selchr <- "chr3"
peak_list_selchr <- lapply(peak_list, function(x) x$regions[seqnames(x$regions) == selchr])
cumnumpeaks <- cumsum(lengths(peak_list_selchr))
# ... extract alignments
df1 <- data.frame(chr = seqnames(rowRanges(data)),
                  pos = start(rowRanges(data)),
                  count = rowSums(assay(data, "counts"))) |>
     filter(chr == selchr)
# ... extract peaks (for different `surrounds_values`)
dy <- 0.08 * max(df1$count)
df2 <- do.call(rbind, lapply(seq_along(surrounds_values), function(i) {
    # o <- cumnumpeaks[i] + seq.int(lengths(peak_list_selchr)[i]) - 1
    o <- i
    data.frame(surrounds = surrounds_values[i],
               parameter_set = factor(surrounds_values[i], levels = surrounds_values),
               xmin = start(peak_list_selchr[[i]]),
               xmax = end(peak_list_selchr[[i]]),
               ymin = -o * dy, ymax = -o * dy + 0.75 * dy)
}))

ggplot() +
    geom_rect(data = df2, linewidth = 2,
              mapping = aes(xmin = xmin, xmax = xmax,
                            ymin = ymin, ymax = ymax,
                            colour = parameter_set,
                            fill = parameter_set)) +
    geom_polygon(data = df1, inherit.aes = FALSE,
                 mapping = aes(pos, count),
                 fill = "gray50") +
    labs(x = "Position on chr3",
         y = "Alignments per window",
         fill = "`surrounds` value",
         colour = "`surrounds` value")

  1. Compare the promoter H3K4me2 enrichments with the gene expression in ES cells (using EntrezID to link the two sets, using the RNA-seq counts form RNA_gene_counts.rds)
rna <- readRDS("data/ChIP/RNA_gene_counts.rds")

df <- left_join(x = df,
                y = rna |> select(starts_with(c("EntrezGeneId", "log2rpkm"))),
                by = join_by("EntrezGeneId"))

ggplot(df, aes((log2rpkm.ES.RNA.a + log2rpkm.ES.RNA.b) / 2, enr)) +
    geom_jitter(height = 0.1, colour = alpha("gray30", 0.2)) +
    geom_hline(yintercept = 0.5, linetype = "dashed") +
    geom_vline(xintercept = 0.5, linetype = "dashed") +
    labs(x = "Mean RNA level (log2 RPKM)",
         y = "Promoter H3K4me2 (log2 IP/input)")
Warning: Removed 1227 rows containing missing values or values outside the scale range
(`geom_point()`).

References

Amemiya, H. M., A. Kundaje, and A. P. Boyle. 2019. The ENCODE Blacklist: Identification of Problematic Regions of the Genome.” Sci Rep 9 (1): 9354. https://doi.org/10.1038/s41598-019-45839-z.
Gaidatzis, Dimos, Anita Lerch, Florian Hahne, and Michael B. Stadler. 2015. “QuasR: Quantification and Annotation of Short Reads in r.” Bioinformatics 31 (7): 1130–32. https://doi.org/10.1093/bioinformatics/btu781.
Huber, W, V Carey, R Gentleman, S Anders, M Carlson, B Carvalho, HC Bravo, et al. 2015. “COrchestrating High-Throughput Genomic Analysis with Bioconductor.” Nature Methods 12: 115–21. https://doi.org/10.1038/nmeth.3252.
Lun, Aaron T L., and Gordon K. Smyth. 2014. “De Novo Detection of Differentially Bound Regions for ChIP-Seq Data Using Peaks and Windows: Controlling Error Rates Correctly.” Nucleic Acids Res 42 (11): e95. https://doi.org/10.1093/nar/gku351.
Pepke, Shirley, Barbara Wold, and Ali Mortazavi. 2009. “Computation for ChIP-Seq and RNA-Seq Studies.” Nature Methods 6: S22–32. https://doi.org/10.1038/nmeth.1371.
Wilbanks, Elizabeth G, and Marc T Facciotti. 2010. “Evaluation of Algorithm Performance in ChIP-Seq Peak Detection.” PLOS ONE 5: e11471. https://doi.org/10.1371/journal.pone.0011471.
Zhang, Yong, Tao Liu, Clifford A. Meyer, Jérôme Eeckhoute, David S. Johnson, Bradley E. Bernstein, Chad Nusbaum, et al. 2008. “Model-Based Analysis of ChIP-Seq (MACS).” Genome Biol 9 (9): R137. https://doi.org/10.1186/gb-2008-9-9-r137.