Chromatin Immuno-Precipitation and Sequencing (ChIP-seq) 1: Introduction

Author

Michael Stadler (michael.stadler@fmi.ch)

Published

April 24, 2024

What is ChIPseq?

Chromatin immuno-precipitation is used to experimentally determine the genomic location of specific proteins or protein variants (see figure below). Detection can be done by PCR (ChIP-qPCR), hybridization (ChIP-chip, hardly used anymore) or sequencing (ChIP-seq). The process consists of:

chemically cross-link chromatin (proteins and DNA) [optional in “native ChIP”]
fragment chromatin (< 1kb fragments, determines resolution)
enrichment of fragments of interest (e.g. using antibodies)
purify DNA and sequence
alignment density is a measure of abundance

ChIPseq_overview (Figure from (Szalkowski and Schmid 2011))

How many reads are needed?

We will use this question to think about the process that generates ChIP-seq data, which will give us the basis for its analysis.

The required sequence coverage depends on many factors:

size of the genome (yeast vs. human)
abundance of precipitated protein (fraction of genome, transcription factor vs. abundant histone mark)
enrichment achieved by immuno-precipitation (good vs. bad antibody, stringency of washing)
desired resolution (base-pairs, promoters, domains)
desired sensitivity (minimal detected fold-enrichment)

A simple theoretical model of ChIPseq data

A simple theoretical model can help to think about this process (described in (Mikkelsen et al. 2007)):

genome is divided into \(N\) bins of size \(b\)
fraction \(f\) of these bins contain feature of interest
uniform sampling of reads: number of reads per bin \(\sim\) Poisson\((\lambda)\)
immuno-precipitation enriches by factor \(e\)
(on average, bins with/without features have \(eM\) and \(M\) reads):
\(\lambda_{background} = M\) and \(\lambda_{enriched} = eM\)
when sequencing a total of \(R\) reads: \(M = \frac{R}{N(ef+(1-f))}\)

A simple theoretical model of ChIPseq data

immuno-precipitation enriches, but does not (perfectly) filter
fold-enrichment (\(e\)) is hard to control
typically, most fragments do not contain the feature of interest

Guided exercise: Let’s derive this model step-by-step

I would recommended that before looking at the solution below, you read a the first item from the instructions below and give yourself a couple of minutest to think about it and possibly write down a formal description (variables or formulas) before going to the next step.

Here are the instructions:

Assume we are working with mouse data (genome size ~3e9 bp) and a desired resolution (bin size) of 1kb (\(b\)). Store the bin and genome size in variables (\(b\) and \(G\)) and calculate the number of bins \(N\).
Assume you would sample reads from the genome (without any enrichment). All \(N\) bins have equal probability of containing (or producing) a given read. What distribution would describe the number of reads per bin?
Assume that our protein of interest (a transcription factor) has 10,000 binding sites in the genome. For simplicity, let’s assume that they all occur in different bins (no bin has more than one binding site). Calculate the fraction of bins \(f\) that are bound by the transcription factor.
Let \(M\) and \(e*M\) be the average number of reads in a non-bound and bound bin, respectively (\(e\) is the IP-enrichment factor). Write down an expression to calculate the total number of sequenced reads \(R\) as a function of \(N\), \(f\), \(M\) and \(e\) (this is probably easiest using pen and paper).
The interesting unknown quantity in the above expression is \(M\). Solve the expression for M (again, probably easiest using pen and paper).

This last expression is all we need to go from a given ChIP experiment to average read counts in bound and non-bound bins.

Example: Transcription Factor with 6000 Binding Sites

This model allows estimating the number of reads per bin for given experimental parameters:

Solution

b <- 1000  # 1kb bins
N <- 3e6   # human genome
f <- 0.002 # ~6000 sites
e <- c(2,5,10) # enrichments
R <- 20e6  # 20 Mio. reads
M <- R / (N * (e * f + (1 - f)))
cbind(enr = e, reads = M, reads.enriched = M * e)

     enr    reads reads.enriched
[1,]   2 6.653360       13.30672
[2,]   5 6.613757       33.06878
[3,]  10 6.548788       65.48788

Example: Abundant Histone Mark (e.g. H3K9me2)

An abundant protein or histone mark is more difficult to see above the background:

Solution

f <- 0.3   # 30% of genome
M <- R / (N * (e * f + (1 - f)))
cbind(enr = e, reads = M, reads.enriched = M * e)

     enr    reads reads.enriched
[1,]   2 5.128205       10.25641
[2,]   5 3.030303       15.15152
[3,]  10 1.801802       18.01802

Code: Plotting Distributions of Reads per Bin

The following code can be used to plot the estimated distributions of counts per bin, illustrating the variance due to sampling:

Solution

# helper function
plot_bin_densities <- function(xs, e, M) {
    require(tidyverse)
    
    # prepare plot data
    df <- data.frame(reads_per_bin = xs,
                     e = rep(e, each = length(xs)),
                     M = rep(M, each = length(xs))) |>
        mutate(dens_background = dpois(reads_per_bin, lambda = M),
               dens_foreground = dpois(reads_per_bin, lambda = e * M)) |>
        pivot_longer(cols = starts_with("dens_"),
                     names_to = "type",
                     values_to = "density") |>
        mutate(type = factor(sub("dens_", "", as.character(type)),
                             levels = sub("dens_", "", unique(type))),
               e_label = factor(paste0(e, "-fold enriched"),
                                levels = unique(paste0(e, "-fold enriched"))))
    
    # plot
    ggplot(df, aes(reads_per_bin, density, fill = type)) +
        geom_col(position = position_dodge()) +
        scale_fill_manual(values = c(foreground = "green3", background = "red")) +
        labs(x = "Reads per genomic bin",
             y = "Density of genomic bins",
             fill = "Bin type") +
        facet_wrap(~ e_label, ncol = 1)
}

plot_bin_densities(xs = 0:40, e = e, M = M)

Loading required package: tidyverse

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Read Distribution: Transcription Factor

\(f = 0.002\) (~6000 binding sites in the human genome)

Solution

b <- 1000  # 1kb bins
N <- 3e6   # human genome
f <- 0.002 # ~6000 sites
e <- c(2,5,10) # enrichments
R <- 20e6  # 20 Mio. reads
M <- R / (N * (e * f + (1 - f)))

plot_bin_densities(xs = 0:90, e = e, M = M)

Read Distribution: Abundant Histone Mark (e.g. H3K9me2)

\(f = 0.3\) (~30% of genome is marked)

Solution

b <- 1000  # 1kb bins
N <- 3e6   # human genome
f <- 0.3   # 30% of genome
e <- c(2,5,10) # enrichments
R <- 20e6  # 20 Mio. reads
M <- R / (N * (e * f + (1 - f)))

plot_bin_densities(xs = 0:35, e = e, M = M)

Limitations of this simple model for ChIPseq

In practice, these estimates should be taken with a grain of salt, because:

real data does not follow a Poisson distribution
(e.g. read density often depends on the underlying sequence composition)
the genome is not homogeneous
(real assemblies contain gaps, errors, unmappable regions, copy number variations compared to the experimental system, etc.)
the IP enrichment factor is hard to estimate and may vary along the genome
enriched bins (peaks) are not all equal
(e.g. variable binding affinity can produce a range of enrichments)
etc.

Outlook: Processing a ChIPseq Dataset from Raw Reads

The main steps in a ChIPseq workflow are similar to what we have seen for ATACseq or RNAseq:

align reads to reference genome
quality control
identify regions of interest (peak finding)
quantify regions of interest
compare samples (IP vs. input, IP1 vs. IP2)

A large number of tools are available for these steps. In this example, we will use QuasR (Gaidatzis et al. 2015) for alignment and quality control, and csaw (Lun and Smyth 2014) for peak finding.

Exercises

Using the above simple model for ChIPseq and the transcription factor example (\(e=5\)), plot the reads per bin distributions for different sequencing depths \(R\) (between 0.1 and 100 Mio. reads).
Calculate the fraction of reads in enriched bins (peaks) for each of these examples. Can you explain these results?
Explore the relationship between feature coverage, IP enrichment and sequencing depth using either one of the following options:
- the Shiny application available from 09_ChIPseq_coverage_shinyApp.R
- the interactive Rmarkdown file available from 09_ChIPseq_coverage_interactive.Rmd
- the code above

If you have never used an interactive Rmarkdown document or a shiny app before, here is what you have to do:

[optional, only needed when working on your laptop] First install the shiny package by running install.packages("shiny")
Download 09_ChIPseq_coverage_interactive.Rmd or 09_ChIPseq_coverage_shinyApp.R
Open one of the downloaded 09_ChIPseq_coverage_interactive.Rmd or 09_ChIPseq_coverage_shinyApp.R files in RStudio and click “Run document” or “Run app” at the top

References

Gaidatzis, Dimos, Anita Lerch, Florian Hahne, and Michael B. Stadler. 2015. “QuasR: Quantification and Annotation of Short Reads in r.” Bioinformatics 31 (7): 1130–32. https://doi.org/10.1093/bioinformatics/btu781.

Lun, Aaron T L., and Gordon K. Smyth. 2014. “De Novo Detection of Differentially Bound Regions for ChIP-Seq Data Using Peaks and Windows: Controlling Error Rates Correctly.” Nucleic Acids Res 42 (11): e95. https://doi.org/10.1093/nar/gku351.

Mikkelsen, Tarjei S., Manching Ku, David B. Jaffe, Biju Issac, Erez Lieberman, Georgia Giannoukos, Pablo Alvarez, et al. 2007. “Genome-Wide Maps of Chromatin State in Pluripotent and Lineage-Committed Cells.” Nature 448 (7153): 553–60. https://doi.org/10.1038/nature06008.

Szalkowski, Adam M., and Christoph D. Schmid. 2011. “Rapid Innovation in ChIP-Seq Peak-Calling Algorithms Is Outdistancing Benchmarking Efforts.” Brief Bioinform 12 (6): 626–33. https://doi.org/10.1093/bib/bbq068.