Chromatin Immuno-Precipitation and Sequencing (ChIP-seq) 1: Introduction
Author
Michael Stadler (michael.stadler@fmi.ch)
Published
April 24, 2024
What is ChIPseq?
Chromatin immuno-precipitation is used to experimentally determine the genomic location of specific proteins or protein variants (see figure below). Detection can be done by PCR (ChIP-qPCR), hybridization (ChIP-chip, hardly used anymore) or sequencing (ChIP-seq). The process consists of:
chemically cross-link chromatin (proteins and DNA) [optional in “native ChIP”]
fragment chromatin (< 1kb fragments, determines resolution)
enrichment of fragments of interest (e.g. using antibodies)
A simple theoretical model can help to think about this process (described in (Mikkelsen et al. 2007)):
genome is divided into \(N\) bins of size \(b\)
fraction \(f\) of these bins contain feature of interest
uniform sampling of reads: number of reads per bin \(\sim\) Poisson\((\lambda)\)
immuno-precipitation enriches by factor \(e\)
(on average, bins with/without features have \(eM\) and \(M\) reads): \(\lambda_{background} = M\) and \(\lambda_{enriched} = eM\)
when sequencing a total of \(R\) reads: \(M = \frac{R}{N(ef+(1-f))}\)
A simple theoretical model of ChIPseq data
immuno-precipitation enriches, but does not (perfectly) filter
fold-enrichment (\(e\)) is hard to control
typically, most fragments do not contain the feature of interest
Guided exercise: Let’s derive this model step-by-step
I would recommended that before looking at the solution below, you read a the first item from the instructions below and give yourself a couple of minutest to think about it and possibly write down a formal description (variables or formulas) before going to the next step.
Here are the instructions:
Assume we are working with mouse data (genome size ~3e9 bp) and a desired resolution (bin size) of 1kb (\(b\)). Store the bin and genome size in variables (\(b\) and \(G\)) and calculate the number of bins \(N\).
Assume you would sample reads from the genome (without any enrichment). All \(N\) bins have equal probability of containing (or producing) a given read. What distribution would describe the number of reads per bin?
Assume that our protein of interest (a transcription factor) has 10,000 binding sites in the genome. For simplicity, let’s assume that they all occur in different bins (no bin has more than one binding site). Calculate the fraction of bins \(f\) that are bound by the transcription factor.
Let \(M\) and \(e*M\) be the average number of reads in a non-bound and bound bin, respectively (\(e\) is the IP-enrichment factor). Write down an expression to calculate the total number of sequenced reads \(R\) as a function of \(N\), \(f\), \(M\) and \(e\) (this is probably easiest using pen and paper).
The interesting unknown quantity in the above expression is \(M\). Solve the expression for M (again, probably easiest using pen and paper).
This last expression is all we need to go from a given ChIP experiment to average read counts in bound and non-bound bins.
Example: Transcription Factor with 6000 Binding Sites
This model allows estimating the number of reads per bin for given experimental parameters:
Solution
b <-1000# 1kb binsN <-3e6# human genomef <-0.002# ~6000 sitese <-c(2,5,10) # enrichmentsR <-20e6# 20 Mio. readsM <- R / (N * (e * f + (1- f)))cbind(enr = e, reads = M, reads.enriched = M * e)
The following code can be used to plot the estimated distributions of counts per bin, illustrating the variance due to sampling:
Solution
# helper functionplot_bin_densities <-function(xs, e, M) {require(tidyverse)# prepare plot data df <-data.frame(reads_per_bin = xs,e =rep(e, each =length(xs)),M =rep(M, each =length(xs))) |>mutate(dens_background =dpois(reads_per_bin, lambda = M),dens_foreground =dpois(reads_per_bin, lambda = e * M)) |>pivot_longer(cols =starts_with("dens_"),names_to ="type",values_to ="density") |>mutate(type =factor(sub("dens_", "", as.character(type)),levels =sub("dens_", "", unique(type))),e_label =factor(paste0(e, "-fold enriched"),levels =unique(paste0(e, "-fold enriched"))))# plotggplot(df, aes(reads_per_bin, density, fill = type)) +geom_col(position =position_dodge()) +scale_fill_manual(values =c(foreground ="green3", background ="red")) +labs(x ="Reads per genomic bin",y ="Density of genomic bins",fill ="Bin type") +facet_wrap(~ e_label, ncol =1)}plot_bin_densities(xs =0:40, e = e, M = M)
Loading required package: tidyverse
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Read Distribution: Transcription Factor
\(f = 0.002\) (~6000 binding sites in the human genome)
Solution
b <-1000# 1kb binsN <-3e6# human genomef <-0.002# ~6000 sitese <-c(2,5,10) # enrichmentsR <-20e6# 20 Mio. readsM <- R / (N * (e * f + (1- f)))plot_bin_densities(xs =0:90, e = e, M = M)
Read Distribution: Abundant Histone Mark (e.g. H3K9me2)
\(f = 0.3\) (~30% of genome is marked)
Solution
b <-1000# 1kb binsN <-3e6# human genomef <-0.3# 30% of genomee <-c(2,5,10) # enrichmentsR <-20e6# 20 Mio. readsM <- R / (N * (e * f + (1- f)))plot_bin_densities(xs =0:35, e = e, M = M)
Limitations of this simple model for ChIPseq
In practice, these estimates should be taken with a grain of salt, because:
real data does not follow a Poisson distribution
(e.g. read density often depends on the underlying sequence composition)
the genome is not homogeneous
(real assemblies contain gaps, errors, unmappable regions, copy number variations compared to the experimental system, etc.)
the IP enrichment factor is hard to estimate and may vary along the genome
enriched bins (peaks) are not all equal
(e.g. variable binding affinity can produce a range of enrichments)
etc.
Outlook: Processing a ChIPseq Dataset from Raw Reads
The main steps in a ChIPseq workflow are similar to what we have seen for ATACseq or RNAseq:
align reads to reference genome
quality control
identify regions of interest (peak finding)
quantify regions of interest
compare samples (IP vs. input, IP1 vs. IP2)
A large number of tools are available for these steps. In this example, we will use QuasR(Gaidatzis et al. 2015) for alignment and quality control, and csaw(Lun and Smyth 2014) for peak finding.
Exercises
Using the above simple model for ChIPseq and the transcription factor example (\(e=5\)), plot the reads per bin distributions for different sequencing depths \(R\) (between 0.1 and 100 Mio. reads).
Calculate the fraction of reads in enriched bins (peaks) for each of these examples. Can you explain these results?
Explore the relationship between feature coverage, IP enrichment and sequencing depth using either one of the following options:
Open one of the downloaded 09_ChIPseq_coverage_interactive.Rmd or 09_ChIPseq_coverage_shinyApp.R files in RStudio and click “Run document” or “Run app” at the top
References
Gaidatzis, Dimos, Anita Lerch, Florian Hahne, and Michael B. Stadler. 2015. “QuasR: Quantification and Annotation of Short Reads in r.”Bioinformatics 31 (7): 1130–32. https://doi.org/10.1093/bioinformatics/btu781.
Lun, Aaron T L., and Gordon K. Smyth. 2014. “De Novo Detection of Differentially Bound Regions for ChIP-Seq Data Using Peaks and Windows: Controlling Error Rates Correctly.”Nucleic Acids Res 42 (11): e95. https://doi.org/10.1093/nar/gku351.
Mikkelsen, Tarjei S., Manching Ku, David B. Jaffe, Biju Issac, Erez Lieberman, Georgia Giannoukos, Pablo Alvarez, et al. 2007. “Genome-Wide Maps of Chromatin State in Pluripotent and Lineage-Committed Cells.”Nature 448 (7153): 553–60. https://doi.org/10.1038/nature06008.
Szalkowski, Adam M., and Christoph D. Schmid. 2011. “Rapid Innovation in ChIP-Seq Peak-Calling Algorithms Is Outdistancing Benchmarking Efforts.”Brief Bioinform 12 (6): 626–33. https://doi.org/10.1093/bib/bbq068.