1 Processing a ChIP-seq Dataset from Raw Reads

The main steps in a ChIP-seq workflow are similar to what we have seen for RNA-seq:

  1. align reads to reference genome
  2. quality control
  3. identify regions of interest (peak finding)
  4. quantify regions of interest
  5. compare samples (IP vs. input, IP1 vs. IP2)

A large number of tools are available for these steps. In this example, we will use QuasR (Gaidatzis et al. 2015) for alignment and quality control, and csaw (Lun and Smyth 2014) for peak finding.

1.1 Prepare: Copy example data to temporary directory

We use a small example dataset that comes with the QuasR package, containing reads from a histone 3 lysine 4 trimethyl (H3K4me3) ChIP-seq experiment. We start the example workflow by copying those files into a temporary directory td, into a subfolder called extdata:

td <- tempdir()
file.copy(system.file(package="QuasR", "extdata"), td, recursive=TRUE)

1.2 Step 1: Align reads using the qAlign function

As we have seen in the session about RNA-seq, QuasR needs a text file that lists all raw sequence files, and the reference genome sequence (here a fasta file). qAlign will take care of the rest and store newly generated bam files with the raw sequence files:

sampleFile <- file.path(td, "extdata", "samples_chip_single.txt")
genomeFile <- file.path(td, "extdata", "hg19sub.fa")

proj <- qAlign(sampleFile, genome=genomeFile)
## Project: qProject
##  Options   : maxHits         : 1
##              paired          : no
##              splicedAlignment: FALSE
##              bisulfite       : no
##              snpFile         : none
##              geneAnnotation  : none
##  Aligner   : Rbowtie v1.24.0 (parameters: -m 1 --best --strata)
##  Genome    : /private/var/folders/qm/f4lwc7c10193k2zfy4smk22c0000gp/T/RtmpdYVwR.../hg19sub.fa (file)
##  Reads     : 2 files, 2 samples (fastq format):
##    1. chip_1_1.fq.bz2  Sample1 (phred33)
##    2. chip_2_1.fq.bz2  Sample2 (phred33)
##  Genome alignments: directory: same as reads
##    1. chip_1_1_5f4642c4d587.bam
##    2. chip_2_1_5f4689eb450.bam 
##  Aux. alignments: none

1.3 Step 2: Quality Control

In addition to assessing technical quality of sequencing data, quality control in ChIP-seq experiments typically checks for:

  • reproducibiliy (consistency between replicate experiments)
  • spikyness (defined genomic regions of increased read densities)

These features can be assessed in many ways, for example using QuasR’s qQCReport function, tools like FastQC, and by visualizing the alignment density along the genome (for example by creating a wiggle file and upload it to UCSC’s genome browser or look at the bam files directly using IGV. The plot below was generated using the Gviz package that allows you to create genome browser like plots within R (it has a detayled vignette with many examples and is highly recommended).

1.4 Step 2a: Quality Control: GC bias

  • alignment density often depends on sequence composition (e.g. GC percentage)
  • this may lead to false positive/negative results, especially when comparing datasets with different degrees of GC bias
  • GC-bias can be assessed by plotting %GC in genomic tiles (e.g. windows of 5kb) versus the number of reads