1 General info

Course website and material

2 Tiling window analysis of ChIP-seq data

In the last lectures, you have seen ways of determining RNA-seq and Chip-seq counts in predefined regions. Today, we want to look at these count data from different angles and try to interpret what we observe. So this lecture will be less focused on R commands and much more on data interpretation. To this end, we will use a “real-word” data set: in a first step, we will analyze published ChIP-seq data of a number of chromatin marks for mouse ES cells and, in a second step, see how they relate to gene expression as measured by RNA-seq.

The lecture is intended to illustrate the importance of visualizing and exploring data in detail, particularly in a “raw” form (as close to the raw data as possible), to understand its structure and recognizing potential pitfalls in the analysis. There are unfortunately no short-cuts to insights in data analysis. Standard workflows (some of which you have seen in the last lectures) may or may not work and simply using existing tools without a detailed look at the data could be dangerous - there are often surprises to be found in the data.

The ChIP-seq data we will use consists of two euchromatic histone marks (H3K4me2 and H3K36me3) and two heterochromatic marks (H3K27me3 and H3K9me2) measured in mouse ES cells. The first three marks were generated in the Schübeler lab at FMI (publicly available from GEO under accessions GSM747543, GSM747544, GSM801982, GSM801983, GSM747539, GSM747541), the last one is from the Zhu lab (GSM1314605). From both labs, we have one input control sample (GSM747546 and GSM1314606, respectively). The RNA-seq data that we will look at also comes from the Schübeler lab at FMI (GSM778487, GSM778488).

In a first step, before we can do any comparison with the RNA-seq data, we need to understand the ChIP-seq data. With ChIP-seq, if the technical quality is fine (good alignment rate, low sequencing error rate, etc.), one is tempted to directly move on to peak finding and carry on the analysis from there on using either read counts in peaks or doing simple overlap statistics. However, it turns out that it is usually very instructive to look at the data first in an annotation-free manner, i.e. without directly focusing in on specific parts of the genome (so in a “raw” form, see above). This you can do in a least two ways: Creating wiggle files and looking at the data in genome browser (you have seen this in previous lectures - we won’t do it today), or looking at ChIP-seq counts in tiling windows along the genome, typically with tiles of a size around 1kb in length (if the number of reads is large enough). The latter approach turns out to be very helpful because it shows you the structure of the data not only in regions that are enriched for a particular chromatin mark or bound transcription factor, but also in the background regions of the genome. These latter regions allow you to identify general biases, in particular the GC bias that you have heard about in the last lecture.

2.1 GC-bias

Using qCount from the QuasR package and the tileGenome function from the GenomicRanges package, we have pre-calculated the levels (log2 library-size scaled counts, with a pseudo-count of \(8\)) of the four chromatin marks and the two inputs in 2kb tiling windows along chromosome 19 (to keep the file small). Let’s read in the data:

DL <- readRDS("data/14/ChIPdata_2kb_chr19.rds")
head(DL)
##   H3K4me2_a H3K4me2_b H3K36me3_a H3K36me3_b H3K27me3_a H3K27me3_b H3K9me2_a H3K9me2_b  Input_a
## 1  4.023915  3.252330   4.251767   3.000000   4.416472   4.535984  4.605221  4.857063 3.682214
## 2  3.000000  3.000000   3.170663   3.697906   3.638250   3.560681  4.779422  5.074005 3.588645
## 3  3.000000  3.467028   3.587175   3.582767   4.416472   3.560681  4.979474  4.705289 3.852905
## 4  3.000000  3.000000   3.000000   3.320611   3.000000   3.776060  4.470980  4.291016 4.005520
## 5  3.000000  3.000000   3.323256   3.582767   3.638250   3.000000  4.647328  4.489930 3.852905
## 6  3.000000  3.000000   3.323256   3.000000   3.638250   3.307408  4.766744  4.601624 3.264877
##    Input_b GCcontent   chr   start     end
## 1 4.483901    0.4620 chr19 3168001 3170000
## 2 4.183805    0.4590 chr19 3170001 3172000
## 3 4.670883    0.3865 chr19 3182001 3184000
## 4 4.580419    0.3540 chr19 3184001 3186000
## 5 4.333751    0.3730 chr19 3186001 3188000
## 6 4.498091    0.3325 chr19 3188001 3190000

To asses wether the number of reads in a tile depend on the GC content of the tile (column GCcontent), we can simply plot the two measures against each other.

2.1.1 Exercise 1

Read in the data and look at the relationship of GC content and log2 read counts for the input control samples. What do you observe?

sel.input <- grep("Input", colnames(DL))
ra <- range(DL[, sel.input])
par(mfrow = c(1, 2))
for (sn in sort(colnames(DL)[sel.input])) {
    smoothScatter(DL$GCcontent, DL[, sn], xlab = "GC content", ylab = sprintf("log2 %s", 
        sn), ylim = ra)
}

- The first input sample shows a clear correlation with GC content and varies roughly 4-fold as a function of GC content.
- The second input sample show very little variation as a function of GC content.

The differences seen here between the two samples are unfortunately quite common and do not only happen between different labs, but can also occur in samples from the same lab. This is especially true when comparing old and new data, which may differ in many technical aspects including chromatin fragmentation method, library generation protocols and sequencing chemistry. Although “no GC bias” represents the ideal case, a GC bias of small to moderate strength (as the ones here) can be dealt with if the ChIP-seq samples to be compared show a similar amount of GC bias.

2.1.2 Exercise 2

Plot the GC content versus counts for IP samples to check if they suffer from any GC bias.

Can you observe any GC bias in these samples?
Can you observe any signal in these samples? It helps to think about what fraction of tiles you would expect to be enriched for a given histone modification.

# index for IP samples
sel <- 1:8
ra <- range(DL[, sel])
par(mfrow = c(2, 4))
for (sn in colnames(DL[, sel])) {
    smoothScatter(DL$GCcontent, DL[, sn], xlab = "GC content", ylab = sprintf("log2 %s", 
        sn), ylim = ra)
}