library(QuasR)
<- tempdir()
td file.copy(system.file(package="QuasR", "extdata"), td, recursive=TRUE)
Sys.chmod(file.path(td, "extdata")) # make sure "extdata" is writable
Chromatin Immuno-Precipitation and Sequencing (ChIP-seq) 2, data analysis and interpretation
Processing a ChIP-seq Dataset from Raw Reads
The main steps in a ChIP-seq workflow are similar to what we would do for RNA-seq (apart from peak finding - genes
are the features of interest) or ATAC-seq:
- align reads to reference genome
- quality control
- identify regions of interest (peak finding)
- quantify regions of interest
- compare samples (IP vs. input, IP1 vs. IP2)
A large number of tools are available for these steps. In this example, we will use QuasR (Gaidatzis et al. 2015) for alignment and quality control, and csaw (Lun and Smyth 2014) for peak finding.
Prepare: Copy example data to temporary directory
We use a small example dataset that comes with the QuasR package, containing reads from a histone 3 lysine 4 trimethyl (H3K4me3) ChIP-seq experiment. We start the example workflow by copying those files into a temporary directory td, into a subfolder called extdata:
Step 1: Align reads using the qAlign function
As we have seen in the session about RNA-seq, QuasR needs a text file that lists all raw sequence files, and the reference genome sequence (here a fasta file). qAlign will take care of the rest and store newly generated bam files with the raw sequence files:
<- file.path(td, "extdata", "samples_chip_single.txt")
sampleFile <- file.path(td, "extdata", "hg19sub.fa")
genomeFile
<- qAlign(sampleFile, genome=genomeFile) proj
proj
Project: qProject
Options : maxHits : 1
paired : no
splicedAlignment: FALSE
bisulfite : no
snpFile : none
geneAnnotation : none
Aligner : Rbowtie v1.42.0 (parameters: -m 1 --best --strata)
Genome : /private/var/folders/qm/f4lwc7c10193k2zfy4smk2.../hg19sub.fa (file)
Reads : 2 files, 2 samples (fastq format):
1. chip_1_1.fq.bz2 Sample1 (phred33)
2. chip_2_1.fq.bz2 Sample2 (phred33)
Genome alignments: directory: same as reads
1. chip_1_1_a8ad7322799c.bam
2. chip_2_1_a8ad5c5e2bd1.bam
Aux. alignments: none
Step 2: Quality Control
In addition to assessing technical quality of sequencing data, quality control in ChIP-seq experiments typically checks for:
- reproducibiliy (consistency between replicate experiments)
- spikyness (defined genomic regions of increased read densities)
These features can be assessed in many ways, for example using QuasR’s qQCReport function, tools like FastQC, and by visualizing the alignment density along the genome (for example by creating a wiggle file and upload it to UCSC’s genome browser or look at the bam files directly using IGV. The plot below was generated using the Gviz package that allows you to create genome browser like plots within R (it has a detailed vignette with many examples and is highly recommended).
Step 2a: Quality Control: GC bias
- alignment density often depends on sequence composition (e.g. GC percentage)
- this may lead to false positive/negative results, especially when comparing datasets with different degrees of GC bias
- GC-bias can be assessed by plotting %GC in genomic tiles (e.g. windows of 5kb) versus the number of reads
- in practice, you want to compare samples whith little GC bias, and importantly, similar GC bias
We can create such a plot for our data easily ourselves (even though with the small dataset we have here, it will not be very informative):
# load the genome sequence
library(Biostrings)
Loading required package: XVector
Attaching package: 'Biostrings'
The following object is masked from 'package:grid':
pattern
The following object is masked from 'package:base':
strsplit
<- readDNAStringSet(genomeFile)
gnm <- seqlengths(gnm)
chrs
# tile genome into bins
<- tileGenome(chrs, tilewidth = 1000)
bins <- unlist(bins)
bins
# get sequences of bins
<- getSeq(gnm, bins)
binSeqs <- oligonucleotideFrequency(binSeqs, width = 1)
binBases
# get percent G+C values for bins
<- rowSums(binBases[, c("C","G")]) /
percGC rowSums(binBases) * 100
# get ChIP-seq read counts for bins
<- qCount(proj, bins) binCounts
counting alignments...
done
# visualize
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ lubridate::%within%() masks IRanges::%within%()
✖ dplyr::collapse() masks Biostrings::collapse(), IRanges::collapse()
✖ dplyr::combine() masks BiocGenerics::combine()
✖ purrr::compact() masks XVector::compact()
✖ dplyr::desc() masks IRanges::desc()
✖ tidyr::expand() masks S4Vectors::expand()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::first() masks S4Vectors::first()
✖ dplyr::lag() masks stats::lag()
✖ ggplot2::Position() masks BiocGenerics::Position(), base::Position()
✖ purrr::reduce() masks GenomicRanges::reduce(), IRanges::reduce()
✖ dplyr::rename() masks S4Vectors::rename()
✖ lubridate::second() masks S4Vectors::second()
✖ lubridate::second<-() masks S4Vectors::second<-()
✖ dplyr::slice() masks XVector::slice(), IRanges::slice()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
<- data.frame(percGC = percGC,
df log2count = log2(binCounts[, -1] + 1)) |>
pivot_longer(cols = starts_with("log2count."),
names_to = "sample", values_to = "log2count") |>
mutate(sample = sub("log2count.", "", sample))
ggplot(df, aes(percGC, log2count, colour = sample)) +
geom_point() +
geom_smooth(se = FALSE) +
labs(x = "Percent G+C",
y = "Reads per tile (log2)")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Step 2b: Reproducibility
- the similarity between replicates is a good quality measure
- for an unbiased comparison, counts in tiling windows along the genome can be used
- problematic windows (e.g. alignment artefacts, repeat regions) may have to be manually removed (for a detailed discussion see (Amemiya, Kundaje, and Boyle 2019))
- if only few windows contain signal (few bindings sites), peak regions should be used instead of tiling windows (see below for identification of peaks)
# re-use binCounts from above
<- log2(binCounts[, -1] + 1)
lcnt ggplot(lcnt, aes(Sample1, Sample2)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, colour = "red") +
annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.5,
label = sprintf("R = %.3f", cor(lcnt[, "Sample1"],
"Sample2"])), bty = "n") lcnt[,
Warning in annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.5, :
Ignoring unknown parameters: `bty`
Step 2c: Variable IP enrichment factors
Assuming that the IP efficiency and the resulting enrichment factor was not the same across your experiments:
- How would that show itself in the data?
- What problems would result from that?
- How could you correct for it?
Let’s use last week’s ChIP-model to simulate artificial ChIP data, one non-enriched and two enriched samples with different IP enrichment factors:
<- 10000 # number of genomic bins
N <- 0.05 # fraction of enriched bins
f <- 12 # average number of reads per non-enriched bin
M <- c(3, 6) # IP enrichment factors
e
set.seed(42) # reproducible randomness
# enriched bins
<- sample(x = c(TRUE, FALSE), size = N,
isEnriched replace = TRUE, prob = c(f, 1 - f))
# counts
<- cbind(input = rpois(N, lambda = M),
cnt IP_1 = ifelse(isEnriched, rpois(N, e[1] * M), rpois(N, M)),
IP_2 = ifelse(isEnriched, rpois(N, e[2] * M), rpois(N, M)))
# normalized counts
<- log2(t(t(cnt) / colSums(cnt)) * 1e6 + 1)
lcpm cor(lcpm)
input IP_1 IP_2
input 1.000000000 -0.001684349 -0.002014354
IP_1 -0.001684349 1.000000000 0.494626911
IP_2 -0.002014354 0.494626911 1.000000000
Now we create MA-plot of all pairs:
<- function(x, ...) { # x: two-column matrix with log counts
plotMA <- data.frame(M = x[,2] - x[,1],
df A = (x[,1] + x[,2]) / 2)
ggplot(df, aes(A, M)) +
geom_point(position = position_jitter(width = 0.1, height = 0.1),
colour = alpha("gray30", 0.3)) +
labs(x = "A",
y = paste0("M (", colnames(x)[2], " - ",
colnames(x)[1], ")")) +
geom_hline(yintercept = 0, linetype = "dashed")
}
plotMA(lcpm[, c(1, 2)], ylim = c(-4, 4))
plotMA(lcpm[, c(1, 3)], ylim = c(-4, 4))
plotMA(lcpm[, c(2, 3)], ylim = c(-4, 4))
Notice the clearer signal in IP_2
(mid) compared to IP_1
(top), and the apparent “differential binding” between IP_2
vs. IP_1
(bottom).
Let’s normalize this effect using loess normalization (see ?limma::normalizeBetweenArrays
for details):
library(limma)
Attaching package: 'limma'
The following object is masked _by_ '.GlobalEnv':
plotMA
The following object is masked from 'package:BiocGenerics':
plotMA
# loess normalize IP samples (replicates)
<- lcpm
normLcpm 2:3] <- normalizeBetweenArrays(lcpm[, 2:3], method = "cyclicloess")
normLcpm[,
plotMA(normLcpm[, c(1, 2)], ylim = c(-4, 4))
plotMA(normLcpm[, c(1, 3)], ylim = c(-4, 4))
plotMA(normLcpm[, c(2, 3)], ylim = c(-4, 4))
Step 3: Identifying Regions of Interest
In contrast to typical RNA-seq, the regions of interest are often not a priori known in ChIP-seq. One might:
- use pre-defined regions in which to quantify the binding, such as:
promoter regions of known genes, CpG islands, predicted transcription factor binding sites, etc. - identify regions as peaks from the data, using tools such as MACS (Zhang et al. 2008), MACS2, MACS3 or PeakSeq
- identify differentially bound windows based on genome tiling windows, using for example csaw
Because csaw works well with R and is related to edgeR that you already know, we will use it here.
Step 3a: Identifying Peaks with csaw
- Identifying differentially bound peaks
The csaw package was primarily designed to identify differentially bound regions between groups of replicated ChIP samples (e.g. for different treatment conditions or genotypes). An excellent description can be opened using the csawUsersGuide() function.
- Identifying peaks relative to local background (emulating MACS)
csaw is however quite flexible and can also be used to identify windows of high alignment density relative to a local neighborhood, similar to the approach implemented in MACS (Zhang et al. 2008). This will be illustrated below.
Estimate average fragment length
The average ChIP-seq fragment size can be estimated from the data, as plus- and minus-strand alignments mark the starts and ends of ChIP fragments, respectively (illustration from (Wilbanks and Facciotti 2010)).
Estimate average fragment length using Autocorrelation
# load library
library(csaw)
# get bam files
<- alignments(proj)$genome$FileName
bam.files
# obtain average fragment length
# from cross-correlation of plus/minus strand alignments
<- correlateReads(bam.files, max.dist = 500)
x
# visualize (raw and smoothed)
<- data.frame(delay = 0:500,
df ccf = x,
ccf_smooth = runmean(Rle(x), k = 101, endrule = "constant"))
<- df[which.max(df$ccf_smooth), "delay"]
frag_len ggplot(df, aes(delay)) +
geom_point(aes(y = ccf)) +
geom_line(aes(y = ccf_smooth), colour = "red") +
geom_vline(xintercept = frag_len, colour = "blue") +
labs(x = "Delay (base pairs)",
y = "Cross-correlation function") +
annotate("text", x = Inf, y = Inf, hjust = 1.1, vjust = 1.5,
label = paste0("fragment length = ", frag_len, "bp"))
Count alignments per window along the genome
As before, we could use qCount
from the QuasR package to obtain counts for windows, but csaw also provides a window counting function, which returns a SummarizedExperiment
object. We will use it here because it simplifies the subsequent peak finding using csaw:
<- windowCounts(bam.files, ext = frag_len)
data head(assays(data)$counts) # counts
[,1] [,2]
[1,] 4 8
[2,] 3 13
[3,] 3 13
[4,] 6 11
[5,] 5 11
[6,] 4 8
head(rowRanges(data)) # ... for windows
GRanges object with 6 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 1801-1850 *
[2] chr1 1851-1900 *
[3] chr1 1901-1950 *
[4] chr1 1951-2000 *
[5] chr1 2001-2050 *
[6] chr1 8651-8700 *
-------
seqinfo: 3 sequences from an unspecified genome
Please note that windowCounts()
counts fragments overlapping each window. In the example here, many fragments are counted several times if they overlap more than a single window. Be careful to avoid this redundant use of your data:
# get the total number of alignments
# ... from the QuasR qProject object
alignmentStats(proj)[, "mapped"]
Sample1:genome Sample2:genome
2339 3609
# ... from the SummarizedExperiment colData
colData(data)[, "totals"]
[1] 2339 3609
# ... and compare to the total counts
colSums(assays(data)$counts)
[1] 7509 14341
Let’s quickly repeat what a SummarizedExperiment
is: It is essentially just a fancy matrix (2D array), one that can store more than a single version of the matrix (assays
, for example raw and normalized), and that can also store metadata on the rows (rowData()
) or columns (colData()
). It is also the parent class of the SingleCellExperiment
that we used to store single cell data. In summary, it stores feature-by-sample measurements with additional annotation for features (rows) and samples (columns).
You might wonder why we don’t use a simple matrix for that purpose? The specialized classes have the advantage, that they keep annotation (e.g. gene symbols or sample descriptions) together with your quantitative data. If you filter your genes based on the counts, the gene annotation is filtered at the same time, and there is no danger of merge or reorder mistakes. Furthermore, it become very easy to save and share the data as a single R object (for example using saveRDS()
).
Below is a graphical representation of the object and the accessors for the different data slots (see (Huber et al. 2015) or the documentation ?"SummarizedExperiment-class"
in the SummarizedExperiment package for more details):
Strategies to Identify ChIP-seq Peaks
Depending on the chipped feature, the ChIP-seq signal and thus the strategy to identify peaks has to be adjusted (see for example the figure below from (Pepke, Wold, and Mortazavi 2009) and also (Wilbanks and Facciotti 2010))
Identify peaks (enriched regions over local background)
# identify peak windows (3-fold over background)
<- 5000 # background from local neighborhood
surrounds <- resize(rowRanges(data), width = surrounds, fix = "center")
neighbour <- regionCounts(bam.files, regions = neighbour, ext = frag_len)
wider <- filterWindowsLocal(data, wider)$filter
enr.over.neighbour <- rowRanges(data)[enr.over.neighbour > log2(3)]
peaks.windows
# merge peak windows into peaks
<- 1000 # fuse windows closer than this
maxdist <- mergeWindows(peaks.windows, tol = maxdist)
peaks peaks
$ids
[1] 1 1 1 1 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5
[39] 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
$regions
GRanges object with 6 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 1851-2050 *
[2] chr1 13101-13300 *
[3] chr1 31551-32250 *
[4] chr2 3401-3800 *
[5] chr3 2451-3150 *
[6] chr3 28551-29650 *
-------
seqinfo: 3 sequences from an unspecified genome
Export peak coordinages as BED file
# exporting a BED file with peak coordinates
export(peaks$region, file.path(td, "peaks.bed"))
cat(paste(readLines(file.path(td, "peaks.bed"), n = 6), collapse = "\n"))
chr1 1850 2050 . 0 .
chr1 13100 13300 . 0 .
chr1 31550 32250 . 0 .
chr2 3400 3800 . 0 .
chr3 2450 3150 . 0 .
chr3 28550 29650 . 0 .
Visualize peaks with Gviz
Visualize peaks with simple R
<- data.frame(chr = seqnames(rowRanges(data)),
df pos = start(rowRanges(data)),
count = rowSums(assay(data, "counts"))) |>
filter(chr == "chr3")
<- seqnames(peaks$region) == "chr3"
j ggplot() +
geom_rect(inherit.aes = FALSE,
data = data.frame(xmin = start(peaks$region[j]),
xmax = end(peaks$region[j]),
ymin = 0, ymax = 1.04 * max(df$count)),
mapping = aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),
fill = alpha("red", 0.2)) +
geom_line(data = df, inherit.aes = FALSE,
mapping = aes(pos, count)) +
scale_y_continuous(expand = expansion(mult = 0)) +
labs(x = "Position on chr3",
y = "Alignments per window")
Step 4: Quantify regions of interest
Once regions of interest (peaks) are identified, it is often useful to go back to the raw data and quantify the signal per region and individual sample, for example to compare replicates. We will do this using qCount
from the QuasR package, which does not count reads redundantly:
<- qCount(proj, peaks$region) cnt
counting alignments...done
head(cnt)
width Sample1 Sample2
1 200 3 14
2 200 7 14
3 700 161 534
4 400 78 531
5 700 229 556
6 1100 135 410
We can now repeat our comparison of replicates, this time using the counts in peaks (rather than in genomic tiles, which often gives a better comparison as most genomic tiles may only show noise rather than signal):
# calculate counts per million (cpm)
<- alignmentStats(proj)[, "mapped"]
Ntot <- t(t(cnt[, c("Sample1","Sample2")]) / Ntot * 1e6)
cpm
<- data.frame(cpm)
df ggplot(df, aes(Sample1, Sample2)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, col = "red") +
scale_x_log10() +
scale_y_log10() +
annotate("text", x = min(cpm), y = max(cpm), hjust = 0, vjust = 1,
label = sprintf("R = %.3f",
cor(log2(df$Sample1),
log2(df$Sample2))))
Step 5: Compare Samples (IP vs. input, IP1 vs. IP2)
Finally, alignment counts in regions of interest can be compared between samples, similarly as we have done it for RNA-seq counts. The promoter counts can be downloaded from here: ChIP_promoter_counts.rds, and I saved them into the folder data/ChIP/
. A useful measure are log-ratios of IP counts over input counts, also referred to as enrichment values:
# load count table:
# regions: promoter regions (TSS +/- 1kb, mm10)
# mouse ES cells, H3K4me2 ChIP (GSM686995)
# mouse ES cells, input chromatin (GSM686996)
<- readRDS("data/ChIP/ChIP_promoter_counts.rds")
cnt head(cnt)
seqnames start end width strand EntrezGeneId
1 chr9 21066093 21068092 2000 - 100009600
2 chr7 84963009 84965008 2000 - 100009609
3 chr10 77710457 77712456 2000 + 100009614
4 chr11 45807083 45809082 2000 + 100009664
5 chr4 144161651 144163650 2000 - 100012
6 chr4 134767004 134769003 2000 - 100017
raw.GSM686995_MmES_H3K4me2_a raw.GSM747545_MmES_Input_a
1 195 35
2 0 3
3 8 16
4 0 8
5 1 7
6 993 11
cpm.GSM686995_MmES_H3K4me2_a cpm.GSM747545_MmES_Input_a
1 12.67881436 3.2062052
2 0.00000000 0.2748176
3 0.52015649 1.4656938
4 0.00000000 0.7328469
5 0.06501956 0.6412410
6 64.56442388 1.0076645
# calculate enrichments
<- cnt |>
df as.data.frame() |>
mutate(enr = log2((cpm.GSM686995_MmES_H3K4me2_a + 1) /
+ 1)))
(cpm.GSM747545_MmES_Input_a
ggplot(df, aes(enr)) +
geom_histogram(bins = 100) +
labs(x = "Enrichment (log2 H3K4me2 / input)",
y = "Number of promoters")
Do you remember why we add a constant to the counts?
Compare samples directly
Another useful plot to compare ChIP-seq samples is the MA-plot, that compares mean abundance (A) with the log2 fold-changes (or enrichment, M) between a pair of samples:
<- df |>
df mutate(A = log2(cpm.GSM686995_MmES_H3K4me2_a + cpm.GSM747545_MmES_Input_a + 1))
ggplot(df, aes(A, enr)) +
geom_jitter(width = 0.1, height = 0.1, colour = alpha("gray30", 0.2)) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(x = "A", y = "M")
Because of the very strong signal-to-noise ratio in this sample, A and M are strongly correlated (there are no non-enriched promoters with large A values).
Finally, a simple scatter plot is always a powerful way to study the relationship between two ChIP-seq sampls:
ggplot(df, aes(cpm.GSM686995_MmES_H3K4me2_a + 1, cpm.GSM747545_MmES_Input_a + 1)) +
geom_jitter(width = 0.01, height = 0.01, colour = alpha("gray30", 0.2)) +
geom_abline(slope = 1, intercept = 0) +
labs(x = "Counts per million in H3K4me2 (log scale)",
y = "Counts per million in Input (log scale)") +
scale_x_log10(limits = c(1, 200)) +
scale_y_log10(limits = c(1, 10))
Warning: Removed 2174 rows containing missing values or values outside the scale range
(`geom_point()`).
Compare ChIP data to other types of data
Much can be learned by comparing ChIP data with other types of data, such as expression (RNA-seq) or DNA sequence features (transcription factor binding sites, sequence composition). Here is an example that calculates the CpG observed-over-expect ratio for promoters and compares that to the H3K4me2 ChIP enrichments calculated above (the code below will require the genome sequence in the form of a BSgenome
package to be installed):
# create GRanges object for promoter regions
head(cnt)
seqnames start end width strand EntrezGeneId
1 chr9 21066093 21068092 2000 - 100009600
2 chr7 84963009 84965008 2000 - 100009609
3 chr10 77710457 77712456 2000 + 100009614
4 chr11 45807083 45809082 2000 + 100009664
5 chr4 144161651 144163650 2000 - 100012
6 chr4 134767004 134769003 2000 - 100017
raw.GSM686995_MmES_H3K4me2_a raw.GSM747545_MmES_Input_a
1 195 35
2 0 3
3 8 16
4 0 8
5 1 7
6 993 11
cpm.GSM686995_MmES_H3K4me2_a cpm.GSM747545_MmES_Input_a
1 12.67881436 3.2062052
2 0.00000000 0.2748176
3 0.52015649 1.4656938
4 0.00000000 0.7328469
5 0.06501956 0.6412410
6 64.56442388 1.0076645
<- as(cnt[,1:5], "GRanges")
cntgr
# get corresponding sequences
library("BSgenome.Mmusculus.UCSC.mm10")
<- getSeq(BSgenome.Mmusculus.UCSC.mm10, cntgr)
cntseq
# calculate mono-/di-nucleotide frequencies
<- oligonucleotideFrequency(cntseq, width = 1, as.prob = TRUE)
f.mono <- oligonucleotideFrequency(cntseq, width = 2, as.prob = TRUE)
f.di
# calculate CpG observed/expected ratio
$CpGexp <- f.mono[,"C"] * f.mono[,"G"]
df$CpGoe <- f.di[,"CG"] / df$CpGexp
df
# create plot
ggplot(df, aes(CpGoe, enr)) +
geom_jitter(height = 0.1, colour = alpha("gray30", 0.2)) +
labs(x = "CpG obs/expected",
y = "H3K4me2 (log2 IP/input)") +
geom_hline(yintercept = 0.5, linetype = "dashed") + # H3K4me2 enrichment threshold
geom_vline(xintercept = 0.4, linetype = "dashed") # high-CpG promoter threshold
Exercises
- Export the list of identified peaks as a bed file, but just containing the peaks on chromosome 3.
<- peaks$regions[seqnames(peaks$regions) == "chr3"]
selPeaks selPeaks
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr3 2451-3150 *
[2] chr3 28551-29650 *
-------
seqinfo: 3 sequences from an unspecified genome
<- file.path(tempdir(), "peaks_chr3.bed")
tf export(selPeaks, tf)
cat(paste(readLines(tf), collapse = "\n"))
chr3 2450 3150 . 0 .
chr3 28550 29650 . 0 .
- What is the impact of increasing or decreasing the genomic neighborhood (surrounds variable) on the resulting peaks? Try to identify a pattern by rerunning peak finding using different values and visualizing the results using the code above.
# helper function to call peaks given `surrounds`
# expects to find the following in the global environment:
# data, bam.files, frag_len, maxdist
<- function(surrounds, cutoff = log2(3)) {
get_peaks suppressWarnings({
<- resize(rowRanges(data), width = surrounds, fix = "center")
neighbour
})<- regionCounts(bam.files, regions = neighbour, ext = frag_len)
wider <- filterWindowsLocal(data, wider)$filter
enr.over.neighbour <- rowRanges(data)[enr.over.neighbour > cutoff]
peaks.windows mergeWindows(peaks.windows, tol = maxdist)
}
<- c(1000, 2500, 5000, 50000)
surrounds_values <- lapply(surrounds_values, get_peaks)
peak_list
# visualize
<- "chr3"
selchr <- lapply(peak_list, function(x) x$regions[seqnames(x$regions) == selchr])
peak_list_selchr <- cumsum(lengths(peak_list_selchr))
cumnumpeaks # ... extract alignments
<- data.frame(chr = seqnames(rowRanges(data)),
df1 pos = start(rowRanges(data)),
count = rowSums(assay(data, "counts"))) |>
filter(chr == selchr)
# ... extract peaks (for different `surrounds_values`)
<- 0.08 * max(df1$count)
dy <- do.call(rbind, lapply(seq_along(surrounds_values), function(i) {
df2 # o <- cumnumpeaks[i] + seq.int(lengths(peak_list_selchr)[i]) - 1
<- i
o data.frame(surrounds = surrounds_values[i],
parameter_set = factor(surrounds_values[i], levels = surrounds_values),
xmin = start(peak_list_selchr[[i]]),
xmax = end(peak_list_selchr[[i]]),
ymin = -o * dy, ymax = -o * dy + 0.75 * dy)
}))
ggplot() +
geom_rect(data = df2, linewidth = 2,
mapping = aes(xmin = xmin, xmax = xmax,
ymin = ymin, ymax = ymax,
colour = parameter_set,
fill = parameter_set)) +
geom_polygon(data = df1, inherit.aes = FALSE,
mapping = aes(pos, count),
fill = "gray50") +
labs(x = "Position on chr3",
y = "Alignments per window",
fill = "`surrounds` value",
colour = "`surrounds` value")
- Compare the promoter H3K4me2 enrichments with the gene expression in ES cells (using EntrezID to link the two sets, using the RNA-seq counts form RNA_gene_counts.rds)
<- readRDS("data/ChIP/RNA_gene_counts.rds")
rna
<- left_join(x = df,
df y = rna |> select(starts_with(c("EntrezGeneId", "log2rpkm"))),
by = join_by("EntrezGeneId"))
ggplot(df, aes((log2rpkm.ES.RNA.a + log2rpkm.ES.RNA.b) / 2, enr)) +
geom_jitter(height = 0.1, colour = alpha("gray30", 0.2)) +
geom_hline(yintercept = 0.5, linetype = "dashed") +
geom_vline(xintercept = 0.5, linetype = "dashed") +
labs(x = "Mean RNA level (log2 RPKM)",
y = "Promoter H3K4me2 (log2 IP/input)")
Warning: Removed 1227 rows containing missing values or values outside the scale range
(`geom_point()`).