## misc.
library(BiocParallel) # parallel computation
library(
# to manipulate [Ranged]SummarizedExperiment objects
SummarizedExperiment) library(stringr) # manipulating character strings
## plotting
library(ggplot2) # for plotting
## genomes
# Mouse Genome
library(BSgenome.Mmusculus.UCSC.mm10)
# Human genome
# NB: if using FMI RStudio server, load a different version
library(BSgenome.Hsapiens.UCSC.hg38)
<- BSgenome.Hsapiens.UCSC.hg38
BSgen.Hu # library(BSgenome.Hsapiens.GENCODE.GRCh38.gencode38)
# BSgen.Hu <- BSgenome.Hsapiens.GENCODE.GRCh38.gencode38
## pacakges w/ sequence motif-related functions
library(Biostrings) # manipualting sequences & motif representations
library(universalmotif) # manipulating motif representations
library(MotifDb) # accessing motif DBs
library(TFBSTools) # accessing motif DBs
library(motifmatchr) # scanning sequences for motif occurences
library(rtracklayer) # import/export from BED files & more
library(JASPAR2018) # JASPAR 2018 database
Sequence motifs: from position weight matrix to motif enrichment
Set Up
Goal
Learn:
- what a sequence motif is, how it is represented, how to retrieve a known motif, and how one can be built from scratch;
- how to score a sequence against a motif, how to consider scores as markers of potential binding sites, and how to search DNA for potential Transcription Factor (TF) binding sites;
- how to test if certain genomic regions are enriched for specific TF binding sites.
The focus in this session will be on the concepts, models, and processes, rather than the actual code.
NB: this handout is voluntarily made more extensive than what will be seen in class, in particular when considering the content of the advanced document, so as to serve as a potential reference for the interested. It will combine more theoretical sections with more research-like sections. Exam questions will only relate to things covered in class!
Required packages
We will use the following set of packages during the lecture
(notice that the Human genome package differs if youare working on the FMI’s RStudio server):
Example data
A quick overview of data used in this session.
Human GATA2 motifs
We will illustrate motif creation, searching and scoring, using data and database entries for Human transcription factor GATA2, amongst others.
Human K562 TF ChIP-seq
We will illustrate de novo motif calling on some transcription factor ChIP-seq data obtained on Human K562 cells for GATA2 and MYC:
GATA2:
Original link: ENCSR000EWG
MYC:
Original link: ENCSR744JJU
Immune cell accessible chromatin regions
We will illustrate analysing the enrichment of regions of interest in motif occurrences using an example dataset from an ATAC-seq experiment that determined accessible chromatin regions in different Mouse immune cells.
Künzli et al. 2020,
“Long-lived T follicular helper cells retain plasticity and help sustain humoral immunity”
PAPERThe authors identify a new role of T follicular helper (TFH) cells during viral infection. During viral infection, antigen-specific CD4+ T cells undergo proliferative expansion and differentiate into several subtypes among which TFH stay out as long-lived, highly plastic and can give rise to all subtypes after rechallenge.
Data processing in brief:
Reads were mapped to Mouse (mm10) using bowtie2
Reads were counted in genomic windows of 150bp
Differential accessibility was tested using the csaw/edgeR framework
Neighbouring differentially accessible windows were merged to form differentially accessible regions.
Raw data deposited in GEO: GSE139423
Processed data: merged_peaks_TFH_TH1.tsv.gz
Establishing, representing & accessing sequence motifs
Historical background
A motif is a repeating pattern, with possible small variations. In genomics, motifs are sequences that repeat in DNA or RNA, with possible nucleotide variations. Such motifs are typically associated with some sort of function, e.g. DNA binding sites for transcription factors or transcription initiation, RNA binding sites for translation initiation or micro-RNA response elements, etc. In practice, typical DNA motifs are 6-20 nucleotides long.
Historically, DNA sequence motifs were first defined from sets of experimentally-identified transcription factor binding sites, e.g. using electrophoretic mobility shift assay (EMSA), protein binding arrays, ChIP-Seq or high-throughput SELEX.
These experiments yield sets of sequences with high affinity for a given TF under study. For example, let’s take a look at some sequences identified as binding the human transcription factor GATA2 (hGATA2):
M. Merika & S. H. Orkin 1993,
“DNA-binding specificity of GATA family transcription factors” PAPERThis is just a subset of the GATA-bound sequences, and it is already a lot of raw information! We need to find a more convenient way of representing a sequence motif and its variations than using full sequence alignments like this (especially as multiple sequence alignments quickly become computationally costly to calculate). Obviously, if there were no variations, the sequence of the exact match would suffice.
These 19 sequences are aligned around the core “GATA” motif that is recognised by the hGATA2 TF. Considering all these sequences aligned in this way, we can derive a “consensus” sequence for the binding site, which here is “SGATAN”. This indicates a highly conserved “GATA” sequence in the middle, with a preceding position that is either G or C, and a trailing position that can be any nucleotide. However, this consensus representation loses a lot of information about the precise variability at each position. Can we do better programmatically?
Computationally representing sequence motifs
Motif packages in R
There are quite a number of different Bioconductor packages and classes designed to computationally represent, manipulate and use sequence motifs. Many are built off of base functionality defined in the Biostrings package. However, others (such as TFBSTools) define their own classes, hindering inter-operability and leading to redundancy. The universalmotif package aims at providing a uniform interface to all these different classes. Despite this, we still rely on all these different packages in order to use all required functionality, e.g. fast motif searches.
Note that R packages are born, used, and some fall into disuse, for lack of interest, or more commonly, a maintainer. Some of the above packages may no longer work in a few years, forcing one to search for alternatives.
The PCM
- Looking at the sequence alignment above, by using the core motif alignment and treating positions independently, we can derive a compact representation for it, i.e. a table counting the occurrences for each nucleotide (here, DNA, so A, C, T, G) in each position. This is what is known as a position count matrix (PCM), or as a position frequency matrix (PFM):
- Some hGATA2 example sequences (restricted to the core of the binding site) have been prepared for you to load into R. Once loaded, we can view them in a pretty fashion by showing a
DNAStringSet
object made from them using package Biostrings:
<- readRDS(
hGATA2_motif_seqs 'data/Motifs/hGATA2_binding_sites.rds')
show(DNAStringSet(hGATA2_motif_seqs))
DNAStringSet object of length 19:
width seq
[1] 6 CGATAG
[2] 6 CGATAT
[3] 6 GGATAG
[4] 6 GGATAG
[5] 6 GGATAC
... ... ...
[15] 6 CGATAT
[16] 6 CGATAG
[17] 6 GGATAT
[18] 6 GGATAC
[19] 6 GGATAC
- From these sequences, we can create a position count matrix (PCM) using the
create_motif
function from universalmotif:
<- create_motif(
hGATA2_motif type="PCM", name = "GATA2")
hGATA2_motif_seqs, hGATA2_motif
Motif name: GATA2
Alphabet: DNA
Type: PCM
Strands: +-
Total IC: 9.19
Pseudocount: 0
Consensus: SGATAN
Target sites: 19
S G A T A N
A 0 0 19 0 19 2
C 9 0 0 0 0 6
G 10 19 0 0 0 8
T 0 0 0 19 0 3
- This type of representation entirely depends on the number of sequences that were initially aligned. We need something that is more generally applicable.
The PPM
- Another representation, which abstracts from the total number of counts, is the position probability matrix (PPM). It normalizes the count matrix by the number of observations, resulting in an estimate for the probability of observing each letter at a given position. We can use functions
convert_type
andview_motifs
from package universalmotif to convert our previous motif object, and show a graphical representation of it:
convert_type(hGATA2_motif, type="PPM")
Motif name: GATA2
Alphabet: DNA
Type: PPM
Strands: +-
Total IC: 9.19
Pseudocount: 0
Consensus: SGATAN
Target sites: 19
S G A T A N
A 0.00 0 1 0 1 0.11
C 0.47 0 0 0 0 0.32
G 0.53 1 0 0 0 0.42
T 0.00 0 0 1 0 0.16
view_motifs(hGATA2_motif, use.type = "PPM")
- This motif representation no longer depends on the number of sequences aligned. However, it does not give us any idea of how “surprising” it would be to observe any given sequence that matches the motif. How informative is the motif? How can we use this to actually find other occurrences of it in genomic sequences? We need some idea of what sequences we can expect to find by chance.
The PWM
- Closely related to the previous is the position weight matrix (PWM), which uses the PPM and a background model to get an idea of “expected” sequences, and thereby can assign weights to each letter at each position. In this context, “background” refers here to the base composition at non-specific sites (i.e. here, sequences that do not necessarily bind the TF). The background is often assumed to be either uniform (i.e. \(1/4\) for each of A, T, C, and G), or to be the genome-wide nucleotide frequencies. PWM scores are defined as the log-ratio of the PPM and the background frequencies:
\[ {\rm PWM}_{ij} = \log_2\left(\frac{\rm PPM_{ij}}{\rm bg_{j}}\right) \] where \({\rm bg}_{j}\) is the background frequency of base \(j\).
Note: PWM scores are negative if the probability for a given letter at a given position is below the background probability.
In cases where relatively few sequences were used to derive a motif, or when a position was very highly conserved, there can be 0 values in the PCM. In order to confine the derived PWM scores to finite values, one has to make use of pseudo-counts:
## no pseudo-counts:
convert_type(hGATA2_motif, type="PWM")
Motif name: GATA2
Alphabet: DNA
Type: PWM
Strands: +-
Total IC: 9.19
Pseudocount: 0
Consensus: SGATAN
Target sites: 19
S G A T A N
A -Inf -Inf 2 -Inf 2 -1.25
C 0.92 -Inf -Inf -Inf -Inf 0.34
G 1.07 2 -Inf -Inf -Inf 0.75
T -Inf -Inf -Inf 2 -Inf -0.66
## pseudocount of 1:
convert_type(hGATA2_motif, type="PWM", pseudocount = 1)
Motif name: GATA2
Alphabet: DNA
Type: PWM
Strands: +-
Total IC: 9.19
Pseudocount: 0
Consensus: SGATAN
Target sites: 19
S G A T A N
A -4.32 -4.32 1.94 -4.32 1.94 -1.15
C 0.89 -4.32 -4.32 -4.32 -4.32 0.32
G 1.04 1.94 -4.32 -4.32 -4.32 0.72
T -4.32 -4.32 -4.32 1.94 -4.32 -0.62
# possible bug: pseudocount in printed object is still shown as 0 when converting
## compare to manually specifying uniform background frequencies:
<- create_motif(hGATA2_motif_seqs, type="PPM", name = "GATA2", pseudocount = 1, bkg = rep(0.25, 4))
ppm convert_type(ppm, type="PWM")
Motif name: GATA2
Alphabet: DNA
Type: PWM
Strands: +-
Total IC: 7.84
Pseudocount: 1
Consensus: SGATAN
Target sites: 19
S G A T A N
A -4.32 -4.32 1.94 -4.32 1.94 -1.15
C 0.89 -4.32 -4.32 -4.32 -4.32 0.32
G 1.04 1.94 -4.32 -4.32 -4.32 0.72
T -4.32 -4.32 -4.32 1.94 -4.32 -0.62
# => same, but PC is shown properly
- The example below uses as a background frequency the occurrence of A, C, G, and T on chromosome 1 of the Human genome (not uniform!).
<- oligonucleotideFrequency(BSgen.Hu$chr1, width=1)
nt_cnts nt_cnts
A C G T
67070277 48055043 48111528 67244164
<- nt_cnts / sum(nt_cnts)
nt_freqs nt_freqs
A C G T
0.2910013 0.2084989 0.2087440 0.2917558
=> notice the background frequencies across chromosome 1 are not uniform, there is a clear depletion in GC. This is common in vertebrates, as we will see later.
<- create_motif(
pwm type="PWM", name="GATA2", pseudocount = 1, bkg = nt_freqs)
hGATA2_motif_seqs,
print(pwm)
Motif name: GATA2
Alphabet: DNA
Type: PWM
Strands: +-
Total IC: 7.84
Pseudocount: 1
Consensus: SGATAN
Target sites: 19
S G A T A N
A -4.54 -4.54 1.73 -4.54 1.73 -1.37
C 1.15 -4.06 -4.06 -4.06 -4.06 0.58
G 1.30 2.21 -4.06 -4.06 -4.06 0.98
T -4.54 -4.54 -4.54 1.72 -4.54 -0.84
If you compare this to the PWM with PWM obtained above with uniform background probabilities, notice that the scores for Gs and Cs increase, and those for As and Ts decrease. This reflects the non-uniform background scores, where seeing a G or a C is more suprising than seeing an A or a T.
PWMs are the representation typically used to scan sequences and identify matches, e.g. potential TF binding sites. It however lacks a “nice” graphical representation:
view_motifs(pwm, use.type = "PWM")
=> For each position, positive weights are shown above the horizontal axis, negative below.
The ICM
- The final representation of a sequence motif uses information content as a means to highlight conserved over non-conserved sites. The information content matrix (ICM) is a scaled version of the PPM, where each position gets scaled by its information content (IC) measured in bits. Most conserved sites have an IC of 2 bits. This form can be visualised as a “sequence logo”, the most common graphical representation of a motif:
view_motifs(hGATA2_motif, use.type = "ICM")
- Note: For the curious, IC can be represented using Shannon’s entropy, which does not take background frequencies into account, or using relative entropy, the latter being better suited to non-uniform background frequencies. Each leads to different logos.
Conclusion
- We can now build a compact representation of a motif using a set of aligned sequences that contain it. However, we would not want to do this for every motif we want to work with! Many motifs have already been experimentally or computationally examined in many genomes, and their ready-made representations have been collated in databases, many of which can be accessed from within R.
Motif databases
There are several resources providing PWM matrices, some are free (i.e., Jaspar), some require a license (i.e., Transfac).
Many such resources are directly available through R packages, e.g. the MotifDb, TFBSTools and JASPAR2018 packages.
All these databases and packages collect the PWMs from different primary sources and there is a lot of redundancy in them. Even a single resource might have multiple PWMs for the same TF (for example acquired in different species, by different techniques).
Iteratively refine a MotifDB query
For example, lets check the human GATA2 transcription factor in the MotifDb package, which compiles access to several databases. The
query()
function builds a query to resources the package has access to, and fetches a list of motif representations with requested strings in the representation’s metadata.The
query()
function takes as arguments strings that should (or should not) match to any field in the underlying database. It is not always clear what strings to search for. We can iteratively define, examine, and then refine our query to the database until we are happy with the results.
query(MotifDb, andStrings=c("GATA2"))
MotifDb object of length 16
| Created from downloaded public sources, last update: 2022-Mar-04
| 16 position frequency matrices from 10 sources:
| HOCOMOCOv10: 2
| HOCOMOCOv11-core-A: 1
| HOCOMOCOv11-secondary-A: 1
| HOMER: 1
| JASPAR_2014: 1
| JASPAR_CORE: 1
| jaspar2016: 2
| jaspar2018: 4
| jaspar2022: 2
| SwissRegulon: 1
| 4 organism/s
| Hsapiens: 12
| Athaliana: 2
| Mmusculus: 1
| other: 1
Hsapiens-HOCOMOCOv10-GATA2_HUMAN.H10MO.A
Mmusculus-HOCOMOCOv10-GATA2_MOUSE.H10MO.A
Hsapiens-HOCOMOCOv11-core-A-GATA2_HUMAN.H11MO.0.A
Hsapiens-HOCOMOCOv11-secondary-A-GATA2_HUMAN.H11MO.1.A
NA-HOMER-Gata2(Zf)/K562-GATA2-ChIP-Seq(GSE18829)/Homer
...
Hsapiens-jaspar2018-GATA2-MA0036.3
Athaliana-jaspar2018-GATA20-MA1324.1
Hsapiens-jaspar2022-GATA2-MA0036.3
Athaliana-jaspar2022-GATA20-MA1324.1
Hsapiens-SwissRegulon-GATA2.SwissRegulon
=> 16 motif representations, across many resources & organisms.
Let’s focus on JASPAR 2018 entries:
query(MotifDb, andStrings=c("GATA2", "JASPAR2018"))
MotifDb object of length 4
| Created from downloaded public sources, last update: 2022-Mar-04
| 4 position frequency matrices from 1 source:
| jaspar2018: 4
| 2 organism/s
| Hsapiens: 3
| Athaliana: 1
Hsapiens-jaspar2018-GATA2-MA0036.1
Hsapiens-jaspar2018-GATA2-MA0036.2
Hsapiens-jaspar2018-GATA2-MA0036.3
Athaliana-jaspar2018-GATA20-MA1324.1
=> 4 motif representations for 2 organisms.
Let’s focus on entries for Human:
query(MotifDb, andStrings=c("GATA2", "JASPAR2018", "Hsapiens"))
MotifDb object of length 3
| Created from downloaded public sources, last update: 2022-Mar-04
| 3 position frequency matrices from 1 source:
| jaspar2018: 3
| 1 organism/s
| Hsapiens: 3
Hsapiens-jaspar2018-GATA2-MA0036.1
Hsapiens-jaspar2018-GATA2-MA0036.2
Hsapiens-jaspar2018-GATA2-MA0036.3
=> 3 motif representations in Human.
Visualise redundant motifs
- This final set of human GATA2 motifs from the Jaspar2018 database is quite redundant, as the following picture shows. Here, we add our own matrix to the end of the list. We use universalmotif’s
convert_motifs
to convert the motif representations for our needs.
<- query(MotifDb, andStrings=c("GATA2", "JASPAR2018", "Hsapiens"))
res <- convert_motifs(res)
res 4]] <- hGATA2_motif
res[[ res
[[1]]
Motif name: GATA2
Alternate name: MA0036.1
Family: GATA-type zinc fingers
Organism: Hsapiens
Alphabet: DNA
Type: PPM
Strands: +-
Total IC: 5.35
Pseudocount: 1
Consensus: NGATR
Target sites: 53
Extra info: [dataSource] jaspar2018
N G A T R
A 0.25 0.00 0.98 0 0.47
C 0.25 0.09 0.00 0 0.13
G 0.34 0.91 0.02 0 0.28
T 0.17 0.00 0.00 1 0.11
[[2]]
Motif name: GATA2
Alternate name: MA0036.2
Family: GATA-type zinc fingers
Organism: Hsapiens
Alphabet: DNA
Type: PPM
Strands: +-
Total IC: 13.41
Pseudocount: 1
Consensus: NNNTTCTTATCWSH
Target sites: 4380
Extra info: [dataSource] jaspar2018
N N N T T C T T A T C W S H
A 0.36 0.26 0.36 0.24 0.01 0.00 0.00 0.01 1 0 0 0.28 0.02 0.29
C 0.13 0.18 0.17 0.12 0.24 0.69 0.04 0.00 0 0 1 0.00 0.40 0.27
G 0.17 0.39 0.28 0.10 0.17 0.21 0.00 0.00 0 0 0 0.00 0.45 0.05
T 0.34 0.18 0.20 0.54 0.58 0.10 0.96 0.99 0 1 0 0.72 0.12 0.39
[[3]]
Motif name: GATA2
Alternate name: MA0036.3
Family: GATA-type zinc fingers
Organism: Hsapiens
Alphabet: DNA
Type: PPM
Strands: +-
Total IC: 10.8
Pseudocount: 1
Consensus: NNCTTATCTBN
Target sites: 14213
Extra info: [dataSource] jaspar2018
N N C T T A T C T B N
A 0.23 0.14 0.07 0.03 0.03 0.99 0.01 0.01 0.17 0.20 0.23
C 0.21 0.29 0.75 0.02 0.00 0.00 0.01 0.97 0.03 0.26 0.27
G 0.23 0.21 0.09 0.01 0.00 0.00 0.00 0.01 0.03 0.26 0.16
T 0.33 0.36 0.10 0.94 0.97 0.00 0.98 0.01 0.77 0.28 0.34
[[4]]
Motif name: GATA2
Alphabet: DNA
Type: PCM
Strands: +-
Total IC: 9.19
Pseudocount: 0
Consensus: SGATAN
Target sites: 19
S G A T A N
A 0 0 19 0 19 2
C 9 0 0 0 0 6
G 10 19 0 0 0 8
T 0 0 0 19 0 3
view_motifs(res, dedup.names = TRUE)
=> Notice that the 2nd and 3rd representations were reverse-complemented. Indeed, function view_motifs
automatically aligns and reverse-complements if necessary to obtain the best visual representation of a set of motifs.
This redundancy complicates the interpretation of motif analysis, and its not immediately clear which motif(s) is(are) superior in a redundant set. Often the motif with he highest information content is only reflecting the consensus site or the best scoring site, and will not match relevant sites with lower affinity. We will see later about scoring motif matches in sequences. Collapsing results across similar motifs is beyond the scope of this session.
For now, with the above, we have access to known motifs, but how can we find motifs de novo?
De novo motif identification
We might have access to some data we want to mine for evidence of new, unknown binding site motifs, such as ChIP-seq or ATAC-seq data in a non-model organism, or for a newly-discovered TF. We essentially want to find short sequences that are over-represented in a set of larger sequences of interest (e.g. ChIP-seq or ATAC-seq peaks) when compared to a background set of sequences (most naively, the whole genome, or all promoters, etc.).
Tools such as HOMER, MEME, or RSAT can be used to do this, and describing each of them in detail is beyond the scope of this course. Several start by finding over-represented k-mers (i.e. subsequences of length k) in sequences of interest, so we will explain this approach.
In some Human cell line ChIP-seq data from ENCODE, we will manually look for k-mers enriched in GATA2 ChIP-seq peaks compared to a selected background. As a quick-n-dirty approach to selecting a background, we will use ChIP-seq peaks from another TF called MYC, which we’d expect to bind to the same sort of genomic region (promoters), but not likely at exactly the same regions.
ChIP-seq peak regions for GATA2 on Human K562 cell cultures (GRCh38 genome build): download from course page or c.f. ENCFF075FIE
ChIP-seq peak regions for MYC on Human K562 cell cultures (GRCh38 genome build): download from course page or c.f. ENCFF462OYX
Preparing the target & background sequences
- First, let’s get the ChIP-seq peak sequences for both TFs, and as a sanity check, view the distribution of the lengths of the peaks:
# grab peak BED files & import to GRanges
<- rtracklayer::import("./data/Motifs/ENCFF075FIE.bigBed")
GATA2_gr <- rtracklayer::import("./data/Motifs/ENCFF462OYX.bigBed")
MYC_gr
# grab peak sequences using these & the human genome (GRCh38)
<- getSeq(BSgen.Hu, GATA2_gr)
GATA2_peak_seq <- getSeq(BSgen.Hu, MYC_gr)
MYC_peak_seq
# look at peak width distributions
hist(width(GATA2_peak_seq), sub=length(GATA2_peak_seq), breaks=64)
hist(width(MYC_peak_seq), sub=length(MYC_peak_seq), breaks=64)
=> GATA2 peaks are typically ~300nts wide, while MYC peaks are ~445.
- We will homogenise the sequence lengths for each TF. Ideally, we’d also want the background sequences to have similar lengths to the foreground ones too, but this is good enough for this example.
<- function(seqs, min_len, max_len) {
subset_by_length <- width(seqs)
lens return(seqs[ !is.na(lens) & lens>=min_len & lens<=max_len ])
}
print(length( GATA2_peak_seq <- subset_by_length(GATA2_peak_seq, 290, 310) ))
[1] 3244
print(length(MYC_peak_seq <- subset_by_length(MYC_peak_seq, 435, 455) ))
[1] 7115
- In this example, we already know what the hGATA2 motif looks like. Out of interest, we can also look up what the known human MYC binding site motif looks like:
<- query(MotifDb, andStrings=c("MYC", "JASPAR2018", "Hsapiens"))
res view_motifs(convert_motifs(res))
# clean up
rm(res)
=> Interestingly, MYC binds to a core CACGTG motif, that is its own reverse-complement.
We can also note that it does not resemble GATA in any way!
Building k-mer frequencies
We want to establish the frequencies of k-mers across both sets of sequences.
How to choose k? Increasing k means the number of possible k-mers increases, and the number of occurrences of each will decrease. Longer k-mers can capture longer motifs and potentially more variation therein, but with too few counts, one might not be able to test for enrichment. One typically has to try several values of k and select one that offers a good trade-off between motif representation, counts per k-mer, and analysis complexity.
Here we will try with k=6, for a total of 4^6=4096 different k-mers, which we know should encapsulate the GATA motif nicely.
Let’s count the occurrences of each k-mer across all sequences of both TFs. We can thus check to see if there are any k-mers with near-0 counts.
NB: counting from one strand only, RCs are thus counted independently.
NB:
oligonucleotideFrequency
generates counts for each input sequence, so we’ll sum across these to get total counts.
# GATA2
<- Biostrings::oligonucleotideFrequency(
GATA2_6mer_mat width = 6, as.prob = F) ## using k=6
GATA2_peak_seq, hist(colSums(GATA2_6mer_mat), breaks=128,
sub=ifelse(!any(colSums(GATA2_6mer_mat)<=5), "OK", "low counts found"))
# MYC
<- Biostrings::oligonucleotideFrequency(
MYC_6mer_mat width = 6, as.prob = F)
MYC_peak_seq, hist(colSums(MYC_6mer_mat), breaks=128,
sub=ifelse(!any(colSums(MYC_6mer_mat)<=5), "OK", "low counts found"))
=> No k-mers with very low counts found, we can continue with our chosen value of k.
- Convert to frequencies:
<- colSums(GATA2_6mer_mat) / sum(GATA2_6mer_mat)
GATA2_6mer_freq <- colSums(MYC_6mer_mat) / sum(MYC_6mer_mat)
MYC_6mer_freq
rm(GATA2_6mer_mat, MYC_6mer_mat)
Calculating k-mer enrichment
Now, we can make an M-A plot showing the enrichment of various k-mers in the GATA peaks compared to in the MYC peaks.
NB: For illustration’s sake, we’ll highlight which k-mers perfectly match each known motif (though of course this would not be known in a de novo motif search setting!).
We’ll also highlight effects of nucleotide composition.
# get single nucleotide frequencies for each k-mer,
# so we can look at potential %GC bias
<- Biostrings::oligonucleotideFrequency(
kmer_descr DNAStringSet(names(GATA2_6mer_freq)), width=1)
# create an MA plot
<- data.frame(
dat row.names = names(GATA2_6mer_freq),
"pGC" = 100 * rowSums(kmer_descr[,c("C","G")]) / nchar(names(GATA2_6mer_freq))[1],
"A" = 0.5*(log2(GATA2_6mer_freq) + log2(MYC_6mer_freq)), # average
"M" = log2(GATA2_6mer_freq) - log2(MYC_6mer_freq) # diff
)
# highlight k-mers containing GATA/MYC motifs & RCs thereof, as well as CpGs
$detected <- ifelse(
dat# note: the GATA motif can match many 6-mers
str_detect(rownames(dat), "GATA|TATC"), "GATA",
ifelse(
# note: the CACGTG motif can only exactly match 1 6-mer
# (due to it being its own RC)
str_detect(rownames(dat), "CACGTG"), "MYC",
ifelse(
# note: CpGs directional
str_detect(rownames(dat), "CG") , "CpG", "none")) )
# plot
ggplot(dat, aes(x=.data[["A"]], y=.data[["M"]])) +
geom_hline(yintercept = 0) +
geom_point(aes(color=.data[["detected"]]), size=0.65)
ggplot(dat, aes(x=.data[["A"]], y=.data[["M"]])) +
geom_hline(yintercept = 0) +
geom_point(aes(color=.data[["pGC"]]), size=0.65)
=> As you can see, many of the GATA k-mers (green) are enriched in the GATA peak sequences relative to the MYC peak sequences. Conversely, the MYC k-mer (blue) is depleted.
=> Not all GATA k-mers are enriched, suggesting there is some biologically-relevant sequence specificity outside of the core GATA sequence. We’ll have a closer look below.
=> What are all the red motifs? These motifs all contain “CG”, aka CpGs. Due to the mechanisms of DNA nucleotide methylation, mutation and repair, CpGs are typically being lost in vertebrates, and so motifs containing this dinucleotide are typically under-represented in genomic sequences (cf for example this paper). Furthermore, there is a general %GC bias in the k-mer abundances and differential abundances, visible in the 2nd plot. Correcting for these biases would require using more advanced methodology (eg that implemented in monaLisa::getKmerFreq
).
- As a sanity check, let’s have a look at some of the most enriched and most depleted GATA-containing k-mers:
<- dat[dat$detected=="GATA",] # select GATA kmers
tmp <- tmp[order(tmp$M, decreasing=TRUE),] # order by enrichment
tmp head(tmp, 10) # top enriched GATA kmers
pGC A M detected
GATAAG 33.33333 -11.40214 3.100328 GATA
CTTATC 33.33333 -11.40132 3.036949 GATA
TTATCT 16.66667 -11.06607 2.817302 GATA
AGATAA 16.66667 -11.00824 2.740202 GATA
TTATCA 16.66667 -11.62906 2.381462 GATA
GTTATC 33.33333 -12.31875 2.354227 GATA
TGATAA 16.66667 -11.60033 2.239393 GATA
CAGATA 33.33333 -11.66407 2.226085 GATA
TATCTC 33.33333 -11.61006 2.219941 GATA
GAGATA 33.33333 -11.48220 2.091920 GATA
tail(tmp, 10) # top depleted GATA kmers
pGC A M detected
GGGATA 50.00000 -12.77993 0.09306402 GATA
GTATCC 50.00000 -13.36322 0.08972058 GATA
CGATAG 50.00000 -14.51261 0.06395115 GATA
ATATCG 33.33333 -14.92645 0.01539685 GATA
TATCGG 50.00000 -14.64366 -0.03968622 GATA
TATCCG 50.00000 -14.46358 -0.24794580 GATA
CGGATA 50.00000 -14.46812 -0.31380230 GATA
GGATAC 50.00000 -13.58191 -0.40715278 GATA
CGATAC 50.00000 -15.14177 -0.41523751 GATA
GTATCG 50.00000 -15.28722 -0.52760403 GATA
rm(tmp)
=> Many of those depleted seem to contain CGs, and have a low overall frequency, indicating that they might be being lost over evolution due to the same mechanism discussed above. Though as mentioned, there might also be some level of biologically-relevant sequence specificity of human GATA2 binding outside of the core GATA sequence.
- In the de novo identification setting, the set of all k-mers above a certain enrichment threshold here could be explored for evidence of a new TF binding site motif.
<- dat[dat$M>=1.25,] # select enriched motifs above an arbitrary threshold
tmp head(tmp[order(tmp$M, decreasing = TRUE),], 20)
pGC A M detected
GATAAG 33.33333 -11.40214 3.100328 GATA
CTTATC 33.33333 -11.40132 3.036949 GATA
TTATCT 16.66667 -11.06607 2.817302 GATA
AGATAA 16.66667 -11.00824 2.740202 GATA
TTATCA 16.66667 -11.62906 2.381462 GATA
GTTATC 33.33333 -12.31875 2.354227 GATA
TGATAA 16.66667 -11.60033 2.239393 GATA
CAGATA 33.33333 -11.66407 2.226085 GATA
TATCTC 33.33333 -11.61006 2.219941 GATA
GAGATA 33.33333 -11.48220 2.091920 GATA
GATAAC 33.33333 -12.32396 2.079452 GATA
TAGATA 16.66667 -12.66735 2.056033 GATA
ATAAGC 33.33333 -12.44507 2.038398 none
TATCTG 33.33333 -11.64060 2.031143 GATA
TATCTA 16.66667 -12.78907 2.028302 GATA
TATCAG 33.33333 -11.87700 2.008089 GATA
GCTTAT 33.33333 -12.41862 1.968923 none
TATCTT 16.66667 -11.95283 1.955444 GATA
TCTTAT 16.66667 -12.07541 1.934653 none
CCTTAT 33.33333 -11.79616 1.929732 none
# => a lot of GATA-related k-mers
# clean up
rm(tmp,dat)
- This exploration would likely involve testing other k-mer lengths, aligning match sequences, extending match sequences outside the k-mer hits, and more. Those interested can look up the processes in the tools mentioned earlier, such as HOMER.
# clean up
rm(GATA2_gr, MYC_gr,
GATA2_peak_seq, MYC_peak_seq,
GATA2_6mer_freq, MYC_6mer_freq)
Summary
In this section, we have seen:
- How to represent, visualise & manipulate motifs programmatically.
- How to grab known motifs from databases.
- How to derive simple motifs by identifying enriched k-mers in ChIP-seq peak sequences.
However, we have not yet seen how to score the matching of a sequence to a given motif, nor used this to find actual motif occurrences in the genome.
Exercise 1: when to use motif representations
- Imagine you are working in a lab that deals specifically with proteins that recognise specific patterns in mRNA molecules (e.g. polyadenylation signals). The protein you are working on only recognises a specific nucleotide sequence; any variations mean there is no binding. Is it useful to build a motif representation, such as a PWM, in this case?
Not really. It would enable you to use the same tools as for other motifs (visual representation, sequence scoring, occurrence identification), however as the protein brooks no variation, there is no nuance that needs encoding programmatically. Simply stating the specific sequence that is bound, and locating exact sequence matches thereof, will be sufficient.
Exercise 2: create a PPM
- Using the simulation function below, create a set of sequences, derive then illustrate a Positional Probability Matrix for them:
<- function( ... ) { # takes any number of arguments, all ignored
sim_motif_seq <- sample(DNA_BASES, size = 8, replace = TRUE)
res c(3,4,6)] <- c("C", "T", "G")
res[DNAString( paste0(res, collapse="") ) # returns 1 DNAString
}
print(sim_motif_seqs <- DNAStringSet( lapply( 1:20, sim_motif_seq ) ))
DNAStringSet object of length 20:
width seq
[1] 8 AACTCGCG
[2] 8 TCCTTGGT
[3] 8 TGCTGGCA
[4] 8 CACTCGCA
[5] 8 AGCTGGGC
... ... ...
[16] 8 TCCTGGTT
[17] 8 CGCTAGCA
[18] 8 CGCTAGTA
[19] 8 TTCTAGTA
[20] 8 TGCTAGAT
<- create_motif(sim_motif_seqs, type="PCM", name = "Sim'd motif")
sim_motif view_motifs(sim_motif, use.type = "PPM")
Sequence scoring and matching
A sequences’ score for a given motif represents how well the sequence matches the motif, accounting for motif length, position conservation, and the expected background. We will see how to calculate this score for a given sequence/motif pair, how to determine how good a score actually is, and then use this to find motif occurrences.
Scoring a single sequence
- First, how can we calculate the score of a given sequence for a specific motif? The score of a sequence is simply calculated by summing the score of the matching letter for each position of the PWM. Let’s look at these positional scores for our hGATA2 motif:
<- convert_type(hGATA2_motif, type="PWM", pseudocount = 1)
hGATA2_motif 'motif'] hGATA2_motif[
S G A T A N
A -4.3219281 -4.321928 1.944858 -4.321928 1.944858 -1.1520031
C 0.8875253 -4.321928 -4.321928 -4.321928 -4.321928 0.3219281
G 1.0356239 1.944858 -4.321928 -4.321928 -4.321928 0.7224660
T -4.3219281 -4.321928 -4.321928 1.944858 -4.321928 -0.6214884
- We can now score a randomly-chosen sequence for this motif:
<- "CAAGAT"
test_seq score_match(hGATA2_motif, test_seq)
[1] -4.488
# same, manually:
0.8875253 + (-4.321928) + 1.944858 + (-4.321928) + 1.944858 + (-0.6214884)
[1] -4.488103
=> not a particularly good match!
- We can also choose a known hGATA2 binding site sequence to see an example of a good score:
1] # should be a good hit, by definition! hGATA2_motif_seqs[
[1] "CGATAG"
score_match(hGATA2_motif, hGATA2_motif_seqs[1])
[1] 9.385
# manually
0.8875253 + 1.944858 + 1.944858 + 1.944858 + 1.944858 + 0.7224660
[1] 9.389423
# manually, with rounding
0.887 + 1.944 + 1.944 + 1.944 + 1.944 + 0.722
[1] 9.385
=> A good match, by design.
=> Notice that score_match
works on truncated decimal scores. This is to improve calculation speed.
Score calculation pitfalls
Background frequencies: The definition of background frequencies can have an impact on motif scores and thus their detection in given sequences. However, this impact will be limited unless the sequences where you are searching for motif matches have nucleotide frequencies far from uniformity. In practice, using uniform background frequencies in all searches makes the search results more comparable. More advanced motif methods might determine background frequencies using their own approaches, e.g. monaLisa.
Affinity vs score: In many cases, PWM-derived match scores are a good proxy for the TF’s binding affinity for each site, but there are also limitations to the PWM model:
- They use single letter frequencies, i.e., they do not capture dependencies between different sites.
- Independent of affinity, TF concentration is a major determinant for binding site occupancy. While at low TF concentrations the PWM score for a given site reflects the probability of it being bound, at very high TF concentration non-specific binding will dominate.
- The binding preference is also determined by other important factors, most notably chromatin state or co-factors.
Zero is too low: With all this, can we determine what range of scores actually marks a motif occurrence, and which are too low to be relevant? Typically, potential target sites of TFs (e.g., genomic locations potentially bound by the TF) will show scores greater than 0, and non-specific sites will show scores below 0. However, a score threshold of 0 will in almost all cases (except very polarized motifs) pick far too many non-relevant sites, e.g. when applied to a set of peak or promoter sequences.
Score thresholds: Ideally, each PWM would be accompanied by a score threshold which is derived experimentally, e.g. using ChIP-Seq data, by contrasting true binding sites with non-functional sites. This approach was used in the HOMER database, but doesn’t guarantee an optimal score threshold when applied to other sets of sequences. How can we derive score thresholds even in the absence of experimental data?
Thresholding scores to detect motif occurences
Retrieve 200 example motifs from JASPAR2018
- We will illustrate approaches for deriving score thresholds (as well as other operations in sections below) using a set of vertebrate PWMs from the “core” of the Jaspar 2018 database. We will use the
getMatrixSet()
function from TFBSTools to query the database this time. Note: as the number of mouse-specific motifs is quite limited, we are selecting all vertebrate motifs. This is often done in practice when working on mouse.
# query options
<- list(
opts tax_group = "vertebrates",
collection = "CORE",
matrixtype = "PWM",
all_versions = FALSE)
# run query
<- getMatrixSet(JASPAR2018, opts) # may take a few seconds
jaspar_pwms print(length(jaspar_pwms))
[1] 579
# clean up
rm(opts)
=> As of 22/05/2023, 579 motifs found!
# show distribution of motif lengths
<- unlist(lapply(jaspar_pwms, length))
tmp hist(tmp, breaks=64)
=> let’s keep 200 random motifs of length 8+, also ensuring we keep the known motif for hGATA2.
# set random number generator seed so we always get the same "random" selection
set.seed(42)
# select based on length, then random sub-selection thereof
<- (tmp>=8)
ok_length <- c(
ok_ids which(names(tmp) == "MA0036.3"), # hGATA2
sample( (1:length(tmp))[ok_length], size=199, replace = F )
)
# apply filtering
<- jaspar_pwms[ok_ids]
jaspar_pwms print(length(jaspar_pwms))
[1] 200
# clean up
rm(tmp, ok_length, ok_ids)
- We can also prepare some things for convenience later.
# convert to unviersalmotif representation, maintaining names
<- setNames(
jaspar_upwms convert_motifs(jaspar_pwms),
names(jaspar_pwms))
Score thresholding approaches
We will now consider several alternatives to deriving score thresholds. These approaches have been implemented in different Bioconductor packages:
The simplest approach is to use a constant score threshold across all motifs. This ensures comparability of motif matching results across motifs. There is no threshold to calculate, so this is the fastest approach. In practice, common arbitrary thresholds are 6, 7 and 8.
Basing a threshold on a percentage of the maximal possible score, e.g. 80% of the highest possible PWM score. This is namely the approach implemented (with default options) in function
matchPWM
from Biostrings. It is very fast to calculate, and simple to intuit.Calculate scores corresponding to a certain significance level (p-value), and use that as a threshold. The p-value of a score corresponds to the probability of observing a sequence with that score or larger, by chance given the background nucleotide frequencies. Common significance thresholds used in practice are 1e-5 and 5e-5. But how is significance determined?
Here, consider a motif \(M\) of length \(m\), and a score threshold to be determined \(T\). Let’s write as \(S(T)\) the set (of size \(n(T)\)) that contains all possible sequences of length \(m\) that would have a score for \(M\) that is \(\ge T\).
This approach seeks to choose a \(T\) such that the probability that the background distribution would generate \(S(T)\) is less likely than the requested p-value.
This would require enumerating all sequences in \(S(T)\), but (assuming a uniform background), the probability of this set being generated under the background is simply \(P_{bg}(T) = n(T) \times 0.25^{m}\).
With this, we can try different values of \(T\) to get the corresponding \(n(T)\) values, until the calculated probability \(P_{bg}(T)\) is within significance.
With a good implementation, this approach is relatively fast to calculate.
Comparing score thresholding approaches
We will compare the latter two approaches to see how they relate to each other.
We’ll also consider the widths of the motifs (in nucleotides).
We’ll also calculate the total Information Content for each PWM, as this indicates how fast \(n(T)\) grows with lowering of \(T\). Indeed, for motifs with high IC, \(n(T)\) grows more slowly as a function of the relative decrease in \(T\).
## preparations
# For the % max score approach:
# for each motif, return quantiles of all possible scores
# 0 => minimum, 1 => maximum, 0.8 => 80% maximum
<- sapply(
quantilesPWMscore threshold=c(0, 0.8, 1))
jaspar_upwms, motif_score,
# For the p-value score threshold approach:
# for each motif, return the score corresponding
# to a given value
<-
pvaluePWMscore motif_pvalue(jaspar_upwms, pvalue = 1e-5, k=10) # may take a couple of secs
# For the total IC:
# for each motif, retrieve total information content
# as already available in a slot within the universalmotif PWM objects
<- sapply(jaspar_upwms, function(upwm) {upwm@icscore})
PWM_total_IC
# For the motif width
<- sapply(jaspar_upwms, ncol)
PWM_widths
# Assemble!
<- data.frame(
pwm_score_info maxPWMscore = quantilesPWMscore["100%",], # keep the max
pvaluePWMscore = pvaluePWMscore,
PWM_total_IC = PWM_total_IC,
PWM_widths = PWM_widths
)
# clean up
rm(quantilesPWMscore, pvaluePWMscore, PWM_total_IC, PWM_widths)
## plot
ggplot(data = pwm_score_info,
mapping = aes(x=maxPWMscore, y=pvaluePWMscore)) +
xlim(c(0, max(pwm_score_info$maxPWMscore))) +
ylim(c(0, max(pwm_score_info$pvaluePWMscore))) +
geom_point(aes(color=PWM_total_IC, size=PWM_widths)) +
scale_size_continuous(range=c(0.25, 3)) +
geom_abline(intercept=0, slope=1) + # solid diagonal
geom_abline(intercept=0, slope=0.8, linetype="dashed") + # dashed diagonal 1
geom_abline(intercept=0, slope=0.5, linetype="dashed") + # dashed diagonal 2
labs(x="Maximal PWM Score", y="PWM Score Threshold for P-Value=1e-5") +
scale_color_gradientn(colours = rainbow(5)) +
theme_bw() +
annotate(geom="text", x=15.00, y=14.5, label="y=100% x",color="black", hjust=0)+
annotate(geom="text", x=18.75, y=14.0, label="y=80% x", color="black", hjust=0)+
annotate(geom="text", x=30.00, y=14.5, label="y=50% x", color="black", hjust=0)
=> We can observe the following:
In the top-left, we can see that short motifs, and/or or motifs with low total information content, require a PWM score threshold very close to their maximal PWM score (ie a very stringent threshold), for a match to reach the targeted significance level of 1e-5. As such, only sites close to perfect matches will be found for such motifs.
In the middle, we see that motifs with average total information content will require a threshold 50-80% that of their maximal score to reach a significance of 1e-5.
Finally, on the right we see that motifs with high total IC will only require a very lenient threshold (<50% maximum) to reach a significance of 1e-5. As such there will likely be many sequences that match, and there will be a great amount of variability in these sequences.
We can conclude, if the objective is indeed to select matches that attain a certain significance, that using a threshold based on a percentage of the maximal score, though much faster to calculate, will be too permissive for short/low IC motifs, and much too stringent for longer/high IC motifs.
# clean up
rm(jaspar_upwms, pwm_score_info)
Finding motif occurences in sequences
Now that we can score sequence matches to motifs, and pick apart which scores mark likely binding sites, we can try finding all occurrences (or hits) of given motifs in a set of sequences, e.g. in ChIP-seq peaks, promoter sequences, or the whole genome.
Obviously, the sequences we will be scanning are much longer than the motifs themselves. We will simply scan along these sequences, scoring each individual sub-sequence of same length as the motif.
Preparing the sequences
Dataset: ATAC-seq-based accessible chromatin regions for Mouse immune cells.
Künzli et al. 2020,
“Long-lived T follicular helper cells retain plasticity and help sustain humoral immunity”
PAPERThe table merged_peaks_TFH_TH1.tsv.gz contains the results of a differential chromatin accessibility analysis contrasting two cell populations. We can load and parse the peak genomic regions, ready for scanning.
Differential accessibility details:
The differential accessibility test was done for 150bp genomic tiles contrasting (Tfh - TH1).
Tiles were then merged into larger regions. Hence:
- “up” contained only tiles more accessible in Tfh than TH1
- “down” contained only tiles more accessible in TH1 than Tfh
- “mixed” contained tiles with a mix of both directions
- the significance of the differential accessibility (after correcting for multiple testing) is provided in the “FDR” field
- anything not significantly differentially accessible is still an ATAC-seq peak, and thus accessible, and can thus be considered as “common” to both cell lines.
# regions of interest: get peak region from last week's lecture
<- read.table(gzfile("data/Motifs/merged_peaks_TFH_TH1.tsv.gz"),
peaks header=TRUE, stringsAsFactors=FALSE, sep="\t")
<- GRanges(peaks, seqinfo=seqinfo(BSgenome.Mmusculus.UCSC.mm10))
peaks_gr names(peaks_gr) <- as.character(peaks_gr)
# exclude "mixed" peaks for simplicity
<- peaks_gr[ peaks_gr$direction != "mixed" ]
peaks_gr
# exclude peaks on sex chromosomes
<- peaks_gr[ ! seqnames(peaks_gr) %in% c("chrX", "chrY", "chrM" )]
peaks_gr
# get peak sequences, we'll need this later
<- getSeq(BSgenome.Mmusculus.UCSC.mm10, peaks_gr) # <2s
peak_seqs
# exclude peaks with any N's, to avoid warnings further down
<- (alphabetFrequency(peak_seqs)[,"N"]>0)
hasNs addmargins(table(hasNs)) # => only drops a few sequences
hasNs
FALSE TRUE Sum
49392 7 49399
<- peak_seqs[!hasNs]
peak_seqs <- peaks_gr[!hasNs]
peaks_gr
# clean up
rm(peaks ,hasNs)
Filtering the peaks
- As a sanity check, let’s have a look at the lengths of the ATAC-seq peaks we’re dealing with:
hist(width(peak_seqs), breaks=64, sub=paste(range(width(peak_seqs)), collapse="-"))
=> Most under 3kb, but some up to 14kb! The smallest is 150bp.
- We should restrict our sequence lengths to something less extreme. Many motif-related analyses, especially when considering a background, will assume that the sequences they’re looking at are of comparable lengths. Here, let’s restrict to 3kb.
# exclude peaks longer than 4kb
<- width(peak_seqs) > 3000
tooLong addmargins(table(tooLong)) # => only drops a few hundred of ~50k sequences
tooLong
FALSE TRUE Sum
48871 521 49392
<- peak_seqs[!tooLong]
peak_seqs <- peaks_gr[!tooLong]
peaks_gr
# clean up
rm(tooLong)
Categorising the peaks
- Now let’s classify peaks into “Tfh>TH1”, “TH1>Tfh”, and “common” (i.e. open but not more open in either T cell sub-population):
# rename so it's clear to interpret
$state <- c(
peaks_gr"up" = "Tfh>TH1", "down" = "TH1>Tfh") [ peaks_gr$direction ]
# take significance into account
$state[peaks_gr$FDR >= 0.1] <- "common"
peaks_gr
# report
addmargins(table(peaks_gr$state))
common Tfh>TH1 TH1>Tfh Sum
35556 6255 7060 48871
Stratifying the peaks
This is still too many sequences to run the commands below in a reasonable time. We also want to ensure that we have roughly the same number of peaks in each of the analyses, ensuring comparable statistical power for both, and more peaks in the background.
To achieve this, we are going to randomly sub-select sequences in a way that considers the various “types” of ATAC-seq peaks (ie the strata).
# setting the seed to always get the same "random" selection
set.seed(42)
# function to sub-sample (names of) a stratum
<- function(size, stratum) {
rnd_subsample_stratum <- peaks_gr[ peaks_gr$state == stratum ]
stratum_peaks_gr stopifnot( length(stratum_peaks_gr)>=size )
return(sample(names(stratum_peaks_gr), size=size, replace=F))
}
# get sub-sampled names for each stratum
<- c(
sel_peak_names rnd_subsample_stratum(25000, "common"),
rnd_subsample_stratum(2500, "TH1>Tfh"),
rnd_subsample_stratum(2500, "Tfh>TH1")
)
# apply
<- peaks_gr[sel_peak_names]
peaks_gr_sel print(addmargins(table(peaks_gr_sel$state)))
common Tfh>TH1 TH1>Tfh Sum
25000 2500 2500 30000
# enforce same selection on sequences
<- peak_seqs[sel_peak_names]
peak_seqs_sel
# clean up
rm(rnd_subsample_stratum, sel_peak_names)
=> This will be faster to run!
Efficient scanning in R using motifmatchr
Function
matchMotifs
in motifmatchr package implements very efficient PWM scanning across a set of genomic regions. It uses as input motif classes as defined in the TFBSTools package. It also implements a selection of target sites using the p-value approach seen above.Once prepared, we can scan the ATAC-seq peak sequences for occurrences of the motif set we derived from JASPAR earlier.
If necessary, you can load the pre-prepared results of this (“jaspar2018_matches_rse.rds”) from the course page.
# run: 30k sequences, 200 motifs
system.time(
<- matchMotifs(
matches_rse genome = BSgenome.Mmusculus.UCSC.mm10,
jaspar_pwms, peaks_gr_sel, p.cutoff = 5e-05, # using a stringent p-value
out="scores"
# <10s on my laptop ))
user system elapsed
8.451 0.285 8.752
# overview
matches_rse
class: RangedSummarizedExperiment
dim: 30000 200
metadata(0):
assays(3): motifScores motifMatches motifCounts
rownames(30000): chr9:41181301-41181750 chr12:116048701-116049150 ...
chr1:94006201-94006350 chr12:3279151-3279450
rowData names(12): num.tests num.up.logFC ... right state
colnames(200): MA0036.3 MA1421.1 ... MA1100.1 MA0835.1
colData names(1): name
matches_rse
is a RangedSummarizedExperiment object and contains information on matches in its assays slot. We can have a quick look at this object:
# check out the object
dim(matches_rse) # 30k peaks in rows, 200 motifs in columns
[1] 30000 200
assayNames(matches_rse) # scores, binary presence/abscence, counts
[1] "motifScores" "motifMatches" "motifCounts"
head(as.data.frame(rowData(matches_rse))) # the original info from the peaks GR
num.tests num.up.logFC num.down.logFC PValue FDR direction
chr2:127792201-127792800 4 0 0 0.46094644 0.6977457 up
chr8:83694001-83694750 4 0 0 0.05340523 0.1810975 down
chr13:12275251-12275400 1 0 0 0.13729721 0.3444052 down
chr7:3290101-3290550 3 0 0 0.70945801 0.8654646 up
chr2:164798701-164798850 1 0 0 0.82985083 0.9292285 up
chr7:114157951-114158250 2 0 0 0.67818764 0.8482615 down
rep.test rep.logFC overlap left
chr2:127792201-127792800 135006 0.62661028 Mtln:-:PE Nphp1:-:3304,Mtln:-:369
chr8:83694001-83694750 215918 -1.09188404 Pkn1:-:PE Pkn1:-:1065
chr13:12275251-12275400 60047 -0.80291154 Actn2:-:PI Actn2:-:27
chr7:3290101-3290550 195022 0.41156221 Myadm:+:PE Myadm:+:776
chr2:164798701-164798850 138518 0.09148081 Acot8:-:I Snx21:+:4885,Acot8:-:2212
chr7:114157951-114158250 206423 -0.62894701
right state
chr2:127792201-127792800 Bub1:-:8322 common
chr8:83694001-83694750 Pkn1:-:3258 common
chr13:12275251-12275400 Actn2:-:963 common
chr7:3290101-3290550 Myadm:+:25 common
chr2:164798701-164798850 Acot8:-:708,Zswim3:+:6248 common
chr7:114157951-114158250 common
head(as.data.frame(colData(matches_rse))) # basic info about motifs
name
MA0036.3 GATA2
MA1421.1 TCF7L1
MA0770.1 HSF2
MA0636.1 BHLHE41
MA0488.1 JUN
MA0145.3 TFCP2
# distribution of max nb occurrences per peak, across all motifs
barplot(
sort(apply(assay(matches_rse, "motifCounts"), 2, max)),
las=3, cex.names=0.5, log="y")
abline(h=20, col="blue")
=> some motifs occur over 20x in a given peak! Not surprising given the lengths of some of these peaks.
- We can check to see if any motifs do NOT have any occurrences, or if any peaks have no motif occurrences:
<- assay(matches_rse, "motifMatches")
binary
any(colSums(binary) == 0 | colSums(binary) == nrow(binary))
## -> FALSE, so all motifs have at least 1 target and none matches all peaks
# clean up
rm(binary)
- However, this
matches_rse
object does not provide any localisation of the matches found. We can get this using a different argument tomatchMotifs
:
# run
system.time(
<- matchMotifs(
matches_grl genome = BSgenome.Mmusculus.UCSC.mm10,
jaspar_pwms, peaks_gr_sel, p.cutoff = 5e-05, out="positions")
# <10s on my laptop )
user system elapsed
8.707 0.371 9.085
# quick look
matches_grl
GRangesList object of length 200:
$MA0036.3
GRanges object with 1802 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <numeric>
[1] chr5 24800629-24800639 + | 13.4745
[2] chr18 56733837-56733847 - | 13.5414
[3] chr10 59970193-59970203 - | 12.4041
[4] chr6 142809492-142809502 - | 13.2129
[5] chr8 72226782-72226792 - | 11.5336
... ... ... ... . ...
[1798] chr2 32039075-32039085 - | 13.2062
[1799] chr10 68673910-68673920 + | 12.8671
[1800] chr1 82973913-82973923 + | 13.8709
[1801] chr13 44799593-44799603 - | 12.1015
[1802] chr12 15998756-15998766 - | 13.6934
-------
seqinfo: 239 sequences from an unspecified genome; no seqlengths
...
<199 more elements>
# how many hits?
addmargins(cbind(
"nb" = sort(unlist( lapply(matches_grl, length) ))),
margin = 1)
nb
MA0777.1 376
MA0836.1 458
MA0635.1 473
MA0838.1 519
MA0875.1 614
MA0887.1 617
MA0024.3 628
MA0900.1 669
MA0876.1 671
MA0666.1 671
MA0679.1 682
MA0709.1 686
MA0879.1 708
MA0818.1 740
MA0886.1 746
MA0873.1 746
MA0894.1 757
MA0612.1 769
MA0843.1 807
MA0748.1 821
MA0027.2 822
MA0906.1 835
MA0644.1 855
MA0817.1 858
MA0639.1 884
MA0726.1 886
MA0014.3 905
MA0019.1 920
MA0145.3 934
MA0125.1 940
MA0132.2 948
MA0075.2 957
MA0069.1 985
MA0630.1 987
MA0720.1 987
MA0895.1 987
MA0654.1 987
MA0722.1 997
MA0907.1 1026
MA0611.1 1041
MA0083.3 1065
MA0905.1 1099
MA0706.1 1108
MA0752.1 1109
MA0723.1 1128
MA0737.1 1137
MA0602.1 1157
MA1143.1 1165
MA0610.1 1208
MA0622.1 1212
MA0669.1 1215
MA0108.2 1225
MA0608.1 1272
MA1117.1 1309
MA0702.1 1324
MA0682.1 1330
MA0594.1 1336
MA0485.1 1337
MA0861.1 1339
MA1151.1 1348
MA0636.1 1354
MA0863.1 1374
MA0831.2 1375
MA0804.1 1378
MA0713.1 1441
MA0865.1 1442
MA0122.2 1467
MA0090.2 1472
MA1150.1 1481
MA0106.3 1489
MA0766.1 1507
MA0739.1 1538
MA0828.1 1548
MA0124.2 1567
MA0663.1 1577
MA0037.3 1580
MA0633.1 1607
MA0672.1 1628
MA0140.2 1679
MA1108.1 1719
MA0113.3 1720
MA0100.3 1729
MA0092.1 1744
MA0829.1 1760
MA1153.1 1780
MA1105.1 1790
MA0842.1 1793
MA0738.1 1799
MA0036.3 1802
MA0511.2 1834
MA0835.1 1868
MA1154.1 1872
MA0878.1 1894
MA0114.3 1898
MA1136.1 1901
MA0071.1 1940
MA0715.1 1967
MA0840.1 1989
MA0800.1 1993
MA0910.1 2005
MA0592.2 2007
MA0764.1 2023
MA0761.1 2031
MA1122.1 2066
MA0482.1 2081
MA0692.1 2092
MA0801.1 2095
MA0162.3 2098
MA0899.1 2098
MA0660.1 2099
MA0869.1 2121
MA0909.1 2127
MA0101.1 2139
MA0867.1 2144
MA0757.1 2152
MA0135.1 2163
MA0913.1 2172
MA0614.1 2187
MA0859.1 2188
MA0770.1 2212
MA0119.1 2221
MA0038.1 2224
MA1139.1 2245
MA0115.1 2255
MA0689.1 2264
MA0143.3 2277
MA0751.1 2295
MA1131.1 2317
MA0060.3 2351
MA0488.1 2425
MA0154.3 2442
MA0035.3 2443
MA0787.1 2448
MA0657.1 2462
MA1148.1 2472
MA0088.2 2498
MA0625.1 2500
MA0677.1 2518
MA1113.1 2522
MA0093.2 2525
MA0773.1 2530
MA0788.1 2535
MA0784.1 2576
MA1123.1 2597
MA0029.1 2624
MA0786.1 2662
MA0503.1 2675
MA1141.1 2682
MA0481.2 2688
MA0258.2 2699
MA0872.1 2718
MA1109.1 2811
MA0155.1 2815
MA0591.1 2822
MA0799.1 2831
MA0796.1 2843
MA0814.1 2879
MA0600.2 2885
MA0095.2 2912
MA0524.2 2929
MA0812.1 2933
MA0652.1 2973
MA0509.1 3106
MA0047.2 3109
MA0495.2 3128
MA0496.2 3129
MA1144.1 3223
MA0665.1 3232
MA0520.1 3251
MA0598.2 3282
MA0845.1 3345
MA0103.3 3517
MA0490.1 3589
MA0484.1 3612
MA1114.1 3644
MA0515.1 3687
MA0593.1 3844
MA0477.1 3874
MA1421.1 4011
MA0505.1 4114
MA0762.1 4174
MA0076.2 4255
MA0653.1 4397
MA0065.2 4414
MA0146.2 4944
MA0742.1 5029
MA0687.1 5159
MA0499.1 5186
MA0816.1 5327
MA0846.1 5447
MA1100.1 5817
MA0002.2 6260
MA1102.1 7113
MA1107.1 9302
MA0139.1 9868
MA0753.1 10079
MA0746.1 11919
MA0741.1 14604
MA0149.1 17975
MA0528.1 53147
Sum 531354
=> ~0.5M significant motif hits in total, from 0.3k to 54k per motif, for 200 motifs across 30k sequences of 0.2-3kb.
matches_grl
is a list of GRanges objects, one for each motif. Applyingunlist
to it will return a single GRanges object, with a name field identifying the original list element, i.e. here the motif name, ready for exporting to a BED file.
export.bed(
unlist(matches_grl),
file.path(".","data", "Motifs", "jaspar2018_matches.bed.gz"))
# note: export.bed can automatically gzip the file
=> This BED file can now be loaded into a genome browser.
# clean up
rm(matches_grl)
Summary
In this section, we saw how to:
- score a single sequence for a given motif;
- different approaches for selecting a score threshold for determining hits;
- how to scan for motif hits using package
motifmatchr
.
We are now ready to perform analyses linking motif hits to genomic features of interest.
Exercise 3: scoring a sequence
Using the MYC PWM prepared below, score the test sequence provided below, either manually or programmatically.
Do you think it is a good hit, or not?
# grab & show MYC PWM, assuming uniform BG & 100 nsites
<- query(MotifDb, andStrings=c("MYC", "JASPAR2018", "Hsapiens"))
res <- convert_type(convert_motifs(res[[2]]), "PWM") myc_motif
Note: motif [motif] has an empty nsites slot, using 100.
# test sequence
<- "GGCAATGTGCAC" test_seq
# view motif
view_motifs(myc_motif, use.type = "PWM") # graphically
print(myc_motif) # weights of PWM
Motif name: motif
Alphabet: DNA
Type: PWM
Strands: +-
Total IC: 11.27
Pseudocount: 0
Consensus: NNCCACGTGCNN
N N C C A C G T G C N N
A -0.25 -0.28 -1.50 -4.81 1.90 -4.52 -2.87 -4.39 -5.75 -1.56 -0.36 -0.71
C 0.12 0.06 1.60 1.97 -4.39 1.87 -2.98 -2.93 -4.46 1.18 0.30 0.53
G 0.42 0.51 -1.39 -5.32 -2.86 -4.14 1.89 -4.29 1.96 -0.04 -0.25 0.13
T -0.44 -0.49 -2.11 -5.52 -3.45 -2.08 -5.02 1.91 -4.75 -1.24 0.20 -0.23
# auto scoring
score_match(myc_motif, test_seq)
[1] 4.647
# percentile score threshold:
motif_score(myc_motif, threshold = 0.8)
80%
5.7812
# p-value score threshold:
motif_pvalue(myc_motif, pvalue = 1e-5, k=10)
[1] 13.954
Exercise 4: playing around with a list of PWMs
We want to grab the complete set of “CORE” vertebrate sequence motifs from Jaspar 2018. In this session, we saw 2 different functions for querying motif databases. One of these is better suited to this objective. Consult their help files and test them out. Once you have a list of motifs you are happy with, identify amongst them:
- the shortest and the longest motifs (function
ncol()
on a PWM can help) - the motifs with the lowest and highest possible log-odds scores (function
motif_score()
can help) - the motifs with the lowest and the highest information contents (for
universalmotif
PWMs, the IC score is containted in the sloticscore
)
- the shortest and the longest motifs (function
# the getMatrixSet() function allows us to grab all CORE vertebrate PWMs easily
<- getMatrixSet(
exo_jaspar_pwms
JASPAR2018, list(
tax_group = "vertebrates",
collection = "CORE",
matrixtype = "PWM",
all_versions = FALSE)) # a few seconds
# => 579 motifs
# ID mapper for convenience
<- sapply(exo_jaspar_pwms, function(pwm) {pwm@name}) exo_motif2tf
# the query() function does not handle taxonomy very well
<- query(
exo_jaspar_pwms
MotifDb, andStrings=c("JASPAR_CORE"),
orStrings = c("Hsapiens","Mmusculus","Rnorvegicus","Vertebrata")) # missing a lot
<- convert_motifs(exo_jaspar_pwms) # conversion necessary
exo_jaspar_pwms # => 125 motifs
# ID mapper for convenience
<- sapply(exo_jaspar_pwms, function(pwm) {pwm@name}) exo_motif2tf
# Lengths
<- sapply(jaspar_pwms, ncol)
PWMlength range(PWMlength)
[1] 8 21
which.min(PWMlength)] # shortest motif exo_motif2tf[
[1] "FOXD1"
which.max(PWMlength)] # longest motif exo_motif2tf[
[1] "HNF1B"
# Max log-odds scores
<- sapply(exo_jaspar_pwms, motif_score)["100%",]
maxPWMscore range(maxPWMscore)
[1] 7.105 35.133
which.min(maxPWMscore)] # motif w/ smallest max score exo_motif2tf[
[1] "GATA2"
which.max(maxPWMscore)] # motif w/ highest max score exo_motif2tf[
[1] "EWSR1-FLI1"
# ICs
<- convert_motifs(exo_jaspar_pwms) # convert to get easy access to IC
jaspar_upwms <- sapply(jaspar_upwms, function(pwm) {pwm@icscore})
ICscore range(ICscore)
[1] 4.900548 31.838563
which.min(ICscore)] # motif w/ smallest IC exo_motif2tf[
[1] "En1"
which.max(ICscore)] # motif w/ highest IC exo_motif2tf[
[1] "EWSR1-FLI1"
Enrichment analysis for motifs
Now that we have our motifs (drawn from databases or experimental data), and have located occurrences of them, we can try and determine if the presence of these occurrences associates with some relevant biology. Using TF binding sites for example:
does a particular TF bind more often near genes annotated with a certain function? This could suggest involvement of the TF in that function.
does a particular TF usually bind more often near genes whose expression is dysregulated upon knock-down of the TF itself? This could indicate which genes are “targeted” by the TF.
does a particular TF bind more often in open chromatin regions? This could reveal TFs that are chromatin remodellers.
etc.
In our ATAC-seq dataset, we obtained differentially open chromatin regions between the Tfh and TH1 T cell populations. Above, we found motif occurrences for many TFs that were located in these regions. Now, we would like to infer which of these motifs may be regulators of the chromatin state. To this end, we will investigate the relation between the occurrence of (potential) transcription factor binding sites and differential chromatin accessibility.
Popular methods for investigating these relations are:
Over-Representation Analysis: a contingency matrix counts occurrences of an annotation (e.g. a TF motif occurrence) amongst two categories of objects (e.g. open chromatin regions vs closed chromatin regions), and a test evaluates if there is a statistical association between annotations and categories.
(Gene) Set Enrichment Analysis: objects (e.g. chromatin regions) are ranked using a continuous score (e.g. accessibility), and a statistical test evaluates whether annotated regions (e.g. those containing a TF motif occurrence) have, on average, higher (or lower) scores.
PWM Regression: objects (e.g. chromatin regions) have a continuous outcome measured (e.g. accessibility), and this is modelled as a linear function of the number of annotations (e.g. motif occurrences, for all motifs).
R package
monaLisa
contains several advanced implementations of these. But here, we will look into the simplest approach, so we can do it manually: ORA for TF binding sites.
Over-representation analysis (ORA) for PWMs
What is ORA?
Over-representation analysis (ORA) generally looks to see whether a level of a certain categorical variable is more frequently observed than expected by chance (i.e. over-represented, or enriched) within one subpopulation, compared to a background population (typically another subpopulation, or the entire population). Such tests can be applied in a wide variety of biological problems. For example, this could be over-representation of a given gene annotation amongst a list of differentially-expressed genes, compared to a background of all tested genes. In our setting, we’d look for TF binding sites being enriched in one set of regions (“TH1>Tf1” or the other), compared to a background (e.g. the rest of the genome, or the “common” peaks).
There are several types of basic statistical test that can be used to investigate over-representation: Fisher’s exact test, the Chi-square test, the hypergeometric test, the proportions test… These all differ in their modelling choices and assumptions, including the choice of “background”. Some are more commonly used in certain types of analyses than others (gene enrichments, TFBS enrichments…).
Here, we will consider Fisher’s exact test for examining enrichment of motif occurrences in our open chromatin regions.
ORA for our ATAC-seq
There is a pitfall when naively comparing different sets of genomic regions, such as promoter vs intergenic regions. Differences in global sequence composition (e.g. GC and CpG content) or level of conservation can easily bias the analysis. Unless these sources of biases are controlled for, the results can be meaningless.
For our analysis, such a pitfall would be to contrast ATAC-seq peaks with regions never accessible (genomic background) since these are expected to have different sequence composition (e.g. different GC content).
Instead, we will focus only on peak regions, and differentiate:
- regions which are specifically open in only one of the two cell populations (specific).
- regions which are accessible in both (common). These regions will serve here as a background. We assume that they carry similar regulatory functions as the first set, and thus share the same sequence composition and biases. We can interrogate this assumption later if need be.
Given these two non-overlapping sets of open regions, we subdivide each into those overlapping a target site for each TF of interest:
- regions with a TF target site
- regions without (non-target)
The outcome of these 2 categorisations is a 2-by-2 contingency table, one per TF motif and cell population, enumerating the four possible cases:
common | specific | |
---|---|---|
target | common.target | specific.target |
non-target | common.non-target | specific.non-target |
Fisher’s exact test is suited to looking for statistically significant association between these 2 binary categorisations of regions.
We can now address the question of an association between the occurrence of TF target sites and the T cell population-specific DNA accessibility patterns.
The test statistic which quantifies the association is the following odds-ratio (note the symmetry of the odds-ratio):
\[ {\rm OR} = \frac{ \rm (nb.specific.target) / (nb.specific.non.target) }{ \rm (nb.common.target) / (nb.common.non.target) } \]
\[ {\rm OR} = \frac{ \rm (nb.specific.target) * (nb.common.non.target) }{ \rm (nb.common.target) * (nb.specific.non.target) } \]
\[ {\rm OR} = \frac{ \rm (nb.specific.target) / (nb.common.target) }{ \rm (nb.specific.non.target) / (nb.common.non.target) } \]
There is an association if the OR > 1. An OR < 1 indicates a negative association (“anti-correlation”, “repulsion”, “depletion”). Since we generally assume that the binding of a TF to a DNA sequence opens up the chromatin and/or maintains it open, we do not focus on negative associations here.
Fisher’s exact test checks for significance of the association. For details, check the help page of
?fisher.test
, and your favourite statistics resources, e.g. for Wikipedia:
Categorising accessible chromatin peaks
- To perform this test on our ATAC-seq dataset, we first need to categorize regions into three classes: open in both T cell populations, open in Tfh only, and open in TH1 only. Here, we consider peaks with a differential openness FDR below 10% to differ across populations, and all others to be open in both (“common”).
# Note: already done above, doing again just in case
# base categorisation
$state <- peaks_gr$direction
peaks_gr$state[peaks_gr$FDR >= 0.1] <- "common"
peaks_gr
# setting the seed to always get the same "random" selection
set.seed(42)
# stratification
<- function(size, stratum) {
rnd_subsample_stratum <- peaks_gr[ peaks_gr$state == stratum ]
stratum_peaks_gr stopifnot( length(stratum_peaks_gr)>=size )
return(sample(names(stratum_peaks_gr), size=size, replace=F))
}
<- c(
sel_peak_names rnd_subsample_stratum(25000, "common"),
rnd_subsample_stratum(2500, "down"),
rnd_subsample_stratum(2500, "up")
)
<- peaks_gr[sel_peak_names]
peaks_gr
# report
addmargins(table(peaks_gr$state))
common down up Sum
25000 2500 2500 30000
ORA for an example TF (hGATA2) in TH1
- As an example, let us focus on the overlap of the TH1-specific regions (state is “TH1>Tfh”) with predicted sites for one of the TFs. Using the
xtabs
function on the binary motif match matrix and the state established above, we can generate a contingency table, that can be subset to a 2x2 table.
# if necessary:
# matches_rse <- readRDS("data/Motifs/jaspar2018_matches_rse.rds")
<- "MA0036.3" # (human) GATA2
motif_id <- xtabs(
xt ~ state + sites,
data.frame(
state=peaks_gr_sel$state,
sites=assay(matches_rse, "motifMatches")[names(peaks_gr_sel),motif_id]
))<- xt[c("common", "TH1>Tfh"),] # down = TH1 specific peaks
xt addmargins(xt)
sites
state FALSE TRUE Sum
common 23734 1266 25000
TH1>Tfh 2238 262 2500
Sum 25972 1528 27500
- We can calculate the odds-ratio (OR) manually, indicating how much more likely it is to observe target sites for this TF in TH1-specific peaks compared to common peaks.
<- (xt[1,1]*xt[2,2])/(xt[2,1]*xt[1,2])
OR OR
[1] 2.194717
=> A more than a 2x fold enrichment of hGATA2 binding sites amongst “TH1>Tfh” peaks compared to “common” peaks! This is a strong enrichment, but is it significant?
- We can perform Fisher’s test (which calculates the OR itself). Remember that we are only interested in positive associations (OR > 1), so we specify
alternative="greater"
to perform a one-sided test.
fisher.test(xt, alternative="greater")
Fisher's Exact Test for Count Data
data: xt
p-value < 2.2e-16
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
1.944818 Inf
sample estimates:
odds ratio
2.19475
=> This indicates that the strong 2-fold enrichment in hGATA2 target sites amongst TH1-specific peaks, relative to peaks open in both TH1 and Tfh, is indeed (highly) statistically significant.
# clean up
rm(motif_id, xt, OR)
Fisher-based ORA across all TFs and both cell populations
- We can do the same, iterating over all TFs of interest:
<- list()
ora_res system.time( {
for (motif_id in colnames(matches_rse)) {
<- xtabs(
xt ~ state + sites,
data.frame(
state=peaks_gr_sel$state,
sites=assay(matches_rse, "motifMatches")[
names(peaks_gr_sel),motif_id]
))<- fisher.test( xt[c("common","TH1>Tfh"),], alternative="greater")
ft.TH1 <- fisher.test( xt[c("common","Tfh>TH1"),], alternative="greater")
ft.Tfh <- data.frame(
ora_res[[motif_id]] row.names = NULL,
"Motif.ID" = motif_id, "TF.name" = colData(matches_rse)[motif_id,"name"],
"OR.TH1" = unname(ft.TH1$estimate), "pv.TH1" = ft.TH1$p.value,
"OR.Tfh" = unname(ft.Tfh$estimate), "pv.Tfh" = ft.Tfh$p.value
)
} # <15s on my laptop })
user system elapsed
8.222 0.459 8.688
<- as.data.frame(do.call("rbind",ora_res))
ora_res rownames(ora_res) <- NULL
# adjust p-values for multiple testing
$fdr.TH1 <- p.adjust(ora_res$pv.TH1, method="BH")
ora_res$fdr.Tfh <- p.adjust(ora_res$pv.Tfh, method="BH")
ora_res
# show (some)
print(head(tibble::column_to_rownames(ora_res, "Motif.ID")[,-c(3,5)], 20))
TF.name OR.TH1 OR.Tfh fdr.TH1 fdr.Tfh
MA0036.3 GATA2 2.1947504 1.290410 1.350614e-23 2.888763e-03
MA1421.1 TCF7L1 1.7952502 1.870206 3.104983e-22 2.034747e-25
MA0770.1 HSF2 2.1250370 1.381160 2.855073e-23 1.055822e-04
MA0636.1 BHLHE41 0.9236761 1.437457 7.644386e-01 3.003552e-03
MA0488.1 JUN 1.8078733 1.621747 2.660687e-15 7.130000e-10
MA0145.3 TFCP2 1.6739737 1.261494 7.390401e-05 6.281313e-02
MA0630.1 SHOX 1.5050123 1.569682 3.561512e-04 8.494469e-05
MA0146.2 Zfx 0.8936772 1.204154 9.793243e-01 2.732522e-03
MA0612.1 EMX1 1.6585375 2.288034 1.156616e-04 2.590676e-11
MA0757.1 ONECUT3 1.6704196 1.532446 5.266953e-10 5.766395e-07
MA0092.1 Hand1::Tcf3 1.7714225 1.280782 4.385537e-12 3.756361e-03
MA0800.1 EOMES 3.5109350 1.539566 5.726196e-74 6.033619e-07
MA0644.1 ESX1 1.5792975 1.561667 1.157016e-04 1.905015e-04
MA1131.1 FOSL2::JUN(var.2) 1.5051414 1.592882 5.348344e-06 2.492116e-07
MA0846.1 FOXC2 1.7093246 1.466996 5.120520e-21 1.423056e-10
MA0088.2 ZNF143 1.6439309 1.233937 5.833836e-12 4.174523e-03
MA0737.1 GLIS3 1.4111390 1.398283 8.864801e-04 1.220692e-03
MA1123.1 TWIST1 1.7806855 1.314523 3.704567e-16 2.484158e-04
MA0682.1 Pitx1 1.6250740 1.169396 2.305851e-07 7.397834e-02
MA0720.1 Shox2 1.5050123 1.569682 3.561512e-04 8.494469e-05
# clean up
rm(motif_id, xt, ft.TH1, ft.Tfh)
# keep for later use
saveRDS(ora_res, file='data/Motifs/ATAC_ORA.rds')
- We can extract the TFs most significantly associated to regions specifically open in TH1 samples. Here, we not only filter based on the significance, but also on the effect size (odds ratio):
head(subset(
order(ora_res$pv.TH1),],
ora_res[< 0.01 & OR.TH1 > 2), 10) # quite a stringent filter fdr.TH1
Motif.ID TF.name OR.TH1 pv.TH1 OR.Tfh pv.Tfh fdr.TH1 fdr.Tfh
45 MA0002.2 RUNX1 3.777648 1.558764e-179 1.222642 1.747404e-04 3.117528e-177 2.627676e-04
12 MA0800.1 EOMES 3.510935 5.726196e-76 1.539566 2.141935e-07 5.726196e-74 6.033619e-07
141 MA0689.1 TBX20 3.146719 4.869246e-68 1.486339 4.181152e-07 3.246164e-66 1.100303e-06
61 MA0801.1 MGA 3.122758 4.398276e-63 1.493900 6.886255e-07 2.199138e-61 1.679574e-06
46 MA1144.1 FOSL2::JUND 2.764599 1.525370e-47 2.261137 2.218020e-28 6.101479e-46 7.393398e-27
28 MA1141.1 FOS::JUND 2.602834 1.845132e-47 2.240753 5.294007e-32 6.150440e-46 2.647004e-30
103 MA0490.1 JUNB 2.550341 2.629518e-46 2.214168 8.948733e-32 7.512908e-45 3.579493e-30
69 MA0477.1 FOSL1 2.468310 1.913646e-43 2.229434 4.977903e-33 4.784116e-42 3.318602e-31
36 MA0511.2 RUNX2 2.357028 4.808296e-30 1.409425 4.553645e-05 9.616592e-29 7.784008e-05
60 MA0071.1 RORA 2.284347 7.028427e-29 1.427101 1.462307e-05 1.277896e-27 2.812128e-05
=> Plenty of enriched TFs.
- Similarly for the TFs most associated to regions specifically open in Tfh samples:
head(subset(
order(ora_res$pv.Tfh),],
ora_res[< 0.01 & OR.Tfh > 2), 10) fdr.Tfh
Motif.ID TF.name OR.TH1 pv.TH1 OR.Tfh pv.Tfh fdr.TH1 fdr.Tfh
108 MA0625.1 NFATC3 1.581686 3.670051e-10 2.403526 9.301669e-40 8.843496e-10 1.860334e-37
88 MA0786.1 POU3F1 1.807866 4.794184e-17 2.262524 3.244028e-34 2.338626e-16 3.244028e-32
69 MA0477.1 FOSL1 2.468310 1.913646e-43 2.229434 4.977903e-33 4.784116e-42 3.318602e-31
28 MA1141.1 FOS::JUND 2.602834 1.845132e-47 2.240753 5.294007e-32 6.150440e-46 2.647004e-30
103 MA0490.1 JUNB 2.550341 2.629518e-46 2.214168 8.948733e-32 7.512908e-45 3.579493e-30
46 MA1144.1 FOSL2::JUND 2.764599 1.525370e-47 2.261137 2.218020e-28 6.101479e-46 7.393398e-27
147 MA0787.1 POU3F2 1.672904 2.262084e-12 2.145987 2.619483e-28 6.854799e-12 7.484238e-27
98 MA0788.1 POU3F3 1.609643 7.136315e-11 2.097486 4.721601e-27 1.903017e-10 1.180400e-25
94 MA0784.1 POU1F1 1.671012 1.056211e-12 2.011368 4.809562e-24 3.580376e-12 9.619123e-23
57 MA0495.2 MAFF 2.022288 1.109122e-20 2.074152 2.448551e-22 7.922300e-20 4.451910e-21
=> A different set of TFs.
- Surprisingly, many of the above TFs seem to be significant for both populations. Are there many in this case? We’ll relax the significance & effect size thresholds to get a broader picture:
addmargins(table(
"signif TH1" = (ora_res$OR.TH1 >= 1.5 & ora_res$fdr.TH1 <= 0.05),
"signif Tfh" = (ora_res$OR.Tfh >= 1.5 & ora_res$fdr.Tfh <= 0.05)
))
signif Tfh
signif TH1 FALSE TRUE Sum
FALSE 49 18 67
TRUE 60 73 133
Sum 109 91 200
=> Many TFs are over-represented in both chromatin that is open specifically in TH1 cells AND in chromatin that is open specifically in Tfh cells! This is not necessarily expected.
- This might be due to an issue with our background selection. We should investigate.
Investigating the background
- We can check to see whether the up/down peaks have a similar sequence composition compared to our selected background of “common” peaks. Let’s do this manually:
# calculate 1-nt and 2-nt frequencies across all peak types
<- function(seqs) {
calc_avg_1nt2nt_freqs # helper function to calculate mono- and di-nucleotide frequencies
# across a set of sequences
c("n"=length(seqs),
round(100 * colMeans(cbind(
oligonucleotideFrequency(seqs, width=1, as.prob = T), # mononucleotide frequencies
oligonucleotideFrequency(seqs, width=2, as.prob = T) # dinucleotide freqs
2))
)),
}<- rbind(
avg_1nt2nt_freqs "all" = calc_avg_1nt2nt_freqs(peak_seqs_sel), # calculate for ALL sequences
do.call(rbind, lapply(
split(peak_seqs_sel, peaks_gr_sel$state),
# calculate for each state
calc_avg_1nt2nt_freqs))
)
# visualise using heatmap and table
<- heatmap(
h scale = "col", Colv=NA,
avg_1nt2nt_freqs, col = colorRampPalette(c("blue","black","red"))(16))
as.data.frame(avg_1nt2nt_freqs[
rev(h$rowInd) # this ensures the same row order as apparent in the heatmap
,])
n A C G T AA AC AG AT CA CC CG CT GA GC GG GT
all 30000 24.63 25.38 25.38 24.61 6.54 5.54 8.12 4.44 7.73 7.14 2.38 8.13 6.22 6.47 7.17 5.53
common 25000 24.50 25.51 25.51 24.47 6.48 5.51 8.12 4.39 7.69 7.22 2.47 8.14 6.22 6.56 7.24 5.49
up 2500 24.80 25.02 25.17 25.02 6.68 5.46 8.08 4.58 7.79 6.86 2.29 8.09 6.19 6.48 6.96 5.55
down 2500 25.80 24.34 24.31 25.55 7.02 5.92 8.11 4.75 8.11 6.65 1.54 8.03 6.28 5.55 6.63 5.86
TA TC TG TT
all 4.14 6.23 7.72 6.52
common 4.11 6.23 7.68 6.45
up 4.13 6.23 7.85 6.80
down 4.39 6.22 8.03 6.90
# clean up
rm(calc_avg_1nt2nt_freqs, avg_1nt2nt_freqs, h)
=> There are clear mono- and di-nucleotide frequency differences between peak “states”, in particular “TH1>Tfh” compared to the rest. The heatmap strongly emphasises the differences, but the actual differences shown in the table are generally within 1-2%.
Conclusion
The mononucleotide and dinucleotide frequency differences between the peak sets suggest that using one of these (common) as a background may introduce biases into the analysis. Options would be:
Continue and knowingly and explicitly accept the potential biases.
Choose a more appropriate background.
Use a motif ORA method that accounts for composition biases. For example, consider
monaLisa
’scalcBinnedMotifEnrR()
function.
Summary
We have seen in this section:
How to perform Fisher test-based ORA of motif occurrences amongst target sequences.
Possibilities for detecting issues with background selections in ORA.
Exercise 5: finding TF BS enrichments within promoter regions
Using the same ATAC-seq data from above, we would now like to perform a Fisher-based ORA, testing to see if there are any motifs enriched amongst the ATAC-seq peaks that were determined to be promoter regions.
Promoter-overlapping regions were flagged in the original data we loaded. From the overlap column on the
peaks_gr_sel
ranges that we built above, you can identify promoter-overlapping peaks with:
<- grepl(':P',mcols(peaks_gr_sel)$overlap)
isPromoter addmargins(table(isPromoter))
isPromoter
FALSE TRUE Sum
23299 6701 30000
=> ~6k promoter regions out of 30k kept.
- You can load the
matches_rse
object from earlier from the downloadable datasets if necessary:
# load from HDD if necessary
<- readRDS("data/Motifs/jaspar2018_matches_rse.rds") matches_rse
- Using ORA, build for each motif the underlying contingency table, and test (with
fisher.test()
) if there is an association between motif occurrence and promoter overlap. As we are working across a large number of related tests, consider using multiple testing correction (withp.adjust()
). Which motifs show the strongest enrichment to promoters? Show the top 10.
promoter | non-promoter | |
---|---|---|
target | promoter.target | non-promoter.target |
non-target | promoter.non-target | non-promoter.non-target |
<- list()
results print(system.time({
for (motif_id in colnames(matches_rse)) {
<- xtabs(
xt ~ isPromoter + sites,
data.frame(
isPromoter=isPromoter,
sites=assay(matches_rse, "motifMatches")[
names(peaks_gr_sel),motif_id]
))
<- fisher.test( xt, alternative="greater")
ft <- data.frame(
results[[motif_id]] row.names = NULL,
"Motif.ID" = motif_id,
"TF.name" = colData(matches_rse)[motif_id,"name"],
"OR" = unname(ft$estimate), "pv" = ft$p.value)
} #<10s }))
user system elapsed
7.909 0.473 8.440
rm(motif_id, xt, ft)
<- as.data.frame(do.call("rbind",results))
results $fdr <- p.adjust(results$pv, method="BH") # adjust p-value for multiple testing
results
# show top hits
order(results$OR, decreasing = TRUE)[1:10],] results[
Motif.ID TF.name OR pv fdr
MA1122.1 MA1122.1 TFDP1 15.287403 0.000000e+00 0.000000e+00
MA0162.3 MA0162.3 EGR1 11.779478 0.000000e+00 0.000000e+00
MA0146.2 MA0146.2 Zfx 9.715685 0.000000e+00 0.000000e+00
MA0748.1 MA0748.1 YY2 9.228412 2.866234e-189 2.605667e-188
MA0014.3 MA0014.3 PAX5 7.401562 5.935830e-173 5.161591e-172
MA0872.1 MA0872.1 TFAP2A(var.3) 6.729756 5.753480e-245 1.438370e-243
MA0746.1 MA0746.1 SP3 5.991699 0.000000e+00 0.000000e+00
MA0741.1 MA0741.1 KLF16 5.617325 0.000000e+00 0.000000e+00
MA0024.3 MA0024.3 E2F1 5.406678 7.398366e-55 2.901320e-54
MA0742.1 MA0742.1 Klf12 5.374739 0.000000e+00 0.000000e+00
Wrap Up
What now?
For the interested, a lot more content can be found for this topic in the vignette of the monaLisa
package.
Exam is coming up! Regarding this lesson, no questions will be asked on things not covered during the oral presentation.
Questions regarding the session can be sent to me by email at: alex.smith@fmi.ch, though I may field them off to Michael Stadler or Florian Geier.
Questions regarding the course in general should be sent to Robert Ivanek.
References
- Merika M, Orkin SH. (1993) DNA-binding specificity of GATA family transcription factors. Mol Cell Biol. 1993;13(7):3999-4010
- Stormo GD (2005) Modeling the specificity of protein-DNA interactions. Quant Biol 1(2) 115–130
- Ruan S and Stormo GD (2017) Inherent limitations of probabilistic models for protein-DNA binding specificity. PLoS Comput Biol 13(7): e1005638
- Subramanian A et al (2005) Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS 102 (43) 15545-15550
- Kuenzli M et al. (2020) Long-lived T follicular helper cells retain plasticity and help sustain humoral immunity. Science Immunology Vol. 5, Issue 45
- Machlab D, Burger L, Soneson C, Rijli FM, Schübeler D, Stadler MB (2022) monaLisa: an R/Bioconductor package for identifying regulatory motifs. Bioinformatics, 38(9), 2624-2625
Alternative & extended reading
(List available as of 11/05/2022)
Transcription factors:
Motif bioinformatics:
Statistics:
Session info
sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur/Monterey 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] glmnet_4.1-8 Matrix_1.6-4
[3] limma_3.52.0 monaLisa_1.2.0
[5] JASPAR2018_1.1.1 motifmatchr_1.18.0
[7] TFBSTools_1.34.0 MotifDb_1.38.0
[9] universalmotif_1.14.0 BSgenome.Hsapiens.UCSC.hg38_1.4.4
[11] BSgenome.Mmusculus.UCSC.mm10_1.4.3 BSgenome_1.64.0
[13] rtracklayer_1.56.0 Biostrings_2.64.0
[15] XVector_0.36.0 ggdendro_0.2.0
[17] ggplot2_3.5.0 stringr_1.5.1
[19] SummarizedExperiment_1.26.1 Biobase_2.56.0
[21] GenomicRanges_1.48.0 GenomeInfoDb_1.32.1
[23] IRanges_2.30.0 S4Vectors_0.34.0
[25] BiocGenerics_0.42.0 MatrixGenerics_1.8.0
[27] matrixStats_1.2.0 BiocParallel_1.30.0
[29] BiocPkgTools_1.14.0 htmlwidgets_1.6.4
[31] BiocStyle_2.24.0 magick_2.8.3
[33] png_0.1-8 knitr_1.46
loaded via a namespace (and not attached):
[1] circlize_0.4.16 sm_2.2-5.7.1 BiocFileCache_2.4.0
[4] plyr_1.8.9 igraph_1.5.1 splines_4.2.0
[7] digest_0.6.35 foreach_1.5.2 htmltools_0.5.8.1
[10] GO.db_3.15.0 fansi_1.0.6 magrittr_2.0.3
[13] memoise_2.0.1 cluster_2.1.6 doParallel_1.0.17
[16] tzdb_0.4.0 ComplexHeatmap_2.12.0 readr_2.1.5
[19] annotate_1.74.0 R.utils_2.12.3 colorspace_2.1-0
[22] blob_1.2.4 rvest_1.0.4 rappdirs_0.3.3
[25] xfun_0.43 dplyr_1.1.4 crayon_1.5.2
[28] RCurl_1.98-1.14 jsonlite_1.8.8 graph_1.74.0
[31] TFMPvalue_0.0.9 zoo_1.8-12 survival_3.5-8
[34] iterators_1.0.14 glue_1.7.0 gtable_0.3.4
[37] zlibbioc_1.42.0 GetoptLong_1.0.5 DelayedArray_0.22.0
[40] shape_1.4.6.1 scales_1.3.0 DBI_1.2.2
[43] Rcpp_1.0.12 xtable_1.8-4 clue_0.3-65
[46] bit_4.0.5 stabs_0.6-4 DT_0.33
[49] httr_1.4.7 RColorBrewer_1.1-3 farver_2.1.1
[52] pkgconfig_2.0.3 XML_3.99-0.16.1 R.methodsS3_1.8.2
[55] dbplyr_2.5.0 utf8_1.2.4 labeling_0.4.3
[58] tidyselect_1.2.1 rlang_1.1.3 reshape2_1.4.4
[61] AnnotationDbi_1.58.0 munsell_0.5.1 biocViews_1.64.0
[64] tools_4.2.0 cachem_1.0.8 cli_3.6.2
[67] DirichletMultinomial_1.38.0 generics_0.1.3 RSQLite_2.3.6
[70] evaluate_0.23 fastmap_1.1.1 yaml_2.3.8
[73] bit64_4.0.5 caTools_1.18.2 purrr_1.0.2
[76] KEGGREST_1.36.0 splitstackshape_1.4.8 gh_1.4.1
[79] RBGL_1.72.0 R.oo_1.26.0 poweRlaw_0.80.0
[82] pracma_2.4.4 xml2_1.3.6 compiler_4.2.0
[85] rstudioapi_0.16.0 filelock_1.0.3 curl_5.2.1
[88] tibble_3.2.1 stringi_1.8.3 lattice_0.22-6
[91] CNEr_1.32.0 vctrs_0.6.5 pillar_1.9.0
[94] lifecycle_1.0.4 RUnit_0.4.33 BiocManager_1.30.22
[97] GlobalOptions_0.1.2 data.table_1.15.4 bitops_1.0-7
[100] R6_2.5.1 BiocIO_1.6.0 codetools_0.2-20
[103] MASS_7.3-60.0.1 gtools_3.9.5 seqLogo_1.62.0
[106] rjson_0.2.21 withr_3.0.0 GenomicAlignments_1.32.0
[109] Rsamtools_2.12.0 GenomeInfoDbData_1.2.8 parallel_4.2.0
[112] hms_1.1.3 tidyr_1.3.1 rmarkdown_2.26
[115] vioplot_0.4.0 restfulr_0.0.15