Sequence motifs: from position weight matrix to motif enrichment

Author

Florian GEIER, Alex SMITH

Published

May 22, 2024

Set Up

Goal

Learn:

what a sequence motif is, how it is represented, how to retrieve a known motif, and how one can be built from scratch;
how to score a sequence against a motif, how to consider scores as markers of potential binding sites, and how to search DNA for potential Transcription Factor (TF) binding sites;
how to test if certain genomic regions are enriched for specific TF binding sites.

The focus in this session will be on the concepts, models, and processes, rather than the actual code.

NB: this handout is voluntarily made more extensive than what will be seen in class, in particular when considering the content of the advanced document, so as to serve as a potential reference for the interested. It will combine more theoretical sections with more research-like sections. Exam questions will only relate to things covered in class!

Required packages

We will use the following set of packages during the lecture
(notice that the Human genome package differs if youare working on the FMI’s RStudio server):

## misc.
library(BiocParallel)    # parallel computation
library(
  SummarizedExperiment)  # to manipulate [Ranged]SummarizedExperiment objects
library(stringr)         # manipulating character strings


## plotting
library(ggplot2)         # for plotting


## genomes
# Mouse Genome
library(BSgenome.Mmusculus.UCSC.mm10) 

# Human genome
# NB: if using FMI RStudio server, load a different version
library(BSgenome.Hsapiens.UCSC.hg38)  
BSgen.Hu <- BSgenome.Hsapiens.UCSC.hg38
# library(BSgenome.Hsapiens.GENCODE.GRCh38.gencode38)   
# BSgen.Hu <- BSgenome.Hsapiens.GENCODE.GRCh38.gencode38


## pacakges w/ sequence motif-related functions
library(Biostrings)      # manipualting sequences & motif representations
library(universalmotif)  # manipulating motif representations
library(MotifDb)         # accessing motif DBs
library(TFBSTools)       # accessing motif DBs
library(motifmatchr)     # scanning sequences for motif occurences
library(rtracklayer)     # import/export from BED files & more
library(JASPAR2018)      # JASPAR 2018 database

Example data

A quick overview of data used in this session.

Human GATA2 motifs

We will illustrate motif creation, searching and scoring, using data and database entries for Human transcription factor GATA2, amongst others.

M. Merika & S. H. Orkin 1993,
“DNA-binding specificity of GATA family transcription factors”
PMC359949
RDS file

Human K562 TF ChIP-seq

We will illustrate de novo motif calling on some transcription factor ChIP-seq data obtained on Human K562 cells for GATA2 and MYC:

GATA2:
- Original link: ENCSR000EWG
- BED file
MYC:
- Original link: ENCSR744JJU
- BED file

Immune cell accessible chromatin regions

We will illustrate analysing the enrichment of regions of interest in motif occurrences using an example dataset from an ATAC-seq experiment that determined accessible chromatin regions in different Mouse immune cells.

Künzli et al. 2020,
“Long-lived T follicular helper cells retain plasticity and help sustain humoral immunity”
PAPER
The authors identify a new role of T follicular helper (TFH) cells during viral infection. During viral infection, antigen-specific CD4+ T cells undergo proliferative expansion and differentiate into several subtypes among which TFH stay out as long-lived, highly plastic and can give rise to all subtypes after rechallenge.
Data processing in brief:
- Reads were mapped to Mouse (mm10) using bowtie2
- Reads were counted in genomic windows of 150bp
- Differential accessibility was tested using the csaw/edgeR framework
- Neighbouring differentially accessible windows were merged to form differentially accessible regions.
- Raw data deposited in GEO: GSE139423
- Processed data: merged_peaks_TFH_TH1.tsv.gz

Establishing, representing & accessing sequence motifs

Historical background

A motif is a repeating pattern, with possible small variations. In genomics, motifs are sequences that repeat in DNA or RNA, with possible nucleotide variations. Such motifs are typically associated with some sort of function, e.g. DNA binding sites for transcription factors or transcription initiation, RNA binding sites for translation initiation or micro-RNA response elements, etc. In practice, typical DNA motifs are 6-20 nucleotides long.
Historically, DNA sequence motifs were first defined from sets of experimentally-identified transcription factor binding sites, e.g. using electrophoretic mobility shift assay (EMSA), protein binding arrays, ChIP-Seq or high-throughput SELEX.
These experiments yield sets of sequences with high affinity for a given TF under study. For example, let’s take a look at some sequences identified as binding the human transcription factor GATA2 (hGATA2):

Example hGATA2-bound sequences. Source: Merika M & Orkin SH (1993)

M. Merika & S. H. Orkin 1993,
“DNA-binding specificity of GATA family transcription factors” PAPER
This is just a subset of the GATA-bound sequences, and it is already a lot of raw information! We need to find a more convenient way of representing a sequence motif and its variations than using full sequence alignments like this (especially as multiple sequence alignments quickly become computationally costly to calculate). Obviously, if there were no variations, the sequence of the exact match would suffice.
These 19 sequences are aligned around the core “GATA” motif that is recognised by the hGATA2 TF. Considering all these sequences aligned in this way, we can derive a “consensus” sequence for the binding site, which here is “SGATAN”. This indicates a highly conserved “GATA” sequence in the middle, with a preceding position that is either G or C, and a trailing position that can be any nucleotide. However, this consensus representation loses a lot of information about the precise variability at each position. Can we do better programmatically?

Computationally representing sequence motifs

Motif packages in R

There are quite a number of different Bioconductor packages and classes designed to computationally represent, manipulate and use sequence motifs. Many are built off of base functionality defined in the Biostrings package. However, others (such as TFBSTools) define their own classes, hindering inter-operability and leading to redundancy. The universalmotif package aims at providing a uniform interface to all these different classes. Despite this, we still rely on all these different packages in order to use all required functionality, e.g. fast motif searches.
Note that R packages are born, used, and some fall into disuse, for lack of interest, or more commonly, a maintainer. Some of the above packages may no longer work in a few years, forcing one to search for alternatives.

The PCM

Looking at the sequence alignment above, by using the core motif alignment and treating positions independently, we can derive a compact representation for it, i.e. a table counting the occurrences for each nucleotide (here, DNA, so A, C, T, G) in each position. This is what is known as a position count matrix (PCM), or as a position frequency matrix (PFM):

Some hGATA2 example sequences (restricted to the core of the binding site) have been prepared for you to load into R. Once loaded, we can view them in a pretty fashion by showing a DNAStringSet object made from them using package Biostrings:

hGATA2_motif_seqs <- readRDS(
  'data/Motifs/hGATA2_binding_sites.rds')
show(DNAStringSet(hGATA2_motif_seqs))

DNAStringSet object of length 19:
     width seq
 [1]     6 CGATAG
 [2]     6 CGATAT
 [3]     6 GGATAG
 [4]     6 GGATAG
 [5]     6 GGATAC
 ...   ... ...
[15]     6 CGATAT
[16]     6 CGATAG
[17]     6 GGATAT
[18]     6 GGATAC
[19]     6 GGATAC

From these sequences, we can create a position count matrix (PCM) using the create_motif function from universalmotif:

hGATA2_motif <- create_motif(
  hGATA2_motif_seqs, type="PCM", name = "GATA2")
hGATA2_motif


       Motif name:   GATA2
         Alphabet:   DNA
             Type:   PCM
          Strands:   +-
         Total IC:   9.19
      Pseudocount:   0
        Consensus:   SGATAN
     Target sites:   19

   S  G  A  T  A N
A  0  0 19  0 19 2
C  9  0  0  0  0 6
G 10 19  0  0  0 8
T  0  0  0 19  0 3

This type of representation entirely depends on the number of sequences that were initially aligned. We need something that is more generally applicable.

The PPM

Another representation, which abstracts from the total number of counts, is the position probability matrix (PPM). It normalizes the count matrix by the number of observations, resulting in an estimate for the probability of observing each letter at a given position. We can use functions convert_type and view_motifs from package universalmotif to convert our previous motif object, and show a graphical representation of it:

convert_type(hGATA2_motif, type="PPM")


       Motif name:   GATA2
         Alphabet:   DNA
             Type:   PPM
          Strands:   +-
         Total IC:   9.19
      Pseudocount:   0
        Consensus:   SGATAN
     Target sites:   19

     S G A T A    N
A 0.00 0 1 0 1 0.11
C 0.47 0 0 0 0 0.32
G 0.53 1 0 0 0 0.42
T 0.00 0 0 1 0 0.16

view_motifs(hGATA2_motif, use.type = "PPM")

This motif representation no longer depends on the number of sequences aligned. However, it does not give us any idea of how “surprising” it would be to observe any given sequence that matches the motif. How informative is the motif? How can we use this to actually find other occurrences of it in genomic sequences? We need some idea of what sequences we can expect to find by chance.

The PWM

Closely related to the previous is the position weight matrix (PWM), which uses the PPM and a background model to get an idea of “expected” sequences, and thereby can assign weights to each letter at each position. In this context, “background” refers here to the base composition at non-specific sites (i.e. here, sequences that do not necessarily bind the TF). The background is often assumed to be either uniform (i.e. \(1/4\) for each of A, T, C, and G), or to be the genome-wide nucleotide frequencies. PWM scores are defined as the log-ratio of the PPM and the background frequencies:

\[ {\rm PWM}_{ij} = \log_2\left(\frac{\rm PPM_{ij}}{\rm bg_{j}}\right) \] where \({\rm bg}_{j}\) is the background frequency of base \(j\).

Note: PWM scores are negative if the probability for a given letter at a given position is below the background probability.
In cases where relatively few sequences were used to derive a motif, or when a position was very highly conserved, there can be 0 values in the PCM. In order to confine the derived PWM scores to finite values, one has to make use of pseudo-counts:

## no pseudo-counts:
convert_type(hGATA2_motif, type="PWM")


       Motif name:   GATA2
         Alphabet:   DNA
             Type:   PWM
          Strands:   +-
         Total IC:   9.19
      Pseudocount:   0
        Consensus:   SGATAN
     Target sites:   19

     S    G    A    T    A     N
A -Inf -Inf    2 -Inf    2 -1.25
C 0.92 -Inf -Inf -Inf -Inf  0.34
G 1.07    2 -Inf -Inf -Inf  0.75
T -Inf -Inf -Inf    2 -Inf -0.66

## pseudocount of 1:
convert_type(hGATA2_motif, type="PWM", pseudocount = 1)


       Motif name:   GATA2
         Alphabet:   DNA
             Type:   PWM
          Strands:   +-
         Total IC:   9.19
      Pseudocount:   0
        Consensus:   SGATAN
     Target sites:   19

      S     G     A     T     A     N
A -4.32 -4.32  1.94 -4.32  1.94 -1.15
C  0.89 -4.32 -4.32 -4.32 -4.32  0.32
G  1.04  1.94 -4.32 -4.32 -4.32  0.72
T -4.32 -4.32 -4.32  1.94 -4.32 -0.62

# possible bug: pseudocount in printed object is still shown as 0 when converting

## compare to manually specifying uniform background frequencies:
ppm <- create_motif(hGATA2_motif_seqs, type="PPM", name = "GATA2", pseudocount = 1, bkg = rep(0.25, 4))
convert_type(ppm, type="PWM")


       Motif name:   GATA2
         Alphabet:   DNA
             Type:   PWM
          Strands:   +-
         Total IC:   7.84
      Pseudocount:   1
        Consensus:   SGATAN
     Target sites:   19

      S     G     A     T     A     N
A -4.32 -4.32  1.94 -4.32  1.94 -1.15
C  0.89 -4.32 -4.32 -4.32 -4.32  0.32
G  1.04  1.94 -4.32 -4.32 -4.32  0.72
T -4.32 -4.32 -4.32  1.94 -4.32 -0.62

# => same, but PC is shown properly

The example below uses as a background frequency the occurrence of A, C, G, and T on chromosome 1 of the Human genome (not uniform!).

nt_cnts <- oligonucleotideFrequency(BSgen.Hu$chr1, width=1)
nt_cnts

       A        C        G        T 
67070277 48055043 48111528 67244164

nt_freqs <- nt_cnts / sum(nt_cnts)
nt_freqs

        A         C         G         T 
0.2910013 0.2084989 0.2087440 0.2917558

=> notice the background frequencies across chromosome 1 are not uniform, there is a clear depletion in GC. This is common in vertebrates, as we will see later.

pwm <- create_motif(
  hGATA2_motif_seqs, type="PWM", name="GATA2", pseudocount = 1, bkg = nt_freqs)

print(pwm)


       Motif name:   GATA2
         Alphabet:   DNA
             Type:   PWM
          Strands:   +-
         Total IC:   7.84
      Pseudocount:   1
        Consensus:   SGATAN
     Target sites:   19

      S     G     A     T     A     N
A -4.54 -4.54  1.73 -4.54  1.73 -1.37
C  1.15 -4.06 -4.06 -4.06 -4.06  0.58
G  1.30  2.21 -4.06 -4.06 -4.06  0.98
T -4.54 -4.54 -4.54  1.72 -4.54 -0.84

If you compare this to the PWM with PWM obtained above with uniform background probabilities, notice that the scores for Gs and Cs increase, and those for As and Ts decrease. This reflects the non-uniform background scores, where seeing a G or a C is more suprising than seeing an A or a T.
PWMs are the representation typically used to scan sequences and identify matches, e.g. potential TF binding sites. It however lacks a “nice” graphical representation:

view_motifs(pwm, use.type = "PWM")

=> For each position, positive weights are shown above the horizontal axis, negative below.

The ICM

The final representation of a sequence motif uses information content as a means to highlight conserved over non-conserved sites. The information content matrix (ICM) is a scaled version of the PPM, where each position gets scaled by its information content (IC) measured in bits. Most conserved sites have an IC of 2 bits. This form can be visualised as a “sequence logo”, the most common graphical representation of a motif:

view_motifs(hGATA2_motif, use.type = "ICM")

Note: For the curious, IC can be represented using Shannon’s entropy, which does not take background frequencies into account, or using relative entropy, the latter being better suited to non-uniform background frequencies. Each leads to different logos.

Conclusion

We can now build a compact representation of a motif using a set of aligned sequences that contain it. However, we would not want to do this for every motif we want to work with! Many motifs have already been experimentally or computationally examined in many genomes, and their ready-made representations have been collated in databases, many of which can be accessed from within R.

Motif databases

There are several resources providing PWM matrices, some are free (i.e., Jaspar), some require a license (i.e., Transfac).
Many such resources are directly available through R packages, e.g. the MotifDb, TFBSTools and JASPAR2018 packages.
All these databases and packages collect the PWMs from different primary sources and there is a lot of redundancy in them. Even a single resource might have multiple PWMs for the same TF (for example acquired in different species, by different techniques).

Iteratively refine a MotifDB query

For example, lets check the human GATA2 transcription factor in the MotifDb package, which compiles access to several databases. The query() function builds a query to resources the package has access to, and fetches a list of motif representations with requested strings in the representation’s metadata.
The query() function takes as arguments strings that should (or should not) match to any field in the underlying database. It is not always clear what strings to search for. We can iteratively define, examine, and then refine our query to the database until we are happy with the results.

query(MotifDb, andStrings=c("GATA2"))

MotifDb object of length 16
| Created from downloaded public sources, last update: 2022-Mar-04
| 16 position frequency matrices from 10 sources:
|        HOCOMOCOv10:    2
| HOCOMOCOv11-core-A:    1
| HOCOMOCOv11-secondary-A:    1
|              HOMER:    1
|        JASPAR_2014:    1
|        JASPAR_CORE:    1
|         jaspar2016:    2
|         jaspar2018:    4
|         jaspar2022:    2
|       SwissRegulon:    1
| 4 organism/s
|           Hsapiens:   12
|          Athaliana:    2
|          Mmusculus:    1
|              other:    1
Hsapiens-HOCOMOCOv10-GATA2_HUMAN.H10MO.A 
Mmusculus-HOCOMOCOv10-GATA2_MOUSE.H10MO.A 
Hsapiens-HOCOMOCOv11-core-A-GATA2_HUMAN.H11MO.0.A 
Hsapiens-HOCOMOCOv11-secondary-A-GATA2_HUMAN.H11MO.1.A 
NA-HOMER-Gata2(Zf)/K562-GATA2-ChIP-Seq(GSE18829)/Homer 
...
Hsapiens-jaspar2018-GATA2-MA0036.3 
Athaliana-jaspar2018-GATA20-MA1324.1 
Hsapiens-jaspar2022-GATA2-MA0036.3 
Athaliana-jaspar2022-GATA20-MA1324.1 
Hsapiens-SwissRegulon-GATA2.SwissRegulon

=> 16 motif representations, across many resources & organisms.

Let’s focus on JASPAR 2018 entries:

query(MotifDb, andStrings=c("GATA2", "JASPAR2018"))

MotifDb object of length 4
| Created from downloaded public sources, last update: 2022-Mar-04
| 4 position frequency matrices from 1 source:
|         jaspar2018:    4
| 2 organism/s
|           Hsapiens:    3
|          Athaliana:    1
Hsapiens-jaspar2018-GATA2-MA0036.1 
Hsapiens-jaspar2018-GATA2-MA0036.2 
Hsapiens-jaspar2018-GATA2-MA0036.3 
Athaliana-jaspar2018-GATA20-MA1324.1

=> 4 motif representations for 2 organisms.

Let’s focus on entries for Human:

query(MotifDb, andStrings=c("GATA2", "JASPAR2018", "Hsapiens"))

MotifDb object of length 3
| Created from downloaded public sources, last update: 2022-Mar-04
| 3 position frequency matrices from 1 source:
|         jaspar2018:    3
| 1 organism/s
|           Hsapiens:    3
Hsapiens-jaspar2018-GATA2-MA0036.1 
Hsapiens-jaspar2018-GATA2-MA0036.2 
Hsapiens-jaspar2018-GATA2-MA0036.3

=> 3 motif representations in Human.

Visualise redundant motifs

This final set of human GATA2 motifs from the Jaspar2018 database is quite redundant, as the following picture shows. Here, we add our own matrix to the end of the list. We use universalmotif’s convert_motifs to convert the motif representations for our needs.

res <- query(MotifDb, andStrings=c("GATA2", "JASPAR2018", "Hsapiens"))
res <- convert_motifs(res)
res[[4]] <- hGATA2_motif
res

[[1]]

       Motif name:   GATA2
   Alternate name:   MA0036.1
           Family:   GATA-type zinc fingers
         Organism:   Hsapiens
         Alphabet:   DNA
             Type:   PPM
          Strands:   +-
         Total IC:   5.35
      Pseudocount:   1
        Consensus:   NGATR
     Target sites:   53
       Extra info:   [dataSource] jaspar2018 

     N    G    A T    R
A 0.25 0.00 0.98 0 0.47
C 0.25 0.09 0.00 0 0.13
G 0.34 0.91 0.02 0 0.28
T 0.17 0.00 0.00 1 0.11

[[2]]

       Motif name:   GATA2
   Alternate name:   MA0036.2
           Family:   GATA-type zinc fingers
         Organism:   Hsapiens
         Alphabet:   DNA
             Type:   PPM
          Strands:   +-
         Total IC:   13.41
      Pseudocount:   1
        Consensus:   NNNTTCTTATCWSH
     Target sites:   4380
       Extra info:   [dataSource] jaspar2018 

     N    N    N    T    T    C    T    T A T C    W    S    H
A 0.36 0.26 0.36 0.24 0.01 0.00 0.00 0.01 1 0 0 0.28 0.02 0.29
C 0.13 0.18 0.17 0.12 0.24 0.69 0.04 0.00 0 0 1 0.00 0.40 0.27
G 0.17 0.39 0.28 0.10 0.17 0.21 0.00 0.00 0 0 0 0.00 0.45 0.05
T 0.34 0.18 0.20 0.54 0.58 0.10 0.96 0.99 0 1 0 0.72 0.12 0.39

[[3]]

       Motif name:   GATA2
   Alternate name:   MA0036.3
           Family:   GATA-type zinc fingers
         Organism:   Hsapiens
         Alphabet:   DNA
             Type:   PPM
          Strands:   +-
         Total IC:   10.8
      Pseudocount:   1
        Consensus:   NNCTTATCTBN
     Target sites:   14213
       Extra info:   [dataSource] jaspar2018 

     N    N    C    T    T    A    T    C    T    B    N
A 0.23 0.14 0.07 0.03 0.03 0.99 0.01 0.01 0.17 0.20 0.23
C 0.21 0.29 0.75 0.02 0.00 0.00 0.01 0.97 0.03 0.26 0.27
G 0.23 0.21 0.09 0.01 0.00 0.00 0.00 0.01 0.03 0.26 0.16
T 0.33 0.36 0.10 0.94 0.97 0.00 0.98 0.01 0.77 0.28 0.34

[[4]]

       Motif name:   GATA2
         Alphabet:   DNA
             Type:   PCM
          Strands:   +-
         Total IC:   9.19
      Pseudocount:   0
        Consensus:   SGATAN
     Target sites:   19

   S  G  A  T  A N
A  0  0 19  0 19 2
C  9  0  0  0  0 6
G 10 19  0  0  0 8
T  0  0  0 19  0 3

view_motifs(res, dedup.names = TRUE)

=> Notice that the 2nd and 3rd representations were reverse-complemented. Indeed, function view_motifs automatically aligns and reverse-complements if necessary to obtain the best visual representation of a set of motifs.

This redundancy complicates the interpretation of motif analysis, and its not immediately clear which motif(s) is(are) superior in a redundant set. Often the motif with he highest information content is only reflecting the consensus site or the best scoring site, and will not match relevant sites with lower affinity. We will see later about scoring motif matches in sequences. Collapsing results across similar motifs is beyond the scope of this session.
For now, with the above, we have access to known motifs, but how can we find motifs de novo?

De novo motif identification

We might have access to some data we want to mine for evidence of new, unknown binding site motifs, such as ChIP-seq or ATAC-seq data in a non-model organism, or for a newly-discovered TF. We essentially want to find short sequences that are over-represented in a set of larger sequences of interest (e.g. ChIP-seq or ATAC-seq peaks) when compared to a background set of sequences (most naively, the whole genome, or all promoters, etc.).
Tools such as HOMER, MEME, or RSAT can be used to do this, and describing each of them in detail is beyond the scope of this course. Several start by finding over-represented k-mers (i.e. subsequences of length k) in sequences of interest, so we will explain this approach.
In some Human cell line ChIP-seq data from ENCODE, we will manually look for k-mers enriched in GATA2 ChIP-seq peaks compared to a selected background. As a quick-n-dirty approach to selecting a background, we will use ChIP-seq peaks from another TF called MYC, which we’d expect to bind to the same sort of genomic region (promoters), but not likely at exactly the same regions.
- ChIP-seq peak regions for GATA2 on Human K562 cell cultures (GRCh38 genome build): download from course page or c.f. ENCFF075FIE
- ChIP-seq peak regions for MYC on Human K562 cell cultures (GRCh38 genome build): download from course page or c.f. ENCFF462OYX

Preparing the target & background sequences

First, let’s get the ChIP-seq peak sequences for both TFs, and as a sanity check, view the distribution of the lengths of the peaks:

# grab peak BED files & import to GRanges
GATA2_gr <- rtracklayer::import("./data/Motifs/ENCFF075FIE.bigBed")
MYC_gr   <- rtracklayer::import("./data/Motifs/ENCFF462OYX.bigBed")

# grab peak sequences using these & the human genome (GRCh38)
GATA2_peak_seq <- getSeq(BSgen.Hu, GATA2_gr)
MYC_peak_seq   <- getSeq(BSgen.Hu, MYC_gr)

# look at peak width distributions
hist(width(GATA2_peak_seq), sub=length(GATA2_peak_seq), breaks=64)

hist(width(MYC_peak_seq),   sub=length(MYC_peak_seq),   breaks=64)

=> GATA2 peaks are typically ~300nts wide, while MYC peaks are ~445.

We will homogenise the sequence lengths for each TF. Ideally, we’d also want the background sequences to have similar lengths to the foreground ones too, but this is good enough for this example.

subset_by_length <- function(seqs, min_len, max_len) {
  lens <- width(seqs)
  return(seqs[ !is.na(lens) & lens>=min_len & lens<=max_len ])
}

print(length( GATA2_peak_seq <- subset_by_length(GATA2_peak_seq, 290, 310) ))

[1] 3244

print(length(MYC_peak_seq   <- subset_by_length(MYC_peak_seq,   435, 455) ))

[1] 7115

In this example, we already know what the hGATA2 motif looks like. Out of interest, we can also look up what the known human MYC binding site motif looks like:

res <- query(MotifDb, andStrings=c("MYC", "JASPAR2018", "Hsapiens"))
view_motifs(convert_motifs(res))

# clean up
rm(res)

=> Interestingly, MYC binds to a core CACGTG motif, that is its own reverse-complement.
We can also note that it does not resemble GATA in any way!

Building k-mer frequencies

We want to establish the frequencies of k-mers across both sets of sequences.
- How to choose k? Increasing k means the number of possible k-mers increases, and the number of occurrences of each will decrease. Longer k-mers can capture longer motifs and potentially more variation therein, but with too few counts, one might not be able to test for enrichment. One typically has to try several values of k and select one that offers a good trade-off between motif representation, counts per k-mer, and analysis complexity.
- Here we will try with k=6, for a total of 4^6=4096 different k-mers, which we know should encapsulate the GATA motif nicely.
Let’s count the occurrences of each k-mer across all sequences of both TFs. We can thus check to see if there are any k-mers with near-0 counts.
- NB: counting from one strand only, RCs are thus counted independently.
- NB: oligonucleotideFrequency generates counts for each input sequence, so we’ll sum across these to get total counts.

# GATA2
GATA2_6mer_mat <- Biostrings::oligonucleotideFrequency(
  GATA2_peak_seq, width = 6, as.prob = F) ## using k=6
hist(colSums(GATA2_6mer_mat), breaks=128,
     sub=ifelse(!any(colSums(GATA2_6mer_mat)<=5), "OK", "low counts found"))

# MYC
MYC_6mer_mat <- Biostrings::oligonucleotideFrequency(
  MYC_peak_seq, width = 6, as.prob = F)
hist(colSums(MYC_6mer_mat), breaks=128,
     sub=ifelse(!any(colSums(MYC_6mer_mat)<=5), "OK", "low counts found"))

=> No k-mers with very low counts found, we can continue with our chosen value of k.

Convert to frequencies:

GATA2_6mer_freq <- colSums(GATA2_6mer_mat) / sum(GATA2_6mer_mat)
MYC_6mer_freq   <- colSums(MYC_6mer_mat)   / sum(MYC_6mer_mat)

rm(GATA2_6mer_mat, MYC_6mer_mat)

Calculating k-mer enrichment

Now, we can make an M-A plot showing the enrichment of various k-mers in the GATA peaks compared to in the MYC peaks.
- NB: For illustration’s sake, we’ll highlight which k-mers perfectly match each known motif (though of course this would not be known in a de novo motif search setting!).
- We’ll also highlight effects of nucleotide composition.

# get single nucleotide frequencies for each k-mer, 
# so we can look at potential %GC bias
kmer_descr <- Biostrings::oligonucleotideFrequency(
    DNAStringSet(names(GATA2_6mer_freq)), width=1)

# create an MA plot
dat <- data.frame(
  row.names = names(GATA2_6mer_freq),
  "pGC" = 100 * rowSums(kmer_descr[,c("C","G")]) / nchar(names(GATA2_6mer_freq))[1], 
  "A" = 0.5*(log2(GATA2_6mer_freq) + log2(MYC_6mer_freq)), # average
  "M" =      log2(GATA2_6mer_freq) - log2(MYC_6mer_freq)  # diff
)

# highlight k-mers containing GATA/MYC motifs & RCs thereof, as well as CpGs
dat$detected <- ifelse(
  # note: the GATA motif can match many 6-mers
  str_detect(rownames(dat), "GATA|TATC"), "GATA",  
  ifelse(
    # note: the CACGTG motif can only exactly match 1 6-mer 
    #       (due to it being its own RC)
    str_detect(rownames(dat), "CACGTG"), "MYC", 
    ifelse( 
      # note: CpGs directional
      str_detect(rownames(dat), "CG") , "CpG", "none")) )

# plot
ggplot(dat, aes(x=.data[["A"]], y=.data[["M"]])) +
  geom_hline(yintercept = 0) +
  geom_point(aes(color=.data[["detected"]]), size=0.65)

ggplot(dat, aes(x=.data[["A"]], y=.data[["M"]])) +
  geom_hline(yintercept = 0) +
  geom_point(aes(color=.data[["pGC"]]), size=0.65)

=> As you can see, many of the GATA k-mers (green) are enriched in the GATA peak sequences relative to the MYC peak sequences. Conversely, the MYC k-mer (blue) is depleted.
=> Not all GATA k-mers are enriched, suggesting there is some biologically-relevant sequence specificity outside of the core GATA sequence. We’ll have a closer look below.
=> What are all the red motifs? These motifs all contain “CG”, aka CpGs. Due to the mechanisms of DNA nucleotide methylation, mutation and repair, CpGs are typically being lost in vertebrates, and so motifs containing this dinucleotide are typically under-represented in genomic sequences (cf for example this paper). Furthermore, there is a general %GC bias in the k-mer abundances and differential abundances, visible in the 2nd plot. Correcting for these biases would require using more advanced methodology (eg that implemented in monaLisa::getKmerFreq).

As a sanity check, let’s have a look at some of the most enriched and most depleted GATA-containing k-mers:

tmp <- dat[dat$detected=="GATA",] # select GATA kmers
tmp <- tmp[order(tmp$M, decreasing=TRUE),] # order by enrichment
head(tmp, 10) # top enriched GATA kmers

            pGC         A        M detected
GATAAG 33.33333 -11.40214 3.100328     GATA
CTTATC 33.33333 -11.40132 3.036949     GATA
TTATCT 16.66667 -11.06607 2.817302     GATA
AGATAA 16.66667 -11.00824 2.740202     GATA
TTATCA 16.66667 -11.62906 2.381462     GATA
GTTATC 33.33333 -12.31875 2.354227     GATA
TGATAA 16.66667 -11.60033 2.239393     GATA
CAGATA 33.33333 -11.66407 2.226085     GATA
TATCTC 33.33333 -11.61006 2.219941     GATA
GAGATA 33.33333 -11.48220 2.091920     GATA

tail(tmp, 10) # top depleted GATA kmers

            pGC         A           M detected
GGGATA 50.00000 -12.77993  0.09306402     GATA
GTATCC 50.00000 -13.36322  0.08972058     GATA
CGATAG 50.00000 -14.51261  0.06395115     GATA
ATATCG 33.33333 -14.92645  0.01539685     GATA
TATCGG 50.00000 -14.64366 -0.03968622     GATA
TATCCG 50.00000 -14.46358 -0.24794580     GATA
CGGATA 50.00000 -14.46812 -0.31380230     GATA
GGATAC 50.00000 -13.58191 -0.40715278     GATA
CGATAC 50.00000 -15.14177 -0.41523751     GATA
GTATCG 50.00000 -15.28722 -0.52760403     GATA

rm(tmp)

=> Many of those depleted seem to contain CGs, and have a low overall frequency, indicating that they might be being lost over evolution due to the same mechanism discussed above. Though as mentioned, there might also be some level of biologically-relevant sequence specificity of human GATA2 binding outside of the core GATA sequence.

In the de novo identification setting, the set of all k-mers above a certain enrichment threshold here could be explored for evidence of a new TF binding site motif.

tmp <- dat[dat$M>=1.25,] # select enriched motifs above an arbitrary threshold
head(tmp[order(tmp$M, decreasing = TRUE),], 20)

            pGC         A        M detected
GATAAG 33.33333 -11.40214 3.100328     GATA
CTTATC 33.33333 -11.40132 3.036949     GATA
TTATCT 16.66667 -11.06607 2.817302     GATA
AGATAA 16.66667 -11.00824 2.740202     GATA
TTATCA 16.66667 -11.62906 2.381462     GATA
GTTATC 33.33333 -12.31875 2.354227     GATA
TGATAA 16.66667 -11.60033 2.239393     GATA
CAGATA 33.33333 -11.66407 2.226085     GATA
TATCTC 33.33333 -11.61006 2.219941     GATA
GAGATA 33.33333 -11.48220 2.091920     GATA
GATAAC 33.33333 -12.32396 2.079452     GATA
TAGATA 16.66667 -12.66735 2.056033     GATA
ATAAGC 33.33333 -12.44507 2.038398     none
TATCTG 33.33333 -11.64060 2.031143     GATA
TATCTA 16.66667 -12.78907 2.028302     GATA
TATCAG 33.33333 -11.87700 2.008089     GATA
GCTTAT 33.33333 -12.41862 1.968923     none
TATCTT 16.66667 -11.95283 1.955444     GATA
TCTTAT 16.66667 -12.07541 1.934653     none
CCTTAT 33.33333 -11.79616 1.929732     none

# => a lot of GATA-related k-mers

# clean up
rm(tmp,dat)

This exploration would likely involve testing other k-mer lengths, aligning match sequences, extending match sequences outside the k-mer hits, and more. Those interested can look up the processes in the tools mentioned earlier, such as HOMER.

# clean up
rm(GATA2_gr, MYC_gr, 
   GATA2_peak_seq, MYC_peak_seq,
   GATA2_6mer_freq,
   MYC_6mer_freq)

Summary

In this section, we have seen:

How to represent, visualise & manipulate motifs programmatically.
How to grab known motifs from databases.
How to derive simple motifs by identifying enriched k-mers in ChIP-seq peak sequences.

However, we have not yet seen how to score the matching of a sequence to a given motif, nor used this to find actual motif occurrences in the genome.

Exercise 1: when to use motif representations

Imagine you are working in a lab that deals specifically with proteins that recognise specific patterns in mRNA molecules (e.g. polyadenylation signals). The protein you are working on only recognises a specific nucleotide sequence; any variations mean there is no binding. Is it useful to build a motif representation, such as a PWM, in this case?

Solution

Not really. It would enable you to use the same tools as for other motifs (visual representation, sequence scoring, occurrence identification), however as the protein brooks no variation, there is no nuance that needs encoding programmatically. Simply stating the specific sequence that is bound, and locating exact sequence matches thereof, will be sufficient.

Exercise 2: create a PPM

Using the simulation function below, create a set of sequences, derive then illustrate a Positional Probability Matrix for them:

sim_motif_seq <- function( ... ) { # takes any number of arguments, all ignored
  res <- sample(DNA_BASES, size = 8, replace = TRUE)
  res[c(3,4,6)] <- c("C", "T", "G")
  DNAString( paste0(res, collapse="") ) # returns 1 DNAString
}

Solution

print(sim_motif_seqs <- DNAStringSet( lapply( 1:20, sim_motif_seq ) ))

DNAStringSet object of length 20:
     width seq
 [1]     8 AACTCGCG
 [2]     8 TCCTTGGT
 [3]     8 TGCTGGCA
 [4]     8 CACTCGCA
 [5]     8 AGCTGGGC
 ...   ... ...
[16]     8 TCCTGGTT
[17]     8 CGCTAGCA
[18]     8 CGCTAGTA
[19]     8 TTCTAGTA
[20]     8 TGCTAGAT

sim_motif <- create_motif(sim_motif_seqs, type="PCM", name = "Sim'd motif")
view_motifs(sim_motif, use.type = "PPM")

Sequence scoring and matching

A sequences’ score for a given motif represents how well the sequence matches the motif, accounting for motif length, position conservation, and the expected background. We will see how to calculate this score for a given sequence/motif pair, how to determine how good a score actually is, and then use this to find motif occurrences.

Scoring a single sequence

First, how can we calculate the score of a given sequence for a specific motif? The score of a sequence is simply calculated by summing the score of the matching letter for each position of the PWM. Let’s look at these positional scores for our hGATA2 motif:

hGATA2_motif <- convert_type(hGATA2_motif, type="PWM", pseudocount = 1)
hGATA2_motif['motif']

           S         G         A         T         A          N
A -4.3219281 -4.321928  1.944858 -4.321928  1.944858 -1.1520031
C  0.8875253 -4.321928 -4.321928 -4.321928 -4.321928  0.3219281
G  1.0356239  1.944858 -4.321928 -4.321928 -4.321928  0.7224660
T -4.3219281 -4.321928 -4.321928  1.944858 -4.321928 -0.6214884

We can now score a randomly-chosen sequence for this motif:

test_seq <- "CAAGAT"
score_match(hGATA2_motif, test_seq)

[1] -4.488

# same, manually:
0.8875253 + (-4.321928) + 1.944858 + (-4.321928) + 1.944858 + (-0.6214884)

[1] -4.488103

=> not a particularly good match!

We can also choose a known hGATA2 binding site sequence to see an example of a good score:

hGATA2_motif_seqs[1] # should be a good hit, by definition!

[1] "CGATAG"

score_match(hGATA2_motif, hGATA2_motif_seqs[1])

[1] 9.385

# manually
0.8875253 + 1.944858 + 1.944858 + 1.944858 + 1.944858 + 0.7224660

[1] 9.389423

# manually, with rounding
0.887     + 1.944    + 1.944    + 1.944    + 1.944    + 0.722

[1] 9.385

=> A good match, by design.
=> Notice that score_match works on truncated decimal scores. This is to improve calculation speed.

Score calculation pitfalls

Background frequencies: The definition of background frequencies can have an impact on motif scores and thus their detection in given sequences. However, this impact will be limited unless the sequences where you are searching for motif matches have nucleotide frequencies far from uniformity. In practice, using uniform background frequencies in all searches makes the search results more comparable. More advanced motif methods might determine background frequencies using their own approaches, e.g. monaLisa.
Affinity vs score: In many cases, PWM-derived match scores are a good proxy for the TF’s binding affinity for each site, but there are also limitations to the PWM model:
- They use single letter frequencies, i.e., they do not capture dependencies between different sites.
- Independent of affinity, TF concentration is a major determinant for binding site occupancy. While at low TF concentrations the PWM score for a given site reflects the probability of it being bound, at very high TF concentration non-specific binding will dominate.
- The binding preference is also determined by other important factors, most notably chromatin state or co-factors.
Zero is too low: With all this, can we determine what range of scores actually marks a motif occurrence, and which are too low to be relevant? Typically, potential target sites of TFs (e.g., genomic locations potentially bound by the TF) will show scores greater than 0, and non-specific sites will show scores below 0. However, a score threshold of 0 will in almost all cases (except very polarized motifs) pick far too many non-relevant sites, e.g. when applied to a set of peak or promoter sequences.
Score thresholds: Ideally, each PWM would be accompanied by a score threshold which is derived experimentally, e.g. using ChIP-Seq data, by contrasting true binding sites with non-functional sites. This approach was used in the HOMER database, but doesn’t guarantee an optimal score threshold when applied to other sets of sequences. How can we derive score thresholds even in the absence of experimental data?

Thresholding scores to detect motif occurences

Retrieve 200 example motifs from JASPAR2018

We will illustrate approaches for deriving score thresholds (as well as other operations in sections below) using a set of vertebrate PWMs from the “core” of the Jaspar 2018 database. We will use the getMatrixSet() function from TFBSTools to query the database this time. Note: as the number of mouse-specific motifs is quite limited, we are selecting all vertebrate motifs. This is often done in practice when working on mouse.

# query options
opts <- list(
  tax_group    = "vertebrates",
  collection   = "CORE",
  matrixtype   = "PWM",
  all_versions = FALSE)

# run query
jaspar_pwms <- getMatrixSet(JASPAR2018, opts) # may take a few seconds
print(length(jaspar_pwms))

[1] 579

# clean up
rm(opts)

=> As of 22/05/2023, 579 motifs found!

# show distribution of motif lengths
tmp <- unlist(lapply(jaspar_pwms, length))
hist(tmp, breaks=64)

=> let’s keep 200 random motifs of length 8+, also ensuring we keep the known motif for hGATA2.

# set random number generator seed so we always get the same "random" selection
set.seed(42)

# select based on length, then random sub-selection thereof
ok_length <- (tmp>=8)
ok_ids <- c(
  which(names(tmp) == "MA0036.3"), # hGATA2
  sample( (1:length(tmp))[ok_length], size=199, replace = F )
)

# apply filtering
jaspar_pwms <- jaspar_pwms[ok_ids]
print(length(jaspar_pwms))

[1] 200

# clean up
rm(tmp, ok_length, ok_ids)

We can also prepare some things for convenience later.

# convert to unviersalmotif representation, maintaining names
jaspar_upwms <- setNames(
  convert_motifs(jaspar_pwms),
  names(jaspar_pwms))

Score thresholding approaches

We will now consider several alternatives to deriving score thresholds. These approaches have been implemented in different Bioconductor packages:
- The simplest approach is to use a constant score threshold across all motifs. This ensures comparability of motif matching results across motifs. There is no threshold to calculate, so this is the fastest approach. In practice, common arbitrary thresholds are 6, 7 and 8.
- Basing a threshold on a percentage of the maximal possible score, e.g. 80% of the highest possible PWM score. This is namely the approach implemented (with default options) in function matchPWM from Biostrings. It is very fast to calculate, and simple to intuit.
- Calculate scores corresponding to a certain significance level (p-value), and use that as a threshold. The p-value of a score corresponds to the probability of observing a sequence with that score or larger, by chance given the background nucleotide frequencies. Common significance thresholds used in practice are 1e-5 and 5e-5. But how is significance determined?
  - Here, consider a motif \(M\) of length \(m\), and a score threshold to be determined \(T\). Let’s write as \(S(T)\) the set (of size \(n(T)\)) that contains all possible sequences of length \(m\) that would have a score for \(M\) that is \(\ge T\).
  - This approach seeks to choose a \(T\) such that the probability that the background distribution would generate \(S(T)\) is less likely than the requested p-value.
  - This would require enumerating all sequences in \(S(T)\), but (assuming a uniform background), the probability of this set being generated under the background is simply \(P_{bg}(T) = n(T) \times 0.25^{m}\).
  - With this, we can try different values of \(T\) to get the corresponding \(n(T)\) values, until the calculated probability \(P_{bg}(T)\) is within significance.
  - With a good implementation, this approach is relatively fast to calculate.

Comparing score thresholding approaches

We will compare the latter two approaches to see how they relate to each other.
- We’ll also consider the widths of the motifs (in nucleotides).
- We’ll also calculate the total Information Content for each PWM, as this indicates how fast \(n(T)\) grows with lowering of \(T\). Indeed, for motifs with high IC, \(n(T)\) grows more slowly as a function of the relative decrease in \(T\).

## preparations

# For the % max score approach:
# for each motif, return quantiles of all possible scores
# 0 => minimum, 1 => maximum, 0.8 => 80% maximum
quantilesPWMscore <- sapply(
  jaspar_upwms, motif_score, threshold=c(0, 0.8, 1))

# For the p-value score threshold approach:
# for each motif, return the score corresponding
# to a given value
pvaluePWMscore <-
  motif_pvalue(jaspar_upwms, pvalue = 1e-5, k=10) # may take a couple of secs

# For the total IC:
# for each motif, retrieve total information content
# as already available in a slot within the universalmotif PWM objects
PWM_total_IC <- sapply(jaspar_upwms, function(upwm) {upwm@icscore})

# For the motif width
PWM_widths <- sapply(jaspar_upwms, ncol)

# Assemble!
pwm_score_info <- data.frame(
  maxPWMscore       = quantilesPWMscore["100%",], # keep the max
  pvaluePWMscore    = pvaluePWMscore,
  PWM_total_IC      = PWM_total_IC,
  PWM_widths        = PWM_widths
) 

# clean up
rm(quantilesPWMscore, pvaluePWMscore, PWM_total_IC, PWM_widths)

## plot
ggplot(data    = pwm_score_info,
       mapping = aes(x=maxPWMscore, y=pvaluePWMscore)) +
  xlim(c(0, max(pwm_score_info$maxPWMscore))) +
  ylim(c(0, max(pwm_score_info$pvaluePWMscore))) +
  geom_point(aes(color=PWM_total_IC, size=PWM_widths)) +
  scale_size_continuous(range=c(0.25, 3)) +
  geom_abline(intercept=0, slope=1) + # solid diagonal
  geom_abline(intercept=0, slope=0.8, linetype="dashed") + # dashed diagonal 1
  geom_abline(intercept=0, slope=0.5, linetype="dashed") + # dashed diagonal 2
  labs(x="Maximal PWM Score", y="PWM Score Threshold for P-Value=1e-5") +
  scale_color_gradientn(colours = rainbow(5)) +
  theme_bw()  + 
  annotate(geom="text", x=15.00, y=14.5, label="y=100% x",color="black", hjust=0)+ 
  annotate(geom="text", x=18.75, y=14.0, label="y=80% x", color="black", hjust=0)+ 
  annotate(geom="text", x=30.00, y=14.5, label="y=50% x", color="black", hjust=0)

=> We can observe the following:

In the top-left, we can see that short motifs, and/or or motifs with low total information content, require a PWM score threshold very close to their maximal PWM score (ie a very stringent threshold), for a match to reach the targeted significance level of 1e-5. As such, only sites close to perfect matches will be found for such motifs.
In the middle, we see that motifs with average total information content will require a threshold 50-80% that of their maximal score to reach a significance of 1e-5.
Finally, on the right we see that motifs with high total IC will only require a very lenient threshold (<50% maximum) to reach a significance of 1e-5. As such there will likely be many sequences that match, and there will be a great amount of variability in these sequences.
We can conclude, if the objective is indeed to select matches that attain a certain significance, that using a threshold based on a percentage of the maximal score, though much faster to calculate, will be too permissive for short/low IC motifs, and much too stringent for longer/high IC motifs.

# clean up
rm(jaspar_upwms, pwm_score_info)

Finding motif occurences in sequences

Now that we can score sequence matches to motifs, and pick apart which scores mark likely binding sites, we can try finding all occurrences (or hits) of given motifs in a set of sequences, e.g. in ChIP-seq peaks, promoter sequences, or the whole genome.

Obviously, the sequences we will be scanning are much longer than the motifs themselves. We will simply scan along these sequences, scoring each individual sub-sequence of same length as the motif.

Preparing the sequences

Dataset: ATAC-seq-based accessible chromatin regions for Mouse immune cells.
Künzli et al. 2020,
“Long-lived T follicular helper cells retain plasticity and help sustain humoral immunity”
PAPER
The table merged_peaks_TFH_TH1.tsv.gz contains the results of a differential chromatin accessibility analysis contrasting two cell populations. We can load and parse the peak genomic regions, ready for scanning.
Differential accessibility details:
- The differential accessibility test was done for 150bp genomic tiles contrasting (Tfh - TH1).
- Tiles were then merged into larger regions. Hence:
  - “up” contained only tiles more accessible in Tfh than TH1
  - “down” contained only tiles more accessible in TH1 than Tfh
  - “mixed” contained tiles with a mix of both directions
  - the significance of the differential accessibility (after correcting for multiple testing) is provided in the “FDR” field
  - anything not significantly differentially accessible is still an ATAC-seq peak, and thus accessible, and can thus be considered as “common” to both cell lines.

# regions of interest: get peak region from last week's lecture
peaks <- read.table(gzfile("data/Motifs/merged_peaks_TFH_TH1.tsv.gz"), 
                    header=TRUE, stringsAsFactors=FALSE, sep="\t")
peaks_gr <- GRanges(peaks, seqinfo=seqinfo(BSgenome.Mmusculus.UCSC.mm10))
names(peaks_gr) <- as.character(peaks_gr)


# exclude "mixed" peaks for simplicity
peaks_gr <- peaks_gr[ peaks_gr$direction != "mixed" ]


# exclude peaks on sex chromosomes
peaks_gr <- peaks_gr[ ! seqnames(peaks_gr) %in% c("chrX", "chrY", "chrM" )]


# get peak sequences, we'll need this later
peak_seqs <- getSeq(BSgenome.Mmusculus.UCSC.mm10, peaks_gr) # <2s


# exclude peaks with any N's, to avoid warnings further down
hasNs    <- (alphabetFrequency(peak_seqs)[,"N"]>0)
addmargins(table(hasNs)) # => only drops a few sequences

hasNs
FALSE  TRUE   Sum 
49392     7 49399

peak_seqs <- peak_seqs[!hasNs]
peaks_gr  <- peaks_gr[!hasNs]


# clean up
rm(peaks ,hasNs)

Filtering the peaks

As a sanity check, let’s have a look at the lengths of the ATAC-seq peaks we’re dealing with:

hist(width(peak_seqs), breaks=64, sub=paste(range(width(peak_seqs)), collapse="-"))

=> Most under 3kb, but some up to 14kb! The smallest is 150bp.

We should restrict our sequence lengths to something less extreme. Many motif-related analyses, especially when considering a background, will assume that the sequences they’re looking at are of comparable lengths. Here, let’s restrict to 3kb.

# exclude peaks longer than 4kb
tooLong <- width(peak_seqs) > 3000
addmargins(table(tooLong)) # => only drops a few hundred of ~50k sequences

tooLong
FALSE  TRUE   Sum 
48871   521 49392

peak_seqs <- peak_seqs[!tooLong]
peaks_gr  <- peaks_gr[!tooLong]

# clean up
rm(tooLong)

Categorising the peaks

Now let’s classify peaks into “Tfh>TH1”, “TH1>Tfh”, and “common” (i.e. open but not more open in either T cell sub-population):

# rename so it's clear to interpret
peaks_gr$state <- c(
  "up" = "Tfh>TH1", "down" = "TH1>Tfh") [ peaks_gr$direction ]

# take significance into account
peaks_gr$state[peaks_gr$FDR >= 0.1] <- "common"

# report
addmargins(table(peaks_gr$state))


 common Tfh>TH1 TH1>Tfh     Sum 
  35556    6255    7060   48871

Stratifying the peaks

This is still too many sequences to run the commands below in a reasonable time. We also want to ensure that we have roughly the same number of peaks in each of the analyses, ensuring comparable statistical power for both, and more peaks in the background.
To achieve this, we are going to randomly sub-select sequences in a way that considers the various “types” of ATAC-seq peaks (ie the strata).

# setting the seed to always get the same "random" selection
set.seed(42)

# function to sub-sample (names of) a stratum
rnd_subsample_stratum <- function(size, stratum) {
  stratum_peaks_gr <- peaks_gr[ peaks_gr$state == stratum ]
  stopifnot( length(stratum_peaks_gr)>=size )
  return(sample(names(stratum_peaks_gr), size=size, replace=F))
}

# get sub-sampled names for each stratum
sel_peak_names <- c(
  rnd_subsample_stratum(25000, "common"),
  rnd_subsample_stratum(2500, "TH1>Tfh"),
  rnd_subsample_stratum(2500, "Tfh>TH1")
)

# apply
peaks_gr_sel <- peaks_gr[sel_peak_names]
print(addmargins(table(peaks_gr_sel$state)))


 common Tfh>TH1 TH1>Tfh     Sum 
  25000    2500    2500   30000

# enforce same selection on sequences
peak_seqs_sel <- peak_seqs[sel_peak_names]

# clean up
rm(rnd_subsample_stratum, sel_peak_names)

=> This will be faster to run!

Efficient scanning in R using `motifmatchr`

Function matchMotifs in motifmatchr package implements very efficient PWM scanning across a set of genomic regions. It uses as input motif classes as defined in the TFBSTools package. It also implements a selection of target sites using the p-value approach seen above.
Once prepared, we can scan the ATAC-seq peak sequences for occurrences of the motif set we derived from JASPAR earlier.
If necessary, you can load the pre-prepared results of this (“jaspar2018_matches_rse.rds”) from the course page.

# run: 30k sequences, 200 motifs
system.time(
  matches_rse <- matchMotifs(
    jaspar_pwms, peaks_gr_sel, genome = BSgenome.Mmusculus.UCSC.mm10, 
    p.cutoff = 5e-05, # using a stringent p-value
    out="scores"
))  # <10s on my laptop

   user  system elapsed 
  8.451   0.285   8.752

# overview
matches_rse

class: RangedSummarizedExperiment 
dim: 30000 200 
metadata(0):
assays(3): motifScores motifMatches motifCounts
rownames(30000): chr9:41181301-41181750 chr12:116048701-116049150 ...
  chr1:94006201-94006350 chr12:3279151-3279450
rowData names(12): num.tests num.up.logFC ... right state
colnames(200): MA0036.3 MA1421.1 ... MA1100.1 MA0835.1
colData names(1): name

matches_rse is a RangedSummarizedExperiment object and contains information on matches in its assays slot. We can have a quick look at this object:

# check out the object
dim(matches_rse) # 30k peaks in rows, 200 motifs in columns

[1] 30000   200

assayNames(matches_rse) # scores, binary presence/abscence, counts

[1] "motifScores"  "motifMatches" "motifCounts"

head(as.data.frame(rowData(matches_rse))) # the original info from the peaks GR

                         num.tests num.up.logFC num.down.logFC     PValue       FDR direction
chr2:127792201-127792800         4            0              0 0.46094644 0.6977457        up
chr8:83694001-83694750           4            0              0 0.05340523 0.1810975      down
chr13:12275251-12275400          1            0              0 0.13729721 0.3444052      down
chr7:3290101-3290550             3            0              0 0.70945801 0.8654646        up
chr2:164798701-164798850         1            0              0 0.82985083 0.9292285        up
chr7:114157951-114158250         2            0              0 0.67818764 0.8482615      down
                         rep.test   rep.logFC    overlap                      left
chr2:127792201-127792800   135006  0.62661028  Mtln:-:PE   Nphp1:-:3304,Mtln:-:369
chr8:83694001-83694750     215918 -1.09188404  Pkn1:-:PE               Pkn1:-:1065
chr13:12275251-12275400     60047 -0.80291154 Actn2:-:PI                Actn2:-:27
chr7:3290101-3290550       195022  0.41156221 Myadm:+:PE               Myadm:+:776
chr2:164798701-164798850   138518  0.09148081  Acot8:-:I Snx21:+:4885,Acot8:-:2212
chr7:114157951-114158250   206423 -0.62894701                                     
                                             right  state
chr2:127792201-127792800               Bub1:-:8322 common
chr8:83694001-83694750                 Pkn1:-:3258 common
chr13:12275251-12275400                Actn2:-:963 common
chr7:3290101-3290550                    Myadm:+:25 common
chr2:164798701-164798850 Acot8:-:708,Zswim3:+:6248 common
chr7:114157951-114158250                           common

head(as.data.frame(colData(matches_rse))) # basic info about motifs

            name
MA0036.3   GATA2
MA1421.1  TCF7L1
MA0770.1    HSF2
MA0636.1 BHLHE41
MA0488.1     JUN
MA0145.3   TFCP2

# distribution of max nb occurrences per peak, across all motifs
barplot(
  sort(apply(assay(matches_rse, "motifCounts"), 2, max)), 
  las=3, cex.names=0.5, log="y")
abline(h=20, col="blue")

=> some motifs occur over 20x in a given peak! Not surprising given the lengths of some of these peaks.

We can check to see if any motifs do NOT have any occurrences, or if any peaks have no motif occurrences:

binary <- assay(matches_rse, "motifMatches")

any(colSums(binary) == 0 | colSums(binary) == nrow(binary))
## -> FALSE, so all motifs have at least 1 target and none matches all peaks

# clean up
rm(binary)

However, this matches_rse object does not provide any localisation of the matches found. We can get this using a different argument to matchMotifs:

# run
system.time(
  matches_grl <- matchMotifs(
    jaspar_pwms, peaks_gr_sel, genome = BSgenome.Mmusculus.UCSC.mm10, 
    p.cutoff = 5e-05, out="positions")
) # <10s on my laptop

   user  system elapsed 
  8.707   0.371   9.085

# quick look
matches_grl

GRangesList object of length 200:
$MA0036.3
GRanges object with 1802 ranges and 1 metadata column:
         seqnames              ranges strand |     score
            <Rle>           <IRanges>  <Rle> | <numeric>
     [1]     chr5   24800629-24800639      + |   13.4745
     [2]    chr18   56733837-56733847      - |   13.5414
     [3]    chr10   59970193-59970203      - |   12.4041
     [4]     chr6 142809492-142809502      - |   13.2129
     [5]     chr8   72226782-72226792      - |   11.5336
     ...      ...                 ...    ... .       ...
  [1798]     chr2   32039075-32039085      - |   13.2062
  [1799]    chr10   68673910-68673920      + |   12.8671
  [1800]     chr1   82973913-82973923      + |   13.8709
  [1801]    chr13   44799593-44799603      - |   12.1015
  [1802]    chr12   15998756-15998766      - |   13.6934
  -------
  seqinfo: 239 sequences from an unspecified genome; no seqlengths

...
<199 more elements>

# how many hits?
addmargins(cbind(
  "nb" = sort(unlist( lapply(matches_grl, length) ))), 
  margin = 1)

             nb
MA0777.1    376
MA0836.1    458
MA0635.1    473
MA0838.1    519
MA0875.1    614
MA0887.1    617
MA0024.3    628
MA0900.1    669
MA0876.1    671
MA0666.1    671
MA0679.1    682
MA0709.1    686
MA0879.1    708
MA0818.1    740
MA0886.1    746
MA0873.1    746
MA0894.1    757
MA0612.1    769
MA0843.1    807
MA0748.1    821
MA0027.2    822
MA0906.1    835
MA0644.1    855
MA0817.1    858
MA0639.1    884
MA0726.1    886
MA0014.3    905
MA0019.1    920
MA0145.3    934
MA0125.1    940
MA0132.2    948
MA0075.2    957
MA0069.1    985
MA0630.1    987
MA0720.1    987
MA0895.1    987
MA0654.1    987
MA0722.1    997
MA0907.1   1026
MA0611.1   1041
MA0083.3   1065
MA0905.1   1099
MA0706.1   1108
MA0752.1   1109
MA0723.1   1128
MA0737.1   1137
MA0602.1   1157
MA1143.1   1165
MA0610.1   1208
MA0622.1   1212
MA0669.1   1215
MA0108.2   1225
MA0608.1   1272
MA1117.1   1309
MA0702.1   1324
MA0682.1   1330
MA0594.1   1336
MA0485.1   1337
MA0861.1   1339
MA1151.1   1348
MA0636.1   1354
MA0863.1   1374
MA0831.2   1375
MA0804.1   1378
MA0713.1   1441
MA0865.1   1442
MA0122.2   1467
MA0090.2   1472
MA1150.1   1481
MA0106.3   1489
MA0766.1   1507
MA0739.1   1538
MA0828.1   1548
MA0124.2   1567
MA0663.1   1577
MA0037.3   1580
MA0633.1   1607
MA0672.1   1628
MA0140.2   1679
MA1108.1   1719
MA0113.3   1720
MA0100.3   1729
MA0092.1   1744
MA0829.1   1760
MA1153.1   1780
MA1105.1   1790
MA0842.1   1793
MA0738.1   1799
MA0036.3   1802
MA0511.2   1834
MA0835.1   1868
MA1154.1   1872
MA0878.1   1894
MA0114.3   1898
MA1136.1   1901
MA0071.1   1940
MA0715.1   1967
MA0840.1   1989
MA0800.1   1993
MA0910.1   2005
MA0592.2   2007
MA0764.1   2023
MA0761.1   2031
MA1122.1   2066
MA0482.1   2081
MA0692.1   2092
MA0801.1   2095
MA0162.3   2098
MA0899.1   2098
MA0660.1   2099
MA0869.1   2121
MA0909.1   2127
MA0101.1   2139
MA0867.1   2144
MA0757.1   2152
MA0135.1   2163
MA0913.1   2172
MA0614.1   2187
MA0859.1   2188
MA0770.1   2212
MA0119.1   2221
MA0038.1   2224
MA1139.1   2245
MA0115.1   2255
MA0689.1   2264
MA0143.3   2277
MA0751.1   2295
MA1131.1   2317
MA0060.3   2351
MA0488.1   2425
MA0154.3   2442
MA0035.3   2443
MA0787.1   2448
MA0657.1   2462
MA1148.1   2472
MA0088.2   2498
MA0625.1   2500
MA0677.1   2518
MA1113.1   2522
MA0093.2   2525
MA0773.1   2530
MA0788.1   2535
MA0784.1   2576
MA1123.1   2597
MA0029.1   2624
MA0786.1   2662
MA0503.1   2675
MA1141.1   2682
MA0481.2   2688
MA0258.2   2699
MA0872.1   2718
MA1109.1   2811
MA0155.1   2815
MA0591.1   2822
MA0799.1   2831
MA0796.1   2843
MA0814.1   2879
MA0600.2   2885
MA0095.2   2912
MA0524.2   2929
MA0812.1   2933
MA0652.1   2973
MA0509.1   3106
MA0047.2   3109
MA0495.2   3128
MA0496.2   3129
MA1144.1   3223
MA0665.1   3232
MA0520.1   3251
MA0598.2   3282
MA0845.1   3345
MA0103.3   3517
MA0490.1   3589
MA0484.1   3612
MA1114.1   3644
MA0515.1   3687
MA0593.1   3844
MA0477.1   3874
MA1421.1   4011
MA0505.1   4114
MA0762.1   4174
MA0076.2   4255
MA0653.1   4397
MA0065.2   4414
MA0146.2   4944
MA0742.1   5029
MA0687.1   5159
MA0499.1   5186
MA0816.1   5327
MA0846.1   5447
MA1100.1   5817
MA0002.2   6260
MA1102.1   7113
MA1107.1   9302
MA0139.1   9868
MA0753.1  10079
MA0746.1  11919
MA0741.1  14604
MA0149.1  17975
MA0528.1  53147
Sum      531354

=> ~0.5M significant motif hits in total, from 0.3k to 54k per motif, for 200 motifs across 30k sequences of 0.2-3kb.

matches_grl is a list of GRanges objects, one for each motif. Applying unlist to it will return a single GRanges object, with a name field identifying the original list element, i.e. here the motif name, ready for exporting to a BED file.

export.bed(
  unlist(matches_grl), 
  file.path(".","data", "Motifs", "jaspar2018_matches.bed.gz"))
# note: export.bed can automatically gzip the file

=> This BED file can now be loaded into a genome browser.

# clean up
rm(matches_grl)

Summary

In this section, we saw how to:

score a single sequence for a given motif;
different approaches for selecting a score threshold for determining hits;
how to scan for motif hits using package motifmatchr.

We are now ready to perform analyses linking motif hits to genomic features of interest.

Exercise 3: scoring a sequence

Using the MYC PWM prepared below, score the test sequence provided below, either manually or programmatically.
Do you think it is a good hit, or not?

# grab & show MYC PWM, assuming uniform BG & 100 nsites
res <- query(MotifDb, andStrings=c("MYC", "JASPAR2018", "Hsapiens"))
myc_motif <- convert_type(convert_motifs(res[[2]]), "PWM")

Note: motif [motif] has an empty nsites slot, using 100.

# test sequence
test_seq <- "GGCAATGTGCAC"

Solution

# view motif
view_motifs(myc_motif, use.type = "PWM") # graphically

print(myc_motif) # weights of PWM


       Motif name:   motif
         Alphabet:   DNA
             Type:   PWM
          Strands:   +-
         Total IC:   11.27
      Pseudocount:   0
        Consensus:   NNCCACGTGCNN

      N     N     C     C     A     C     G     T     G     C     N     N
A -0.25 -0.28 -1.50 -4.81  1.90 -4.52 -2.87 -4.39 -5.75 -1.56 -0.36 -0.71
C  0.12  0.06  1.60  1.97 -4.39  1.87 -2.98 -2.93 -4.46  1.18  0.30  0.53
G  0.42  0.51 -1.39 -5.32 -2.86 -4.14  1.89 -4.29  1.96 -0.04 -0.25  0.13
T -0.44 -0.49 -2.11 -5.52 -3.45 -2.08 -5.02  1.91 -4.75 -1.24  0.20 -0.23

# auto scoring
score_match(myc_motif, test_seq)

[1] 4.647

Solution

# percentile score threshold:
motif_score(myc_motif, threshold = 0.8)

   80% 
5.7812

# p-value score threshold:
motif_pvalue(myc_motif, pvalue = 1e-5, k=10)

[1] 13.954

Exercise 4: playing around with a list of PWMs

We want to grab the complete set of “CORE” vertebrate sequence motifs from Jaspar 2018. In this session, we saw 2 different functions for querying motif databases. One of these is better suited to this objective. Consult their help files and test them out. Once you have a list of motifs you are happy with, identify amongst them:
- the shortest and the longest motifs (function ncol() on a PWM can help)
- the motifs with the lowest and highest possible log-odds scores (function motif_score() can help)
- the motifs with the lowest and the highest information contents (for universalmotif PWMs, the IC score is containted in the slot icscore)

Solution part I

# the getMatrixSet() function allows us to grab all CORE vertebrate PWMs easily
exo_jaspar_pwms <- getMatrixSet(
  JASPAR2018, 
  list(
    tax_group    = "vertebrates",
    collection   = "CORE",
    matrixtype   = "PWM",
    all_versions = FALSE)) # a few seconds
# => 579 motifs

# ID mapper for convenience
exo_motif2tf <- sapply(exo_jaspar_pwms, function(pwm) {pwm@name})

Solution part I (alt)

# the query() function does not handle taxonomy very well
exo_jaspar_pwms <- query(
  MotifDb, 
  andStrings=c("JASPAR_CORE"), 
  orStrings = c("Hsapiens","Mmusculus","Rnorvegicus","Vertebrata")) # missing a lot
exo_jaspar_pwms <- convert_motifs(exo_jaspar_pwms) # conversion necessary
# => 125 motifs

# ID mapper for convenience
exo_motif2tf <- sapply(exo_jaspar_pwms, function(pwm) {pwm@name})

Solution part II

# Lengths
PWMlength <- sapply(jaspar_pwms, ncol)
range(PWMlength)

[1]  8 21

exo_motif2tf[which.min(PWMlength)] # shortest motif

[1] "FOXD1"

exo_motif2tf[which.max(PWMlength)] # longest motif

[1] "HNF1B"

Solution part III

# Max log-odds scores
maxPWMscore <- sapply(exo_jaspar_pwms, motif_score)["100%",]
range(maxPWMscore)

[1]  7.105 35.133

exo_motif2tf[which.min(maxPWMscore)] # motif w/ smallest max score

[1] "GATA2"

exo_motif2tf[which.max(maxPWMscore)] # motif w/ highest max score

[1] "EWSR1-FLI1"

Solution part IV

# ICs
jaspar_upwms <- convert_motifs(exo_jaspar_pwms) # convert to get easy access to IC
ICscore <- sapply(jaspar_upwms, function(pwm) {pwm@icscore})
range(ICscore)

[1]  4.900548 31.838563

exo_motif2tf[which.min(ICscore)] # motif w/ smallest IC

[1] "En1"

exo_motif2tf[which.max(ICscore)] # motif w/ highest IC

[1] "EWSR1-FLI1"

Enrichment analysis for motifs

Now that we have our motifs (drawn from databases or experimental data), and have located occurrences of them, we can try and determine if the presence of these occurrences associates with some relevant biology. Using TF binding sites for example:
- does a particular TF bind more often near genes annotated with a certain function? This could suggest involvement of the TF in that function.
- does a particular TF usually bind more often near genes whose expression is dysregulated upon knock-down of the TF itself? This could indicate which genes are “targeted” by the TF.
- does a particular TF bind more often in open chromatin regions? This could reveal TFs that are chromatin remodellers.
- etc.
In our ATAC-seq dataset, we obtained differentially open chromatin regions between the Tfh and TH1 T cell populations. Above, we found motif occurrences for many TFs that were located in these regions. Now, we would like to infer which of these motifs may be regulators of the chromatin state. To this end, we will investigate the relation between the occurrence of (potential) transcription factor binding sites and differential chromatin accessibility.
Popular methods for investigating these relations are:
- Over-Representation Analysis: a contingency matrix counts occurrences of an annotation (e.g. a TF motif occurrence) amongst two categories of objects (e.g. open chromatin regions vs closed chromatin regions), and a test evaluates if there is a statistical association between annotations and categories.
- (Gene) Set Enrichment Analysis: objects (e.g. chromatin regions) are ranked using a continuous score (e.g. accessibility), and a statistical test evaluates whether annotated regions (e.g. those containing a TF motif occurrence) have, on average, higher (or lower) scores.
- PWM Regression: objects (e.g. chromatin regions) have a continuous outcome measured (e.g. accessibility), and this is modelled as a linear function of the number of annotations (e.g. motif occurrences, for all motifs).
R package monaLisa contains several advanced implementations of these. But here, we will look into the simplest approach, so we can do it manually: ORA for TF binding sites.

Over-representation analysis (ORA) for PWMs

What is ORA?

Over-representation analysis (ORA) generally looks to see whether a level of a certain categorical variable is more frequently observed than expected by chance (i.e. over-represented, or enriched) within one subpopulation, compared to a background population (typically another subpopulation, or the entire population). Such tests can be applied in a wide variety of biological problems. For example, this could be over-representation of a given gene annotation amongst a list of differentially-expressed genes, compared to a background of all tested genes. In our setting, we’d look for TF binding sites being enriched in one set of regions (“TH1>Tf1” or the other), compared to a background (e.g. the rest of the genome, or the “common” peaks).
There are several types of basic statistical test that can be used to investigate over-representation: Fisher’s exact test, the Chi-square test, the hypergeometric test, the proportions test… These all differ in their modelling choices and assumptions, including the choice of “background”. Some are more commonly used in certain types of analyses than others (gene enrichments, TFBS enrichments…).
Here, we will consider Fisher’s exact test for examining enrichment of motif occurrences in our open chromatin regions.

ORA for our ATAC-seq

There is a pitfall when naively comparing different sets of genomic regions, such as promoter vs intergenic regions. Differences in global sequence composition (e.g. GC and CpG content) or level of conservation can easily bias the analysis. Unless these sources of biases are controlled for, the results can be meaningless.
For our analysis, such a pitfall would be to contrast ATAC-seq peaks with regions never accessible (genomic background) since these are expected to have different sequence composition (e.g. different GC content).
Instead, we will focus only on peak regions, and differentiate:
- regions which are specifically open in only one of the two cell populations (specific).
- regions which are accessible in both (common). These regions will serve here as a background. We assume that they carry similar regulatory functions as the first set, and thus share the same sequence composition and biases. We can interrogate this assumption later if need be.
Given these two non-overlapping sets of open regions, we subdivide each into those overlapping a target site for each TF of interest:
- regions with a TF target site
- regions without (non-target)
The outcome of these 2 categorisations is a 2-by-2 contingency table, one per TF motif and cell population, enumerating the four possible cases:

	common	specific
target	common.target	specific.target
non-target	common.non-target	specific.non-target

Fisher’s exact test is suited to looking for statistically significant association between these 2 binary categorisations of regions.
We can now address the question of an association between the occurrence of TF target sites and the T cell population-specific DNA accessibility patterns.
The test statistic which quantifies the association is the following odds-ratio (note the symmetry of the odds-ratio):

\[ {\rm OR} = \frac{ \rm (nb.specific.target) / (nb.specific.non.target) }{ \rm (nb.common.target) / (nb.common.non.target) } \]

\[ {\rm OR} = \frac{ \rm (nb.specific.target) * (nb.common.non.target) }{ \rm (nb.common.target) * (nb.specific.non.target) } \]

\[ {\rm OR} = \frac{ \rm (nb.specific.target) / (nb.common.target) }{ \rm (nb.specific.non.target) / (nb.common.non.target) } \]

There is an association if the OR > 1. An OR < 1 indicates a negative association (“anti-correlation”, “repulsion”, “depletion”). Since we generally assume that the binding of a TF to a DNA sequence opens up the chromatin and/or maintains it open, we do not focus on negative associations here.
Fisher’s exact test checks for significance of the association. For details, check the help page of ?fisher.test, and your favourite statistics resources, e.g. for Wikipedia:
- Fisher’s_exact_test, and
- Odds_ratio.

Categorising accessible chromatin peaks

To perform this test on our ATAC-seq dataset, we first need to categorize regions into three classes: open in both T cell populations, open in Tfh only, and open in TH1 only. Here, we consider peaks with a differential openness FDR below 10% to differ across populations, and all others to be open in both (“common”).

# Note: already done above, doing again just in case


# base categorisation
peaks_gr$state <- peaks_gr$direction
peaks_gr$state[peaks_gr$FDR >= 0.1] <- "common"


# setting the seed to always get the same "random" selection
set.seed(42)


# stratification
rnd_subsample_stratum <- function(size, stratum) {
  stratum_peaks_gr <- peaks_gr[ peaks_gr$state == stratum ]
  stopifnot( length(stratum_peaks_gr)>=size )
  return(sample(names(stratum_peaks_gr), size=size, replace=F))
}

sel_peak_names <- c(
  rnd_subsample_stratum(25000, "common"),
  rnd_subsample_stratum(2500, "down"),
  rnd_subsample_stratum(2500, "up")
)

peaks_gr <- peaks_gr[sel_peak_names]


# report
addmargins(table(peaks_gr$state))


common   down     up    Sum 
 25000   2500   2500  30000

ORA for an example TF (hGATA2) in TH1

As an example, let us focus on the overlap of the TH1-specific regions (state is “TH1>Tfh”) with predicted sites for one of the TFs. Using the xtabs function on the binary motif match matrix and the state established above, we can generate a contingency table, that can be subset to a 2x2 table.

# if necessary:
# matches_rse <- readRDS("data/Motifs/jaspar2018_matches_rse.rds")

motif_id <- "MA0036.3" # (human) GATA2
xt <- xtabs(
  ~ state + sites, 
  data.frame(
    state=peaks_gr_sel$state, 
    sites=assay(matches_rse, "motifMatches")[names(peaks_gr_sel),motif_id]
))
xt <- xt[c("common", "TH1>Tfh"),] # down = TH1 specific peaks 
addmargins(xt)

         sites
state     FALSE  TRUE   Sum
  common  23734  1266 25000
  TH1>Tfh  2238   262  2500
  Sum     25972  1528 27500

We can calculate the odds-ratio (OR) manually, indicating how much more likely it is to observe target sites for this TF in TH1-specific peaks compared to common peaks.

OR <- (xt[1,1]*xt[2,2])/(xt[2,1]*xt[1,2])
OR

[1] 2.194717

=> A more than a 2x fold enrichment of hGATA2 binding sites amongst “TH1>Tfh” peaks compared to “common” peaks! This is a strong enrichment, but is it significant?

We can perform Fisher’s test (which calculates the OR itself). Remember that we are only interested in positive associations (OR > 1), so we specify alternative="greater" to perform a one-sided test.

fisher.test(xt, alternative="greater")


    Fisher's Exact Test for Count Data

data:  xt
p-value < 2.2e-16
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
 1.944818      Inf
sample estimates:
odds ratio 
   2.19475

=> This indicates that the strong 2-fold enrichment in hGATA2 target sites amongst TH1-specific peaks, relative to peaks open in both TH1 and Tfh, is indeed (highly) statistically significant.

# clean up
rm(motif_id, xt, OR)

Fisher-based ORA across all TFs and both cell populations

We can do the same, iterating over all TFs of interest:

ora_res <- list()
system.time( {
  for (motif_id in colnames(matches_rse)) {
    xt <- xtabs( 
      ~ state + sites, 
      data.frame(
        state=peaks_gr_sel$state, 
        sites=assay(matches_rse, "motifMatches")[
          names(peaks_gr_sel),motif_id]
    ))
    ft.TH1 <- fisher.test( xt[c("common","TH1>Tfh"),], alternative="greater") 
    ft.Tfh <- fisher.test( xt[c("common","Tfh>TH1"),], alternative="greater") 
    ora_res[[motif_id]] <- data.frame(
      row.names = NULL,
      "Motif.ID" = motif_id, "TF.name" = colData(matches_rse)[motif_id,"name"],  
      "OR.TH1"   = unname(ft.TH1$estimate), "pv.TH1" = ft.TH1$p.value,
      "OR.Tfh"   = unname(ft.Tfh$estimate), "pv.Tfh" = ft.Tfh$p.value
    )
  } 
}) # <15s on my laptop

   user  system elapsed 
  8.222   0.459   8.688

ora_res <- as.data.frame(do.call("rbind",ora_res))
rownames(ora_res) <- NULL

# adjust p-values for multiple testing
ora_res$fdr.TH1 <- p.adjust(ora_res$pv.TH1, method="BH")
ora_res$fdr.Tfh <- p.adjust(ora_res$pv.Tfh, method="BH") 

# show (some)
print(head(tibble::column_to_rownames(ora_res, "Motif.ID")[,-c(3,5)], 20))

                   TF.name    OR.TH1   OR.Tfh      fdr.TH1      fdr.Tfh
MA0036.3             GATA2 2.1947504 1.290410 1.350614e-23 2.888763e-03
MA1421.1            TCF7L1 1.7952502 1.870206 3.104983e-22 2.034747e-25
MA0770.1              HSF2 2.1250370 1.381160 2.855073e-23 1.055822e-04
MA0636.1           BHLHE41 0.9236761 1.437457 7.644386e-01 3.003552e-03
MA0488.1               JUN 1.8078733 1.621747 2.660687e-15 7.130000e-10
MA0145.3             TFCP2 1.6739737 1.261494 7.390401e-05 6.281313e-02
MA0630.1              SHOX 1.5050123 1.569682 3.561512e-04 8.494469e-05
MA0146.2               Zfx 0.8936772 1.204154 9.793243e-01 2.732522e-03
MA0612.1              EMX1 1.6585375 2.288034 1.156616e-04 2.590676e-11
MA0757.1           ONECUT3 1.6704196 1.532446 5.266953e-10 5.766395e-07
MA0092.1       Hand1::Tcf3 1.7714225 1.280782 4.385537e-12 3.756361e-03
MA0800.1             EOMES 3.5109350 1.539566 5.726196e-74 6.033619e-07
MA0644.1              ESX1 1.5792975 1.561667 1.157016e-04 1.905015e-04
MA1131.1 FOSL2::JUN(var.2) 1.5051414 1.592882 5.348344e-06 2.492116e-07
MA0846.1             FOXC2 1.7093246 1.466996 5.120520e-21 1.423056e-10
MA0088.2            ZNF143 1.6439309 1.233937 5.833836e-12 4.174523e-03
MA0737.1             GLIS3 1.4111390 1.398283 8.864801e-04 1.220692e-03
MA1123.1            TWIST1 1.7806855 1.314523 3.704567e-16 2.484158e-04
MA0682.1             Pitx1 1.6250740 1.169396 2.305851e-07 7.397834e-02
MA0720.1             Shox2 1.5050123 1.569682 3.561512e-04 8.494469e-05

# clean up 
rm(motif_id, xt, ft.TH1, ft.Tfh)

# keep for later use
saveRDS(ora_res, file='data/Motifs/ATAC_ORA.rds')

We can extract the TFs most significantly associated to regions specifically open in TH1 samples. Here, we not only filter based on the significance, but also on the effect size (odds ratio):

head(subset(
  ora_res[order(ora_res$pv.TH1),], 
  fdr.TH1 < 0.01 & OR.TH1 > 2), 10) # quite a stringent filter

    Motif.ID     TF.name   OR.TH1        pv.TH1   OR.Tfh       pv.Tfh       fdr.TH1      fdr.Tfh
45  MA0002.2       RUNX1 3.777648 1.558764e-179 1.222642 1.747404e-04 3.117528e-177 2.627676e-04
12  MA0800.1       EOMES 3.510935  5.726196e-76 1.539566 2.141935e-07  5.726196e-74 6.033619e-07
141 MA0689.1       TBX20 3.146719  4.869246e-68 1.486339 4.181152e-07  3.246164e-66 1.100303e-06
61  MA0801.1         MGA 3.122758  4.398276e-63 1.493900 6.886255e-07  2.199138e-61 1.679574e-06
46  MA1144.1 FOSL2::JUND 2.764599  1.525370e-47 2.261137 2.218020e-28  6.101479e-46 7.393398e-27
28  MA1141.1   FOS::JUND 2.602834  1.845132e-47 2.240753 5.294007e-32  6.150440e-46 2.647004e-30
103 MA0490.1        JUNB 2.550341  2.629518e-46 2.214168 8.948733e-32  7.512908e-45 3.579493e-30
69  MA0477.1       FOSL1 2.468310  1.913646e-43 2.229434 4.977903e-33  4.784116e-42 3.318602e-31
36  MA0511.2       RUNX2 2.357028  4.808296e-30 1.409425 4.553645e-05  9.616592e-29 7.784008e-05
60  MA0071.1        RORA 2.284347  7.028427e-29 1.427101 1.462307e-05  1.277896e-27 2.812128e-05

=> Plenty of enriched TFs.

Similarly for the TFs most associated to regions specifically open in Tfh samples:

head(subset(
  ora_res[order(ora_res$pv.Tfh),], 
  fdr.Tfh < 0.01 & OR.Tfh > 2), 10)

    Motif.ID     TF.name   OR.TH1       pv.TH1   OR.Tfh       pv.Tfh      fdr.TH1      fdr.Tfh
108 MA0625.1      NFATC3 1.581686 3.670051e-10 2.403526 9.301669e-40 8.843496e-10 1.860334e-37
88  MA0786.1      POU3F1 1.807866 4.794184e-17 2.262524 3.244028e-34 2.338626e-16 3.244028e-32
69  MA0477.1       FOSL1 2.468310 1.913646e-43 2.229434 4.977903e-33 4.784116e-42 3.318602e-31
28  MA1141.1   FOS::JUND 2.602834 1.845132e-47 2.240753 5.294007e-32 6.150440e-46 2.647004e-30
103 MA0490.1        JUNB 2.550341 2.629518e-46 2.214168 8.948733e-32 7.512908e-45 3.579493e-30
46  MA1144.1 FOSL2::JUND 2.764599 1.525370e-47 2.261137 2.218020e-28 6.101479e-46 7.393398e-27
147 MA0787.1      POU3F2 1.672904 2.262084e-12 2.145987 2.619483e-28 6.854799e-12 7.484238e-27
98  MA0788.1      POU3F3 1.609643 7.136315e-11 2.097486 4.721601e-27 1.903017e-10 1.180400e-25
94  MA0784.1      POU1F1 1.671012 1.056211e-12 2.011368 4.809562e-24 3.580376e-12 9.619123e-23
57  MA0495.2        MAFF 2.022288 1.109122e-20 2.074152 2.448551e-22 7.922300e-20 4.451910e-21

=> A different set of TFs.

Surprisingly, many of the above TFs seem to be significant for both populations. Are there many in this case? We’ll relax the significance & effect size thresholds to get a broader picture:

addmargins(table(
  "signif TH1" = (ora_res$OR.TH1 >= 1.5 & ora_res$fdr.TH1 <= 0.05), 
  "signif Tfh" = (ora_res$OR.Tfh   >= 1.5 & ora_res$fdr.Tfh   <= 0.05)
))

          signif Tfh
signif TH1 FALSE TRUE Sum
     FALSE    49   18  67
     TRUE     60   73 133
     Sum     109   91 200

=> Many TFs are over-represented in both chromatin that is open specifically in TH1 cells AND in chromatin that is open specifically in Tfh cells! This is not necessarily expected.

This might be due to an issue with our background selection. We should investigate.

Investigating the background

We can check to see whether the up/down peaks have a similar sequence composition compared to our selected background of “common” peaks. Let’s do this manually:

# calculate 1-nt and 2-nt frequencies across all peak types
calc_avg_1nt2nt_freqs <- function(seqs) {
  # helper function to calculate mono- and di-nucleotide frequencies
  # across a set of sequences
  c("n"=length(seqs), 
    round(100 * colMeans(cbind(
      oligonucleotideFrequency(seqs, width=1, as.prob = T), # mononucleotide frequencies
      oligonucleotideFrequency(seqs, width=2, as.prob = T)  # dinucleotide freqs
  )), 2))
}
avg_1nt2nt_freqs <- rbind(
  "all" = calc_avg_1nt2nt_freqs(peak_seqs_sel), # calculate for ALL sequences
  do.call(rbind, lapply(
    split(peak_seqs_sel, peaks_gr_sel$state), 
    calc_avg_1nt2nt_freqs))          # calculate for each state
)


# visualise using heatmap and table
h <- heatmap(
  avg_1nt2nt_freqs, scale = "col", Colv=NA,
  col = colorRampPalette(c("blue","black","red"))(16))

as.data.frame(avg_1nt2nt_freqs[
  rev(h$rowInd) # this ensures the same row order as apparent in the heatmap
,])

           n     A     C     G     T   AA   AC   AG   AT   CA   CC   CG   CT   GA   GC   GG   GT
all    30000 24.63 25.38 25.38 24.61 6.54 5.54 8.12 4.44 7.73 7.14 2.38 8.13 6.22 6.47 7.17 5.53
common 25000 24.50 25.51 25.51 24.47 6.48 5.51 8.12 4.39 7.69 7.22 2.47 8.14 6.22 6.56 7.24 5.49
up      2500 24.80 25.02 25.17 25.02 6.68 5.46 8.08 4.58 7.79 6.86 2.29 8.09 6.19 6.48 6.96 5.55
down    2500 25.80 24.34 24.31 25.55 7.02 5.92 8.11 4.75 8.11 6.65 1.54 8.03 6.28 5.55 6.63 5.86
         TA   TC   TG   TT
all    4.14 6.23 7.72 6.52
common 4.11 6.23 7.68 6.45
up     4.13 6.23 7.85 6.80
down   4.39 6.22 8.03 6.90

# clean up
rm(calc_avg_1nt2nt_freqs, avg_1nt2nt_freqs, h)

=> There are clear mono- and di-nucleotide frequency differences between peak “states”, in particular “TH1>Tfh” compared to the rest. The heatmap strongly emphasises the differences, but the actual differences shown in the table are generally within 1-2%.

Conclusion

The mononucleotide and dinucleotide frequency differences between the peak sets suggest that using one of these (common) as a background may introduce biases into the analysis. Options would be:

Continue and knowingly and explicitly accept the potential biases.
Choose a more appropriate background.
Use a motif ORA method that accounts for composition biases. For example, consider monaLisa’s calcBinnedMotifEnrR() function.

Summary

We have seen in this section:

How to perform Fisher test-based ORA of motif occurrences amongst target sequences.
Possibilities for detecting issues with background selections in ORA.

Exercise 5: finding TF BS enrichments within promoter regions

Using the same ATAC-seq data from above, we would now like to perform a Fisher-based ORA, testing to see if there are any motifs enriched amongst the ATAC-seq peaks that were determined to be promoter regions.
Promoter-overlapping regions were flagged in the original data we loaded. From the overlap column on the peaks_gr_sel ranges that we built above, you can identify promoter-overlapping peaks with:

isPromoter <- grepl(':P',mcols(peaks_gr_sel)$overlap)
addmargins(table(isPromoter))

isPromoter
FALSE  TRUE   Sum 
23299  6701 30000

=> ~6k promoter regions out of 30k kept.

You can load the matches_rse object from earlier from the downloadable datasets if necessary:

# load from HDD if necessary
matches_rse <- readRDS("data/Motifs/jaspar2018_matches_rse.rds")

Using ORA, build for each motif the underlying contingency table, and test (with fisher.test()) if there is an association between motif occurrence and promoter overlap. As we are working across a large number of related tests, consider using multiple testing correction (with p.adjust()). Which motifs show the strongest enrichment to promoters? Show the top 10.

	promoter	non-promoter
target	promoter.target	non-promoter.target
non-target	promoter.non-target	non-promoter.non-target

Solution

results <- list()
print(system.time({
  for (motif_id in colnames(matches_rse)) {
    
    xt <- xtabs(
      ~ isPromoter + sites, 
      data.frame(
        isPromoter=isPromoter, 
        sites=assay(matches_rse, "motifMatches")[
          names(peaks_gr_sel),motif_id]
    ))
      
    ft <- fisher.test( xt, alternative="greater")
    results[[motif_id]] <- data.frame(
      row.names = NULL,
      "Motif.ID"  = motif_id, 
      "TF.name"   = colData(matches_rse)[motif_id,"name"], 
      "OR" = unname(ft$estimate), "pv" = ft$p.value)
    
  } 
})) #<10s

   user  system elapsed 
  7.909   0.473   8.440

rm(motif_id, xt, ft)

results <- as.data.frame(do.call("rbind",results))
results$fdr <- p.adjust(results$pv, method="BH") # adjust p-value for multiple testing

# show top hits
results[order(results$OR, decreasing = TRUE)[1:10],]

         Motif.ID       TF.name        OR            pv           fdr
MA1122.1 MA1122.1         TFDP1 15.287403  0.000000e+00  0.000000e+00
MA0162.3 MA0162.3          EGR1 11.779478  0.000000e+00  0.000000e+00
MA0146.2 MA0146.2           Zfx  9.715685  0.000000e+00  0.000000e+00
MA0748.1 MA0748.1           YY2  9.228412 2.866234e-189 2.605667e-188
MA0014.3 MA0014.3          PAX5  7.401562 5.935830e-173 5.161591e-172
MA0872.1 MA0872.1 TFAP2A(var.3)  6.729756 5.753480e-245 1.438370e-243
MA0746.1 MA0746.1           SP3  5.991699  0.000000e+00  0.000000e+00
MA0741.1 MA0741.1         KLF16  5.617325  0.000000e+00  0.000000e+00
MA0024.3 MA0024.3          E2F1  5.406678  7.398366e-55  2.901320e-54
MA0742.1 MA0742.1         Klf12  5.374739  0.000000e+00  0.000000e+00

Wrap Up

What now?

For the interested, a lot more content can be found for this topic in the vignette of the monaLisa package.

Exam is coming up! Regarding this lesson, no questions will be asked on things not covered during the oral presentation.

Questions regarding the session can be sent to me by email at: alex.smith@fmi.ch, though I may field them off to Michael Stadler or Florian Geier.
Questions regarding the course in general should be sent to Robert Ivanek.

References

Merika M, Orkin SH. (1993) DNA-binding specificity of GATA family transcription factors. Mol Cell Biol. 1993;13(7):3999-4010
Stormo GD (2005) Modeling the specificity of protein-DNA interactions. Quant Biol 1(2) 115–130
Ruan S and Stormo GD (2017) Inherent limitations of probabilistic models for protein-DNA binding specificity. PLoS Comput Biol 13(7): e1005638
Subramanian A et al (2005) Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS 102 (43) 15545-15550
Kuenzli M et al. (2020) Long-lived T follicular helper cells retain plasticity and help sustain humoral immunity. Science Immunology Vol. 5, Issue 45
Machlab D, Burger L, Soneson C, Rijli FM, Schübeler D, Stadler MB (2022) monaLisa: an R/Bioconductor package for identifying regulatory motifs. Bioinformatics, 38(9), 2624-2625

Alternative & extended reading

(List available as of 11/05/2022)

Transcription factors:

The Human Transcription Factors

Motif bioinformatics:

Statistics:

Session info

sessionInfo()

R version 4.2.0 (2022-04-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur/Monterey 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] glmnet_4.1-8                       Matrix_1.6-4                      
 [3] limma_3.52.0                       monaLisa_1.2.0                    
 [5] JASPAR2018_1.1.1                   motifmatchr_1.18.0                
 [7] TFBSTools_1.34.0                   MotifDb_1.38.0                    
 [9] universalmotif_1.14.0              BSgenome.Hsapiens.UCSC.hg38_1.4.4 
[11] BSgenome.Mmusculus.UCSC.mm10_1.4.3 BSgenome_1.64.0                   
[13] rtracklayer_1.56.0                 Biostrings_2.64.0                 
[15] XVector_0.36.0                     ggdendro_0.2.0                    
[17] ggplot2_3.5.0                      stringr_1.5.1                     
[19] SummarizedExperiment_1.26.1        Biobase_2.56.0                    
[21] GenomicRanges_1.48.0               GenomeInfoDb_1.32.1               
[23] IRanges_2.30.0                     S4Vectors_0.34.0                  
[25] BiocGenerics_0.42.0                MatrixGenerics_1.8.0              
[27] matrixStats_1.2.0                  BiocParallel_1.30.0               
[29] BiocPkgTools_1.14.0                htmlwidgets_1.6.4                 
[31] BiocStyle_2.24.0                   magick_2.8.3                      
[33] png_0.1-8                          knitr_1.46                        

loaded via a namespace (and not attached):
  [1] circlize_0.4.16             sm_2.2-5.7.1                BiocFileCache_2.4.0        
  [4] plyr_1.8.9                  igraph_1.5.1                splines_4.2.0              
  [7] digest_0.6.35               foreach_1.5.2               htmltools_0.5.8.1          
 [10] GO.db_3.15.0                fansi_1.0.6                 magrittr_2.0.3             
 [13] memoise_2.0.1               cluster_2.1.6               doParallel_1.0.17          
 [16] tzdb_0.4.0                  ComplexHeatmap_2.12.0       readr_2.1.5                
 [19] annotate_1.74.0             R.utils_2.12.3              colorspace_2.1-0           
 [22] blob_1.2.4                  rvest_1.0.4                 rappdirs_0.3.3             
 [25] xfun_0.43                   dplyr_1.1.4                 crayon_1.5.2               
 [28] RCurl_1.98-1.14             jsonlite_1.8.8              graph_1.74.0               
 [31] TFMPvalue_0.0.9             zoo_1.8-12                  survival_3.5-8             
 [34] iterators_1.0.14            glue_1.7.0                  gtable_0.3.4               
 [37] zlibbioc_1.42.0             GetoptLong_1.0.5            DelayedArray_0.22.0        
 [40] shape_1.4.6.1               scales_1.3.0                DBI_1.2.2                  
 [43] Rcpp_1.0.12                 xtable_1.8-4                clue_0.3-65                
 [46] bit_4.0.5                   stabs_0.6-4                 DT_0.33                    
 [49] httr_1.4.7                  RColorBrewer_1.1-3          farver_2.1.1               
 [52] pkgconfig_2.0.3             XML_3.99-0.16.1             R.methodsS3_1.8.2          
 [55] dbplyr_2.5.0                utf8_1.2.4                  labeling_0.4.3             
 [58] tidyselect_1.2.1            rlang_1.1.3                 reshape2_1.4.4             
 [61] AnnotationDbi_1.58.0        munsell_0.5.1               biocViews_1.64.0           
 [64] tools_4.2.0                 cachem_1.0.8                cli_3.6.2                  
 [67] DirichletMultinomial_1.38.0 generics_0.1.3              RSQLite_2.3.6              
 [70] evaluate_0.23               fastmap_1.1.1               yaml_2.3.8                 
 [73] bit64_4.0.5                 caTools_1.18.2              purrr_1.0.2                
 [76] KEGGREST_1.36.0             splitstackshape_1.4.8       gh_1.4.1                   
 [79] RBGL_1.72.0                 R.oo_1.26.0                 poweRlaw_0.80.0            
 [82] pracma_2.4.4                xml2_1.3.6                  compiler_4.2.0             
 [85] rstudioapi_0.16.0           filelock_1.0.3              curl_5.2.1                 
 [88] tibble_3.2.1                stringi_1.8.3               lattice_0.22-6             
 [91] CNEr_1.32.0                 vctrs_0.6.5                 pillar_1.9.0               
 [94] lifecycle_1.0.4             RUnit_0.4.33                BiocManager_1.30.22        
 [97] GlobalOptions_0.1.2         data.table_1.15.4           bitops_1.0-7               
[100] R6_2.5.1                    BiocIO_1.6.0                codetools_0.2-20           
[103] MASS_7.3-60.0.1             gtools_3.9.5                seqLogo_1.62.0             
[106] rjson_0.2.21                withr_3.0.0                 GenomicAlignments_1.32.0   
[109] Rsamtools_2.12.0            GenomeInfoDbData_1.2.8      parallel_4.2.0             
[112] hms_1.1.3                   tidyr_1.3.1                 rmarkdown_2.26             
[115] vioplot_0.4.0               restfulr_0.0.15

Set Up

Goal

Required packages

Example data

Human GATA2 motifs

Human K562 TF ChIP-seq

Immune cell accessible chromatin regions

Establishing, representing & accessing sequence motifs

Historical background

Computationally representing sequence motifs

Motif packages in R

The PCM

The PPM

The PWM

The ICM

Conclusion

Motif databases

Iteratively refine a MotifDB query

Visualise redundant motifs

De novo motif identification

Preparing the target & background sequences

Building k-mer frequencies

Calculating k-mer enrichment

Summary

Exercise 1: when to use motif representations

Exercise 2: create a PPM

Sequence scoring and matching

Scoring a single sequence

Score calculation pitfalls

Thresholding scores to detect motif occurences

Retrieve 200 example motifs from JASPAR2018

Score thresholding approaches

Comparing score thresholding approaches

Finding motif occurences in sequences

Preparing the sequences

Filtering the peaks

Categorising the peaks

Stratifying the peaks

Efficient scanning in R using motifmatchr

Summary

Exercise 3: scoring a sequence

Exercise 4: playing around with a list of PWMs

Enrichment analysis for motifs

Over-representation analysis (ORA) for PWMs

What is ORA?

ORA for our ATAC-seq

Categorising accessible chromatin peaks

ORA for an example TF (hGATA2) in TH1

Fisher-based ORA across all TFs and both cell populations

Investigating the background

Conclusion

Summary

Exercise 5: finding TF BS enrichments within promoter regions

Wrap Up

What now?

References

Alternative & extended reading

Session info

Efficient scanning in R using `motifmatchr`