Sequence motifs: from position weigth matrices to motif enrichments

Author

Michal Kloc, Florian Geier

Published

May 21, 2025

Goal

  • Understand what a sequence motifs is and how it is represented
  • Learn how to search DNA for potential TF binding sites using sequence motifs
  • Learn how to test if differentially accessible peaks (as identified in the previous lecture) are enriched for specific transcription factor binding sites

Required packages

## parallel computation
library(BiocParallel)

## plotting
library(ggplot2)
library(ggdendro)

## sequence motif specifc packages
library(universalmotif)
library(MotifDb)
library(TFBSTools)
library(motifmatchr)
library(BSgenome.Mmusculus.UCSC.mm10)
library(rtracklayer)
library(JASPAR2022)

## binned enrichment
library(monaLisa)
library(SummarizedExperiment)

##
library(ComplexHeatmap)
library(circlize)

Sequence motifs

Representing sequence motifs

  • DNA sequence motifs where first defined from sets of experimentally identified transcription factor binding sites as e.g. studied in EMSA, protein binding arrays, ChIP-Seq or high-throughput SELEX

  • These experiments yield sets of sequences with high affinity for the TF under study. Lets take a look at such as set identified from human GATA2 binding using classical electrophoretic mobility shift assay (EMSA).

Source: Merika M & Orkin SH et al. 1993
  • By aligning the sequences and treating positions independently we can derive a compact representation of the GATA2 binding site in form of a position count matrix (PCM), also called position frequency matrix (PFM): at each position we tabulate the occurrence of all possible letters (here: A,C,G,T).

  • There are quite a number of different Bioconductor packages and classes to represent sequence motifs. The base functionality for handling sequence motifs is defined in the Biostrings package, but it is enhanced by many other packages such MotifDb and TFBSTools. However, these packages define their own classes which hinders inter-operability. The universalmotif package aims at providing a uniform interface to all these different classes. Nevertheless to be able to use all functionality, e.g. fast motif searches, we still rely on all these different packages, even though some of the functionality is redundant.

  • Given the GATA2 example sequences we can create a position count matrix (PCM) using the create_motif function from universalmotif by:

seqs <- readRDS('data/Motifs/hGATA2_binding_sites.rds')

gata2 <- create_motif(seqs, type="PCM", name = "GATA2")
gata2

       Motif name:   GATA2
         Alphabet:   DNA
             Type:   PCM
          Strands:   +-
         Total IC:   9.19
      Pseudocount:   0
        Consensus:   SGATAN
     Target sites:   19

   S  G  A  T  A N
A  0  0 19  0 19 2
C  9  0  0  0  0 6
G 10 19  0  0  0 8
T  0  0  0 19  0 3
  • Another representation which abstracts from the total number of counts is the position probability matrix (PPM). It normalizes the count matrix by the number of observations, resulting in an estimate for the probability of a observing each letter at a given position. Let \(i \in \{1, 2, 3, 4 \}\) label the bases (vertical direction in the motif matrix) and \(j\) the positin in the motif (horizontal direction), then elements of PPM are simply given as

\[ {\rm PPM}_{ij} = \frac{\rm counts_{ij}}{\rm \sum_{i^{\prime}} counts_{i^{\prime}j}} . \]

convert_type(gata2, type="PPM")

       Motif name:   GATA2
         Alphabet:   DNA
             Type:   PPM
          Strands:   +-
         Total IC:   9.19
      Pseudocount:   0
        Consensus:   SGATAN
     Target sites:   19

     S G A T A    N
A 0.00 0 1 0 1 0.11
C 0.47 0 0 0 0 0.32
G 0.53 1 0 0 0 0.42
T 0.00 0 0 1 0 0.16
view_motifs(gata2, use.type = "PPM")

  • Closely related is the position weight matrix (PWM), which uses the PPM and a background model to assigns a weight to each letter at each position. Background refers here to the base composition at non-specific sites. Often it is assumed to be either uniform (i.e. \(1/4\) for all A,T,C,G) or the genome-wide nucleotide frequencies. PWM scores are the log-ratio of the PPM and the background frequencies:

\[ {\rm PWM}_{ij} = \log_2\left(\frac{\rm PPM_{ij}}{\rm bg_{i}}\right) \] where \({\rm bg}_{i}\) is the background frequency of base \(i\).

  • PWM scores are negative if the probability for letter at a given position is below the background probability.

  • In order to confine scores to finite values one has to make use of pseudo-counts which will make probabilities greater 0.

  • The example below uses as a background frequency the occurrence of A,C,G,T on chromosome 1 of the mouse genome.
freq <- oligonucleotideFrequency(BSgenome.Mmusculus.UCSC.mm10$chr1, width=1)
freq
       A        C        G        T 
56530182 39495313 39467408 56416289 
freq/sum(freq)
        A         C         G         T 
0.2945673 0.2058021 0.2056567 0.2939739 
create_motif(seqs, type="PWM", name="GATA2", pseudocount = 1, bkg = freq/sum(freq))

       Motif name:   GATA2
         Alphabet:   DNA
             Type:   PWM
          Strands:   +-
         Total IC:   7.84
      Pseudocount:   1
        Consensus:   SGATAN
     Target sites:   19

      S     G     A     T     A     N
A -4.56 -4.56  1.71 -4.56  1.71 -1.39
C  1.17 -4.04 -4.04 -4.04 -4.04  0.60
G  1.32  2.23 -4.04 -4.04 -4.04  1.00
T -4.56 -4.56 -4.56  1.71 -4.56 -0.86
  • Yet the final representation of a sequence motif uses information content (IC) as a means to highlight conserved over non conserved sites. The information content matrix (ICM) is a scaled version of the PPM where each position gets scaled by its information content (IC) measured in bits. Most conserved sites have an IC of 2 bits. This type of visualization is sometimes called sequence logo.
view_motifs(gata2, use.type = "ICM")

convert_type(gata2, type="ICM")

       Motif name:   GATA2
         Alphabet:   DNA
             Type:   ICM
          Strands:   +-
         Total IC:   9.19
      Pseudocount:   0
        Consensus:   SGATAN
     Target sites:   19

     S G A T A    N
A 0.00 0 2 0 2 0.02
C 0.47 0 0 0 0 0.06
G 0.53 2 0 0 0 0.08
T 0.00 0 0 2 0 0.03

A detailed derivation of ICM based on Shannon entropy can be found here. The idea is to quantify how much each site is informative about its composition compared to the random background. Long or polarized motifs have generally high IC and can thus be more uniquely matched to the genome.

Some limitations of sequence motifs

  • PWMs are the ones used to scan sequences and identify matches e.g. potential TF binding sites

  • In many cases they are a good proxy for the TFs binding affinity, but there are also limitations to the PWM model:

    • they use single letter frequencies, i.e., they do not capture dependencies between different sites.
    • Independent of affinity, TF concentration is a major determinant for binding site occupancy. While at low TF concentrations the PWM score for a given site reflects the probability of it being bound, at very high TF concentration unspecific binding will dominate.
    • The binding preference is also determined by other factors such as chromatin state or co-factors.

Motif databases

  • There are several resources providing PWM matrices, some are free (i.e., Jaspar), some require a license (i.e., Transfac).

  • Many resources are directly available through R packages e.g. the MotifDb, TFBSTools and JASPAR2022 packages.

  • All these databases and packages collect the PWMs from different primary sources and there is a lot of redundancy in them. Even a single resource might have multiple PWMs for the same TF (acquired in different species, by different techniques).

  • For example lets check the GATA2 transcription factor in the MotifDb package:

library(MotifDb)
query(MotifDb, andStrings=(c("GATA2")))
MotifDb object of length 19
| Created from downloaded public sources, last update: 2022-Mar-04
| 19 position frequency matrices from 12 sources:
|        HOCOMOCOv10:    2
| HOCOMOCOv11-core-A:    1
| HOCOMOCOv11-secondary-A:    1
|        HOCOMOCOv13:    1
|              HOMER:    1
|        JASPAR_2014:    1
|        JASPAR_CORE:    1
|         jaspar2016:    2
|         jaspar2018:    4
|         jaspar2022:    2
|         jaspar2024:    2
|       SwissRegulon:    1
| 4 organism/s
|           Hsapiens:   14
|          Athaliana:    3
|          Mmusculus:    1
|              other:    1
Hsapiens-HOCOMOCOv13-GATA2.H13CORE.1.P.B 
Hsapiens-HOCOMOCOv10-GATA2_HUMAN.H10MO.A 
Mmusculus-HOCOMOCOv10-GATA2_MOUSE.H10MO.A 
Hsapiens-HOCOMOCOv11-core-A-GATA2_HUMAN.H11MO.0.A 
Hsapiens-HOCOMOCOv11-secondary-A-GATA2_HUMAN.H11MO.1.A 
...
Hsapiens-jaspar2022-GATA2-MA0036.3 
Athaliana-jaspar2022-GATA20-MA1324.1 
Hsapiens-jaspar2024-GATA2-MA0036.4 
Athaliana-jaspar2024-GATA20-MA1324.1 
Hsapiens-SwissRegulon-GATA2.SwissRegulon 
query(MotifDb, andStrings=(c("GATA2", "JASPAR2018")))
MotifDb object of length 4
| Created from downloaded public sources, last update: 2022-Mar-04
| 4 position frequency matrices from 1 source:
|         jaspar2018:    4
| 2 organism/s
|           Hsapiens:    3
|          Athaliana:    1
Hsapiens-jaspar2018-GATA2-MA0036.1 
Hsapiens-jaspar2018-GATA2-MA0036.2 
Hsapiens-jaspar2018-GATA2-MA0036.3 
Athaliana-jaspar2018-GATA20-MA1324.1 
query(MotifDb, andStrings=(c("GATA2", "JASPAR2018", "Hsapiens")))
MotifDb object of length 3
| Created from downloaded public sources, last update: 2022-Mar-04
| 3 position frequency matrices from 1 source:
|         jaspar2018:    3
| 1 organism/s
|           Hsapiens:    3
Hsapiens-jaspar2018-GATA2-MA0036.1 
Hsapiens-jaspar2018-GATA2-MA0036.2 
Hsapiens-jaspar2018-GATA2-MA0036.3 
  • This set of final GATA2 motifs from the Jaspar2022 database is quite redundant as the following picture shows. Here, we add our matrix to the end of the list.
res <- query(MotifDb, andStrings=(c("GATA2", "JASPAR2018", "Hsapiens")))
res <- convert_motifs(res)
res[[4]] <- gata2
view_motifs(res, dedup.names = TRUE)

  • This redundancy complicates the interpretation of motif analysis, however its not clear from the beginning which motifs are superior in a redundant motif set. Often the motif with he highest information content is only reflecting the consensus site or the best scoring site and will not match relevant sites with lower affinity.

  • For the following analysis we will select PWMs from the “core” of the Jaspar 2022 database using TFBSTools. As the number of mouse specific motifs is quite limited, we will select all vertebrate motifs.

pwms <- getMatrixSet(JASPAR2022,
                     opts = list(matrixtype = "PWM",
                                 collection="CORE",
                                 tax_group = "vertebrates"))

tfNames <- sapply(pwms, name) #collect names

Sequence matching and scoring

  • The score of a sequence is calculated by summing the score of the matching letter for each position of the PWM
gata2PWM <- convert_type(gata2, type="PWM", pseudocount = 1)
gata2PWM['motif']
           S         G         A         T         A          N
A -4.3219281 -4.321928  1.944858 -4.321928  1.944858 -1.1520031
C  0.8875253 -4.321928 -4.321928 -4.321928 -4.321928  0.3219281
G  1.0356239  1.944858 -4.321928 -4.321928 -4.321928  0.7224660
T -4.3219281 -4.321928 -4.321928  1.944858 -4.321928 -0.6214884
score_match(gata2PWM, "CAAGAT")
[1] -4.488
0.8875253  + (-4.321928) + 1.944858 + (-4.321928) + 1.944858 + (-0.6214884)
[1] -4.488103
  • We can also assess the minimal and maximal score of the motif.
motif_range(gata2PWM)
    min     max 
-22.757   9.533 
#by hand
mot.max <- 1.0356239  + 1.944858 + 1.944858 + 1.944858 + 1.944858 + 0.7224660
mot.min <- - 4.3219281 - 4.321928 - 4.321928 - 4.321928 - 4.321928 - 1.1520031

sprintf("Maximal score of the motif: %s", mot.max)
[1] "Maximal score of the motif: 9.5375219"
sprintf("Minimal score of the motif: %s", mot.min)
[1] "Minimal score of the motif: -22.7616432"
  • Potential target sites of TFs (e.g., genomic locations potentially bound by the TF) are showing scores greater than 0, and non-specific sites are showing scores below 0.

  • However a score threshold of 0 will in almost all cases (except very polarized motifs) pick far too many non-relevant sites when e.g. applied to a set of peak or promoter sequences.

  • Ideally, each PWM would be accompanied by a score threshold which is derived experimentally e.g. using ChIP-Seq data by contrasting true binding sites with non-functional sites. Such an absolute score would reflect directly the critical binding affinity for each TF. This approach was used in the HOMER database, but doesn’t guarantee an optimal score threshold when applied to other sets of sequences. This approach using an absolute score is also implemented in monalisa package as the default motif scoring.

  • There are other two alternatives implemented in different Bioconductor packages:

    • Use a simple similarity cutoff, e.g., 80% of the highest possible PWM score (matchPWM approach form Biostrings). This approach tends to favor short motifs.

    • Calculate p-values per motif and threshold them. Here, first a sequence background needs to be estimated for each motif. This process generates random sequences of the same length as the motif sequence and scores them. The threshold \(T\) is indirectly chosen such, that the probability, that the background distribution would generate the set of sequences all with score \(\ge T\) is less likely than the p-value. Usually a relatively stringent p-value needs to be chosen. Then short motifs or motifs with low information content get a threshold which is close or equal to their maximal score (i.e. only sites close to perfect matches are predicted). On the other hand motifs with high information content are treated more liberal, such that more sites that deviate from the optimal are retained.

Binned motif enrichment analysis

  • In the previous lecture we identified differentially open chromatin regions between the doubly-positive (DP) and CD4+/CD8+ T cell populations. Now we would like to infer underlying regulators. To this end we will investigate the relation between the occurrence of (potential) transcription factor binding sites and differential accessibility.

  • The basic idea would to identify overrepresented motifs in the differentially regulated regions and thus link them to activities of transcription factors. We could generate a contingency table per TF motif that will split the chromatin regions according to their accessibility (specific vs. common) and if the TF motif is identified (target vs. non-target). Then, Fisher’s test can be applied.

  common specific (up or down)
target common.target specific.target
non-target common.non-target specific.non-target
  • We will elaborate on this basic idea and implement a binned version of enrichment analysis using monalisa package. The word “binned” refers to the approach where we consider a continuous score per analysed region (like \(\rm{log}_2 FC\)) and group the regions based on this score into several bins. Using Fisher’s test, we subsequently test overrepresentation of the motifs in each bin (foreground) with respect to all the remaining regions (background). Compared to the approach using the contingency table above, we will obtain a finer resolution of the associations between the motifs and the observed changes in accessibility.

  • The package performs the motif matching as well as the Fisher’s test. It also provides useful diagnostic tools and has a built-in sequence composition correction for GC and k-mer differences between the foreground and background.

Region binning

First, we read the data from the last week’s lecture. We will will try to identify TFs that are driving the differentiation of T cells towards CD4+ helper cells.

peaks <- read.table(gzfile("data/Motifs/peaks_CD4_vs_DP.tsv.gz"), header=TRUE, stringsAsFactors=FALSE, sep="\t")
peaks.gr <- GRanges(peaks, seqinfo=seqinfo(BSgenome.Mmusculus.UCSC.mm10))
names(peaks.gr) <-as.character(peaks.gr)

peaks.gr
GRanges object with 48744 ranges and 14 metadata columns:
                       seqnames          ranges strand |      overlap
                          <Rle>       <IRanges>  <Rle> |  <character>
  chr1:3042901-3043350     chr1 3042901-3043350      * |             
  chr1:4258201-4258500     chr1 4258201-4258500      * |      Rp1:-:I
  chr1:4496251-4496850     chr1 4496251-4496850      * |   Sox17:-:PE
  chr1:4771051-4771200     chr1 4771051-4771200      * |             
  chr1:4780201-4780350     chr1 4780201-4780350      * |   Mrpl15:-:I
                   ...      ...             ...    ... .          ...
    chrY:897451-897600     chrY   897451-897600      * |    Kdm5d:+:P
  chrY:1010251-1010700     chrY 1010251-1010700      * | Eif2s3y:+:PE
  chrY:1245451-1245900     chrY 1245451-1245900      * |     Uty:-:PE
  chrY:2224951-2225100     chrY 2224951-2225100      * |             
  chrY:2225401-2225550     chrY 2225401-2225550      * |             
                                left         right num.tests num.up.logFC
                         <character>   <character> <integer>    <integer>
  chr1:3042901-3043350                                     3            0
  chr1:4258201-4258500                  Rp1:-:3027         2            0
  chr1:4496251-4496850   Sox17:-:309                       4            1
  chr1:4771051-4771200               Mrpl15:-:2006         1            0
  chr1:4780201-4780350 Mrpl15:-:2553  Mrpl15:-:871         1            0
                   ...           ...           ...       ...          ...
    chrY:897451-897600                 Kdm5d:+:188         1            0
  chrY:1010251-1010700               Eif2s3y:+:560         3            0
  chrY:1245451-1245900     Uty:-:232                       3            1
  chrY:2224951-2225100                                     1            0
  chrY:2225401-2225550                                     1            0
                       num.down.logFC      PValue         FDR   direction
                            <integer>   <numeric>   <numeric> <character>
  chr1:3042901-3043350              0 1.96259e-01 0.414670778        down
  chr1:4258201-4258500              2 3.55200e-02 0.141788961        down
  chr1:4496251-4496850              0 5.06666e-06 0.000198688          up
  chr1:4771051-4771200              0 2.84625e-01 0.513425190        down
  chr1:4780201-4780350              0 3.98520e-01 0.618252846          up
                   ...            ...         ...         ...         ...
    chrY:897451-897600              0   0.1663729    0.375639          up
  chrY:1010251-1010700              0   0.1824989    0.397326          up
  chrY:1245451-1245900              0   0.0398650    0.152983          up
  chrY:2224951-2225100              0   0.0880565    0.254419        down
  chrY:2225401-2225550              0   0.0513747    0.180484        down
                        rep.test rep.logFC rep.logCPM    Contrast
                       <integer> <numeric>  <numeric> <character>
  chr1:3042901-3043350         3 -0.776602   0.116788    CD4 - DP
  chr1:4258201-4258500         5 -1.115939  -0.661670    CD4 - DP
  chr1:4496251-4496850         7  2.978558   0.121192    CD4 - DP
  chr1:4771051-4771200        10 -0.732725  -1.027829    CD4 - DP
  chr1:4780201-4780350        11  0.504890  -0.618404    CD4 - DP
                   ...       ...       ...        ...         ...
    chrY:897451-897600    125829  0.867217  -0.817155    CD4 - DP
  chrY:1010251-1010700    125832  0.876682   0.315328    CD4 - DP
  chrY:1245451-1245900    125833  1.636117  -0.948253    CD4 - DP
  chrY:2224951-2225100    125836 -1.034671  -0.894518    CD4 - DP
  chrY:2225401-2225550    125837 -0.990185  -0.498133    CD4 - DP
                              name
                       <character>
  chr1:3042901-3043350      Common
  chr1:4258201-4258500      Common
  chr1:4496251-4496850         CD4
  chr1:4771051-4771200      Common
  chr1:4780201-4780350      Common
                   ...         ...
    chrY:897451-897600      Common
  chrY:1010251-1010700      Common
  chrY:1245451-1245900      Common
  chrY:2224951-2225100      Common
  chrY:2225401-2225550      Common
  -------
  seqinfo: 239 sequences (1 circular) from mm10 genome

Now, we will bin the regions according to the reported log-fold-change. First, let’s have a look how are these values distributed.

hist(peaks.gr$rep.logFC, 100, col = "gray", main = "",
       xlab = "Change of accesibility (log2FC)", ylab = "frequency")

The distribution is fairly symmetric and centered around 0. We will use the bin function from the package to split the regions into groups of an equal size of 2000 and by minAbsX = 1

bins <- monaLisa::bin(x = peaks.gr$rep.logFC, binmode = "equalN", nElement = 2000,
              minAbsX = 1)
plotBinDensity(peaks.gr$rep.logFC, bins, legend = "topleft", xlab = "log2FC")
Warning: `legend` is deprecated and ignored.
ℹ You can use `legendPosition` to control legend drawing and position

Note that the bin breaks around the no-change bin are not exactly -1 to 1. They have been adjusted to have the required 2000 per bin below and above it.

The size of the identified regions from ATACseq can vary substantially. The very long regions have naturally higher chance to be randomly matched by a TF motif.

summary(width(peaks.gr))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  150.0   150.0   300.0   399.9   450.0  3900.0 
hist(width(peaks.gr))
abline(v = median(width(peaks.gr)), col="red") #add median value
abline(v =  quantile(width(peaks.gr),0.95), col="blue", lwd=3, lty=2)

At this point it is a question if we ignore this bias or trim the sequences, which on the other hand leads to a loss of regions that we identified as accessible. One solution could be to split the very long regions and analyse them separately. For simplicity, here we will trim only the very long sequences (above the 95th quantile of the sequence width distribution).

width.for.resize <- quantile(width(peaks.gr),0.95)
peaks.gr[width(peaks.gr) > width.for.resize ] <-  trim( resize(peaks.gr[width(peaks.gr) > width.for.resize ], width =  width.for.resize, fix = "center"))
summary(width( peaks.gr))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  150.0   150.0   300.0   384.1   450.0  1200.0 

Sequence compositions

So far, our GRanges object contains only the genomic coordinates of the regions. Let’s extract the respective sequences using Biostrings package and turn the variable into a DNAStringSet object.

peaks.gr <- getSeq(BSgenome.Mmusculus.UCSC.mm10, peaks.gr)

Now we can plot the CG content and dinucleotide frequencies across the bins.

plotBinDiagnostics(seqs = peaks.gr, bins = bins, aspect = "GCfrac")

monaLisa::plotBinDiagnostics(seqs = peaks.gr, bins = bins, aspect = "dinucfreq")

Apparently, the CG content is generally correlated with identified differential accessibility. This would mean that any GC-rich motif is likely to be associated with the bins with negative log-fold-change. As already mentioned, monalisa takes the GC content into account. Alternatively, one could re-sample the regions with more consistent GC content.

Enrichment test

The enrichment test is performed using the function calcBinnedMotifEnrR. The function performs several steps:

  • It matches the motifs to the sequences of the studied regions using the findMotifHits function. The default setting implements matchPWM function from the Biostrings package with a fixed empiric score min.score = 10.

  • It performs adjustments for different k-mer and GC compositions between the foreground and the background. The default maximal k-mer length is set maxKmerSize = 3L.

  • Fisher’s test is performed for each bin vs. the rest. Benjamini-Hochberg correction for multiple testing is applied.

suppressWarnings({
MotifEnrichment <- calcBinnedMotifEnrR(seqs = peaks.gr ,
                                      pwmL = pwms ,
                                      bins = bins ,
                                      BPPARAM = BiocParallel::SerialParam(RNGseed = 42),
                                      verbose = TRUE)

})
ℹ Filtering sequences ...
ℹ in total filtering out 0 of 48744 sequences (0%)
✔ in total filtering out 0 of 48744 sequences (0%) [7ms]
ℹ Filtering sequences ...
✔ Filtering sequences ... [60ms]

ℹ Scanning sequences for motif hits...
✔ Scanning sequences for motif hits... [4m 41.1s]

ℹ Create motif hit matrix...
ℹ starting analysis of bin [-7.97,-1.88]
✔ starting analysis of bin [-7.97,-1.88] [4ms]

ℹ Create motif hit matrix...
ℹ Defining background sequence set (otherBins)...
✔ Defining background sequence set (otherBins)... [73ms]

ℹ Create motif hit matrix...
ℹ Correcting for GC differences to the background sequences...
ℹ 7 of 9 GC-bins used (have both fore- and background sequences) 12 of 487…
✔ 7 of 9 GC-bins used (have both fore- and background sequences) 12 of 487…

ℹ Correcting for GC differences to the background sequences...
✔ Correcting for GC differences to the background sequences... [66ms]

ℹ Create motif hit matrix...
ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ starting iterative adjustment for k-mer composition (up to 160 iteration…
✔ starting iterative adjustment for k-mer composition (up to 160 iteration…

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 40 of 160 iterations done
✔ 80 of 160 iterations done [3s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 80 of 160 iterations done
✔ 120 of 160 iterations done [2.7s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 120 of 160 iterations done
✔ 160 of 160 iterations done [2.6s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 160 of 160 iterations done
ℹ     iterations finished
ℹ 160 of 160 iterations done
✔ 160 of 160 iterations done [14ms]

ℹ Correcting for k-mer differences between fore- and background sequences.…
✔ Correcting for k-mer differences between fore- and background sequences.…

ℹ Create motif hit matrix...
ℹ Calculating motif enrichment...
ℹ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…
✔ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…

ℹ Calculating motif enrichment...
✔ Calculating motif enrichment... [1.2s]

ℹ Create motif hit matrix...
ℹ starting analysis of bin (-1.88,-1.45]
✔ starting analysis of bin (-1.88,-1.45] [4ms]

ℹ Create motif hit matrix...
ℹ Defining background sequence set (otherBins)...
✔ Defining background sequence set (otherBins)... [67ms]

ℹ Create motif hit matrix...
ℹ Correcting for GC differences to the background sequences...
ℹ 8 of 9 GC-bins used (have both fore- and background sequences) 0 of 4874…
✔ 8 of 9 GC-bins used (have both fore- and background sequences) 0 of 4874…

ℹ Correcting for GC differences to the background sequences...
✔ Correcting for GC differences to the background sequences... [42ms]

ℹ Create motif hit matrix...
ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ starting iterative adjustment for k-mer composition (up to 160 iteration…
✔ starting iterative adjustment for k-mer composition (up to 160 iteration…

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 40 of 160 iterations done
✔ 80 of 160 iterations done [2.7s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 80 of 160 iterations done
✔ 120 of 160 iterations done [3s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 120 of 160 iterations done
✔ 160 of 160 iterations done [3.2s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 160 of 160 iterations done
ℹ     iterations finished
ℹ 160 of 160 iterations done
✔ 160 of 160 iterations done [21ms]

ℹ Correcting for k-mer differences between fore- and background sequences.…
✔ Correcting for k-mer differences between fore- and background sequences.…

ℹ Create motif hit matrix...
ℹ Calculating motif enrichment...
ℹ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…
✔ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…

ℹ Calculating motif enrichment...
✔ Calculating motif enrichment... [1.1s]

ℹ Create motif hit matrix...
ℹ starting analysis of bin (-1.45,-1.18]
✔ starting analysis of bin (-1.45,-1.18] [4ms]

ℹ Create motif hit matrix...
ℹ Defining background sequence set (otherBins)...
✔ Defining background sequence set (otherBins)... [70ms]

ℹ Create motif hit matrix...
ℹ Correcting for GC differences to the background sequences...
ℹ 7 of 9 GC-bins used (have both fore- and background sequences) 12 of 487…
✔ 7 of 9 GC-bins used (have both fore- and background sequences) 12 of 487…

ℹ Correcting for GC differences to the background sequences...
✔ Correcting for GC differences to the background sequences... [39ms]

ℹ Create motif hit matrix...
ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ starting iterative adjustment for k-mer composition (up to 160 iteration…
✔ starting iterative adjustment for k-mer composition (up to 160 iteration…

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 40 of 160 iterations done
✔ 80 of 160 iterations done [2.7s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 80 of 160 iterations done
✔ 120 of 160 iterations done [3s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 120 of 160 iterations done
✔ 160 of 160 iterations done [2.8s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 160 of 160 iterations done
ℹ     iterations finished
ℹ 160 of 160 iterations done
✔ 160 of 160 iterations done [16ms]

ℹ Correcting for k-mer differences between fore- and background sequences.…
✔ Correcting for k-mer differences between fore- and background sequences.…

ℹ Create motif hit matrix...
ℹ Calculating motif enrichment...
ℹ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…
✔ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…

ℹ Calculating motif enrichment...
✔ Calculating motif enrichment... [1.2s]

ℹ Create motif hit matrix...
ℹ starting analysis of bin (-1.18,-0.989]
✔ starting analysis of bin (-1.18,-0.989] [5ms]

ℹ Create motif hit matrix...
ℹ Defining background sequence set (otherBins)...
✔ Defining background sequence set (otherBins)... [78ms]

ℹ Create motif hit matrix...
ℹ Correcting for GC differences to the background sequences...
ℹ 7 of 9 GC-bins used (have both fore- and background sequences) 12 of 487…
✔ 7 of 9 GC-bins used (have both fore- and background sequences) 12 of 487…

ℹ Correcting for GC differences to the background sequences...
✔ Correcting for GC differences to the background sequences... [56ms]

ℹ Create motif hit matrix...
ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ starting iterative adjustment for k-mer composition (up to 160 iteration…
✔ starting iterative adjustment for k-mer composition (up to 160 iteration…

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 40 of 160 iterations done
✔ 80 of 160 iterations done [2.9s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 80 of 160 iterations done
✔ 120 of 160 iterations done [3.4s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 120 of 160 iterations done
✔ 160 of 160 iterations done [2.9s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 160 of 160 iterations done
ℹ     iterations finished
ℹ 160 of 160 iterations done
✔ 160 of 160 iterations done [18ms]

ℹ Correcting for k-mer differences between fore- and background sequences.…
✔ Correcting for k-mer differences between fore- and background sequences.…

ℹ Create motif hit matrix...
ℹ Calculating motif enrichment...
ℹ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…
✔ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…

ℹ Calculating motif enrichment...
✔ Calculating motif enrichment... [1.2s]

ℹ Create motif hit matrix...
ℹ starting analysis of bin (-0.989,1]
✔ starting analysis of bin (-0.989,1] [4ms]

ℹ Create motif hit matrix...
ℹ Defining background sequence set (otherBins)...
✔ Defining background sequence set (otherBins)... [71ms]

ℹ Create motif hit matrix...
ℹ Correcting for GC differences to the background sequences...
ℹ 8 of 9 GC-bins used (have both fore- and background sequences) 0 of 4874…
✔ 8 of 9 GC-bins used (have both fore- and background sequences) 0 of 4874…

ℹ Correcting for GC differences to the background sequences...
✔ Correcting for GC differences to the background sequences... [41ms]

ℹ Create motif hit matrix...
ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ starting iterative adjustment for k-mer composition (up to 160 iteration…
✔ starting iterative adjustment for k-mer composition (up to 160 iteration…

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 40 of 160 iterations done
✔ 80 of 160 iterations done [1.6s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 80 of 160 iterations done
✔ 120 of 160 iterations done [1.6s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 120 of 160 iterations done
✔ 160 of 160 iterations done [1.9s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 160 of 160 iterations done
ℹ     iterations finished
ℹ 160 of 160 iterations done
✔ 160 of 160 iterations done [15ms]

ℹ Correcting for k-mer differences between fore- and background sequences.…
✔ Correcting for k-mer differences between fore- and background sequences.…

ℹ Create motif hit matrix...
ℹ Calculating motif enrichment...
ℹ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…
✔ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…

ℹ Calculating motif enrichment...
✔ Calculating motif enrichment... [2.3s]

ℹ Create motif hit matrix...
ℹ starting analysis of bin (1,1.16]
✔ starting analysis of bin (1,1.16] [4ms]

ℹ Create motif hit matrix...
ℹ Defining background sequence set (otherBins)...
✔ Defining background sequence set (otherBins)... [86ms]

ℹ Create motif hit matrix...
ℹ Correcting for GC differences to the background sequences...
ℹ 8 of 9 GC-bins used (have both fore- and background sequences) 0 of 4874…
✔ 8 of 9 GC-bins used (have both fore- and background sequences) 0 of 4874…

ℹ Correcting for GC differences to the background sequences...
✔ Correcting for GC differences to the background sequences... [55ms]

ℹ Create motif hit matrix...
ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ starting iterative adjustment for k-mer composition (up to 160 iteration…
✔ starting iterative adjustment for k-mer composition (up to 160 iteration…

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 40 of 160 iterations done
✔ 80 of 160 iterations done [3.1s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 80 of 160 iterations done
✔ 120 of 160 iterations done [3.2s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 120 of 160 iterations done
✔ 160 of 160 iterations done [2.9s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 160 of 160 iterations done
ℹ     iterations finished
ℹ 160 of 160 iterations done
✔ 160 of 160 iterations done [15ms]

ℹ Correcting for k-mer differences between fore- and background sequences.…
✔ Correcting for k-mer differences between fore- and background sequences.…

ℹ Create motif hit matrix...
ℹ Calculating motif enrichment...
ℹ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…
✔ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…

ℹ Calculating motif enrichment...
✔ Calculating motif enrichment... [1.1s]

ℹ Create motif hit matrix...
ℹ starting analysis of bin (1.16,1.36]
✔ starting analysis of bin (1.16,1.36] [4ms]

ℹ Create motif hit matrix...
ℹ Defining background sequence set (otherBins)...
✔ Defining background sequence set (otherBins)... [70ms]

ℹ Create motif hit matrix...
ℹ Correcting for GC differences to the background sequences...
ℹ 7 of 9 GC-bins used (have both fore- and background sequences) 12 of 487…
✔ 7 of 9 GC-bins used (have both fore- and background sequences) 12 of 487…

ℹ Correcting for GC differences to the background sequences...
✔ Correcting for GC differences to the background sequences... [41ms]

ℹ Create motif hit matrix...
ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ starting iterative adjustment for k-mer composition (up to 160 iteration…
✔ starting iterative adjustment for k-mer composition (up to 160 iteration…

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 40 of 160 iterations done
✔ 80 of 160 iterations done [2.7s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 80 of 160 iterations done
✔ 120 of 160 iterations done [2.7s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 120 of 160 iterations done
✔ 160 of 160 iterations done [2.7s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 160 of 160 iterations done
ℹ     iterations finished
ℹ 160 of 160 iterations done
✔ 160 of 160 iterations done [16ms]

ℹ Correcting for k-mer differences between fore- and background sequences.…
✔ Correcting for k-mer differences between fore- and background sequences.…

ℹ Create motif hit matrix...
ℹ Calculating motif enrichment...
ℹ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…
✔ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…

ℹ Calculating motif enrichment...
✔ Calculating motif enrichment... [1s]

ℹ Create motif hit matrix...
ℹ starting analysis of bin (1.36,1.6]
✔ starting analysis of bin (1.36,1.6] [4ms]

ℹ Create motif hit matrix...
ℹ Defining background sequence set (otherBins)...
✔ Defining background sequence set (otherBins)... [67ms]

ℹ Create motif hit matrix...
ℹ Correcting for GC differences to the background sequences...
ℹ 8 of 9 GC-bins used (have both fore- and background sequences) 0 of 4874…
✔ 8 of 9 GC-bins used (have both fore- and background sequences) 0 of 4874…

ℹ Correcting for GC differences to the background sequences...
✔ Correcting for GC differences to the background sequences... [39ms]

ℹ Create motif hit matrix...
ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ starting iterative adjustment for k-mer composition (up to 160 iteration…
✔ starting iterative adjustment for k-mer composition (up to 160 iteration…

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 40 of 160 iterations done
✔ 80 of 160 iterations done [2.7s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 80 of 160 iterations done
✔ 120 of 160 iterations done [2.7s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 120 of 160 iterations done
✔ 160 of 160 iterations done [3s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 160 of 160 iterations done
ℹ     iterations finished
ℹ 160 of 160 iterations done
✔ 160 of 160 iterations done [13ms]

ℹ Correcting for k-mer differences between fore- and background sequences.…
✔ Correcting for k-mer differences between fore- and background sequences.…

ℹ Create motif hit matrix...
ℹ Calculating motif enrichment...
ℹ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…
✔ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…

ℹ Calculating motif enrichment...
✔ Calculating motif enrichment... [1.1s]

ℹ Create motif hit matrix...
ℹ starting analysis of bin (1.6,1.93]
✔ starting analysis of bin (1.6,1.93] [5ms]

ℹ Create motif hit matrix...
ℹ Defining background sequence set (otherBins)...
✔ Defining background sequence set (otherBins)... [71ms]

ℹ Create motif hit matrix...
ℹ Correcting for GC differences to the background sequences...
ℹ 7 of 9 GC-bins used (have both fore- and background sequences) 12 of 487…
✔ 7 of 9 GC-bins used (have both fore- and background sequences) 12 of 487…

ℹ Correcting for GC differences to the background sequences...
✔ Correcting for GC differences to the background sequences... [40ms]

ℹ Create motif hit matrix...
ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ starting iterative adjustment for k-mer composition (up to 160 iteration…
✔ starting iterative adjustment for k-mer composition (up to 160 iteration…

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 40 of 160 iterations done
✔ 80 of 160 iterations done [3.3s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 80 of 160 iterations done
✔ 120 of 160 iterations done [2.6s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 120 of 160 iterations done
✔ 160 of 160 iterations done [2.7s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 160 of 160 iterations done
ℹ     iterations finished
ℹ 160 of 160 iterations done
✔ 160 of 160 iterations done [13ms]

ℹ Correcting for k-mer differences between fore- and background sequences.…
✔ Correcting for k-mer differences between fore- and background sequences.…

ℹ Create motif hit matrix...
ℹ Calculating motif enrichment...
ℹ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…
✔ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…

ℹ Calculating motif enrichment...
✔ Calculating motif enrichment... [1s]

ℹ Create motif hit matrix...
ℹ starting analysis of bin (1.93,2.48]
✔ starting analysis of bin (1.93,2.48] [4ms]

ℹ Create motif hit matrix...
ℹ Defining background sequence set (otherBins)...
✔ Defining background sequence set (otherBins)... [67ms]

ℹ Create motif hit matrix...
ℹ Correcting for GC differences to the background sequences...
ℹ 7 of 9 GC-bins used (have both fore- and background sequences) 12 of 487…
✔ 7 of 9 GC-bins used (have both fore- and background sequences) 12 of 487…

ℹ Correcting for GC differences to the background sequences...
✔ Correcting for GC differences to the background sequences... [40ms]

ℹ Create motif hit matrix...
ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ starting iterative adjustment for k-mer composition (up to 160 iteration…
✔ starting iterative adjustment for k-mer composition (up to 160 iteration…

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 40 of 160 iterations done
✔ 80 of 160 iterations done [2.6s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 80 of 160 iterations done
✔ 120 of 160 iterations done [2.7s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 120 of 160 iterations done
✔ 160 of 160 iterations done [2.8s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 160 of 160 iterations done
ℹ     iterations finished
ℹ 160 of 160 iterations done
✔ 160 of 160 iterations done [13ms]

ℹ Correcting for k-mer differences between fore- and background sequences.…
✔ Correcting for k-mer differences between fore- and background sequences.…

ℹ Create motif hit matrix...
ℹ Calculating motif enrichment...
ℹ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…
✔ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…

ℹ Calculating motif enrichment...
✔ Calculating motif enrichment... [1.1s]

ℹ Create motif hit matrix...
ℹ starting analysis of bin (2.48,8.36]
✔ starting analysis of bin (2.48,8.36] [4ms]

ℹ Create motif hit matrix...
ℹ Defining background sequence set (otherBins)...
✔ Defining background sequence set (otherBins)... [66ms]

ℹ Create motif hit matrix...
ℹ Correcting for GC differences to the background sequences...
ℹ 8 of 9 GC-bins used (have both fore- and background sequences) 0 of 4874…
✔ 8 of 9 GC-bins used (have both fore- and background sequences) 0 of 4874…

ℹ Correcting for GC differences to the background sequences...
✔ Correcting for GC differences to the background sequences... [39ms]

ℹ Create motif hit matrix...
ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ starting iterative adjustment for k-mer composition (up to 160 iteration…
✔ starting iterative adjustment for k-mer composition (up to 160 iteration…

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 40 of 160 iterations done
✔ 80 of 160 iterations done [2.7s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 80 of 160 iterations done
✔ 120 of 160 iterations done [2.9s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 120 of 160 iterations done
✔ 160 of 160 iterations done [2.9s]

ℹ Correcting for k-mer differences between fore- and background sequences.…
ℹ 160 of 160 iterations done
ℹ     iterations finished
ℹ 160 of 160 iterations done
✔ 160 of 160 iterations done [16ms]

ℹ Correcting for k-mer differences between fore- and background sequences.…
✔ Correcting for k-mer differences between fore- and background sequences.…

ℹ Create motif hit matrix...
ℹ Calculating motif enrichment...
ℹ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…
✔ using Fisher's exact test (one-sided) to calculate log(p-values) for enr…

ℹ Calculating motif enrichment...
✔ Calculating motif enrichment... [985ms]

ℹ Create motif hit matrix...
✔ Create motif hit matrix... [2m 20.5s]
# saveRDS(MotifEnrichment, file = "data/Motifs/MotEnrichment.RDS")

We will filter those transcription factors, with \(-{\rm log} (p.adj) > 10\) in at least one bin.

#MotifEnrichment <- readRDS("data/Motifs/MotEnrichment.RDS")
sel  <- apply(assay(MotifEnrichment, "negLog10Padj"), 1, 
             function(x) max(abs(x), 0, na.rm = TRUE)) > 10.0

sum(sel, na.rm = TRUE) #78
[1] 78
MotifEnrichmentSel <- MotifEnrichment[sel, ]

We can visually represent the results in a heatmap using plotMotifHeatmaps function. Rows can be clustered according to motif similarity.

SimMatSel <- monaLisa::motifSimilarity(rowData(MotifEnrichmentSel)$motif.pfm)
range(SimMatSel)
[1] 0.04543382 1.00000000
hcl <- hclust(as.dist(1 - SimMatSel), method = "average")
monaLisa::plotMotifHeatmaps(x = MotifEnrichmentSel, which.plots = c("log2enr", "negLog10Padj"), 
                  width = 2.1, cluster = hcl, maxEnr = 2, maxSig = 15,
                  show_dendrogram = TRUE, show_seqlogo = TRUE, show_motif_GC = TRUE,
                  width.seqlogo = 1.5)

  • In the heatmap we can identify several TFs whose motifs correlate with the observed differences in accessibility and thus are more likely to drive the differentiation process towards CD4+ T cells. The prioritization of the TF can be done on multiple criteria. The strength of enrichment, but mainly from the correlation of the significance along the binning.

  • Another challenge becomes apparent in the motif enrichment analysis: many transcription factors share highly similar motifs, making it difficult to confidently identify the true regulatory factor. This problem is often referred to as co-linearity problem because the matrix of regions vs. transcription binding sites (site-count matrix) has many nearly identical columns. At this stage it is difficult to clearly assess the right candidate. ATACseq data are often accompanied with RNAseq and one can prioritize the TFs based on their expression levels or additional functional knowledge.

  • We should be also careful about the motifs with high GC content. Despite the built-in correction, their significance in the analysis might be influenced by this lower-order nucleotide content.

  • One can also run an “unbiased” k-mer enrichment analysis calcBinnedKmerEnr which searches for enriched k-mers of a length kmerLen in the bins.

# suppressWarnings({kmerEnrichment <- calcBinnedKmerEnr(seqs = peaks.gr ,
#                                       bins = bins,
#                                       kmerLen = 6,
#                                       includeRevComp = TRUE,
#                                       BPPARAM = BiocParallel::SerialParam(RNGseed = 42),
#                                       verbose = FALSE)
# })
#saveRDS(kmerEnrichment, file = "data/Motifs/kmerEnrichment.RDS")
kmerEnrichment <- readRDS("data/Motifs/kmerEnrichment.RDS")
selkm <- apply(assay(kmerEnrichment, "negLog10Padj"), 1, 
               function(x) max(abs(x), 0, na.rm = TRUE)) > 10
sum(selkm)
[1] 52
#> [1] 52
sekmSel <- kmerEnrichment[selkm, ]

Here, we can plot a heatmap that links the identified overrepresented k-mers in the motifs from the previous analysis. It shows that scoring of several transcription factors can be linked with their similarity to the same enriched k-mer. On the other hand, we can also identify k-mers that are not associated with the annotated transcription factors.

pfmSel <- rowData(MotifEnrichmentSel)$motif.pfm
sims <- motifKmerSimilarity(x = pfmSel,
                            kmers = rownames(sekmSel),
                            includeRevComp = TRUE)
dim(sims)
[1] 78 52
maxwidth <- max(sapply(TFBSTools::Matrix(pfmSel), ncol))
seqlogoGrobs <- lapply(pfmSel, seqLogoGrob, xmax = maxwidth)
hmSeqlogo <- rowAnnotation(logo = annoSeqlogo(seqlogoGrobs, which = "row"),
                           annotation_width = unit(1.5, "inch"), 
                           show_annotation_name = FALSE
)


Heatmap(sims, 
        show_row_names = TRUE, row_names_gp = gpar(fontsize = 8),
        show_column_names = TRUE, column_names_gp = gpar(fontsize = 8),
        name = "Similarity", column_title = "Selected TFs and enriched k-mers",
        col = colorRamp2(c(0, 1), c("white", "red")), 
        right_annotation = hmSeqlogo)

Further references

  • You can further explore the options of the monalisa package according to the vignette.

  • To read more about computational tools to identify novel motifs (novel DNA sequences with regulatory functions), you can have a look here. A nice presentation about the Gibbs sampling algorithm can be found here

Session info

sessionInfo()
R version 4.5.0 Patched (2025-05-01 r88190)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Prague
tzcode source: internal

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] circlize_0.4.16                    ComplexHeatmap_2.24.0             
 [3] SummarizedExperiment_1.38.1        Biobase_2.68.0                    
 [5] MatrixGenerics_1.20.0              matrixStats_1.5.0                 
 [7] monaLisa_1.14.0                    JASPAR2022_0.99.8                 
 [9] BiocFileCache_2.16.0               dbplyr_2.5.0                      
[11] BSgenome.Mmusculus.UCSC.mm10_1.4.3 BSgenome_1.76.0                   
[13] rtracklayer_1.68.0                 BiocIO_1.18.0                     
[15] motifmatchr_1.30.0                 TFBSTools_1.46.0                  
[17] MotifDb_1.50.0                     Biostrings_2.76.0                 
[19] XVector_0.48.0                     GenomicRanges_1.60.0              
[21] GenomeInfoDb_1.44.0                IRanges_2.42.0                    
[23] S4Vectors_0.46.0                   BiocGenerics_0.54.0               
[25] generics_0.1.4                     universalmotif_1.26.2             
[27] ggdendro_0.2.0                     ggplot2_3.5.2                     
[29] BiocParallel_1.42.0                BiocPkgTools_1.26.0               
[31] htmlwidgets_1.6.4                  BiocStyle_2.36.0                  
[33] magick_2.8.6                       png_0.1-8                         
[35] knitr_1.50                        

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3          shape_1.4.6.1              
  [3] rstudioapi_0.17.1           jsonlite_2.0.0             
  [5] magrittr_2.0.3              farver_2.1.2               
  [7] rmarkdown_2.29              GlobalOptions_0.1.2        
  [9] vctrs_0.6.5                 memoise_2.0.1              
 [11] Rsamtools_2.24.0            RCurl_1.98-1.17            
 [13] gh_1.4.1                    htmltools_0.5.8.1          
 [15] S4Arrays_1.8.0              curl_6.2.2                 
 [17] SparseArray_1.8.0           lubridate_1.9.4            
 [19] cachem_1.1.0                GenomicAlignments_1.44.0   
 [21] whisker_0.4.1               igraph_2.1.4               
 [23] lifecycle_1.0.4             iterators_1.0.14           
 [25] pkgconfig_2.0.3             Matrix_1.7-3               
 [27] R6_2.6.1                    fastmap_1.2.0              
 [29] clue_0.3-66                 GenomeInfoDbData_1.2.14    
 [31] digest_0.6.37               colorspace_2.1-1           
 [33] TFMPvalue_0.0.9             stabs_0.6-4                
 [35] RSQLite_2.3.11              labeling_0.4.3             
 [37] seqLogo_1.74.0              filelock_1.0.3             
 [39] timechange_0.3.0            httr_1.4.7                 
 [41] abind_1.4-8                 compiler_4.5.0             
 [43] bit64_4.6.0-1               withr_3.0.2                
 [45] doParallel_1.0.17           biocViews_1.76.0           
 [47] DBI_1.2.3                   fauxpas_0.5.2              
 [49] MASS_7.3-65                 DelayedArray_0.34.1        
 [51] rjson_0.2.23                gtools_3.9.5               
 [53] caTools_1.18.3              tools_4.5.0                
 [55] splitstackshape_1.4.8       glue_1.8.0                 
 [57] rorcid_0.7.0                restfulr_0.0.15            
 [59] cluster_2.1.8.1             gtable_0.3.6               
 [61] tzdb_0.5.0                  tidyr_1.3.1                
 [63] data.table_1.17.2           hms_1.1.3                  
 [65] xml2_1.3.8                  foreach_1.5.2              
 [67] pillar_1.10.2               stringr_1.5.1              
 [69] splines_4.5.0               dplyr_1.1.4                
 [71] lattice_0.22-7              survival_3.8-3             
 [73] bit_4.6.0                   tidyselect_1.2.1           
 [75] DirichletMultinomial_1.50.0 RBGL_1.84.0                
 [77] crul_1.5.0                  xfun_0.52                  
 [79] DT_0.33                     stringi_1.8.7              
 [81] UCSC.utils_1.4.0            yaml_2.3.10                
 [83] evaluate_1.0.3              codetools_0.2-20           
 [85] httpcode_0.3.0              tibble_3.2.1               
 [87] BiocManager_1.30.25         graph_1.86.0               
 [89] cli_3.6.5                   Rcpp_1.0.14                
 [91] XML_3.99-0.18               RUnit_0.4.33               
 [93] parallel_4.5.0              readr_2.1.5                
 [95] blob_1.2.4                  bitops_1.0-9               
 [97] glmnet_4.1-8                pwalign_1.4.0              
 [99] scales_1.4.0                purrr_1.0.4                
[101] crayon_1.5.3                GetoptLong_1.0.5           
[103] rlang_1.1.6                 rvest_1.0.4