Transcription factors (TFs) are the managers of the cellular factory, controlling everything from cellular identity to response to external stimuli [1
]. Because of their central importance in interpreting the genome, millions of people are affected by mutations residing within TFs [2
], causing a wide variety of symptoms (see Table 1
). For example, over half of all cancers have a mutation in the TF TP53 [3
Moreover, most disease-causing mutations are found in regulatory regions [14
], e.g., enhancers, which are dense with TF binding sites [16
]. A startling 60–76.5% of disease-associated single nucleotide polymorphisms (SNPs) are in enhancers [17
], which are short regulatory regions densely bound by TFs [21
]. In fact, the well known program HaploReg now lists all TFs that bind over each SNP, a useful piece of information for understanding the impact of a SNP [22
The relationship between many diseases and transcription factors has led to tremendous interest in global investigations of transcription factor activity. To decipher transcription factor activity requires understanding of the two major functions of a transcription factor: binding to DNA and modification of transcription. Transcription factors bind to specific DNA sequences, a TF recognition motif. A number of techniques have been utilized to identify and characterize such recognition motifs [23
]. However, because most genomic instances of the motif are not actually bound, having the recognition motif is insufficient. Protein–DNA interactions can be measured genome-wide using chromatin immunoprecipitation followed by sequencing (ChIP-seq) [23
]. Unfortunately, numerous lines of evidence indicate that not all binding events influence transcription [26
]. Conceptually, this is akin to saying that someone merely standing in a lab (TF binding) may not be conducting an experiment (altering transcription). Therefore, distinct assays are necessary to identify the locations where a TF is bound to DNA and determine whether that DNA binding leads to altered transcription nearby. A number of high-throughput assays are available to interrogate these two key functions.
Extensive attention has focused on determining where in the genome transcription factors bind [23
]. The ENCODE project included approximately 2000 TF ChIP-seq experiments, including 180 TFs in K562 (myeloid leukemia) cells alone [29
]. Large regulation projects such as ENCODE and Roadmap Epigenomics have been invaluable to our understanding of TF binding. However, there are an estimated 1600 TFs in the human genome, and many do not have a reliable antibody for ChIP-seq [23
]. Even when antibodies are available, individual transcription factors can have distinct profiles of binding locations across cell types and conditions. Consequently, the cost of individually profiling every TF in each cell type is enormous, much less across different conditions [31
]. Finally, if the effect of a particular perturbation is unknown, profiling assorted TFs by ChIP is prohibitively expensive.
An alternative approach to detecting individual protein–DNA binding locations is to infer a large collection of binding events via DNA footprinting [32
]. Dense mapping of DNase I clevage sites identifies small regions protected from cleavage by the presence of a bound transcription factor [32
]. While early footprinting studies identified a large repertoire of previously uncharacterized motifs protected from cleavage, suggesting many novel transcription factors [34
], subsequent work indicates these regions likely reflect sequence based cleavage bias of the DNase I enzyme [35
]. Additionally, it is also now clear that most TFs (80%) do not show a measurable footprint [36
], thereby limiting the effectiveness of this approach.
Despite these limitations, DNA footprinting assays uncovered a distinct function for transcription factors: altering DNA accessibility. When chromatin accessibility data is considered in the context of known TF sequence motifs [37
], one can reasonably infer transcription factor binding profiles [41
]. When accessibility profiles are then compared to ChIP in the context of perturbations, transcription factors can be classified as “pioneer” or “settler” depending on whether they open chromatin or require accessible, exposed DNA to bind [42
]. Whether alterations of local chromatin accessibility reflect a byproduct of the TF’s DNA binding or its altering of transcription remains unclear.
Altering transcription is the second major function of transcription factors [23
]. Because TFs alter transcription, some of the earliest studies of TFs as regulators were based on expression data. For nearly twenty years, large compendiums of expression data have been utilized to infer gene regulatory networks [43
]. Typically these approaches search for modules, collections of co-regulated genes across distinct conditions. The identification of nearby TF recognition motifs [45
] or co-regulated transcription factors [43
] link particular TFs to the module of genes they regulate. For instance, ISMARA (Integrated System for Motif Activity Response Analysis) [47
] models gene expression in terms of TF sequence motifs within proximal promoters. Gene regulatory network methods have been instrumental for understanding large-scale regulatory networks but are inherently limited by the fact that they depend on steady state expression data. Steady state expression assays (microarray or RNA-seq) reflect not only transcription but also RNA processing, maturation, and stability. Hence, they are an indirect readout on the effect of perturbations to transcription factors. Additionally, they are generally incapable of reliably detecting small changes at short time points without an impractical number of replicates [48
Nascent transcription assays (GRO-seq and PRO-seq) directly profile RNA associated with engaged cellular polymerases [49
]. Consequently, nascent assays are a direct readout on changes to transcription induced by perturbations [21
]. Interestingly, an additional feature of nascent transcription data is the identification of short unstable transcripts immediately proximal to sites of transcription factor binding [52
]. Importantly, these transcripts, now known as eRNAs can be employed as markers of TF activity [58
]. The change in patterns of eRNA usage, genome-wide relative to TF recognition motifs, allows one to determine which transcription factors are altered by a perturbation with no a priori information. Unfortunately, nascent transcription protocols [49
] are onerous, are time-consuming, and require large numbers of cells as input. Consequently, these experimental assays are predominantly used on cultured cell lines and not yet widely adopted. Therefore, we sought a simpler, easy-to-use approach to inferring differential transcription factor activity.
The Assay for Transposase-Accessible Chromatin followed by sequencing (ATAC-seq), a method for identifying regions of open chromatin, is particularly attractive because it is quick, easy, inexpensive, and deployable in small cell count samples. Additionally, recent work has shown that changes in chromatin accessibility can inform on TF activity. Specifically, BagFoot [36
] combined footprinting with differential accessibility to identify TFs associated with altered chromatin accessibility profiles in the presence of a perturbation. They predominantly focused on DNase I hypersensitivity data, but also examined a small number of ATAC-seq datasets. Here, we seek to confirm and extend their results in two ways. First, we investigated as to whether an alternative approach, namely the motif displacement statistic [58
], developed initially for nascent transcription analysis, can infer differential TF activity from ATAC-seq datasets. Second, we sought to construct an easy-to-use pipeline specific to the analysis of differential ATAC-seq analysis.
We introduce here a tool, the Differential ATAC-seq toolkit (DAStk), developed with simplicity and ease of implementation in mind, focused around inferring changes in TF activity from ATAC-seq data. Using nascent transcription data, we had previously developed the motif displacement score (MD-score), a metric that assesses TF-associated transcriptional activity. As such, the MD-score reflects the enrichment of a TF sequence motif within a small radius (150 bp) of enhancer RNA (eRNA) origins relative to a larger local window (1500 bp) [58
]. While ATAC-seq does not directly provide information on eRNAs, most sites of eRNA activity reside within open chromatin [59
]. Therefore, we utilize the midpoint of detected ATAC-seq peaks (rather than the eRNA origin) as a frame of reference for calculating MD-scores. Then, given two distinct biological conditions, we compare the ratio of MD-scores across the conditions and identify statistically significant changes by a two-proportion Z-test. Using public ATAC-seq data from a variety of human and mouse cell lines (IMR90, H524, NJH29, and BRG1fl/fl
) and perturbations (nutlin, doxycycline, and tamoxifen), we assessed changes in accessibility over all putative TF sequence recognition motifs (for all motifs within the HOCOMOCO database[38
Given our familiarity with TP53 activation [55
], we first examined this approach on ATAC-seq data gathered before and 6 h after Nutlin-3a exposure on IMR90 cells [61
]. Nutlin-3a is an exquisitely specific activator of TP53. As expected, we found that TP53 displayed the most significant change (p
) in MD-score (Figure 1
a, in red) of all motifs within the HOCOMOCO database [38
]. Relaxing the p
-value cutoff (p
), we subsequently identified altered activity in TP63 and TP73 (Figure 1
a, in maroon), likely reflecting the fact that these two proteins have nearly identical sequence recognition motifs to TP53.
Interestingly, Nutlin-3a has also been analyzed using nascent transcription data albeit in a different cell line (HCT116) at a shorter time point (1 h) [55
]. The MD-score analysis of the nascent data [58
] obtained very similar results (Figure 1
b). Unfortunately, a direct comparison of individual genomic loci between the two data sets is not feasible because different cell lines and drug exposure times are used. However, a couple of interesting observations concerning the overall MD-score trends are nonetheless noteworthy. First, the co-localization of the TP53 motif with ATAC-peak midpoints is far less striking than the co-localization of motifs with the eRNA origins (observed in the heatmap histograms). This observation, combined with the relative lower magnitude of
MD-scores (y-axis), suggests that the eRNA origin (obtained in nascent transcription) is a far more precise method of localizing and detecting changes in TF activity. Second, despite this lack of precision, ATAC-seq correctly identifies TP53 as the most dramatically altered MD-score, whereas the best scoring motif with nascent transcription is TP63. Why this discrepancy exists is unclear, but given the relative similarity of these two TF motifs it may simply be coincidental.
We next analyzed differential ATAC-data gathered by Denny et al. to examine whether Nfib promotes metastasis via increasing chromatin accessibility. For this question, they examined two human small cell lung carcinoma (SCLC) cell lines (H524 and NJH29), profiling by ATAC-seq before and four hours after doxycycline treatment. Using the MD-score approach, we detected changes in TF activity for multiple members of the NFI family (Figure 2
a,b). An increase in NFIA (two different motifs) and NFIC was detected in both cell types (p
for H524s; p
for NJH29s). As further confirmation of the NFI signal, we tested one of their mouse samples (KP22 cells) and found an increase of NFIA (p
), consistent with the human results. We next investigated as to whether our results were sensitive to the particular peaks utilized. To this end, we sub-sampled peaks from the NJH29 data and re-ran our analysis. Both NFIA and NFIC are detectable as significant (p
) even when using only half of the ATAC-seq peaks, suggesting the signal is reasonably robust.
We then sought to determine how the
MD-score approach compared to the BagFoot [36
] approach at identifying differential TF activity. BagFoot also identified NIFA and NIFC within the SCLC differential ATAC-seq data [36
]. However, they additionally claimed HNF6 as potentially altered in the SCLC data. Importantly, Baek et. al. noted that the HNF6 result did not hold when their approach utilized bias corrected data (based on naked DNA digested with Tn5). The fact that our MD-score approach does not identify HNF6 as altered further supports the idea that this result reflects a data artifact rather than a true biological phenomena. Interestingly, the MD-score approach and Bagfoot obtained nearly identical results on a second differential ATAC-seq dataset. In this case, King and Klose [62
] showed BRG1, essential for pluripotency-related chromatin modifications, is required to make chromatin accessible at OCT4 target sites. To this end, they treated BRG1fl/fl
mouse embryonic stem cells (ESCs) with tamoxifen for 72 h to ablate BRG1 expression. When compared to the unperturbed mouse ESC control, we observed lowered MD-scores for SOX2, POU5F1 (Oct4), and NANOG in the BRG1-depleted cells (p
; Figure 3
a), directly confirming the BagFoot findings.
Finally, we examined a differential ATAC-seq dataset obtained for decidualized and undecidualized human endometrium cells [63
]. Spontaneous decidualization occurs in response to progesterone signaling (i.e., by an implanted embryo at the early stages of pregnancy). Using our MD-score approach, we found the CEBP family of transcription factors had increased activity in decidualized cells, consistent with the author’s conclusion (Figure 3
b). Additionally, we found significantly lowered MD-scores for the KLF16 motif (a TF known to be involved in regulatory uterine cell biology [64
]) and TFDP1 (a known target to the estrogen receptor
present in all endometrial cell types [65
] of lower activity during the secretory phase, in concert with the decidualization process). In all cases, the magnitude of MD-score alterations were relatively small, and yet the transcription factors uncovered can be linked to the underlying decidualization process.
We sought to identify changes in TF activity across differential ATAC-seq datasets, as this protocol is inexpensive, is simple, and requires relatively small cell counts. Here, we demonstrate two important results. First, using a simple statistic (the motif displacement score) as a co-localization measure of ATAC-seq peak midpoints to TF sequence motif sites across the genome, we correctly detect changes in TF activity. Second, our approach independently confirms the results obtained by BagFoot [36
], as the two analysis techniques are distinct in their approach to quantifying differences in chromatin accessibility across conditions. Arguably, regardless of which analysis technique is preferred, differential ATAC-seq is a relatively simple and inexpensive way to assess changes in TF activity induced by perturbations.
We believe there are two distinct advantages to the MD-score approach to assessing TF activity. First, the MD-score is calculated relative to a local background window. Consequently, it cleanly accounts for the localized sequence bias observed at promoters and enhancers [58
], which likely reduces false positives. Second, the statistic is relatively simple to implement and naturally accommodates multiprocessing for faster computations. DAStk can easily be incorporated at the tail-end of a traditional processing pipeline for ATAC-seq data, in that MD-scores are calculated directly from called peaks and genomic sequence.
Our MD-score statistic was originally developed for analysis of nascent transcription data [58
] and focused on enhancer RNA co-localization with motifs. Given most eRNAs originate from areas of open chromatin [21
] and many transcription factors can alter chromatin accessibility [42
], it is perhaps unsurprising that differential chromatin accessibility can be used to infer changes in TF activity. However, it remains unclear whether the observed alterations of chromatin reflect a distinct functional activity of transcription factors or are simply a side effect of DNA binding and/or altering transcription. While a careful examination of the two Nutlin-3a datasets (Figure 1
) yields the identification of several genomic regions that are uniquely altered in only one of the two datasets (ATAC-seq or nascent), the lack of matched data makes interpretation of these differences difficult. Do they reflect differences of cell type or distinct functional activities of TP53? A careful comparison of chromatin accessibility and nascent transcription data in the context of a perturbation will be necessary to fully address this question.