Improved Prediction of Regulatory Element Using Hybrid Abelian Complexity Features with DNA Sequences

Deciphering the code of cis-regulatory element (CRE) is one of the core issues of current biology. As an important category of CRE, enhancers play crucial roles in gene transcriptional regulations in a distant manner. Further, the disruption of an enhancer can cause abnormal transcription and, thus, trigger human diseases, which means that its accurate identification is currently of broad interest. Here, we introduce an innovative concept, i.e., abelian complexity function (ACF), which is a more complex extension of the classic subword complexity function, for a new coding of DNA sequences. After feature selection by an upper bound estimation and integration with DNA composition features, we developed an enhancer prediction model with hybrid abelian complexity features (HACF). Compared with existing methods, HACF shows consistently superior performance on three sources of enhancer datasets. We tested the generalization ability of HACF by scanning human chromosome 22 to validate previously reported super-enhancers. Meanwhile, we identified novel candidate enhancers which have supports from enhancer-related ENCODE ChIP-seq signals. In summary, HACF improves current enhancer prediction and may be beneficial for further prioritization of functional noncoding variants.


Introduction
The development and cellular differentiation in eukaryotic organisms requires precise regulation of gene expression, which is governed by the orchestration of various cis-regulatory elements (CRE) [1]. Enhancer is one category of CRE that regulates the transcription of target gene in a distal manner [2][3][4]. On the one hand, the ENCODE project identified >500,000 putative enhancers using histone modifications indirectly [5,6] and their total length might reach 12% of the human genome [7]. On the other hand, genome-wide association studies (GWAS) in the past decade identified over 55% of the disease-associated SNPs in the non-coding DNA [8] and some of them were located exactly in the enhancer regions [9]. Therefore, accurate identification of enhancer region across the whole genome is currently of broad interest. It is urgent that we systematically discover active enhancers in specific cells or tissues for a better understanding of the functional roles of non-coding variants.
However, this is a challenging problem because an enhancer acts in its regulating role with both a distal and bidirectional manner. In contrast to promoter, distal enhancers are typically located more than 10 kb away from the gene it regulates [4], and some extreme examples showed that the distance might be farther away, such as 1 Mb [10,11]. Such a long distance makes precise positionings of enhancers difficult. Also, enhancers perform diverse regulatory functions during different tissues and cell lines. For instance, the genomic region of chr11:110,716,521-110,832,306 in diffuse large B-cell lymphoma (DLBCL) acts as an enhancer that specifically regulates POU2AF1 [12], and the genomic region of chr22:27,552,905-27,512,763 acts as an enhancer with activating expression of XBP1 in the multiple myeloma cell line (MM1.S) [13]. The specificities of enhancers among different tissue and cell lines make it harder to discover novel enhancers. Moreover, unlike a promoter that is located in the upstream of its target gene, an enhancer can function bidirectionally, which doubles the searching difficulty [3].
In the past two decades, researchers developed several distinct experimental strategies from different viewpoints for inferring the locations of active enhancers, such as Vista enhancer browser using transgenic mouse assay [14], EP300 (histone acetyltransferase, an ubiquitous activator) enhancer using ChIP-seq of EP300 [15], ENCODE enhancer inferred from chromatin features of ENCODE data [5,6,[16][17][18], experimental-validated enhancer by massively parallel report assay (MPRA) [18][19][20], STARR-seq using self-transcribing transcripts [21] and FANTOM enhancer identified with cap analysis of gene expression (CAGE) of enhancer RNA (eRNA) [22]. Among them, VISTA enhancer browser [14] used a transgenic mouse assay to experimentally validate activities of candidate enhancers (human or mouse noncoding fragments supported with their extreme conservation or putative enhancer markers) in vivo. ChromHMM [5] or Segway [6] adopted an unsupervised clustering approach for annotating genome-wide enhancers based on a variety of chromatin features (such as chromatin accessibility with DNase I Hypersensitive Sites (DHSs), covalent histone modifications including monomethylation of lysine 4 of histone H3 (H3K4me1) and acetylation of lysine 27 of histone H3 (H3K27ac) from ENCODE data. However, these annotations are not satisfactory since an example of a study using MPRA demonstrated that only around 20% of ENCODE annotated enhancers in K562 cell lines drove significant activity of a reporter gene [20]. More recently, enhancers were found to be acting as transcription units and to be producing eRNA in a widespread manner [23]. Based on this, the FANTOM consortium [22] employed CAGE to analyze eRNA peaks around enhancers and found a universal pattern that enhancer exhibits a bidirectional transcription both in the forward and reverse strands. Taking this as a criterion, they identified the exact location of enhancer to be the interval between the two CAGE peaks (FANTOM5 enhancer data). Although a golden criterion for accurate enhancer identification is still obscure, nowadays the strategy employing eRNA has received increasing acceptances [3,24,25] and FANTOM5 enhancer data has become the state-of-the-art data resource for discovering enhancers [1].
Another possible strategy for enhancer identification is by a computational method from the DNA sequence only. The rationale behind this lies in two aspects: (1) enhancer sequence contains combinations of short sequence motifs known as transcriptional factor binding sites (TFBSs) and disrupting their core motifs substantially reduced the activities of the enhancers [26], suggesting that these motif features are necessary for enhancer activity; (2) even when being removed from its endogenous context and moved to upstream of a reporter gene, enhancer sequence can still drive gene expressions [27], suggesting that enhancer DNA is sufficient for enhancer activity [28]. The above two points make the sequence-based computational mean an alternative strategy for enhancer identification.
The computational strategy aims to learn cis-regulatory code (CRC) [29] hidden in trustable enhancer samples and then to build a reliable prediction model both for evaluating existing enhancers and for discovering novel enhancers [26,[30][31][32][33][34]. A pioneer finding is that k-mer features of varying length (3-10 bp) (an optimal k is 6) are predictive sequence features for discriminative prediction of enhancer when analyzing ChIP-seq data of EP300 [30]. Notably, a motif discovery of STARR-seq enhancer data found that known TF motifs as well as dinucleotide repeat motifs (DRMs) are enriched in enhancer sequences [26]. Further, a support vector machine (SVM) model using these differential motif distributions was sufficient (AUC = 0.90) to discriminate enhancers from negative controls. Moreover, DEEP integrates three resources of enhancer data, ENCODE, FANTOM5 and VISTA to develop an ensemble prediction framework by using 351 attributes derived from the sequences including 1-4 mers and other 11 combinatoric features [31]. iEnhancer-2L proposed to use pseudo k-tuple nucleotide composition features for identifying enhancers and their strength [32]. More recently, more advanced tools that can automatically learn CRC, including deep learning [33] and gkm-SVM [34], were applied for predicting enhancers.
Although different tools adopted distinct technologies, a consensus is that one should try to learn combinatoric properties of DNA motifs or k-mer to be predictive sequence features. Notably, DRMs like "GCGCGC" and "GAGAGA" were proven to be necessarily required for active enhancers and a high occurrence number of DRM was found to be a main feature of broadly active enhancers [26]. In mathematics, subword complexity function (SCF) and topological entropy (TE) characterize complexity of DNA sequence by counting the number of different k-mers (termed as "subwords" in mathematics) appearing in the DNA sequence. This enables SCF or TE to be the most appropriate tools for characterizing repeat sequence [35] because repeat sequence like "GCGCGC" uses less subwords (only two distinct subwords with length 2-5). Naturally, we tried to apply TE for discriminating enhancers from other types of DNA elements. As expected, TE is sufficient for discriminating enhancers from promoters. Unexpectedly, when discriminating enhancers from non-enhancers, TE is no longer efficient, suggesting that a new method is required. Here, we introduce a novel sequence complexity concept, i.e., abelian complexity function (ACF), for a new coding of DNA sequences. After feature selection supported with an upper bound theorem and integration with 4-mer motif features, hybrid abelian complexity features (HACF) are proven to be predictive for enhancer prediction.

Topological Entropy of Enhancer is Larger than that of Exon, Intron or Promoter
Different functional genomic regulatory elements have distinct TEs: it was reported that H top (intron) > H top (exon) > H top (promoter)(H top denotes the topological entropy) [36,37], leading us to naturally ask about TE of enhancer. Toward this end, we applied the formula (4) in corollary 1 to compute TEs of four categories of genomic elements: exon, intron, promoter and enhancer. To be similar to a previous study which employed TE for discriminating intron and exon [36], we chose the longest 200 introns and exons, respectively, based on the GRCh37 reference genome. For promoter and enhancer data, we chose the longest 200 promoters from FANTOM5 promoter data [38] and the longest 200 enhancers from the dataset Enhancer4589, respectively. To be consistent with previous findings, the relationship of H top (intron) > H top (exon) > H top (promoter)still holds. And, as expected, H top (enhancer) is significantly larger than that of exon, intron or promoter ( Figure 1A), suggesting that enhancer sequence uses more distinct subwords and is more random than other genomic elements.
It is noteworthy that the difference of TE between promoter and enhancer is large ( Figure 1B), leading us to ask whether we could distinguish enhancer from promoter with TE. We then ran a SVM classifier on the above longest 200 enhancers (positives) and the longest 200 promoters (negatives), and the resulting AUC was 0.854 ( Figure 1C), suggesting that TE is a good predictor for discriminating enhancer from promoter. However, when the negatives were replaced to random intergenic regions without any annotations, the resulting AUC dropped down to 0.698 ( Figure 1C), suggesting that TE is not sufficient for predicting enhancer in the genome-wide scale and that a new method is required. The p-value of T test between AUC of HACF and SACF209 is 8.1 × 10 −3 ; the p-value of T test between AUC of SACF209 and tetra-nucleotide frequency is 2.5 × 10 −3 ; the p-value of T test between AUC of tetra-nucleotide frequency and SACF484 is 4.9 × 10 −5 .
It is noteworthy that the difference of TE between promoter and enhancer is large ( Figure 1B), leading us to ask whether we could distinguish enhancer from promoter with TE. We then ran a SVM classifier on the above longest 200 enhancers (positives) and the longest 200 promoters (negatives), and the resulting AUC was 0.854 ( Figure 1C), suggesting that TE is a good predictor for discriminating enhancer from promoter. However, when the negatives were replaced to random intergenic regions without any annotations, the resulting AUC dropped down to 0.698 ( Figure 1C), suggesting that TE is not sufficient for predicting enhancer in the genome-wide scale and that a new method is required. The p-value of T test between AUC of HACF and SACF209 is 8.1 × 10 −3 ; the p-value of T test between AUC of SACF209 and tetra-nucleotide frequency is 2.5 × 10 −3 ; the p-value of T test between AUC of tetra-nucleotide frequency and SACF484 is 4.9 × 10 −5 .

Hybrid Abelian Complexity Features are Predictive Features for Enhancers
Next, we proposed a new concept of sequence complexity-abelian complexity to construct a prediction model of enhancer based on the dataset of Enhancer4589. We first applied SACF484 (Methods 2.4) to run a SVM classifier and found a predicting accuracy of AUC of 0.877 ( Figure 1D and Table 1). We further selected 209 abelian complexity features (SACF209, Supplementary Table S2) among 484 features by a statistical test between positive and negative samples (Wilcoxon-test, p-value <1.0 × 10 −20 , light blue points in Figure 2C). When using SACF209 for prediction, a significant increase was achieved with AUC of 0.933 ( Figure 1D and Table 1).
DNA composition features are fundamental features for representing DNA sequences and tetra-nucleotide frequency features (TNF, 256 4-mer features by counting frequencies of four continuous bases) were proven as predictive features for identifying enhancers in DEEP [31]. Finally, when integrating SACF209 and TNF to construct a combined feature set with 465 features in total (hybrid abelian complexity features, HACF), the predicting accuracy further achieved a slight promotion with AUC of 0.955 (Table 1). Notably, when comparing SACF209 with TNF, AUC of 0.933 of the former is better than AUC of 0.912 of the later. Moreover, we ran a feature importance program to rank all the 465 features by each feature's contribution to prediction and found that there are 88 features out of SACF209 (dark blue points in Figure 2D, Supplementary Table S3) among the top 100 important features. These findings all support the conclusion that SACF209 are predictive features and HACF is a more optimal combined feature set for enhancer prediction.

Comparisons with Other Existing Methods
To comprehensively compare with existing methods, we chose three state-of-the-art models, EnhancerFinder [39], DEEP [31] and iEnhancer-2L [32] for comparisons. EnhancerFinder constructed a dataset including 717 VISTA human enhancer sequences and 717 random genomic regions, however it did not provide the dataset in their publication [39]. We turned to the latest version of the VISTA database (February 08, 2018) and downloaded 959 human enhancer sequences and we randomly chose 959 intergenic regions with the same length distribution of positive samples to construct the negative sample set. When running our model (HACF) on the newly-constructed VISTA enhancer dataset using 10-fold cross validation, the AUC of 0.962 outperformed the AUC of 0.960 of EnhancerFinder (Table 2). DEEP [31] performed tissue-specific enhancer predictions based on FANTOM5 enhancer data, however it only provided the detailed prediction results on three specific tissues: heart, liver and brain, which were chosen for comparisons. The same enhancer data of three tissues of DEEP was used as positive samples and the negative samples were chosen from random intergenic regions with 10 times the number of positive samples of each tissue (Supplementary Table S4). When performing the optimal testing strategy (40% for training and 60% for testing) of DEEP, the ACC and AUC values of HACF outperform those of DEEP while the GM values are relatively lower (Table 2). This suggests that HACF is generally comparable with DEEP. On the one hand, prediction performance of HACF on some indexes like AUC are superior to those of DEEP. On the other hand, DEEP has greater GM values and is more balanced between positive predictions and negative predictions.
iEnhancer-2L [32] constructed a two-layer predictor where 742 strong enhancers and 742 weak enhancers annotated by ChromHMM [5] were combined as positive samples in the first layer (1484 non-enhancers as negative samples). And the second layer was further designed to distinguish 742 strong enhancers and 742 weak enhancers. After downloading the provided dataset of iEnhancer-2L (http://bioinformatics.hitsz.edu.cn/iEnhancer-2L/data/) (access on 5 May 2018), we ran the HACF model and listed a detailed comparison based on a leave-one-out test in Table 2. An AUC of 0.907 of the first layer and an AUC of 0.789 of the second layer outperform 0.850 and 0.660 of iEnhancer-2L, respectively, suggesting that HACF is superior to iEnhancer-2L in both layers.
In summary, comprehensive comparisons with three state-of-the-art methods might lead us to conclude that our HACF approach is a more accurate and robust model for enhancer prediction.

Scanning Chromosome 22 Identifies Known Disease-Related Super-Enhancers as Well as Novel Candidate Enhancers
Finally, we applied the HACF model to scan the whole DNA sequence of human chromosome 22 for validating its generalization ability. We adopted a scanning strategy by sliding a 1 kb-window with step size 10 bp across human chromosome 22 (34,885,300 bp in total), by which 3,488,530 DNA fragments were obtained and then were put into the HACF model for prediction. To control the false positive, we adopted a similar analysis strategy with ChIP-seq for the quality-control task: (1) each 1 kb DNA fragment that had a predicted value larger than 0.5 was regarded as a candidate read; (2) all candidate reads along with their predicted values were put into MACS2 [41] to perform peak calling. As per the output, MACS2 gave 7045 peaks that were considered as the predicted enhancers (Supplementary Table S5). As a result, 869 out of 1151 (75.5%) FANTOM5 enhancers within chromosome 22 are consistent with our predicted enhancers (with overlap), which further validated our model. For example, we listed two representative regions validated in GM12878 cell lines (Figure 2A,B) where the regions that are covered by gray color represent the overlapping enhancers with FANTOM5 enhancers. Notably, these regions are also supported with neighboring enhancer-related epigenomic markers of H3K4me1, H3K27ac, EP300 and H3K4me3 (Figure 2A,B), suggesting that the HACF model is an accurate model for enhancer identification.
Meanwhile, we tested the generalization ability of the HACF model on two known super-enhancers (SE) located within chromosome 22 which were reported to be related with MYC oncogene in multiple myeloma [13]. Lovén et al. used the ChIP-seq signals of coactivator BRD4 and MED1 to identify the locations of SE and showed that SE are key oncogenic drivers in many tumor cells [13]. They totally identified 308 SE of the multiple myeloma cell line (MM1.S), six of which are located within chromosome 22. We chose two of them (chr22:27516134-27555928, 3.9 kb and chr22:35940694-35972007, 3.1 kb) to perform two case studies. We used the same strategy above to scan these two regions and identified enhancer peaks ( Figure 2C,D), most of which are quite consistent with ChIP-seq peaks of BRD4 and MED1, suggesting that the HACF model has a powerful ability of identifying disease-related super-enhancers.
They totally identified 308 SE of the multiple myeloma cell line (MM1.S), six of which are located within chromosome 22. We chose two of them (chr22:27516134-27555928, 3.9 kb and chr22:35940694-35972007, 3.1 kb) to perform two case studies. We used the same strategy above to scan these two regions and identified enhancer peaks ( Figure 2C, 2D), most of which are quite consistent with ChIPseq peaks of BRD4 and MED1, suggesting that the HACF model has a powerful ability of identifying disease-related super-enhancers. In addition, our scanning approach simultaneously identified some novel candidate enhancers, some of which are covered by a blue color (Figure 2A, 2B)  In addition, our scanning approach simultaneously identified some novel candidate enhancers, some of which are covered by a blue color (Figure 2A,B), representing those candidate enhancers that have no overlaps with FANTOM5 enhancers yet are supported by at least one of ChIP-seq peaks of H3K4me1, H3K27ac, EP300 and H3K4me3 markers (downloaded from https://www.ncbi. nlm.nih.gov/geo/roadmap/epigenomics/) access on June 2, 2018. To comprehensively evaluate validities of novel candidate enhancers, we made comparisons between 7045 candidate enhancer regions (±2Kbcentered at the predicted enhancers) and the same number of random intergenic regions (background) on their average signal intensities of H3K4me1 and H3K27ac in two cell lines or tissues (iPS-20b and kidney). It was found that both H3K4me1 (red curves) and H3K27ac (purple curves) signals on candidate enhancer regions generated obvious peaks around the locations of predicted enhancers which are significantly larger than the signals on the random regions (blue and green curves) ( Figure 2E,F), supporting that the enhancers predicted by the HACF model have the potential to be bona fide enhancers.

Dataset Preparation
In this study, we constructed a large-scale enhancer dataset based on FANTOM5 enhancer data [22], which can be downloaded at: http://fantom.gsc.riken.jp/5/datafiles/latest/extra/ Enhancers/ (access on January 11, 2018). The original data was presented as a matrix E 65423×1829 with 65,423 rows and 1829 columns, where each row represents a specific enhancer, each column represents a specific tissue or cell line in human and element in ith row and jth column e ij epresents the relative log expression (RLE)-normalized expression value (presented as tags per million mapped reads, TPM) of CAGE of ith enhancer in jth tissue or cell line.
To construct a high-quality enhancer dataset, the following four major steps of data preprocessing were involved. First, Andersson et al. [22] found that bidirectional capped RNA is a signature feature of active enhancers, which implies that larger TPM ensures greater active intensity of enhancer. Using this principle, we adopted a cut-off criterion of 'TPM rowmin ≥ α' (TPM rowmin represents the minimal nonzero value of TPM across all tissues and cell lines of a given enhancer, α is a given positive parameter) to select most active enhancers. When α was determined as 0.08 (we showed that the prediction performance was not sensitively dependent on the selection of parameter α in the supplementary information and α = 0.08 is an appropriate choice), only 5386 enhancers passed this criterion. Next, although some existing works used 2 kb as the length of enhancer sequence [42,43], we fixed all the input sequences to a length of 1 kb because FANTOM enhancer sequence has a shorter length (282 bp on average) compared with ENCODE enhancer sequence. Since a previous study reported that too short a sequence would lead to failed feature extraction [44], we therefore excluded those enhancers with lengths shorter than 100 bp and obtained 4667 enhancers. Meanwhile, for sequences that were larger or shorter than 1 kb, they were truncated at the 5' and 3' ends or were expanded to 1 kb based on the human reference genome (hg19), respectively, centering at the middle points. In the third step, we employed a redundancy reduction procedure CD-HIT [45] (http://www.bioinformatics.org/cd-hit/) (access on February 12, 2018). with a cutoff threshold of 0.8 to remove the redundancy between enhancer sequences, meaning 4653 enhancers remained. In the last step, for a fair evaluation of scanning chromosome 22, we discarded those enhancers located within chromosome 22 and obtained a total of 4589 enhancers as the final positive sample dataset for training (Enhancer4589, Supplementary Table S1).
To help the discrimination of enhancers, a negative sample dataset is required. Recent studies reached a consensus of using ten times the number [31][32][33] or seven times the number [43] of positives from promoter regions or randomly chosen intergenic regions to construct the negatives. The rationale behind this lies in that enhancers account for around 12% of the human genome [7] and the training model should adopt a similar ratio (1:10 or 1:7) of positives and negatives, otherwise the trained model will predict a large number of false-positives when scanning genome, resulting in an invalid prediction model. Therefore, we then randomly selected 10 times (45,890) the amount of 1 kb DNA sequences as negative samples (Enhancer4589, Supplementary Table S1) from non-enhancer intergenic regions, which are defined 1 kb regions based on the GRCh37 reference genome by excluding exon, intron and known enhancers.

Subword Complexity Function and Topological Entropy
Subword complexity function (SCF) studies combinatorial features of sequence by counting the number of different subwords appearing in the sequence and topological entropy (TE) is a mathematical index for quantitatively measuring the asymptotic exponential increasing speed of SCF [35]. They were originally proposed to study problems in the areas of natural language processing, pattern matching and coding theory [35], and were transferred to study DNA sequences in the past two decades [36,[46][47][48][49]. More precisely, let Ω = {A,C,G,T} be an alphabet with four letters, the SCF of a given DNA sequence ω over Ω with a finite length |ω| is defined as: Definition 1. (Subword Complexity Function, SCF [35,36,49]) For a DNA sequence ω over Ω, the subword complexity function p ω : N→N is given by where Card denotes the number of elements of a set and µ < ω represents that µ is a subword (or a factor) of ω.
Mathematically, the nature of 1 ≤ p ω (n) ≤ 4 n holds for an arbitrary DNA sequence, and p ω (n) = 1 represents a complete repeat sequence, whereas p ω (n) = 4 n represents the sequence containing all subwords with full complexity. In early works, Kirillova [47] and Colosimo and De Luca [46] studied DNA sequences by using the entire SCF, which suffers a high-dimensionality problem when the studied sequence is very long [36]. To grab the most significant complexity features, Koslicki [36] noted that the SCF is roughly strictly increasing, nondecreasing and linear decreasing with a slope of -1 on three disjoint intervals, the third of which carries no information. For the first two parts, Koslicki tried to use a unique index, i.e., TE, to quantitatively measure the complexity of a DNA sequence. The classic definition of TE of infinite ω over Ω is defined as: Definition 2. (Topological Entropy, TE [35]) For an infinite sequence ω over Ω, the topological entropy H to is defined by The direct computing of TE is difficult because the classic definition of TE is based on infinite sequences equipped with the limit operation, which requires us to give an asymptotic algorithm to estimate the value of TE. A rational estimation of TE of ω was computed at the maximal value of n such that 4 n + n -1 ≤ |ω| based on the following theorem and corollary [36]: Theorem 1. (Existence of sequence with full complexity [50]) For all n ≥ 0 there exists a word ω over Ω with |ω| = 4 n + n -1 such that p ω (i) = 4 i , i = 1, . . . , n. (3)

Corollary 1. (Koslicki's algorithm of TE [36])
Let ω be a finite sequence of length |ω|, and let n 0 be the unique integer such that4 n 0 + n 0 -1 < |ω| ≤ 4 n 0 + 1 + (n 0 + 1) − 1, then Using the above algorithm, Koslicki gave a unique value of TE at n 0 instead of the entire SCF for measuring the complexity of a DNA sequence. By applying this to the intron and exon elements of the human genome, he found that the complexities of introns are larger than those of exons, suggesting that intron is more random and is under less selective pressure. After that, Jin et al. [37] found that neighbouring points around n 0 should be integrated together to give a more accurate estimation of TE. By applying it to three categories of genomic elements, they found that intron > exon > promoter of their complexities. Recently, Wu et al. [49] found that the neighbouring complexity features around n 0 are good predictors for predicting DNA methylation. These related works altogether demonstrate that specific positions of SCF where it may reach full complexities are intrinsic features as well as good predictors for discriminating distinct functional genomic elements.

Abelian Complexity Function and Its computing Resource Consumption
Recently, Richomme et al. [40] comprehensively studied the general theory of abelian complexity function (ACF) and emphasized the importance of characterizing sequences by ACF and SCF simultaneously. Here, we introduce the concept of ACF to study DNA sequences: Definition 3. (Parikh vector, PV [40]) For a finite sequence ω over Ω, the Parikh vector of ω is defined as the functionψ: Ω |ω| →N 4 , given by ψ(ω) = (|ω| A , |ω| C , |ω| G , |ω| T ), (5) where |ω| A denotes the number of occurrences of the letter A Ω in the finite sequence ω.

Definition 4.
(Abelian Complexity Function, ACF [40]) For a finite sequence ω over Ω, the abelian complexity function ρ ab : N→N is defined as: where ψ(ω) is the Parikh vector of subword u of the finite sequence ω.
Note that for the finite sequence ω, if n > |ω|, then ρ ab ω (n) = 0. We always assume that n < |ω| when computing ρ ab ω (n) = 0. For better understanding of ACF, we showed the detailed computing process of it of an example of a given DNA fragment in Figure 3, from which we could find that ACF only counts the number of different PVs, differing from SCF which counts the number of all different subwords. This means 1 ≤ ρ ab ω (n) ≤ ρ ω (n) for all n ≥ 1. Another marked difference between ACF and SCF was that ACF usually showed a fluctuation while SCF strictly increased in the first part. For example, for the given DNA fragment ω=AAGCAGTCGG used in Figure 3, its ACF is, from which we could find that ρ ab ω (2) = 7 > 5 = ρ ab ω (3). We drew a graph of ACF of an enhancer sequence with 1 kb in Figure 2A, which helps us to understand the fluctuation property of ACF.

Abelian Complexity Function and Its computing Resource Consumption
Recently, Richomme et al. [40] comprehensively studied the general theory of abelian complexity function (ACF) and emphasized the importance of characterizing sequences by ACF and SCF simultaneously. Here, we introduce the concept of ACF to study DNA sequences: Definition 3. (Parikh vector, PV [40]) For a finite sequence ω over Ω, the Parikh vector of ω is defined as the functionψ: Ω |ω| →N 4 , given by ( ) = (| | , | | , | | , | | ), where |ω| A denotes the number of occurrences of the letter A ϵ Ω in the finite sequenceω. Definition 4. (Abelian Complexity Function, ACF [40]) For a finite sequence ω over Ω,the abelian complexity function ρ ab : N→N is defined as: where ψ(ω) is the Parikh vector of subword u of the finite sequence ω.
Note that for the finite sequenceω, if n > |ω|, then ( ) = 0. We always assume that n < |ω| when computing ( ) = 0. For better understanding of ACF, we showed the detailed computing process of it of an example of a given DNA fragment in Figure 3, from which we could find that ACF only counts the number of different PVs, differing from SCF which counts the number of all different subwords. This means 1 ≤ ( ) ≤ ( ) for all n ≥ 1. Another marked difference between ACF and SCF was that ACF usually showed a fluctuation while SCF strictly increased in the first part. For example, for the given DNA fragment ω=AAGCAGTCGG used in Figure 3, its ACF is, from which we could find that (2) = 7 > 5 = (3). We drew a graph of ACF of an enhancer sequence with 1 kb in Figure 2A, which helps us to understand the fluctuation property of ACF. We next investigated the resource consumption of computation of ACF by running our script on a test computer with Ubuntu 18.04 on Intel Xeon 2.10 GHz processors and 256 GB RAM. When

Upper Bound Theorem and Feature Selection
For a long DNA sequence, we now face the same problem of high-dimensionality as with the subword complexity situation. The main experience learnt from previous studies [36,37,49] is that we should select those positions where ACF might reach its maximums (full complexity). Therefore, we are required to give the upper bound estimation of ACF by the following theorem: Theorem 2. (Upper bound of ACF) For every finite word ω and every positive integer satisfying 1 ≤ n ≤ |ω|, the absolute upper bound of ACF is given by where C means the binomial coefficient. Proof. We gave the proof of theorem 2 in the supplementary information.
Under the guidance of the main idea of computing TE, we employed the upper bound of ACF in theorem 2 to estimate the minimal length required for reaching full abelian complexity. When determining the most appropriate position for computing TE, one chose the largest integer n 0 such that 4 n 0 + n -1 ≤ |ω| because the minimal length required for reaching full subword complexity 4 n 0 is 4 n 0 + n -1 (theorem 1). Here, we determine such a length for full abelian complexity by the following theorem: Proof. Suppose there exists a word ω with full abelian complexity, which means ρ ab ω (i) = C 3 i+3 , i = 1, . . . , m (such word must exist because full subword complexity guarantees full abelian complexity). Then it must have ρ ab ω (m) = C 3 m+3 . Therefore, there must exist a subword µ* of ω such that it has at least C 3 m+3 different factors, which requires that µ* at least has the length of C 3 m+3 + m − 1 to place all the different factors. This immediately leads Based on our experience of determining intrinsic features, we should focus on the middle region with high abelian complexity by discarding both ends of the ACF. For the head end, based on the above theorem 3, if |ω| = 1000, we determined the integer m 0 = 16 such that 1000 = |ω| ≥ C 3 m 0 +3 + m 0 − 1. The abelian complexity features that are smaller than 17 were considered as irrelevant features that were discarded in the further studies. For the back end, we discarded the back-end region which had abelian complexities smaller than ρ ab ω (17) (under the dashed line in Figure 4B). The selected abelian complexity features are ρ ab ω (i), 17 ≤ i ≤ 500} (484 features, denoted by SACF484) in the case of |ω|=1000 ( Figure 4B).

Evaluation of the Prediction Performance
Generally, the performance of a prediction method is measured by sensitivity (Sens or recall), specificity (Spec), precision, accuracy (ACC), Matthew's correlation coefficient (MCC) value and GM value, which are calculated as: where TP represents the number of true positive (enhancers predicted as enhancers), TN represents the number of true negative (non-enhancers predicted as non-enhancers), FP represents the number of false positive (non-enhancers predicted as enhancers) and FN represents the number of false negative (enhancers predicted as non-enhancers). Additionally, to test the balance between true positive and false positive rates, another evaluating index is the Area Under the ROC Curve (AUC).

Discussion
An enhancer is one class of the most important CRE and plays significant roles in the regulation of gene expressions. Identification of the locations of enhancers is currently of broad interest because it is a prerequisite for a better understanding of complex mechanisms of transcriptional regulations and for evaluating non-coding DNA variants. Subword complexity function and topological entropy were proven useful tools for discriminating distinct classes of CREs, such as intron, exon and promoter. However, it is no longer valid for the identification of enhancers, which is why we developed a new mathematical tool termed "abelian complexity function" to try achieve this task. Under an estimation of the upper bound, a feature selection approach finally determines 209 abelian complexity features that outperformed three existing methods in fair comparisons. Moreover, in addition, to be consistent with existing enhancers of FANTOM enhancers and known disease-related

Evaluation of the Prediction Performance
Generally, the performance of a prediction method is measured by sensitivity (Sens or recall), specificity (Spec), precision, accuracy (ACC), Matthew's correlation coefficient (MCC) value and GM value, which are calculated as: where TP represents the number of true positive (enhancers predicted as enhancers), TN represents the number of true negative (non-enhancers predicted as non-enhancers), FP represents the number of false positive (non-enhancers predicted as enhancers) and FN represents the number of false negative (enhancers predicted as non-enhancers). Additionally, to test the balance between true positive and false positive rates, another evaluating index is the Area Under the ROC Curve (AUC).

Discussion
An enhancer is one class of the most important CRE and plays significant roles in the regulation of gene expressions. Identification of the locations of enhancers is currently of broad interest because it is a prerequisite for a better understanding of complex mechanisms of transcriptional regulations and for evaluating non-coding DNA variants. Subword complexity function and topological entropy were proven useful tools for discriminating distinct classes of CREs, such as intron, exon and promoter. However, it is no longer valid for the identification of enhancers, which is why we developed a new mathematical tool termed "abelian complexity function" to try achieve this task. Under an estimation of the upper bound, a feature selection approach finally determines 209 abelian complexity features that outperformed three existing methods in fair comparisons. Moreover, in addition, to be consistent with existing enhancers of FANTOM enhancers and known disease-related super-enhancers, scanning human chromosome 22 with our model finds some novel candidate enhancers with supports of histone modification markers of H3K4me1 and H3K27ac.
On the basis of current efforts, we conclude that our method is an accurate and robust model for enhancer identification and we summarize that our work brings some new contributions into the area of enhancer biology: (1) New contribution to prediction methodology: for the first time, we proposed an innovative method, i.e., selected abelian complexity features together with DNA composition for accurate enhancer identification. In comparison with three state-of-the-art enhancer prediction tools, our method shows greater accuracy and more robust performances on imbalanced datasets. (2) New contribution to the prediction of novel candidate enhancers: when we applied our trained model to scan human chromosome 22, 7045 genomic regions were identified as enhancers. Some of them are consistent with known FANTOM5 enhancer, and the majority of them are considered as candidate enhancers by the evidences of surrounding histone modification markers including H3K4me1 and H3K27ac. This demonstrates that the predicted enhancers by our scanning approach have the potential to be true enhancers.
Finally, we list two drawbacks or limitations of our model which should be addressed and overcome in future works. The first limitation is the problem of cell-specificity of the enhancer. Our enhancer dataset (Enhancer4589) was constructed by the criteria of "TPM min ≥ 0.08", implying that the enhancers we chose for training are actually ubiquitously active across all tissues and cell lines (termed as "housekeeping enhancers" [51]). A rational inference is that our predicted enhancers are all "housekeeping enhancers", suggesting that our model has the limitation of predicting cell-specific enhancers. A possible solution is to collect cell-specific enhancers of some particular tissues or cell lines and then construct a prediction model for learning their specific sequence features. Another limitation of current work is the authenticities of the predicted enhancers, which are usually measured by the abilities of driving target gene expression. At this point, further biological experiments such as MPRA are required for a comprehensive evaluation. Author Contributions: X.H. and J.C. conceived this project and designed the methodology, C.W. implemented the algorithm using R and C.W. and Y.L. performed the statistical analysis. X.H. drafted the manuscript. All authors have read and approved the final manuscript.