Enhancer-LSTMAtt: A Bi-LSTM and Attention-Based Deep Learning Method for Enhancer Recognition

Enhancers are short DNA segments that play a key role in biological processes, such as accelerating transcription of target genes. Since the enhancer resides anywhere in a genome sequence, it is difficult to precisely identify enhancers. We presented a bi-directional long-short term memory (Bi-LSTM) and attention-based deep learning method (Enhancer-LSTMAtt) for enhancer recognition. Enhancer-LSTMAtt is an end-to-end deep learning model that consists mainly of deep residual neural network, Bi-LSTM, and feed-forward attention. We extensively compared the Enhancer-LSTMAtt with 19 state-of-the-art methods by 5-fold cross validation, 10-fold cross validation and independent test. Enhancer-LSTMAtt achieved competitive performances, especially in the independent test. We realized Enhancer-LSTMAtt into a user-friendly web application. Enhancer-LSTMAtt is applicable not only to recognizing enhancers, but also to distinguishing strong enhancer from weak enhancers. Enhancer-LSTMAtt is believed to become a promising tool for identifying enhancers.


Introduction
Enhancers are short pieces of DNA sequences of 50 to 1500 bp, which can accelerate the transcription of target genes by binding the transcription factors [1,2]. Unlike the promoters, enhancers are located either in the upstream/downstream or within the genes they regulate and doesn't have to be close to the starting sites of transcription [2][3][4]. Increasing evidences indicate that enhancers play a critical role in the gene regulation [4,5]. The enhancers control the expression of genes involved in cell differentiation [6,7] and are responsible for morphological changes in three spine stickleback fish [8]. The enhancers orchestrate critical cellular events such as differentiation [9,10], maintenance of cell identity [11,12], and response to stimuli [13][14][15] by binding to transcription factors [16]. The enhancers are closely related to inflammation and cancer [17]. Therefore, precisely detecting enhancers from DNA sequences is critical to further investigate their functions or roles in the cellular processes.
The methods or techniques used to identify enhancers are divided into two categories: high-throughput experimental technology and computational method [5,18]. The former includes chromatin immunoprecipitation followed by deep sequencing (ChIP-seq) [19,20], protein-binding microarrays (PBMs) [21], systematic evolution of ligands by exponential enrichment (SELEX) [22], yeast-one-hybrid (Y1H) [23], and bacterial-one-hybrid [24]. The main idea behind these technologies is to identify enhancers by recognizing properties of the enhancer-binding interactors [16]. There are generally four ways of experimental technology. The first way is to identify enhancers by binding sites of the specific transcription factors (TFs) with the help of ChIP-seq [13,25]. These techniques are restricted to cell-type or tissue-specific TFs. The second way is to detect enhancers by recognizing the binding sites of transcriptional co-activators such as CBP (also known as CREB-binding protein or CREBBP) and P300 (also called EP300 or E1A binding protein p300) recruited by the TFs [12,13,26]. However, not all enhancers are characterized by the co-activators, and the ChIP-grade antibodies are not always available. The third way is to identify nucleosome-depleted open regions of DNase I hypersensitivity [27]. The open regions include other DNA elements, such as promoters, silencers/repressors, insulators, and other function-unknown sequences [28,29]. The modifications of histones in the flanking nucleosomes are of certain signature of enhancers. For example, histones flanking active enhancers are typically marked by H3 mono-methylated at lysine 4 (H3K4me1), while histone flanking active promoters are marked by H3K4me3 [13]. Therefore, the fourth method is genome-wide mapping of histone modifications. In spite of great success in identifying enhancers, high-throughput experimental technologies have two drawbacks: they are time-consuming and expensive. Therefore, it is a challenging task to identify all enhancers from thousands of tissues or cells.
The computational methods have been developed to complement the high-throughput experimental technologies over the recent decade [18,30,31]. The computational methods include genomics comparison-based methods and machine learning-based methods. The enhancers reside in any region of the genome, so it is very difficult to find intuitively linear motifs of enhancers by genomics comparison-based methods. Machine learningbased methods build a classification model to fit known enhancers and then predict new enhancers. Furthermore, Machine learning-based methods are capable of discovering non-linear hidden motifs of enhancers. To date, there are at least twenty machine learningbased methods for enhancer prediction [16,, such as iEnhancer-2L [40], iEnhancer-PsedeKNC [41], EnhancerPred [42], and EnhancerPred2.0 [43]. The general workflow of these methods is firstly to compute representation of sequences such as pseudo k-tuple nucleotide composition, nucleotide binary profiles, as well as accumulated nucleotide frequency, then to learn a classifier by using a machine learning algorithm such as support vector machine and random forest, and finally to predict unknown sequences.
The aforementioned machine learning-based methods require sophisticated design of representations as well as sophisticated selection of conventional machine learning algorithms. In practice, any single representation is not able to characterize enhancers well, while a combination of diverse representations has the potential to improve the performance but reduces the generalization ability of the methods. The deep learning methods that have been developed in the recent decades have proven to be good at addressing complex issues, including protein structure prediction, which is thought to be one of the most challenging tasks [64,65]. Yao et al. [60] presented a word embedding-based deep learning method named iEnhancer-GAN to detect enhancers. To make up for insufficiency of the number of training samples, iEnhancer-GAN [60] used the sequence generative adversarial net [66] to augment training samples. Min et al. [33] developed a deep convolution neural network (CNN)-based method for distinguishing enhancers from non-enhancers, which required only primary sequences as input. Khanal et al. [52] exploited word embedding in the field of natural language processing as well as CNN to construct a method named iEnhancer-CNN. Nguyen et al. [50] integrated multiple CNNs into the iEnhancer-ECNN. The CNN is capable of characterizing local properties [67], but is insufficient to represent semantic relationships between words in the context of sequences. Tan et al. [48] exploited recurrent neural networks (RNN) and integrated the output of both RNN and CNN for the final decision. Le et al. [55] presented an advanced method (BERT [68]) to capture semantics of DNA sequences. On the basis of analysis of the published works or methods for detecting enhancers, we presented a bi-directional long-short term memory (Bi-LSTM) and attentionbased deep learning method for enhancer recognition called Enhancer-LSTMAtt.
Another dataset S i was used for the independent test, which was from reference [46]. The S i contained 100 strong enhancers S i strong , 100 weak enhancers S i weak and 100 nonenhancers S i non . The sequence identities between any two enhancers are not more than 0.8 by processing by CD-HIT [71][72][73].

Methods
As shown in Figure 1, the proposed method comprised mainly input, embedding, 1D CNN, residual neural network (ResNet), Bi-LSTM, attention, dropout, flattened, and fully connected layers. The input was DNA segments of 200 bp. Then, DNA segments were transformed into number sequences by where N denoted the characters of the unknown nucleotide. The embedding of number sequences was entered into the convolution module and the LSTM module. The convolution module consisted mainly of the 1D CNN and ResNet [81][82][83], while the LSTM module comprised mainly Bi-LSTM [84,85] and feed-forward attention [86,87]. The concatenation of outputs of the two modules was entered into the fully connected layer. Following the fully connected layer was the final layer, which contained one neuron representing the probabilities of belonging to enhancers. We set the threshold to 0.5, and thus more than 0.5 output indicated that the corresponding input was predicted to be positive and otherwise to be negative. The numbers of the parameters and the shape of output in each layer of the Enhancer-LSTMAtt were listed in Table 1. where N denoted the characters of the unknown nucleotide. The embedding of n sequences was entered into the convolution module and the LSTM module. The lution module consisted mainly of the 1D CNN and ResNet [81][82][83], while the module comprised mainly Bi-LSTM [84,85] and feed-forward attention [86,87]. T catenation of outputs of the two modules was entered into the fully connected Following the fully connected layer was the final layer, which contained one representing the probabilities of belonging to enhancers. We set the threshold to thus more than 0.5 output indicated that the corresponding input was predicte positive and otherwise to be negative. The numbers of the parameters and the s output in each layer of the Enhancer-LSTMAtt were listed in Table 1.

Layers
Shape of Output Number of Parame Input (None, 200) 0

Embedding Layer
The embedding is generally the first layer of the deep neural network, whose role is to map the categorical (discrete) variable to continuous vectors (https://towardsdatascience. com/neural-network-embeddings-explained-4d028e6f0526 (accessed on 3 March 2022)) [88]. The traditional one-hot encoding suffered from two defaults. One was that it was not capable of distinguishing similarities between representations. Another was that the representation was sparse in the case of the large vocabulary. The embedding well solved two issues and thus was widely applied to the area of natural language processing. The embedding can be used alone, such as word2vec and Glove, or fused into the deep neural network as the first layer.

CNN
CNN is one of most popular neural network architectures used to construct deep neural network [67,89,90]. The main characteristic of the CNN is to capture the local hidden structure by using convolutional kernels or filters. As shown in Figure 2A, the input is divided into patches, which are convoluted into the feature map by the convolutional kernel. The patches are allowed to overlap, and the interval between adjacent patches is called the stride. All of the patches in the same input share the convolutional kernel which are learnable parameters. To keep the size of the input unchanged, the input is sometimes required to pad. To increase the non-linear ability of the CNN, the activation function is added to the feature map. The activation function includes ReLU, sigmoid, tanh, weakly ReLU, and ELU. The pooling in the CNN is a non-linear down-sampling, whose role is to reduce the dimensionality of representations and to speed up the calculation. In addition, the pooling is able to avoid or decrease the over-fitting issue.
kernel which are learnable parameters. To keep the size of the input unchanged, the input is sometimes required to pad. To increase the non-linear ability of the CNN, the activation function is added to the feature map. The activation function includes ReLU, sigmoid, tanh, weakly ReLU, and ELU. The pooling in the CNN is a non-linear down-sampling, whose role is to reduce the dimensionality of representations and to speed up the calculation. In addition, the pooling is able to avoid or decrease the over-fitting issue.

ResNet
As the number of stacked layers in the deep neural network increased, three issues would occur: information loss, gradient vanishing or exploding, and network degradation. This resulted in the worse performance of the deep neural network [89]. He el al. [81] presented ResNet to address these issues. The basic architecture of ResNet [81] was composed of residual mapping F(x) and identity mapping x, as shown in Figure 2B. The identify mapping ensured no loss of inputted information in spite of increasing layers. The residual mapping was viewed as the learnable residual function and might be conventional convolutions. The ResNet enabled the neural network to go deeper without network degradation. Li et al. [81] used ResNet to construct a 152-layers deep network, which reduced the top 5 error rates of image recognition to 5.71% on the ImageNet.

Bi-LSTM
Long-short term memory (LSTM) [90] is a type of recurrent neural network (RNN) [91,92]. The RNN is especially suitable to deal with time series questions due to its architecture: sharing weights at all of the time steps. The RNN was applied to a wide range of fields, including speech recognition [93], continuous B-cell epitope prediction [94], sentiment analysis [95], and action recognition [96]. The major default of the RNN was that it is prone to cause gradient vanishing or exploding for long sequence analysis. Therefore, the RNN was restricted to short sequences [97,98]. The LSTM [90] employed the gate mechanism to control conveying of information, including selective addition of new information or removal of information accumulated previously. The LSTM was

ResNet
As the number of stacked layers in the deep neural network increased, three issues would occur: information loss, gradient vanishing or exploding, and network degradation. This resulted in the worse performance of the deep neural network [89]. He el al. [81] presented ResNet to address these issues. The basic architecture of ResNet [81] was composed of residual mapping F(x) and identity mapping x, as shown in Figure 2B. The identify mapping ensured no loss of inputted information in spite of increasing layers. The residual mapping was viewed as the learnable residual function and might be conventional convolutions. The ResNet enabled the neural network to go deeper without network degradation. Li et al. [81] used ResNet to construct a 152-layers deep network, which reduced the top 5 error rates of image recognition to 5.71% on the ImageNet.

Bi-LSTM
Long-short term memory (LSTM) [90] is a type of recurrent neural network (RNN) [91,92]. The RNN is especially suitable to deal with time series questions due to its architecture: sharing weights at all of the time steps. The RNN was applied to a wide range of fields, including speech recognition [93], continuous B-cell epitope prediction [94], sentiment analysis [95], and action recognition [96]. The major default of the RNN was that it is prone to cause gradient vanishing or exploding for long sequence analysis. Therefore, the RNN was restricted to short sequences [97,98]. The LSTM [90] employed the gate mechanism to control conveying of information, including selective addition of new information or removal of information accumulated previously. The LSTM was able to capture the relationship of the words in the former with those in the back but was not able to characterize the relationship of the words in the back with those in the former. The Bi-LSTM [84,85] addressed the issue well. As shown in Figure 3, the Bi-LSTM was made up of two LSTMs, one from forward to backward and another from backward to forward. The two LSTMs shared embedding of words but were independent of each other in terms of learnable parameters. The concatenation of hidden states in both LSTMs corresponded to the output of the Bi-LSTM.
was not able to characterize the relationship of the words in the back with those in the former. The Bi-LSTM [84,85] addressed the issue well. As shown in Figure 3, the Bi-LSTM was made up of two LSTMs, one from forward to backward and another from backward to forward. The two LSTMs shared embedding of words but were independent of each other in terms of learnable parameters. The concatenation of hidden states in both LSTMs corresponded to the output of the Bi-LSTM. denoted the input, the forward hidden state, the backward hidden state, and the bi-directional hidden state at the time step i, respectively.

Feed-Forward Attention
Attention mechanisms are increasingly becoming a hot topic in the field of deep learning. The attention mechanisms are a scheme of allocating weights, which is very similar to the scene where one assigns a different focus to different parts when watching an object. There are many attention schemes, including feed-forward attention [99] and self-attention [100], etc. The feed-forward attention is intended to make up for the deficiency of the LSTM in the long-term dependency. Assume that the hidden state at time step t in the LSTM was ℎ . The context vector generated by the feed-forward attention was computed by where α was the attention weight of the hidden state h . α was defined by where e = δ(h ).
δ was the learnable parameter.

Dropout Layer
Dropout proposed by Hinton et al. [101] is a concept to train deep neural network. In the process of training, a certain proportion of neurons are randomly dropped out, and all of the neurons are used as usual in the process of prediction [102]. The dropout serves two-fold functions: speeding up training of the deep neural network and reducing over-fitting.

Flatten Layer and Fully Connected Layer
The flatten layer was intended to convert the shape of data so as to link conveniently the next layers. The flatten layers did not have any learnable parameters. The fully

Feed-Forward Attention
Attention mechanisms are increasingly becoming a hot topic in the field of deep learning. The attention mechanisms are a scheme of allocating weights, which is very similar to the scene where one assigns a different focus to different parts when watching an object. There are many attention schemes, including feed-forward attention [99] and selfattention [100], etc. The feed-forward attention is intended to make up for the deficiency of the LSTM in the long-term dependency. Assume that the hidden state at time step t in the LSTM was h t . The context vector generated by the feed-forward attention was computed by where α t was the attention weight of the hidden state h t . α t was defined by where e t = δ(h t ). (4) δ was the learnable parameter.

Dropout Layer
Dropout proposed by Hinton et al. [101] is a concept to train deep neural network. In the process of training, a certain proportion of neurons are randomly dropped out, and all of the neurons are used as usual in the process of prediction [102]. The dropout serves two-fold functions: speeding up training of the deep neural network and reducing over-fitting.

Flatten Layer and Fully Connected Layer
The flatten layer was intended to convert the shape of data so as to link conveniently the next layers. The flatten layers did not have any learnable parameters. The fully connected layer was identical to the hidden layer in the multilayer perceptron, and each neuron was connected to all of the neurons in the previous layer.

Cross Validation and Evaluation Metrics
To examine the predictive performance of the presented method, we used n-fold cross validation and independent test. In the n-fold cross-validation, the training dataset was divided into n parts of equal or approximately equal size, of which n-1 parts were used to train the model and the remaining part was used to test the model. This process was repeated n times. In the independent test, the training dataset was used to train the model, and the independent dataset was used to test the model. This is a binary classification issue, so we used common metrics to evaluate the predictive performance, including sensitivity (SN), specificity (SP), accuracy (ACC), and Matthews' correlation coefficient (MCC), which are defined as The area under the ROC curve (AUC) ranges from 0 to 1. If the AUC was equal to 1, the prediction was perfect. The AUC equaling 0.5 indicated a random prediction and the AUC equaling to 0 was an opposite prediction.
We used Python programming language along with the deep learning toolkit Tensor-Flow (version 2.0) to implement the Enhancer-LSTMAtt. We conducted 5-fold cross validation, 10-fold cross validation and independent test on the Microsoft Windows 10 operating system, which is installed on a notebook computer with 32G RAM and 6 CPUs, each with 2.60 GHz. Each epoch costs about 25 s in the training process, while prediction of each sample takes no more than 2 s by using the trained Enhancer-LSTMAtt. The codes along with the datasets are available at Github: https://github.com/feng-123/Enhancer-LSTMAtt.

Results
We tested the Enhancer-LSTMAtt for its ability to not only distinguish between enhancers and non-enhancers, but also discriminate strong enhancers from weak enhancers. The process of distinguishing between enhancers and non-enhancers was called the first stage, where all of the enhancers, including weak enhancers, were positive samples. The process of discriminating strong from weak enhancers was called the second stage, where the strong enhancers were positive and the weak enhancers were negative samples. We conducted 5-fold cross validation in dataset S. Figure 4 shows the ROC curve of each fold, and Table 2 lists the evaluation of performance. We obtained an average AUC of 0.8259 in the first stage and an average AUC of 0.6439 in the second stage. We achieved an average SN of 0.7304, an average SP of 0.8006, an average ACC of 0.7655, and an average MCC of 0.5339 in the first stage and an average SN of 0.6765, an average SP of 0.6024, an average ACC of 0.6395, and an average MCC of 0.2804 in the second stage. Obviously, the predictive performance in the first stage was much better than that in the second stage, indicating that it was more difficult to discriminate strong enhancers from weak enhancers than to discriminate enhancers from non-enhancers.

Comparison with State-of-the-Art Methods
As mentioned in the introduction, no less than 20 computational methods have been developed for predicting enhancers. Some methods were tested by jackknife test,

Comparison with State-of-the-Art Methods
As mentioned in the introduction, no less than 20 computational methods have been developed for predicting enhancers. Some methods were tested by jackknife test, some by 5-fold cross validation, some by 10-fold cross-validation, and some by the independent test. Some methods distinguished enhancers from non-enhancers, while some discriminated strong from weak enhancers. Table 3 summarizes these methods. Since the jackknife test is too time-consuming for deep learning methods, we conducted 5-fold cross, 10-fold cross validation, and independent tests to compare these state-of-the-art methods. Tables 4 and 5 list the evaluation of performances. Different indices evaluate different performances. For instance, SN is used to evaluate the ratio of the number of correctly predicted positive samples to the total number of positive ones, while SP is the ratio of the number of correctly predicted negative samples to the total number of negative ones. Sometimes, the two indices would not maintain synchronization, which was difficult to determine as good or bad. In this case, the overall indices could be used, such as ACC and MCC. In the 5-fold cross-validation, Enhancer-LSTMAtt was superior to Enhancer-BERT [55], DeployEnhancer [48] and iEnhancer-RF [57] in terms of ACC and MCC in the first stage and exceeded iEnhancer-PsedeKNC [41], DeployEnhancer [48], EnhancerP-2L [51], and iEnhancer-RF [57] in terms of MCC in the second stage. In the 10-fold cross-validation, Enhancer-LSTMAtt reached competitive performance with ES-ARCNN [49], iEnhancer-XG [53], and iEnhancer-MFGBDT [63] in the second stage.    Table 6 lists evaluation of performances of all of the 19 methods on the independent test. To the best of our knowledge, nearly all of the methods used the same independent dataset S i for independent test, and no other published enhancers were collected as the second independent dataset. Obviously, Enhancer-LSTMAtt achieved competitive performance with these state-of-the-art methods. In the first stage, Enhancer-LSTMAtt reached the best SP (0.8150), the best ACC (0.8050), and the best MCC (0.6101), achieved a second AUC (0.8588), which was less than the AUC of iEnhancer-RF, and obtained a competitive SN (0.7950), which was less than the SN of iEnhancer-GAN [60], spEnhancer [58], iEnhancer-5Step [47], piEnPred [61], iEnhancer-RD [62], and iEnhancer-BERT [55]. In the second stage, the Enhancer-LSTMAtt reached the best SN, ACC and MCC, a second AUC to that of the iEnhancer-RF [57], and a second SP to that of the Enhancer-DRRNN [54]. These results indicated that Enhancer-LSTMAtt is a competitive method to recognize enhancers. It must be pointed out that we didn't conduct cross validation and independent test for 19 methods, and the evaluation of their performances directly came from their published papers. Figure 5 shows the ROC curve of the independent test. Table 6. Comparison with state-of-the-art methods by independent test.

SN SP ACC MCC AUC
Second Stage iEnhancer-2L [40] 0.4700 0.7400 0.6050 0.2181 0.6678 EnhancerPred [42] 0.4500 0.6500 0.5500 0.1020 0.5790 iEnhancer-EL [46] 0.5400 0.6800 0.6100 0.2222 0.6801 iEnhancer-5Step [47] 0.7400 0.5300 0.6350 0.2800 -DeployEnhancer [48] 0 the number of correctly predicted positive samples to the total number of positive ones, while SP is the ratio of the number of correctly predicted negative samples to the total number of negative ones. Sometimes, the two indices would not maintain synchronization, which was difficult to determine as good or bad. In this case, the overall indices could be used, such as ACC and MCC. In the 5-fold cross-validation, Enhancer-LSTMAtt was superior to Enhancer-BERT [55], DeployEnhancer [48] and iEnhancer-RF [57] in terms of ACC and MCC in the first stage and exceeded iEnhancer-PsedeKNC [41], De-ployEnhancer [48], EnhancerP-2L [51], and iEnhancer-RF [57] in terms of MCC in the second stage. In the 10-fold cross-validation, Enhancer-LSTMAtt reached competitive performance with ES-ARCNN [49], iEnhancer-XG [53], and iEnhancer-MFGBDT [63] in the second stage. Table 6 lists evaluation of performances of all of the 19 methods on the independent test. To the best of our knowledge, nearly all of the methods used the same independent dataset S for independent test, and no other published enhancers were collected as the second independent dataset. Obviously, Enhancer-LSTMAtt achieved competitive performance with these state-of-the-art methods. In the first stage, Enhancer-LSTMAtt reached the best SP (0.8150), the best ACC (0.8050), and the best MCC (0.6101), achieved a second AUC (0.8588), which was less than the AUC of iEnhancer-RF, and obtained a competitive SN (0.7950), which was less than the SN of iEnhancer-GAN [60], spEnhancer [58], iEnhancer-5Step [47], piEnPred [61], iEnhancer-RD [62], and iEnhancer-BERT [55]. In the second stage, the Enhancer-LSTMAtt reached the best SN, ACC and MCC, a second AUC to that of the iEnhancer-RF [57], and a second SP to that of the Enhancer-DRRNN [54]. These results indicated that Enhancer-LSTMAtt is a competitive method to recognize enhancers. It must be pointed out that we didn't conduct cross validation and independent test for 19 methods, and the evaluation of their performances directly came from their published papers. Figure 5 shows the ROC curve of the independent test.

Enhancer-LSTMAtt Webserver
We implemented Enhancer-LSTMAtt into a user-friendly web application which is freely available at http://www.biolscience.cn/Enhancer-LSTMAtt/ (accessed on 20 May 2022) to all of the scientific researchers. The web application is easy for users to use. The only thing users do is to upload DNA sequences in a FASTA format either by pasting it into textbox or by uploading a file. Users click the "submit" button, and then the web application returns prediction in the 7-tuple. The first column is the names of input sequences, the second is the range of the enhancers, the third and the fourth are the probabilities of predicting as the enhancer and the non-enhancer respectively, the fifth and the sixth are the probabilities of predicting as the strong and the weak enhancer, and the seventh is the predicted result.

Discussion
We investigated effect of different non-enhancers on the methods. Due to nonenhancers that were not available before sampling, we used the sampling and mutation strategy to generate new non-enhancers. We randomly selected 30%, 40%, and 50% of samples in the non-enhancer set S non and made them mutate. The mutated non-enhancers and the non-mutated non-enhancers constituted three new non-enhancer sets which along with the enhancers further comprised three new training sets, respectively. We used the independent test to examine the performance of the proposed method trained by the new training sets. As shown in Figure 6, the non-enhancers have a certain influence on the performance of the method, but this influences is little.
Biomolecules 2022, 12, x 14 of 19 Figure 6. The ROC curves by the independent test over different non-enhancer sets. N_10, N_30, and N_50 denote the training sets in which 10%, 30%, 50% of non-enhancers were formed by mutation from the original non-enhancers, respectively. The original data denotes the training set which was made up of the enhancers and original non-enhancers. The purple line is the baseline of ROC curve, named random guessing line.  . The ROC curves by the independent test over different non-enhancer sets. N_10, N_30, and N_50 denote the training sets in which 10%, 30%, 50% of non-enhancers were formed by mutation from the original non-enhancers, respectively. The original data denotes the training set which was made up of the enhancers and original non-enhancers. The purple line is the baseline of ROC curve, named random guessing line.
In the Enhancer-LSTMAtt, there are up to 250,921 trainable parameters. The more trainable parameters there are, the more overfitting the deep learning model. We used dropout and batch normalization to reduce model overfitting. We investigated the roles of both techniques in reducing overfitting. As shown in Figure 7, the training loss descended rapidly at the beginning stage and then slowly declined to be stable with the increment of epoch, while the loss of the independent test declined rapidly at the beginning stage and then fluctuated in a certain range. The AUC of the independent test ascended rapidly at the beginning stage and then tended to stabilize with the increment of epochs. Therefore, there is no remarkable overfitting issues for Enhancer-LSTMAtt. Figure 6. The ROC curves by the independent test over different non-enhancer sets. N_10, N_30, and N_50 denote the training sets in which 10%, 30%, 50% of non-enhancers were formed by mutation from the original non-enhancers, respectively. The original data denotes the training set which was made up of the enhancers and original non-enhancers. The purple line is the baseline of ROC curve, named random guessing line.  Enhancers-LSTMAtt is a deep learning-based and end-to-end method that does not require any feature design. This avoided artificial interference and sophisticated feature extraction or selection. The Enhancers-LSTMAtt is easier to implement than the featurebased methods from this viewpoint. Most feature-based methods performed well over the cross validation but performed badly over the independent test, indicating the weakly generalized ability. For example, the EnhancerP-2L [51] achieved an MCC of 0.8340 and an MCC of 0.8398 over the 5-fold and 10-fold cross validations, respectively, but reached only an MCC of 0.5907, which decreased by more than 0.24. piEnPred [61] substantially decreased the MCC from 0.7660 over the 5-fold cross validation to 0.6099 over the independent test. The Enhancers-LSTMAtt did not reduce the MCC over the independent test and instead increased the MCC by at least 0.07. Thus, the Enhancers-LSTMAtt is more generalized to the independent test than the feature-based methods. Most deep learning-based methods either utilize the CNN, LSTM, or their combination for enhancer recognition. For example, both the iEnhancer-ECNN [50] and the iEnhancer-CNN [52] exploited the CNN, the iEnhancer-EBLSTM [59] used Bi-LSTM, and the DeployEnhancer [48] sequentially combined the CNN and Bi-LSTM. The CNN and Bi-LSTM are two popular neural network architectures that have the ability to capture different information. The sequential combination between CNN and Bi-LSTM is disadvantageous to complete exploitation of these two different types of information. Stacking the CNN and Bi-LSTM in a parallel manner is able to exploit their respective representations. In addition, we also used the residual network and the attention mechanism to improve representation. This is two potential reasons why Enhancers-LSTMAtt is superior to other deep learning-based methods in the independent test, such as the iEnhancer-ECNN [50], the iEnhancer-CNN [52], the iEnhancer-EBLSTM [59], and the DeployEnhancer [48]. On the other hand, inclusion of the residual neural network as well as the feed-forward attention and stacking CNN and Bi-LSTM in a parallel manner added complexity to a certain extent, which in turn increased the computing cost.

Conclusions
Identifying enhancers is key to uncovering their roles in the regulation of transcription for target genes. We employed multiple deep learning techniques (i.e., Bi-LSTM, CNN, residual network and feed-forward attention) to construct Enhancer-LSTMAtt for enhancer recognition. The Enhancer-LSTMAtt is of the following superiorities over the state-of-the-art methods: (1) the Enhancer-LSTMAtt stacked the CNN and the LSTM in parallel, not in a series-connection manner, which allows stacking diverse representations; (2) the Enhancer-LSTMAtt utilized the residual neural network, which allows construction of deeper neural networks without loss of information; and (3) the Enhancer-LSTMAtt employed the attention mechanism, which allows focusing on key information. Comprehensive comparison with state-of-the-art methods suggested that Enhancers-LSTMAtt was not only a stable tool but also an effective and efficient tool for enhancer identification.