A Novel Position-Specific Encoding Algorithm (SeqPose) of Nucleotide Sequences and Its Application for Detecting Enhancers

Enhancers are short genomic regions exerting tissue-specific regulatory roles, usually for remote coding regions. Enhancers are observed in both prokaryotic and eukaryotic genomes, and their detections facilitate a better understanding of the transcriptional regulation mechanism. The accurate detection and transcriptional regulation strength evaluation of the enhancers remain a major bioinformatics challenge. Most of the current studies utilized the statistical features of short fixed-length nucleotide sequences. This study introduces the location information of each k-mer (SeqPose) into the encoding strategy of a DNA sequence and employs the attention mechanism in the two-layer bi-directional long-short term memory (BD-LSTM) model (spEnhancer) for the enhancer detection problem. The first layer of the delivered classifier discriminates between enhancers and non-enhancers, and the second layer evaluates the transcriptional regulation strength of the detected enhancer. The SeqPose-encoded features are selected by the Chi-squared test, and 45 positions are removed from further analysis. The existing studies may focus on selecting the statistical DNA sequence descriptors with large contributions to the prediction models. This study does not utilize these statistical DNA sequence descriptors. Then the word vector of the SeqPose-encoded features is obtained by using the word embedding layer. This study hypothesizes that different word vector features may contribute differently to the enhancer detection model, and assigns different weights to these word vectors through the attention mechanism in the BD-LSTM model. The previous study generously provided the training and independent test datasets, and the proposed spEnhancer is compared with the three existing state-of-the-art studies using the same experimental procedure. The leave-one-out validation data on the training dataset shows that the proposed spEnhancer achieves similar detection performances as the three existing studies. While spEnhancer achieves the best overall performance metric MCC for both of the two binary classification problems on the independent test dataset. The experimental data shows that the strategy of removing redundant positions (SeqPose) may help improve the DNA sequence-based prediction models. spEnhancer may serve well as a complementary model to the existing studies, especially for the novel query enhancers that are not included in the training dataset.


Introduction
The innovative technologies and comprehensive biological investigations show that the non-coding genomic regions are not functionally inactive as previously hypothesized and play essential roles in transcriptional regulations [1]. An enhancer is a small genomic region that binds to transcription factors and exerts its regulatory roles to the target genes [2,3]. Enhancers are different from promoters that are always upstream to the target genes and may even reside in the introns [4]. The burst frequency of a target gene may be significantly increased by enhancers [5]. Therefore, the functional investigation of enhancers will improve our understanding of the transcription regulation mechanism [6].
Enhancers may be detected through in vivo animal experiments. Heintzman and Ren identified novel enhancers through the binding affinities to the transcription factors like P300 [7]. Boyle et al. detected enhancers by investigating the DNaseI hypersensitivity [8]. However, the wet-lab experiments are time-consuming and labor-intensive, and many enhancers cannot be detected in this way due to their condition-specific activities [9].
Various machine learning methods are proposed for the enhancer detection and evaluation problem. Firipi et al. proposed the artificial neural network-based algorithm CSI-ANN to efficiently extract the enhancers' sequence features and accurately detect novel ones [10]. The tool EnhancerFinder combined multiple learning kernels based on the evolutionary conservation patterns, sequence motifs, and cell type-specific functional information for the detection, and genomic distribution characterization of enhancers [3]. A random forest (RF) model was trained using the chromatin status to construct the enhancer maps in multiple cell types [11]. Deep learning is another powerful tool to detect enhancers, and the tool EnhancedDBN achieved the enhancer detection task using a deep belief network (DBN) [12].
The strength type is another important feature of enhancers. Liu et al. utilized the pseudo-k-set nucleotide composition (PseKNC) algorithm [13] as the sequence features and proposed a two-layer classifier iEnhancer-2L for both the enhancer detection and the enhancer strength type determination [14]. Jia et al. utilized a two-step wrapper feature selection algorithm to find the best features from the useful information of bot bi-profile Bayes and PseKNC, and their model EnhancerPred by 0.01 and 0.12 in the metric Matthews Correlation Coefficient (MCC) for the two layers of iEnhancer-2L [15]. Nguyen et al. combined the one-hot-encoding of and the statistical k-mer descriptors neighboring to each nucleotide, and trained convolutional neural network (CNN) models for the enhancer detection problem (iEnhancer-ECNN) [16]. Their ensembled models demonstrated much improved prediction performance on the independent test dataset. Both the ensembled prediction models and the CNN classifiers are notorious for the high computational requirements. This might be the reason that the leave-one-out validation was not carried out.
This study proposes a novel position-specific encoding algorithm (SeqPose) of nucleotide sequences and selects a subset of the SeqPose features to build the two-layer enhancer classification model. The first layer separates the enhancers from the non-enhancers, and the second layer determines whether an enhancer is strong or weak. The experimental data shows that the position-specific patterns contribute useful information to the accurate enhancer detection and strength evaluation problem.
There are two major contributions of this study. Firstly, the DNA sequence encoding strategy in this study utilizes the location information of each k-mer. In this paper, a novel sequence preprocessing strategy (SeqPose) is proposed. We map the original DNA sequences into numerical sequences by using the encoding rules (SeqPoseDict), and then select the redundant positions with low correlation with sample label by using the Chisquared test, and remove them from the sequences altogether. In this study, we look at the different positions of the encoded sequence as features, trying to find unimportant positions and delete them from the sample, which was ignored by most of the existing studies. The correlations between different positions in a DNA sequence also shows phenotype associations. This is different from the previous position-specific encoding strategy of DNA sequences, which calculates the one-hot encoding of and the other statistical descriptors neighboring to the nucleotide A/C/G/T in each position and formulates these engineered features in the same orders of their corresponding nucleotides [16].
Secondly, the attention mechanism is widely used in the deep learning-based text classification studies, and this study hypothesizes that the attention mechanism might be able to highlight the nucleotide 'words' with large contributions to the enhancer prediction models. Our experimental data in the following sections demonstrates that the proposed SeqPose features achieves similar leave-one-out validation and independent test performances compared with the existing studies. The feature selection strategy and the attention mechanism ensure that the training and prediction of the proposed prediction models may be completed within reasonable time and have the potential to be deployed to the situations with limited computing power.

Results and Discussion
This section evaluates the parameters of the proposed model spEnhancer, and then compares spEnhancer with the existing studies on the same datasets. In the first three subsections, we divide 10% of the 2968 training DNA sequences into test sets, then 10% of the remaining data sets into verification sets, and the rest were all used as training sets for training models. All experiments in this study the random number seed 75. The results of the first layer structure of the model on the test set are used as the criteria for parameter selection.

Evaluating the Length of K-mers
It is anticipated that different length of k-mers makes different SeqPose features and may have large impacts on the final prediction models. This section investigates the binary classification between enhancers and non-enhancers. The three parameters are initially set as pBatchSize = 100, pLSTMSize = 128, and pDropoutRatio = 0.2.
The enhancer detection model of 7-mers achieves the worst prediction accuracy (Acc), as shown in Table 1. The data suggests that the model of 6-mers performed reasonably well on the metric specificity (Sp), but its sensitivity (Sn) is only 0.6338, which is at least 0.0775 worse than the models using the other k-mers. Therefore, 6-mers are excluded from further evaluation. The data suggests that the model using 2-mers performed the best Acc = 0.8047 in Table 1. The following sections focus on the models using 2-mers.

Selecting the Best SeqPose Features
We hypothesize that some of the SeqPose features had no contributions to the classification problem in this study. We use the Chi2 measurement to evaluate the features and calculate the prediction performances of the binary classification problem between enhancers and non-enhancers on the test dataset after removing some features, as shown in Figure 1. Firstly, we carry out a procedure of coarse-grain feature selection, through removing five features with the largest p-value measurement in each iteration, as shown in Figure 1a. There are 398 2-mer SeqPose features constructed from each 200-bp nucleotide sequences. The best classification accuracy is Acc = 0.8249 after removing 45 features ("−45"). At the same time, the classification model based on the feature list also obtains the best parameter-independent metric AUC = 0.8986.
Secondly, we carry out a fine-grain feature selection procedure by removing one feature in each iteration, as shown in Figure 1b. The experimental data suggests that the classification model with 45 removed features is not improved with more features being removed. The experimental data of Table 1 and Figure 1 show that 45 of the SeqPose features do not contribute to the enhancer prediction models. The best prediction model in Table 1 was substantially improved for Acc (from 0.8047 to 0.8249), MCC (from 0.5673 to 0.5936), and AUC (from 0.8781 to 0.8986). Therefore, the removal of the redundant positions in the SeqPose encoding features may improve the DNA sequence-based prediction models like the enhancer predictions. The following section will use the 2-mer SeqPose features after the removal of those 45 position-specific features.

Symbolic Interpretation of SeqPoseDict
This section demonstrates through experimental comparison that SeqPoseDict is only used to convert DNA sequences into computer-recognizable numerical representations, and its selection will not have a decisive impact on the final experimental results (see Table 2). We sequentially: (1) use all data sets to construct SeqPoseDict, see Res1; (2) use 95% of all data sets to construct SeqPoseDict, see Res2; (3) use 90% of all data sets to construct SeqPoseDict, see Res3; (4) use 85% of all data sets to construct SeqPoseDict, see Res4; (5) use 80% of all data sets to construct SeqPoseDict, see Res5. The data shows that under the framework proposed in this paper, the size of SeqPose-Dict does not have a decisive influence on the prediction effect of the model. It can be seen that when the data set used to construct SeqPoseDict fluctuates within 20%, the range of the Acc indicator does not exceed 0.03, which is within the acceptable range.

Optimizing the Best Choices of the Three Parameters
The three parameters pLSTMSize, pDropoutRatio, and pBatchSize are optimized through eight-fold cross-validation, as shown in Table 3. The overall prediction accuracy Acc is used as a parameter selection goal for the binary classification problem between enhancers and non-enhancers.   The set of pLSTMSize is {64, 128, 192}; the set of pDropoutRatio is {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7}; the set of pBatchSize is {16, 32, 64, 128, 256, 512}. Every time we take out a set of hyperparameters from these three sets as training parameters, until we get all possible combinations, compare their evaluation indicators on the same test set (See Table 3). The data shows that when the hyperparameter combination is pLSTMSize = 64, pBatchSize = 64, and pDropoutRatio = 0.5, the model training effect is the best (Acc = 0.7884, AUC = 0.8506). Therefore, the following sections are conducted on the basis of this set of hyperparameters.

Comparing spEnhancer with the Existing Models
This section compares the experimental results of the proposed spEnhancer models with the existing models, as shown in Tables 4 and 5. The parameters of spEnhancer models are set by the optimization procedure in the above sections. The SeqPose encoding features use 2-mers. The 45 redundant positions are removed according to the above sections. The parameters pLSTMSize, pBatchSize, and pDropoutRatio are set as 64, 64, and 0.5, respectively. The random seed is set to 75. The SeqPoseDict is calculated based on the training dataset. The threshold of the prediction probability uses the default value 0.5. The performance metrics Sn and Sp measure a binary classification model from different aspects, and one metric may be increased at the cost of the other one [17,18]. The overall prediction performance metrics Acc and MCC may be used for a fair comparison of two classification models [14]. Leave-one-out validation tends to give optimistic results for a classification model and the model generalization on future samples is usually evaluated by the prediction performances on independent datasets that are not involved in the model training process. The same independent dataset is publicly available and used in this study [14]. Table 4 shows that the proposed method spEnhancer achieves similar prediction performances for both of the two binary classification problem using the leave-one-out validation. The spEnhancer model achieves Acc = 0.7793 for the classification problem between enhancers and non-enhancers, which is 0.001 smaller than the best accuracy of the iEnhancer-EL model. Due to the difference in experimental equipment and the influence of the randomness of the parameters, we consider the difference in accuracy to be negligible and acceptable. The spEnhancer model achieved the second-best accuracy Acc = 0.6413, which is slightly smaller than the best model iEnhancer-EL's Acc = 0.6503. Table 5 gives the comparative results between the proposed spEnhancer and the four existing state-of-the-art models on the independent test dataset. The prediction performances of the four existing state-of-the-art studies are also retrieved from the previous study. The prediction accuracy on the independent dataset is supposed to represent how well a prediction model may achieve on the future query nucleotide sequences.
The proposed spEnhancer models and another algorithm iEnhancer-ECNN achieve the best two prediction accuracies and MCC on both enhancer detection problems, as shown in Table 5. The spEnhancer model achieves the best prediction Acc = 0.7725 and MCC = 0.5793 for the first layer of the enhancer detection problem, suggesting that spEnhancer may accurately separate enhancers from non-enhancers even on novel query sequences. For the second layer of the enhancer detection problem, spEnhancer achieves Acc = 0.6200, which is slightly worse than the Acc = 0.6780 of the iEnhancer-ECNN model [16]. However, spEnhancer outperforms iEnhancer-ECNN in MCC by an improvement of 0.0023. The previously best model iEnhancer-ECNN achieves a better strong/weak enhancer prediction accuracy than the proposed spEnhancer with a sacrifice in the training and prediction time. The iEnhancer-ECNN integrates the prediction results of five convolutional neural networks, each of which is trained by 20 epochs [16]. While this study completes the training of the proposed spEnhancer within six epochs. This might be the reason that the iEnhancer-ECNN model does not provide the leave-one-out validation results.
Therefore, the overall prediction performances of spEnhancer are better than or comparable to the four state-of-the-art models. This suggests that spEnhancer may serve well as a complementary model to the existing enhancer detection studies.

Evaluating the Three-Class Classification Model
The BD-LSTM algorithm may also handle the three-class classification problem. The above sections formulate the enhancer detection problem as a two-layer setting for a consistent comparison with the network structure of Liu et al. [17]. This section collects the three classes of samples and directly trains a three-class BD-LSTM prediction model with the same parameter values for the enhancer detection problem. The model training is on the training dataset and the performance is calculated using the independent test dataset. The three-class BD-LSTM model achieves 0.6402 in the overall prediction accuracy, while the binary classification BD-LSTM model achieves Acc = 0.7725 for separating enhancers from non-enhancers, as shown in Table 5. However, the three-class BD-LSTM model outperforms the binary classification model between strong and weak enhancers.
Therefore, the setting of the two-layer model delivers similar prediction performances as the three-class classification model. In order to carry out a direct comparison with the existing studies, the two-layer model is recommended.

Evaluating Different Word Vector Dimensions
The dimension size of the word vectors may affect the prediction performances, and this study evaluates different dimensions of the word vectors, as shown in Table 6. Eight values are evaluated for the dimension of the word vector, i.e., 12, 24, 48, 96, 192, 394, 768, and 1536. The values 48, 768, and 1536 achieve the best three Acc and MCC for the binary classification problem between enhancers and non-enhancers. The values 48, 96, 394, and 768 achieve the best four Acc for separating the strong enhancers from the weak ones, but the best MCC = 0.3703 is achieved by the dimension of word vector 768. At the same time, it can be seen that although the Acc with vector dimension equal to 1546 reached the optimal when classifying the enhancers and non-enhancers, the result was only 0.0039 higher than that of the Acc with dimension of 768. However, when classifying strong enhancers and weak enhancers, the Acc with a vector dimension of 768 was higher than the Acc with a vector of 1536, which was 0.04, which was a significant improvement. Therefore, this study sets 768 as the dimension of the word vector. Table 6. Comparison of the spEnhancer model performances using different dimensions of word vectors on the independent test dataset. The first column gives the binary classification problem. The column "WV" is the dimension of the word vector. The other five columns give the prediction performances Acc, Sn, Sp, MCC, and AUC.

Materials and Methods
This section introduces a two-layer classifier for the detection and strength determination of enhancers. Firstly, the position-specific encoding algorithm (SeqPose) of nucleotide sequences is described. Then the Chi-squared test (Chi2) is used to select a subset of the SeqPose features. Lastly, the selected SeqPose features are loaded to the embedding layer and mapped to a three-dimensional tensor. The two-layer classifier spEnhancer is optimized for the two binary classification problems.

Datasets and Performance Metrics
This study retrieves the publicly available training and independent test datasets released by the previous study [14]. Three existing state-of-the-art models are evaluated using the same datasets. Therefore, the proposed model spEnhancer is also evaluated on these two datasets. A fair comparison is carried out for the proposed model spEnhancer and the three existing models using the leave-one-out validation on the training dataset and the validation on the independent test dataset. The following describes how the training dataset is generated. The independent test dataset is generated using the same procedure and has no overlapped samples with the training dataset.
This article cites the standard data set S constructed by Prof. Bin Liu [14] based on nine kinds of cell chromosome status information (2968 sequences in total) where, S + strong contains only strong enhancer sequences, with a total of 742; S + weak contains only weak enhancer sequences, with a total of 742; S − contains only non-enhancer sequences, with a total of 1484.
All data consists of a positive sample data set and a negative sample data set. Among them, the enhancer type in the first layer structure is a positive sample, and the nonenhancer type is a negative sample; the strong enhancer type in the second layer structure is a positive sample, and the weak enhancer type is a negative sample.
This study uses the five widely-used classification performance metrics [13], i.e., accuracy (Acc), Matthews correlation coefficient (MCC), sensitivity (Sn), specificity (Sp), and the area under the ROC curve (AUC). Let the samples with the class labels 1 and 0 be positive and negative ones, respectively. The metrics Sn and Sp describe the ratios of correctly predicted positive and negative samples, respectively. Acc is the ratio of the correctly predicted samples. MCC describes the correlation between the real and predicted class labels. AUC is a threshold-independent metric for a classification model, and a good model tends to have a large AUC value [19]. These five metrics are defined in the following formula.
The notation S + is the number of positive samples, while S + − is the number of incorrectly predicted positive samples. S − is the number of negative samples, and S − + is the number of incorrectly negative samples.

K-mer Indexing and SeqPose Feature Extraction
This study hypothesizes that the position-specific k-mer patterns may deliver important information to discriminate enhancers from the other nucleotide sequences. A nucleotide sequence is a vector of letters and may be formulated as the one-hot integer vectors. Another feature extraction strategy is to summarize the statistical metrics of k-mers in a nucleotide sequence. All the k-mer instances are usually collected through a sliding window with step size 1, as shown in Figure 2. The collected list of unique k-mers from all the training sequences is shuffled into a set, which is used as an enhancer dictionary (denoted as SeqPoseDict). Each k-mer has a unique ID (starting from 1) in this dictionary, SeqPoseDict. A query nucleotide sequence is encoded by the same dictionary generated from the training sequences. Note that SeqPoseDict ID is not in the actual meaning. It is used to map the DNA sequence for numerical vectors, we will in the next chapter by the experimental results show that different SeqPoseDict model does not cause large fluctuations, it also illustrates the model of our proposed architecture as a whole is robust. Here we only give one of the methods of constructing SeqPoseDict: according to the order of appearance of k-mer in the training set, assign values from 1 until all k-mers in the training set are traversed. For k-mers that are not in the SeqPoseDict in the unknown data set, we specify that the k-mer ID is 0.

Selecting the Subset of Best Features
A feature selection step is carried out to remove the positions extracted in the above section if these have low phenotype associations. Chi2 assumes the null hypothesis that under the Chi-square distribution, a given feature is independent of the class label. A statistical p-value is calculated to describe the null hypothesis. A small p-value rejects the null hypothesis and supports that the given position is correlated with the class label. Next, use Chi2 to determine the position label with low correlation with the label on the training set (T) as the redundant feature and remove it from all samples, including the test set.
Each sample is a fixed-length (n) nucleotide sequence and is sliced as multiple k-mers through the sliding window with step size 1. Each k-mer is regarded as a 'word'. Therefore, each sample can be viewed as a 'sentence' composed of multiple 'words' (See step 1 in Figure 3).
The k-mer 'word' in the extracted feature of a query nucleotide sequence is replaced by its ID in the dictionary SeqPoseDict (See step 2 in Figure 3).
The samples in the training dataset and the labels of these corresponding samples are formed into a matrix M. That is, the first C-1 column of M indicates position-specific feature, the last column is the label column, and each row of M is a sample. The correlation between each feature and the class label is calculated by Chi2. The detailed calculations are described in detail in the following sub-steps.
Take out the ith column and the label column Y in M. We can treat the different IDs in the ith column as different k-mer categories and extend the column vectors into matrix X by one-hot coding, where each row represents the one-hot encoding of IDs, and each column represents the k-mer category. Then do the same for the label variable Y with one-hot coding to make a two-dimensional matrix Y_label, where the rows represent the one-hot code of the sample's class label, and the column represents the sample's class label (see Step 3.1 in Figure 3).
Calculate the observed value by formula (6) and record it as vObserved (see Step 3.2 in Figure 3). vObserved = Y T X Summing each column of the data matrix X makes the variable vFeatureSum. Then the formulas (7) and (8) are used to calculate the proportion of positive (vProbP) and negative (vProbN) samples in the column vector Y_label, respectively. The variables vProbP and vProbN are combined as vProbClass. The theoretical value is calculated according to the formula (9), denoted as vExpected. Please be noted that the calculation uses floating-point values with eight digits after the decimal point. Figure 3 shows only two digits after the decimal point for an easy view (see Step 3.3 in Figure 3).
where N 1 , N 0 , and N represent the numbers of the samples in category 1, the samples in category 0 and all the samples, respectively. Formula (10) calculates the Chi2 value of each k-mer category in a column of the data matrix M. Please be noted that the calculated Chi2 values reach 16 digits after the decimal point. Figure 3 rounds these values for a better view (see Step 3.4 in Figure 3).
The corresponding p-value is calculated using the tool Python package scikit-learn version 0.20.3. This paper defines the p-value of each column in the matrix M as the sum of p-values of each k-mer in this column, denoted as vFeaturePvalue (See step 3.5 in Figure 3).
Repeat the steps 3.1-3.5 until all the feature columns are evaluated (See step 4 in Figure 3). The features are sorted by vFeaturePvalue in the descendent order, and the top K features with the largest vFeaturePvalue are removed for their small correlations with the class labels.

Vectorization
The features of a sample nucleotide sequence from the above section form a list of consecutive k-mer 'words', and this list is regarded as a 'sentence'. The traditional one-hot coding strategy makes the feature matrix very sparse, so the step of text vectorization is carried out through the word embedding technique. A numeric tensor is generated by the text vectorization of a sample (a nucleotide sequence).
The dimension of the original one-hot coding feature vector of each k-mer is 1 × P, and we project each k-mer into a Q-dimensional word vector by embedding layer. P is also the size of the dictionary SeqPoseDict. This paper uses the word embedding layer to map each k-mer to a word vector of Q dimensions. The recently proposed BERT model performs very well on the text-based classification problems. After evaluating eight values {12, 24, 48, 96, 192, 394, 768, 1536} for the dimension of word vectors, this study uses the same parameter settings of the token embedding layer from the BERT BASE model [20], which assigns Q = 768. The detailed experimental data may be found in the following sections.
The word embedding layer is randomly initialized and updated during the training process of the deep learning classifier. After each word passes through the word embedding layer, it will change from an ID value to a 768-dimensional row vector, as shown in Figure 4. This step generates the SeqPose features from the samples.

Structural Description of the Classifier SpEnhancer
The two-layer enhancer prediction model spEnhancer is illustrated in Figure 5. All the nucleotide sequences are converted to the SeqPose features. Both of the two layers of the proposed classifier spEnhancer use the bi-directional long-short term memory (BD-LSTM) model combined with attention mechanism as the classifier because that the reverse complementary nature of a DNA molecule. The output of each neuron in the BD-LSTM is calculated from the input of the current neuron and its two neighboring neurons, and then passed as the input of batch normalization. The reason for introducing batch normalization is to prevent the problem of gradient disappearance. In addition, when extracting features, we hope that the model will pay more attention to features with high importance. Therefore, we introduce the attention mechanism in the model to assign weight coefficients to features to indicate the importance of the features, and then linearly combine the weight coefficients with the original input features and output to the next layer. The dropout layer is then utilized to mask a certain proportion of neurons from calculations and may effectively prevent over-fitting. This BD-LSTM layer delivers its output to the dropout layer, and then the full connection layer generates the final prediction results using one neuron with the value 1 or 0. The two layers of the proposed model spEnhancer have the same neural network architecture, as shown in Figure 5. The first layer is trained to predict whether a nucleotide sequence is an enhancer (1) or not (0). The second layer of the proposed model is trained to describe whether an enhancer is a strong (1) or weak (0) one.

Training Procedure of SpEnhancer
This study implements the deep learning model's training through the packages keras version 2.2.4 and tensorflow-gpu version 1.14.0 in the Python programming language version 3.7.7. The working environment is equipped with the GPU card Nvidia GeForce RTX 2060. In a multi-CPU server environment, we use 8-fold cross-validation to determine three sets of hyperparameters in turn: batch size (pBatchSize), the number of neural units in the LSTM layer (pLSTMSize), and the dropout ratio (pDropoutRatio). Moreover, the results obtained on the leave-one-out method are compared with existing research.
The dataset is retrieved from the database [17]. The model uses the Adam [21] optimizer to guide the model training process, and the optimization goal is the prediction accuracy on the validation dataset.

Conclusions
This proof-of-principle study demonstrates that the SeqPose features generated by the natural language processing (NLP) technologies achieved similar detection performances of enhancers and their enhancing strengths, compared with the existing best models. The experimental data suggests that the genomic sequences may be regarded as the language of lives, and the functional roles of genomic sequences may be investigated through the NLP technologies.
Our experimental data also demonstrates the importance of removing the unassociated positions from training a DNA sequence-based prediction model. The retaining of some positions in the DNA sequences may even reduce the overall model prediction performances. A previous study showed that many deep learning models may be improved by removing the features without contributions to the models [22]. The time-consuming training process of a deep learning model may be sped up by removing the unassociated features.
It is anticipated that more applications of the NLP technologies will be conducted to investigate genomic functional elements.