PromGER: Promoter Prediction Based on Graph Embedding and Ensemble Learning for Eukaryotic Sequence

Promoters are DNA non-coding regions around the transcription start site and are responsible for regulating the gene transcription process. Due to their key role in gene function and transcriptional activity, the prediction of promoter sequences and their core elements accurately is a crucial research area in bioinformatics. At present, models based on machine learning and deep learning have been developed for promoter prediction. However, these models cannot mine the deeper biological information of promoter sequences and consider the complex relationship among promoter sequences. In this work, we propose a novel prediction model called PromGER to predict eukaryotic promoter sequences. For a promoter sequence, firstly, PromGER utilizes four types of feature-encoding methods to extract local information within promoter sequences. Secondly, according to the potential relationships among promoter sequences, the whole promoter sequences are constructed as a graph. Furthermore, three different scales of graph-embedding methods are applied for obtaining the global feature information more comprehensively in the graph. Finally, combining local features with global features of sequences, PromGER analyzes and predicts promoter sequences through a tree-based ensemble-learning framework. Compared with seven existing methods, PromGER improved the average specificity of 13%, accuracy of 10%, Matthew’s correlation coefficient of 16%, precision of 4%, F1 score of 6%, and AUC of 9%. Specifically, this study interpreted the PromGER by the t-distributed stochastic neighbor embedding (t-SNE) method and SHAPley Additive exPlanations (SHAP) value analysis, which demonstrates the interpretability of the model.


Introduction
A promoter is a short, non-coding sequence region on a genomic DNA that serves as the starting signal for gene transcription [1]. Typically located at the 5' end of the gene, the promoter can be bound by RNA polymerase II to initiate transcription under the regulation of transcription factors [2]. The biological function of the promoter is to activate gene transcription, leading to the expression of the corresponding protein or RNA. In other words, the promoter can allow genes to be transcribed at specific times, locations, and levels, thus affecting protein synthesis. In eukaryotes, promoters generally consist of multiple nucleotide sequences, including the core promoter, enhancers, and silencers, which regulate the transcriptional activity of the gene by binding to transcription factors. The core promoter is the most basic part of the promoter, consisting of the transcription start site (TSS) and conserved sequence elements, such as the TATA box, CAAT box, and GC box [3]. The TATA box is the most typical and ancient core promoter element, which is present in organisms ranging from yeast to plants and metazoans. Located about 25-35 base pairs upstream of encoding as the model input, mostly. Due to sparse and high dimensionality, it is unable to sufficiently extract the biological semantic information of the nucleotide. Secondly, there are repeated structures in eukaryotic promoter sequences called promoter elements. Although promoter elements are widely observed in eukaryotes, it is not the only basis for predicting promoter sequences. By the convolutional operation, deep-learning models will rely on these shallow features, actually, ignoring the deeper non-element information of a promoter sequence [38]. Meanwhile, pooling is another major procedure in CNN methods. However, the distribution of nucleotide location can be affected by this compression [39]. Thirdly, since there are more non-promoter samples than promoter samples, the class between the positive and negative is imbalanced. Moreover, the promoter sequence samples are significantly smaller compared with the data size in other domains. Thus, when simply utilizing deep-learning methods, the promoter samples' class imbalance and small sample sizes cause the model to be easily overfit [40]. Finally, as a feed-forward network, the input is a fixed length, which limits the generalization ability of the model. The input promoter sequences are also assumed to be mutually independent, ignoring the fact that there are potential relationships between promoter sequences.
In the past few years, graph deep learning was raised to represent the biological network data in various deep-learning methods. It had gained popularity in the computational biology area, such as graph generation, link prediction, and node classification [41]. Compared with other models in deep learning, the natural advantage of graph deep learning in capturing hidden information brings new opportunities to design computational models in the biology field [42]. In promoter prediction, the global information relationships among promoter sequences can be described by the graph data structure. Constructing the graph, graph deep learning may be able to capture deeper feature representations. Meanwhile, ensemble learning is a traditional machine-learning technique that effectively solves the challenges of small sample size, high dimensionality, and data noise, which is also employed in models for promoter prediction [43]. With its architectures and strategies, ensemble learning leads to remarkable and widespread breakthroughs in the field of bioinformatics [44][45][46].
In this study, we developed a promoter sequence prediction model called PromGER to predict eukaryotic promoters for multi-species via combining graph-embedding methods and ensemble learning. The main contributions of the paper are summarized as follows: (1) We extract a typical sequence's local features to illustrate biological attributes of nucleotides, including the nucleotide chemical property (NCP), nucleotide density (ND), electron-ion interaction pseudopotential (EIIP), and bi-profile Bayes (BPB) features. (2) The promoter sequences are modeled as a graph, where nodes are the promoter sample sequences and edges represent potential contacts among sequences, aiming at the ability to obtain global information. (3) Three graph-embedding methods from various scales, involved with the single node, group community, and global structure, are applied to extract the relationship representation features within the graph. The features are further integrated into a popular tree-based ensemble-learning framework named CatBoost to train the classifier. (4) We evaluated and verified the prediction performance of our model by the ablation study, t-SNE method, and SHAP value. Compared with different promoter prediction models on independent test datasets, the results indicated the effectiveness of PromGER for promoter prediction.

Overall Framework
The overall framework of PromGER mainly includes four steps. In the first step, we collect reliable datasets. In the second step, we first performed a train-test split due to the features of BPB. Then, we introduced multiple sequence features to encode the promoter accordingly, including NCP, ND, EIIP, and BPB. In this feature calculation phase, it is noted that we used the entire data for one species, including both train and test. In the graph construction stage, we still used the entire data for constructing the graph, including train and test, so that the subsequent graph embedding can take place. In the third step, PromGER employ an ensemble-learning framework to train the prediction model based on the previously split data. The last step evaluates the prediction performance of PromGER on independent test datasets, designs the ablation study, and provides an interpretability analysis of PromGER via the t-SNE and SHAP viewpoint. The framework of the PromGER is shown in Figure 1.

Overall Framework
The overall framework of PromGER mainly includes four steps. In the first step, we collect reliable datasets. In the second step, we first performed a train-test split due to the features of BPB. Then, we introduced multiple sequence features to encode the promoter accordingly, including NCP, ND, EIIP, and BPB. In this feature calculation phase, it is noted that we used the entire data for one species, including both train and test. In the graph construction stage, we still used the entire data for constructing the graph, including train and test, so that the subsequent graph embedding can take place. In the third step, PromGER employ an ensemble-learning framework to train the prediction model based on the previously split data. The last step evaluates the prediction performance of PromGER on independent test datasets, designs the ablation study, and provides an interpretability analysis of PromGER via the t-SNE and SHAP viewpoint. The framework of the PromGER is shown in Figure 1. In the graph embedding, the red line or square represents various scales, involved with the single node, group community, and global structure. In the ensemble learning, the green triangle represents the positive sample, and the blue triangle represents the negative sample.

Dataset Collecting
Constructing a proper and standard dataset is a fundamental step in designing a robust prediction model [47]. In this study, we used the eukaryotic data organized by Zhang et al. [48], which contain comprehensive and up-to-date datasets for multiple species and support assessment research on eukaryotic promoters. For the PromGER, the promoter prediction is a binary classification task, where the promoter samples can be divided into (C) ensemble learning; and (D) performance evaluation and interpretability analysis. In the graph embedding, the red line or square represents various scales, involved with the single node, group community, and global structure. In the ensemble learning, the green triangle represents the positive sample, and the blue triangle represents the negative sample.

Dataset Collecting
Constructing a proper and standard dataset is a fundamental step in designing a robust prediction model [47]. In this study, we used the eukaryotic data organized by Zhang et al. [48], which contain comprehensive and up-to-date datasets for multiple species and support assessment research on eukaryotic promoters. For the PromGER, the promoter prediction is a binary classification task, where the promoter samples can be divided into positive samples and negative samples. The positive samples are promoter sequences labeled as 1, while the negative samples are non-promoter sequences labeled as 0. The positive sample resources are collected from the Eukaryotic Promoter Database (EPDnew) and the DataBase of Transcriptional Start Sites (DBTSS). EPDnew [49] is an annotated nonredundant collection of eukaryotic POL II promoters, for which the transcription start site has been discovered experimentally. DBTSS [50] is also a database determining the biological information of TSS. Meanwhile, negative sample datasets are obtained from Exon-Intron Database (EID) [51], which documents the exon and intron information of the corresponding species.
To assess the relative functionality of alternative approaches objectively, this study chose promoter sequences from Homo sapiens (H. sapiens) and Rattus norvegicus (R. norvegicus), which are widely applied to existing promoter predictors. In eukaryotic promoter prediction, the promoter sequences are usually categorized into TATA-containing types (i.e., promoter sequences with the TATA box element) and TATA-less types (i.e., promoter sequences without the TATA box element) by most models. Usually, the lengths of input sequences (251, 300, or 1001 bps) are not uniform among existing promoter predictors. Therefore, the input of the predictors was spliced to the corresponding sequence length by extracting from upstream to downstream regions of TSS. For each species, the CD-HIT-EST tool [52] with an identity threshold of 0.8 [53] was employed to exclude redundancy sequences.
In the above manner, there are two TATA types for each species, including TATAcontaining and TATA-less. Moreover, the input sample length of every TATA type can be divided into 251 bps, 300 bps. and 1001 bps. For the balanced dataset, the non-promoter sequences had the same amount as the promoter sequences. The final statistical results of the balanced datasets in this paper are shown in Table 1. Generally, the eukaryotes have more non-promoters than promoters, which means the evaluation on the imbalanced datasets is essential for PromGER. To compare model performance fully, we have also selected corresponding imbalanced datasets for H. sapiens and R. norvegicus, which include more negative samples than positive samples. The specific numbers in the imbalanced datasets are shown in Table 2.
Moreover, we used Drosophila melanogaster (D. melanogaster) and Zea mays (Z. mays) datasets with 300 bps for ablation study and interpretability analysis, respectively, as shown in Tables 1 and 2. The reasons are that: (i) the region 300 bps is representative, as the core elements of transcription in eukaryotes are located between −250 bps and +50 bps of TSS [54]; and (ii) both datasets are collated newly by Zhang et al. [48], making the model more reliable and convincing in terms of sample size and data update. For the subsequent model training, all kinds of datasets were split into training sets (including training and validation datasets) and independent test sets with a ratio of 4:1, respectively.

Sequence Feature-Encoding Method
As the basic building blocks of DNA, there are generally four types of nucleotides in eukaryotic organisms. These are Adenine (A), Thymine (T), Guanine (G), and Cytosine (C). Instead of one-hot encoding, we utilized four encoding methods that cover the biochemical and frequency properties of nucleotides in this study.

Nucleotide Chemical Property (NCP) Feature Encoding
The variety of chemical structures in nucleotides is essential, with ring structures, hydrogen bonds, and functional groups being the most important [55]. Pyrimidines and purines share the six-membered ring as a chemical property. Purines A and G have a pentagon and a hexagon in general, whereas pyrimidines C and T have only one hexagon. In terms of hydrogen bonds, it is worth noting that the number of hydrogen bond donors is unequal, resulting in the formation of three hydrogen bonds between C and G and two between A and T, in turn leading to changes in the strength of hydrogen bonds. Furthermore, ketone groups are closely related to G and T, but A and C are more important for the structural composition of amino.
As in the above manner, one nucleotide n can be encoded by a three-dimensional vector (r, h, f): where r, h, and f represent the NCP encoding result of nucleotide n, respectively.

Nucleotide Density (ND) Feature Encoding
The nucleotide density (ND) can measure the correlation of position and frequency by extracting nucleotide weight information within a sequence [56], which is formally stated as: Here, n i is the i-th nucleotide in a DNA sequence from the first position, and each DNA sequence was represented as a one-dimensional vector.
The NCP-and ND-encoding methods have been successfully applied in bioinformatics such as motif discovery. The existing studies have combined them to form NCP-ND feature encoding, which we will also use in the paper. In NCP-ND feature-encoding methods, a nucleotide within the promoter sequence can be represented as: where r, h, and f indicate the NCP-encoding result of nucleotide n, respectively, and nd is the ND-encoding result of nucleotide n, taking into account both chemical property and element frequency.

Electron-Ion Interaction Pseudopotential (EIIP) Feature Encoding
By measuring the energy of delocalized electrons in nucleotides, the electron-ion interaction pseudopotential (EIIP) [57] is a single indicator sequence that shows how the free electron energies are distributed throughout the DNA sequence. EIIP represents the nucleotides T, A, G, and C as 0.1335, 0.1260, 0.0806, and 0.1340, respectively, which can reduce the computing cost.

Bi-Profile Bayes (BPB) Feature Encoding
Originally employed for protein methylation sites, the bi-profile Bayes (BPB) [58] encoding system has been widely applied to granzyme cleavage site prediction and enhancer detection. The nucleotide ni for a promoter sequence at the i-th position can be encoded by the BPB-encoding methods as: where p + denotes the posterior probability of the nucleotide n i in the positive samples of training datasets, and p − denotes the posterior probability of the nucleotide n i in the negative samples of training datasets.

Graph Construction-Fast Linear Neighborhood Similarity Approach (FLNSA)
The relationships among the entire set of promoter sequences can be described by a graph defined as: where X = {x 1 , x 2 , . . . , x m } represents the set of nodes, and E = {(i, j)| x i is adjacent to x j } is the set of edges. The relation among samples in G can be measured utilizing the fast linear neighborhood similarity approach (FLNSA) [59]. By fusing the NCP-ND, EIIP, and BPB features, a node with the length l can be created. Additionally, there will be an edge in the task if two sample nodes have potential relation. The dataset's sequence samples are all converted into vectors x 1 , x 2 , . . . , x m , where x i (1 ≤ i ≤ m) is the i-th sample vector and m is the sample count. The vectors are gathered into an m × l matrix X under the FLNSA assumption that each sample node can be seen as the reconstruction by the linear weighting of other nodes. The following function is optimized by FLNSA for the node construction errors: where F is the Frobenius norm, * represents the Hadamard product, u is the regularization parameter, and e = (1, 1, . . . , 1) T is an m-dimensional vector with all column values equal to 1. C is an m × m indicator matrix with the element c(i, j) denoted as the following equation: Genes 2023, 14, 1441 8 of 20 where N(X i ) is the nearest neighborhood nodes set of X i By determining the neighborhood ratio and calculating the Euclidian distance between X i and other sample node, N(X i ) can be formed. W is an m × m matrix, measuring other nodes' reconstruction weight contributions to X i in the i-th row. With the Lagrange multiplier method and Karush-Kuhn-Tucker conditions, it can be derivatized that: where the matrix W is randomly initialized and iteratively updated until convergence. The adjacency matrix W will construct a graph. The graph-embedding methods require a connected graph as input. For every sample node, FLNSA can find the nearest c (0 < c < m) samples to make the Graph G connected.

Graph-Embedding Feature-Encoding Method
The graph embedding maps each node to a dense, low-dimensional feature vector and tries to preserve the connection strengths and attribute correlations between vertices [60]. We implement three proposed graph-embedding approaches to extract the information from single node, group community, and global structure aspect.

Single Node-Node2vec
Node2vec [61] is a micro-level graph-embedding algorithm that prioritizes the individual. With its focus on a single node, Node2vec applies an upgraded version of the random wandering approach to gather information on how node interacts with its neighbors.

Group Community-SocDim
A meso-level structural-representation-learning method called SocDim [62] focuses on group interactions. It splits the graph into separate, non-overlapping communities using a modularity strategy. Due to the fact that nodes within a community have those attributes that are different from nodes outside of it, this capture greatly simplifies node analysis and computational learning in graphs.

Global Structure-GraRep
The node vectors produced by GraRep [63] have a global character because of various informational steps reflected in the mapping subspaces. It can generate a low-dimensional vector representation at the scale of the entire graph by taking into account the similarity in long-distance nodes.

Ensemble-Learning Strategy
Ensemble learning describes a set of strategies that mix numerous "base" models to carry out supervised and unsupervised task, rather than creating a single model. The CatBoost model [64] is a traditional supervised learning ensemble approach. In boosting process, training set is used to generate a classifier with above-average accuracy. And new based classifiers are then added to produce an ensemble, in which the joint decision rules have a higher accuracy level. As a result, the classification performance is improved.

Performance Evaluation
To fully verify the effectiveness of our proposed method, six evaluation metrics are used to assess the method's performance, including sensitivity (Sen), specificity (Spe), accuracy (Acc), Matthew's correlation coefficient (MCC), precision (Pre), and F1 score. The above six metrics are defined as follows: where TP and TN represent the number of true-positive samples and true-negative samples, respectively; and FP and FN represent the number of false-positive samples and falsenegative samples, respectively. By the receiver operating characteristic curve (ROC), the area under the curve (AUC) has also been used to measure the performance of the model. In general, the closer the value of AUC is to 1, the more realistic and reliable the overall performance is.

Performance on Balanced Datasets
In this section, we tested models on balanced datasets, as shown in Table 3.  [20], and FProm [45] on the independent test dataset. As a result, PromGER acquired the best metrics for both TATA-containing and TATA-less promoters with the exception of Spe and Pre, which were only 0.0043 and 0.0028 lower than that of CNNProm. Compared with the suboptimal predictor CNNProm, PromGER improved the AUC of 0.0702 for predicting TATA-containing promoters, and 0.086 for TATA-less promoters, respectively.
For H. sapiens (300 bps), PromGER was compared with iProEP [27], Depicter [39], and DeePromoter [38] on the independent test dataset. We can observe that PromGER had the best performance in terms of all major measurements among these predictors. The Sen of PromGER is 0.0494 and 0.0244 lower than the suboptimal value of the predictor Depicter for TATA-containing promoters and DeePromoter for TATA-less promoters, respectively.
For R. norvegicus (251 bps), PromGER was compared with CNNProm and NNPP2.2 on the independent test dataset because FProm did not consider this situation. PromGER achieved the highest Acc, MCC, F1, and AUC on both TATA-containing and TATA-less datasets. Its AUC is 0.9923 and 0.9488, which outperformed other methods.
For R. norvegicus (300 bps), PromGER was compared with Depicter and DeePromoter on the independent test dataset, because iProEP did not consider this situation. PromGER is also superior to other baselines by achieving an Spe of 0.945, Acc of 0.9395, MCC of 0.8791, Pre of 0.9444, F1 of 0.9392, and AUC of 0.9841 for TATA-containing sequences. Moreover, PromGER has improved slightly in all six measures except Sen on the TATA-less dataset.

Performance on Imbalanced Datasets
In this section, we tested models on imbalanced datasets, as shown in Table 4. On the H. sapiens (251 bps) dataset, PromGER obtained the highest metrics except Sen for the TATA-containing promoters, which was 0.0854 lower than NNPP2.2. For the TATA-less promoters, PromGER performed the overall best with AUC (0.9932). As the crucial evaluation for imbalanced datasets, PromGER improved its predictive performance with MCC and F1.
On the H. sapiens (300 bps) dataset, PromGER also performed well. Compared with other predictors, PromGER predicted better with an Acc of 0.9338 and 0.9560 on the TATAcontaining and TATA-less datasets, respectively. Its MCC and F1 were slightly lower than Depicter.
On the R. norvegicus (251 bps) and R. norvegicus (300 bps) datasets, PromGER achieved a better performance than Depicter and DeePromoter in terms of Spe, Acc, MCC, Pre, F1, and AUC. The Sen of PromGER was 0.0228 lower and 0.0286 lower than CNNProm and Depicter on the TATA-containing datasets of 251 bps and 300 bps, respectively. For TATA-less datasets of 300 bps, PromGER was 0.0353 lower than Depicter.
We also took into account a model comparison of 1001 bps [46] in addition to the 251 bps and 300 bps input, though such tools were uncommon. As shown in Figure 2, PromGER still maintained greater stability for longer sequences.

Ablation Study of Different Feature Combinations
In order to verify the effectiveness of the sequence features on promoter prediction, we focused on seven different feature combinations as the input of the PromGER. All features are removed before the graph embedding, including NCP_ND, EIIP, and BPB. On the independent test datasets of D. melanogaster, we used more extreme imbalanced data besides balanced datasets, where the ratio of positive samples to negative samples (i.e., promoters: non-promoters) is 1:5. We introduced the Kappa coefficient as a measure to capture the consistency: where n is the total number of samples; a m denotes the actual sample number in class m, and b m denotes the predicted sample number in class m, accordingly. Table 5 shows the comparison results, with the receiver operating characteristic curve (ROC) and the precision-recall (PR) curve displayed in Figure 3. For the balanced datasets, the Kappa coefficient of NCP_ND + EIIP + BPB was increased by 16.65% and 0.23% compared with the worst (only NCP_ND) and the suboptimality (EIIP + BPB) on the TATA-containing type. Meanwhile, on the TATA-less type, although the trend was similar to the TATA-containing type, the improvement was caused by EIIP encoding and NCP_ND + BPB, respectively.
For the imbalanced dataset, the Kappa coefficient of NCP_ND + EIIP + BPB was increased by 7.15% and 1.88% compared with the worst (NCP_ND + EIIP) and the suboptimality (only NCP_ND) on the TATA-containing type, but the ROC curve and the PR curve showed minor fluctuations. On the TATA-less type, NCP_ND + EIIP + BPB was better in all aspects.
Regarding the combination of graph-embedding features, we also design the ablation experiments shown in Table 6. For the balanced datasets, the Kappa coefficient of Node2vec + SocDim + GraRep was increased by 8.48% compared with the worst (only SocDim) and 0.45% compared with the suboptimality (Node2vec + SocDim). Meanwhile, on the TATAless type, the corresponding results are 4.81% (only Node2vec) and 1.56% (only GraRep). For the imbalanced datasets, the Kappa coefficient of Node2vec + SocDim + GraRep was increased by 2.99% compared with the worst (only SocDim) and 0.05% compared with the suboptimality (only GraRep). Likewise, on the TATA-less type, the corresponding results are 3.41% (only GraRep) and 0.49% (Node2vec + SocDim). Therefore, it is helpful for the prediction to extract the graph-embedding information from the single node, group community, and global structure aspect.

Ablation Study of Different Feature Combinations
In order to verify the effectiveness of the sequence features on promoter prediction, we focused on seven different feature combinations as the input of the PromGER. All features are removed before the graph embedding, including NCP_ND, EIIP, and BPB. On the independent test datasets of D. melanogaster, we used more extreme imbalanced data

Model Visualization Interpretation
Deep-learning and machine-learning methods are the "black box", which is not visible for the hidden intermediate process. In other words, it is beneficial for the model reality to improve the interpretability of such models, which has been the major target of research efforts. To provide an explanation for the predictive behavior of PromGER, we employed a visualization algorithm called t-SNE. The t-SNE maps data from the high-dimensional feature space to the two-dimensional space by nonlinear dimensionality reduction. To further understand the prediction results of PromGER, we introduced the SHAP value analysis to enhance an interpretation of the decision results.

Visualization in Graph-Embedding Period
The comparison results on the Z. mays datasets are shown in Figure 4. Before graph embedding, the two types of points are mixed together. However, the feature vectors can be well-separated after graph-embedding processing.

Model Visualization Interpretation
Deep-learning and machine-learning methods are the "black box", which is not visible for the hidden intermediate process. In other words, it is beneficial for the model reality to improve the interpretability of such models, which has been the major target of research efforts. To provide an explanation for the predictive behavior of PromGER, we employed a visualization algorithm called t-SNE. The t-SNE maps data from the high-dimensional feature space to the two-dimensional space by nonlinear dimensionality reduction. To further understand the prediction results of PromGER, we introduced the SHAP value analysis to enhance an interpretation of the decision results.

Visualization in Graph-Embedding Period
The comparison results on the Z. mays datasets are shown in Figure 4. Before graph embedding, the two types of points are mixed together. However, the feature vectors can be well-separated after graph-embedding processing.

Visualization in Ensemble-Learning Periods
The SHAP value was a universal measure of feature importance from the coalitiona game theory, which provided an importance value for each feature in the ensemble learn ing. It made the predictions comprehensible by assessing the effect of each feature on th model training. To identify which features are more crucial for PromGER, we used th SHAP value on the Z. mays datasets. Figure 5 showed the results based on SHAP value for both TATA-containing and TATA-less types of promoters, revealing the top feature ranked by the sum of SHAP-value magnitudes over all samples for both balanced and imbalanced types, which emphasizes the distribution of each feature's influence on th PromGER output. The results showed that, on the four datasets, PromGER is based o slightly different important features, including both the sequence's local features and graph-embedded features. This indicates that these two different types of features jointl influence the prediction results. Meanwhile, GraphEmbeddings_31 shows a high positiv correlation on different datasets, and BPB_400 shows a negative correlation on TATA-les

Visualization in Ensemble-Learning Periods
The SHAP value was a universal measure of feature importance from the coalitional game theory, which provided an importance value for each feature in the ensemble learning. It made the predictions comprehensible by assessing the effect of each feature on the model training. To identify which features are more crucial for PromGER, we used the SHAP value on the Z. mays datasets. Figure 5 showed the results based on SHAP values for both TATAcontaining and TATA-less types of promoters, revealing the top features ranked by the sum of SHAP-value magnitudes over all samples for both balanced and imbalanced types, which emphasizes the distribution of each feature's influence on the PromGER output. The results showed that, on the four datasets, PromGER is based on slightly different important features, including both the sequence's local features and graph-embedded features. This indicates that these two different types of features jointly influence the prediction results.
Meanwhile, GraphEmbeddings_31 shows a high positive correlation on different datasets, and BPB_400 shows a negative correlation on TATA-less datasets. Moreover, the top features for the TATA-less prediction are several GraphEmbedding features, implying the importance of graph-embedding methods for the TATA-less promoter prediction.
for both TATA-containing and TATA-less types of promoters, revealing the top features ranked by the sum of SHAP-value magnitudes over all samples for both balanced and imbalanced types, which emphasizes the distribution of each feature's influence on the PromGER output. The results showed that, on the four datasets, PromGER is based on slightly different important features, including both the sequence's local features and graph-embedded features. This indicates that these two different types of features jointly influence the prediction results. Meanwhile, GraphEmbeddings_31 shows a high positive correlation on different datasets, and BPB_400 shows a negative correlation on TATA-less datasets. Moreover, the top features for the TATA-less prediction are several GraphEmbedding features, implying the importance of graph-embedding methods for the TATAless promoter prediction.

Discussion
Promoters are crucial components of the DNA's non-coding regions, which contribute to exploration in the context of human diseases. Thus, predicting eukaryotic promoter sequences in bioinformatics is valuable for the comprehension of the transcriptional control in genes. Various methods based on machine learning and deep learning have been applied to the promoter prediction tasks in recent years. In this study, we propose a promoter prediction model to predict eukaryotic promoter sequences, which is called PromGER. PromGER obtained the four attribute features of nucleotides within each promoter sequence, including physical features, biochemical properties, molecular characteristics, and frequency distributions. In addition, we also gained a sequence-to-sequence relationship representation by graph-embedding methods. This multi-scale feature extraction approach is able to process deeper semantic information and biological meaning in promoter sequences, compared with most of other models.
We evaluated PromGER on different datasets, and the results demonstrated PromGER obtained the best performance compared to the comparison models. Different from models based on CNN, PromGER has no input sequence constraints, which makes it more generalizable to promoter prediction tasks, regardless of input length or species variations. Our models outperformed the existing models for Acc, MCC, Pre, F1, and AUC, suggesting that the PromGER is a robust and reliable predictor. Meanwhile, the Sen of the PromGER sometimes has a gap with the optimal models. For these models, they tend to extract the intuitive and shallow features like core promoter elements at absolute advantages. However, the fundamental features at specific positions, such as the TATA box, are not a determinant. Furthermore, it may make the model depend on the presence or absence of these local elements to discriminate the promoter, reducing the generalization of the model with a low Spe value. Accordingly, PromGER can achieve a balance between Sen and Spe. More importantly, PromGER extracts the potential relationship

Discussion
Promoters are crucial components of the DNA's non-coding regions, which contribute to exploration in the context of human diseases. Thus, predicting eukaryotic promoter sequences in bioinformatics is valuable for the comprehension of the transcriptional control in genes. Various methods based on machine learning and deep learning have been applied to the promoter prediction tasks in recent years. In this study, we propose a promoter prediction model to predict eukaryotic promoter sequences, which is called PromGER. PromGER obtained the four attribute features of nucleotides within each promoter sequence, including physical features, biochemical properties, molecular characteristics, and frequency distributions. In addition, we also gained a sequence-to-sequence relationship representation by graph-embedding methods. This multi-scale feature extraction approach is able to process deeper semantic information and biological meaning in promoter sequences, compared with most of other models.
We evaluated PromGER on different datasets, and the results demonstrated PromGER obtained the best performance compared to the comparison models. Different from models based on CNN, PromGER has no input sequence constraints, which makes it more generalizable to promoter prediction tasks, regardless of input length or species variations. Our models outperformed the existing models for Acc, MCC, Pre, F1, and AUC, suggesting that the PromGER is a robust and reliable predictor. Meanwhile, the Sen of the PromGER sometimes has a gap with the optimal models. For these models, they tend to extract the intuitive and shallow features like core promoter elements at absolute advantages. However, the fundamental features at specific positions, such as the TATA box, are not a determinant. Furthermore, it may make the model depend on the presence or absence of these local elements to discriminate the promoter, reducing the generalization of the model with a low Spe value. Accordingly, PromGER can achieve a balance between Sen and Spe. More importantly, PromGER extracts the potential relationship among promoter sequences ignored by existing models. For a promoter sequence, three different scales of graph-embedding methods are applied to represent the global features in a graph. We can see that, in addition to the TATA-containing type datasets, PromGER also achieves better results on the TATA-less type datasets. This is perhaps just because potential connections between sequences will contribute to correct predictions using graph-embedding methods. Attributed to CatBoost ensemble learning, PromGER maintained a superior performance on the imbalanced dataset. Through the ablation study for different features, we verified the validity of the encoding. The results indicated that a single feature is not beneficial and feature combination is a very important factor for the prediction. Finally, we provided the visual interpretation of the model with the t-SNE algorithm and SHAP values, which demonstrated these features are effective for predicting different types of promoters and improved the prediction performance of PromGER. And the information within the local features can be complemented by the features from the graph embedding, implying the powerful potential of graph deep learning.

Conclusions
In this article, we proposed a novel computational model called PromGER to predict the eukaryotic promoters, by combining graph embedding and ensemble learning. The experimental results show that PromGER can predict the eukaryotic promoters accurately for both TATA-containing and TATA-less types. This approach may play a crucial role in supplementing other existing methods of predicting promoters and other biological sites. Future research will consider the potential impact of global information more carefully, and we can expect further improvements by introducing more graph deep-learning methods.