A Novel Predictor for the Analysis and Prediction of Enhancers and Their Strength via Multi-View Features and Deep Forest

: Enhancers are short DNA segments (50–1500 bp) that effectively activate gene transcription when transcription factors (TFs) are present. There is a correlation between the genetic differences in enhancers and numerous human disorders including cancer and inﬂammatory bowel disease. In computational biology, the accurate categorization of enhancers can yield important information for drug discovery and development. High-throughput experimental approaches are thought to be vital tools for researching enhancers’ key characteristics; however, because these techniques require a lot of labor and time, it might be difﬁcult for researchers to forecast enhancers and their powers. Therefore, computational techniques are considered an alternate strategy for handling this issue. Based on the types of algorithms that have been used to construct predictors, the current methodologies can be divided into three primary categories: ensemble-based methods, deep learning-based approaches, and traditional ML-based techniques. In this study, we developed a novel two-layer deep forest-based predictor for accurate enhancer and strength prediction, namely, NEPERS . Enhancers and non-enhancers are divided at the ﬁrst level by NEPERS , whereas strong and weak enhancers are divided at the second level. To evaluate the effectiveness of feature fusion, block-wise deep forest and other algorithms were combined with multi-view features such as PSTNPss, PSTNPdss, CKSNAP, and NCP via 10-fold cross-validation and independent testing. Our proposed technique performs better than competing models across all parameters, with an ACC of 0.876, Sen of 0.864, Spe of 0.888, MCC of 0.753, and AUC of 0.940 for layer 1 and an ACC of 0.959, Sen of 0.960, Spe of 0.958, MCC of 0.918, and AUC of 0.990 for layer 2, respectively, for the benchmark dataset. Similarly, for the independent test, the ACC , Sen, Spe, MCC, and AUC were 0.863, 0.865, 0.860, 0.725, and 0.948 for layer 1 and 0.890, 0.940, 0.840, 0.784, and 0.951 for layer 2, respectively. This study provides conclusive insights for the accurate and effective detection and characterization of enhancers and their strengths.


Introduction
Cells are the basic unit of life and are classified as prokaryotic (the primitive form of cells) or eukaryotic (more advanced and complex forms of cells).The nucleus is an organelle that contains chromosomes in eukaryotic cells.Chromosomes are composed of both DNA and proteins.DNA is composed of nucleotide chains [1].All cellular information is stored in DNA in the form of a specific arrangement of nucleotides.It is responsible for passing this information from the parent to the child cells.During the flow of information in a cell, DNA is used to make an RNA molecule (transcription), and then the RNA molecule is used to make a protein (translation).Transcription involves the synthesis of RNA molecules from DNA [2].Promoters, enhancers, silencers, and transcription factors are involved in the regulation of transcription and are located within the DNA.DNA expression (transcription and translation) can be controlled by regulatory elements (enhancers and silencers, respectively).Promoters and enhancers increase transcription, while silencers decrease this process.Transcription factors are not part of the DNA molecule itself but are proteins that bind to promoters and enhancers to regulate transcription [2].
Enhancers are the distal regulatory elements which affect a number of cell processes, such as gene expression, tissue specificity, cell growth, carcinogenesis, and viral activity in the cell.However, not all enhancers in the DNA affect the promoter sites of the gene, as some of these are active in the cell, whereas others are inactive.Enhancers are of particular interest to researchers because they can alter cell expression, and mutations (permanent alterations in DNA) in enhancers can cause serious genetic diseases [3].In addition to other regulatory elements, enhancers must be classified because they exist in several states, each with unique biological activities and effects on genes [4].
Various conventional methods have been investigated to predict enhancers and their strengths (details are provided in the Related Work section).Although existing methods have yielded significant results, there is still room for improvement, as the recent accumulation of high-throughput data on enhancers has raised the need for efficient computational methods capable of accurately predicting enhancer positions at the genome-wide level.This led us to introduce a new two-layer bioinformatics tool, NEw Predictor for EnhanceRs and their Strength Prediction (NEPERS).The significant contributions of this study are summarized below.As shown in Figure 1: Framework for the proposed method, NEPERS, for the prediction of enhancers and their strength.

•
NEPERS formulates the prediction of enhancers and their strength as a binary classification problem and solves it using a cascade deep forest algorithm.

•
It takes advantage of multi-view features, such as position-specific trinucleotide propensity based on single-stranded (PSTNPss) characteristics, position-specific trinucleotide propensity based on double-stranded (PSTNPdss) characteristics, the composition of k-spaced nucleic acid pairs (CKSNAP), and nucleotide chemical properties (NCP), to incorporate biological sequences into nominal descriptors.

•
A block-wise deep forest algorithm was applied, and a quantitative score was derived using metrics including accuracy (ACC), specificity (Spe), sensitivity (Sen), and Mathew's correlation coefficient (MCC) with five-fold cross-validation (5CV), and independent dataset tests are utilized to evaluate the performance of NEPERS.

•
Our method outperformed existing predictors by achieving high predictive rates.
In 2016, Liu et al. created the first two-layer enhancer predictor, iEnhancer-2L [4].In this model, DNA sequences were produced using the pseudo-k-tuple nucleotide composition (PseKNC).A jackknife test was conducted to evaluate the predictor performance.The five-fold cross-validation (5CV) technique was utilized to find the best parameters.SVM performed the best among all classifiers [21].In 2016, Cangzhi et al. proposed En-hancerPred [5], which utilized three types of sequence-based features to convert biological sequences into attributes.The F1-score, in combination with SVM, was used to obtain the optimal features.The study utilized different classifiers, among which the SVM provided outstanding results in terms of overall performance.In 2017, Tahir et al. developed iEnhancer-TNC [6], which utilized dinucleotide composition (DNC) and trinucleotide composition (TNC) to extract numerical descriptors from these biological sequences.In this model, the SVM classifier, along with the TNC approach, was predicted to be the best for classifying enhancers.In 2019, Le et al. developed a classifier named iEnhancer-5Steps [7] using the concept of pseudo-amino acid composition with 5CV.The scikit-learn package was used to perform SVM on the dataset.In 2021, Yang et al. introduced a predictor called Enhancer-PCWM based on the physicochemical properties of the weight matrix (PCWM) [8].In the first layer, SVM with a linear kernel is used, and in the second layer, the AdaBoost ensemble method is used to classify enhancers and their strengths, respectively.In 2021, Lim et al. developed iEnhancer-RF [9].In this study, the RF method and a light gradient boost machine (LGBM) model were applied.The 5CV strategy was used for data extraction in both layers.Independent tests were also performed on the data obtained from iEnhancer-EL.
Similarly, in 2018, Liu et al. developed an ensemble-based method, iEnhancer-EL [11].The BioSeq-Analysis tool was used to transform sequences into vectors.The jackknife CV method was applied to the training dataset.The K-mers, subsequence profile, and PseKNC techniques were employed for feature extraction and SVM for enhancer prediction.In 2018, Cai et al. developed iEnhancer-XG [12].In this model, the subsequence profile technique along with the XGBoost learning algorithm were used for prediction.The 10CV and independent datasets were used as evaluation strategies.In 2021, Niu et al. developed iEnhancer-EBLSTM [13].In this study, the Ensemble Long Short-Term Memory (EBLSTM) method was used, which works in two steps: extracting features using a 3-mer and identifying enhancers using the LSTM method.In 2021, Liang et al. developed iEnhancer-MFGBDT [10].In this study, multiple features, such as the k-mer and reversecomplement k-mer nucleotide composition (RCKmers) based on DNA sequence, secondorder moving average, normalized Moreau-Broto auto-cross-correlation, and Moran autocross-correlation based on the dinucleotide physical structural property matrix, were fused, and then a gradient boosting decision tree (GBDT) was used to extract features and perform classification.In 2019, Nguyen et al. developed a CNN-based model called Enhancer-ECNN [14].For data transformation, one-hot encoding and k-mers were used.The trained CNNs were used for model construction.In 2020, Li et al. developed Enhancer-RNN [15].The most efficient optimal feature extraction method used in this study was Random Forest (RF) which was applied along with the 10-fold CV strategy.Enhancer classification was performed using the 3-mer, word-to-vector, and RNN models.In 2020, Ibrahim et al. introduced Enhancer-DSNet [16].In this study, a two-layer classification model based on a linear classifier was developed, in which 5CV and independent tests, along with an SVM classifier, were used.In 2021, Mu et al. developed spEnhancer [17].This study used SeqPose to convert DNA sequences into a numerical sequence and a Chi-Squared test to remove redundant data.Another three-class classification model is implemented in this study.This model collects samples from three different classes and trains them via a bidirectional LSTM model.In 2021, Yang et al. developed the iEnhancer-RD [18] method with optimal features selected using the RFE.For parameter optimization, SVM, along with a linear kernel and 5CV strategy, was used as the evaluation strategy.Enhancer-BERT-2D [19] was introduced by Nguyen et al. in 2021 [19].BERT and CNN were used as the feature extraction techniques.The model's performance was tested using independent data, 5CV, and strategy.In 2021, NAGINA et al. developed the iEnhancer-DHF [20].This method represents the DNA sequence in the feature vectors using the PseKNC and FastText methods.This model uses Chou's five-step rule, and is based on the DNN algorithm.An independent dataset and the 10CV strategy were used in this model.

Data Collection
For appropriate predictor training and assessment, a reliable benchmark dataset must be created or selected.The following two datasets (benchmark and independent) have mostly been utilized in the literature for the prediction of enhancers and their strengths from primary sequences [2,5,7,15,16].
The dataset S B indicates the benchmark dataset for discrimination of enhancers and non-enhancers where there are 1484 enhancer samples (positive) and 1484 non-enhancer samples (negative), represented with symbols S + and S − , respectively.The dataset was named layer-1 prediction and is the current predictor.The dataset S + indicates the layer-2 benchmark dataset where the 1484 enhancer samples (positive) are further divided into strong and weak enhancers.There are 742 samples for each strong and weak enhancer which are represented with symbols S + strong and S + weak , respectively.Likewise, S I ND shows the independent dataset with 200 of each sample for enhancers and non-enhancers.Enhancers are further categorized into strong and weak enhancers, having 100 samples for each and represented with S + I ND as shown in Table 1.Position-specific trinucleotide propensity based on single-stranded (PSTNPss) characteristics is a statistical technique based on single-stranded DNA/RNA properties [22,23].The total trinucleotide number is 64, including AAA, AAC, ..., TTT.Therefore, the following 64 × 79 matrix may be used to describe the trinucleotide position specificity for an 81 bp sample: where the variable: Here, i = 1, 2, ..., 64 and j = 1, 2, ..., 79.F + (3mer i |j) and F − (3mer i |j) denote the frequency of the ith trinucleotide (3mer i ) at the jth position appearing in the positive (S + ) and negative (S − ) datasets, respectively.In the formula, mer 1 equals AAA, 3mer 2 equals AAC, . .., 3mer 64 equals TTT.Therefore, the sample of Equation ( 5) can be expressed as: where T is the operator of transpose and ϕ u is defined as follows where 1 <= µ <= 79 [24]:

PSTNPdss
PSTNPdss is statistically featured, employing an approach based on double-stranded DNA properties according to complementary base pairing; hence, they have more obvious statistical features [25].At this point, we consider A and T to be equivalent to C and G.As a result, each sample may be transformed into a sequence including only A and T. As a result, the following 8 × 79 matrix may convey the features of trinucleotide position specificity for an 81 bp sample: where the variable is Here, i = 1, 2, . .., 8 and j = 1,2, . .., 79.F + (3mer i |j) and F − (3mer i |j) denote the frequency of the ith trinucleotide (3mer i ) at the jth position appearing in the positive (S + ) and negative (S − ) datasets, respectively.In the formula, mer 1 equals AAA, 3mer 2 equals AAC, . .., 3mer 8 equals CCC.Therefore, the sample of Equation ( 9) can be expressed as: where T is the operator of transpose and ϕ u is defined as follows where 1 <= µ <= 79 [24]:

.3. CKSNAP
Composition of k-spaced nucleic acid pairs (CKSNAP) that are K steps apart from one another is represented by the K-spaced nucleic acid pairs feature encoding [26,27].The CKSNAP characteristic has 16 values that correspond to nucleic acid pairs: AA, AC, AG, . .., TG, TT.K is specified to have a maximum value of 5 by default.Using k = 1 as an example, CKSNAP is as follows: where * denotes any A, C, G, or T nucleotide; N A×Y denotes the number of X * Y nucleic acid pairs that occur in the sequence; and N Total is the total number of one-spaced nucleic acid pairs in the sequence [28].
To further clarify the CKSNAP feature calculation, let us consider a small example using the DNA sequence 'ATGCATGC': For k = 1: where: N A * A represents the count of 'AA' pairs; N A * C represents the count of 'AC' pairs; N A * G represents the count of 'AG' pairs; and so on for all 16 possible nucleic acid pairs.N Total is the total count of one-spaced nucleic acid pairs in the sequence.
For our example sequence 'ATGCATGC', the counts would be determined, and the corresponding values for each pair in the feature vector would be calculated.This illustrates how CKSNAP captures the composition of one-spaced nucleic acid pairs in a sequence.

Nucleotide Chemical Property (NCP)
NCP is a simple encoding technique denoted by nucleotide chemical properties.Its distinctiveness stems from the chemical nature of each form of ribonucleic acid.There is no ribonucleic acid that shares more than one chemical attribute with others based on three chemical qualities [29].As a result, NCP-encoded characteristics extracted from an RNA sequence contain sufficient structural details to solve a binary classification problem.Each sequence sample is converted into a 3×N-dimensional matrix using NCP, where N is the sequence length.Nucleotides come in four distinct forms, which are adenine (A), guanine (G), cytosine (C), and thymine (T) with various chemical structures and bonds that make up an RNA sequence.Guanine and adenine are purines with two fused aromatic rings, whereas cytosine and thymine are pyrimidines with a single aromatic ring.Aromatic cyclic structures are present in both groups and attach to sugar molecules [30].Furthermore, the amino group is shared by adenine and cytosine, whereas the keto group is shared by guanine and uracil.The number of hydrogen bonds created between adenine and uracil, on the other hand, is less than that formed between guanine and cytosine.As a result, these four separate kinds of nucleotides with their three distinct sets of chemical characteristics have been given binary categorization criteria.Based on their chemical characteristics, A, C, G, and T are represented by the combined coordinates [1, 1, 1], [0, 1, 0], [1, 0, 0], and [0, 0, 1], respectively [31,32].

Deep Forest
Deep forest (DF) is a deep neural network which has great modeling ability, minimal hyperparameters and is parallel-friendly [33].Deep forest's layer-by-layer processing, in-model feature translation, and proper model complexity are three aspects that contribute to the success of deep neural networks [33].DF models, as compared to other DNN approaches, produce significantly superior outcomes.When dealing with tiny sample sizes, however, several basic deep forest models may encounter overfitting and ensemble diversity issues.In classifiers, these methods are capable of learning more relevant highlevel features [34].It has become a dominating classifier in a variety of fields, including radar high-resolution range profile recognition, face anti-spoofing, hyperspectral imaging, self-interacting proteins, and cancer subtype classification.For feature learning in each layer, DF is an ensemble-based technique that uses decision trees rather than various neural network structures.DF's technique is reliable and suitable for training tiny amounts of data because of its cascade-type design.DNN has a harder time tuning the hyperparameter than DF.Unlike other classification algorithms, the deep forest classifier's greater discriminative capacity is due to its strong learning potential [35].We created a new CDF technique, which is an extension of the gcForest model that includes the RF, extreme gradient boost (XGBoost), and extremely randomized trees (ERT) classifiers.The XGBoost classifier's boosting parameter was set to k = 500 and the trees of ERT were also set to 500.The number of decision trees on RF was similarly set to 500, and the node characteristics were chosen at random [36].The details of parameters utilized for the training of the final model are provided in Table S1.

Two-Layer Classification Framework
This framework classifies enhancers into two levels.These not only identify enhancers but also classify them as strong or weak enhancers based on their strength.Enhancers and non-enhancers are separated at the first level, while enhancers are divided into two subgroups at the second level: strong enhancers and weak enhancers.In the two-layer classifying framework, the benchmark dataset is used to train, and an independent dataset is used to test the predictor.Enhancers should be identified as strong and weak enhancers based on their varied levels of biological activity and the regulatory impact they have on the target genes [4].That is why we have also used a two-layered classification framework in our model.

Evaluation Parameters
The performance evaluation is considered important in machine learning as a way to evaluate the effectiveness of a computational model after its development [37][38][39].According to the data types and categorization issues, different performance measurements factors are employed for this aim.With the use of a confusion matrix, which records both the correct and wrong values for each class and compares them to the actual outcomes, performance measure parameters are created [40].The confusion matrix's columns indicate the actual values for each class, whereas its rows indicate the predicted class [4].The following is an overview of the performance evaluation criteria adopted in the ongoing research work:

Feature Analysis (Individual vs. Fusion)
We used the PSTNPss, PSTNPdss, CKSNAP, and NCP for the feature encoding combination in Table 2 by employing the CDF technique for the purpose of evaluating the feature fusion's performance.The ACC of PSTNPss, PSTNPdss, CKSNAP, and NCP for layer 1 can be observed to be 0.810, 0.801, 0.761, and 0.744, respectively.The achieved ACC of the four features combined was 0.876, which is greater than the ACC of individual sequence features as shown in Figure 2A and the ROC curve in Figure 2B.Additionally, the ACCs for the second layer were 0.916, 0.658, 0.635, and 0.606, accordingly.Once more, the fusion of all four features performs better than any one feature alone, with an ACC of 0.959 as shown in Figure 3A and the ROC curve in Figure 3B.We also assessed how well the models performed in independent tests.According to Table 2, the ACCs for layer 1 of the independent tests for PSTNPss, PSTNPdss, CKSNAP, and NCP were 0.853, 0.778, 0.748, and 0.503, respectively.The ACC of the four features combined, however, was 0.863 for an independent dataset, which is higher than the ACC of each sequence feature as shown in Figure 2C and the ROC curve in Figure 2D.Additionally, the ACCs for the second layer for the independent tests were 0.855, 0.805, 0.655, and 0.650, respectively.Once more, the fusion of all four traits exceeded each one separately with an ACC of 0.890, as shown in Figure 3C and the ROC curve in Figure 3D.In conclusion, these findings demonstrated that the feature fusion approach is successful in enhancing our model's capacity for prediction.We also assessed how well the models performed in independent tests.According to Table 2, the ACCs for layer 1 of the independent tests for PSTNPss, PSTNPdss, CKSNAP, and NCP were 0.853, 0.778, 0.748, and 0.503, respectively.The ACC of the four features combined, however, was 0.863 for an independent dataset, which is higher than the ACC of each sequence feature as shown in Figure 2C and the ROC curve in Figure 2D.Additionally, the ACCs for the second layer for the independent tests were 0.855, 0.805, 0.655, and 0.650, respectively.Once more, the fusion of all four traits exceeded each one separately with an ACC of 0.890, as shown in Figure 3C and the ROC curve in Figure 3D.In conclusion, these findings demonstrated that the feature fusion approach is successful in enhancing our model's capacity for prediction.
It is extensively documented in the machine learning literature that harnessing the combination of multiple features, as opposed to depending on a solitary feature set, can markedly improve model performance [41,42].'PSTNPss' may exert a dominant influence on the model's performance in comparison to other features.In certain instances, a singular feature might encapsulate the essential information requisite for precise predictions, diminishing the impact of additional features [43][44][45].'PSTNPss' captures pivotal patterns in both enhancer and non-enhancer sequences, playing a crucial role in the model's efficacy.This feature contains a wealth of information and distinctive patterns that differentiate enhancers from non-enhancers, especially when contrasted with 'NCP' and 'CKSNAP.'The model appears to heavily rely on the information encoded in 'PSTNPs' for making accurate predictions.Therefore, we also tested and evaluated different combinations (like sets of two and three) of the individual features for performance, and their results are illustrated in Table S2.It is found from the comparison that different combination results are less than the combination of all features.Thus, for the final training of the proposed model, all features were considered for both layers 1 and 2 and on both benchmark and independent datasets.It is extensively documented in the machine learning literature that harnessing the combination of multiple features, as opposed to depending on a solitary feature set, can markedly improve model performance [41,42].'PSTNPss' may exert a dominant influence on the model's performance in comparison to other features.In certain instances, a singular feature might encapsulate the essential information requisite for precise predictions, diminishing the impact of additional features [43][44][45].'PSTNPss' captures pivotal patterns in both enhancer and non-enhancer sequences, playing a crucial role in the model's effi-

Analysis of Various Classifiers
In this section, we compare the performance of our suggested classifiers to that of other classifiers via the benchmark dataset, e.g., RF, SVM, XGB, GBDT, LightGBM, DNN, MLP, CNN, and CDF.We compared these classifiers with respect to both the CV and independent test performance.Table 3 displays the 10-fold CV outcomes, bolding the metrics with the highest values for each class.It is evident that CDF performs better than other classifiers in all measurement parameters with an ACC of 0.876, Sen of 0.864, Spe of 0.888, MCC of 0.753, and AUC of 0.940 for layer 1 and an ACC of 0.959, Sen of 0.960, Spe of 0.958, MCC of 0.918, and AUC of 0.990 for layer 2, respectively.Table 3 displays the independent testing results, bolding the metrics with the highest values for each class.Looking at the results drawn for independent testing in Table 3, for layer 1, we can see that the CDF has the highest ACC of 0.863, Sen of 0.865, Spe of 0.860, MCC of 0.725, and AUC of 0.948.And for second layer, the CDF has the highest ACC of 0.890 and Sen of 0.940, the DNN has the highest Spe, i.e., 0.990, and the CDF has the highest MCC of 0.784 and AUC of 0.951.Based on the ACC of both the layers, we will use the CDF to analyze our features as the best analyzing classifier among the rest of the classifiers tested, as shown in Figure 4. Therefore, we will now analyze our features on both cross-validation and the independent dataset using CDF.

Model Interpretation
In this section, we use the Shapley Additive Explanation Algorithm (SHAP) [4] to interpret the proposed models for prediction of enhancers and their strength.SHAP is a unified predictive interpretation framework proposed in 2017 as the only locally consistent and accurate expectation-based feature mapping method [28].This technique can interpret the values of the importance of features of complex learning models and provide interpretable predictions for the test sample [28].The SHAP scores are proposed as a consistent measure of feature importance because they assign each feature a significant value (∅) that represents the impact of including that feature in model predictions [28,46].
The top 20 features for the enhancer and its strength are displayed in Figure 5A,B, where each row represents the distribution of a feature's SHAP value.The feature contribution increases in the positive direction in proportion to the bigger SHAP value in that direction, and vice versa.The feature value's magnitude is represented by color.When the qualities rise, the color becomes red, and when they fall, it becomes blue.For instance, in the case of PSTNPdss39 and CKSNAP4, a larger characteristic value corresponds to a stronger positive SHAP value.A SHAP value of >0 denotes that the prediction favors the positive class, and the points with varied colors show the distribution of each feature's impact on the output of the suggested model.On the other hand, a prediction that has a SHAP value of less than zero indicates a tendency toward the negative class.
Information 2023, 14, x FOR PEER REVIEW 15 of 18

Model Interpretation
In this section, we use the Shapley Additive Explanation Algorithm (SHAP) [4] to interpret the proposed models for prediction of enhancers and their strength.SHAP is a unified predictive interpretation framework proposed in 2017 as the only locally consistent and accurate expectation-based feature mapping method [28].This technique can interpret the values of the importance of features of complex learning models and provide interpretable predictions for the test sample [28].The SHAP scores are proposed as a consistent measure of feature importance because they assign each feature a significant value (∅) that represents the impact of including that feature in model predictions [28,46].
The top 20 features for the enhancer and its strength are displayed in Figure 5A,B, where each row represents the distribution of a feature's SHAP value.The feature contribution increases in the positive direction in proportion to the bigger SHAP value in that direction, and vice versa.The feature value's magnitude is represented by color.When the qualities rise, the color becomes red, and when they fall, it becomes blue.For instance, in the case of PSTNPdss39 and CKSNAP4, a larger characteristic value corresponds to a stronger positive SHAP value.A SHAP value of >0 denotes that the prediction favors the positive class, and the points with varied colors show the distribution of each feature's impact on the output of the suggested model.On the other hand, a prediction that has a SHAP value of less than zero indicates a tendency toward the negative class.

Conclusions
This study includes a comprehensive analysis of all existing enhancer predictors based on ML-based techniques, ensemble approaches, and DL-based methods.We performed a comparison analysis and performance assessment of all the available predictors in terms of datasets utilization, feature construction, ACC outcomes, and MCC outcomes.We discovered that enhancer prediction was performed using two datasets: the

Conclusions
This study includes a comprehensive analysis of all existing enhancer predictors based on ML-based techniques, ensemble approaches, and DL-based methods.We performed a comparison analysis and performance assessment of all the available predictors in terms of datasets utilization, feature construction, ACC outcomes, and MCC outcomes.We discovered that enhancer prediction was performed using two datasets: the benchmark for enhancer prediction among other regulatory elements and, for estimating enhancer strength, an independent dataset.In terms of efficiency and acceptability, we ran a comparative analysis and discovered two approaches that produce the best results: iEnhancer-5Step was found to be best for the S B dataset, while Enhancer-RNN was found to be best for the S + dataset.This study also gives us information that will help us design and develop new predictors for the precise and reliable identification and prediction of enhancers and their strengths.In our model, we also incorporated a cascade-type structure consisting of two layers, where enhancers and non-enhancers are separated at the first level, and the differentiation between strong and weak enhancers is performed at the second level.We used the PSTNPss, PSTNPdss, CKSNAP, and NCP algorithms in conjunction with the CDF technique in order to perform a preliminary evaluation of the performance of the feature encoding combination before moving on to evaluate the performance of feature fusion and selection.Through the use of a benchmark dataset, we evaluated how well our suggested classifier, CDF, performed in comparison to the other classifiers such as RF, SVM, XGB, GBDT, LightGBM, DNN, MLP, and CNN.The performances of these classifiers via 10-fold CV and independent tests were taken into consideration and compared.CDF clearly outperforms than other classifiers in all measurement parameters, making it a better predictor than other models.It is anticipated that with the help of this comparative analysis and assessment as well as based on the outcomes of the proposed method, the future researchers will find it effective in predicting enhancers and their strengths.

Supplementary Materials:
The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/info14120636/s1.Table S1: Hyperparameter search details for Cascade deep forest classifiers; Table S2: Performance analysis of various feature subsets via benchmark and independent datasets.

18 Figure 1 .
Figure 1.Framework for the proposed method, NEPERS, for the prediction of enhancers and their strength.

Figure 2 .
Figure 2. Performance analysis of various features via benchmark datasets on layer 1 and layer 2. (A) shows the performance metrics values for layer 1, (B) shows the metrics values for layer 2, (C) represents the ROC curve for layer 1, and (D) indicates the ROC curve for layer 2.

Figure 2 .
Figure 2. Performance analysis of various features via benchmark datasets on layer 1 and layer 2. (A) shows the performance metrics values for layer 1, (B) shows the metrics values for layer 2, (C) represents the ROC curve for layer 1, and (D) indicates the ROC curve for layer 2.

Information 2023 , 18 Figure 3 .
Figure 3. Performance analysis of various features for independent datasets on layer 1 and layer 2. (A) shows the performance metrics values for layer 1, (B) shows the metrics values for layer 2, (C) represents the ROC curve for layer 1, and (D) indicates the ROC curve for layer 2.

Figure 3 .
Figure 3. Performance analysis of various features for independent datasets on layer 1 and layer 2. (A) shows the performance metrics values for layer 1, (B) shows the metrics values for layer 2, (C) represents the ROC curve for layer 1, and (D) indicates the ROC curve for layer 2.

Figure 4 .
Figure 4. Performance analysis of various classifiers.(A) indicates layer 1 and (C) layer 2 for benchmark dataset, whereas the independent dataset results are represented in (B) for layer 1 and (D) for layer 2.

Figure 5 .
Figure 5. Representation of essential features utilized by NEPERS in enhancer predictions, where SHAP values represent the directionality of the top 20 features where negative and positive SHAP values influences the predictions toward layer 1 in (A) and layer 2 in (B), respectively.

Figure 5 .
Figure 5. Representation of essential features utilized by NEPERS in enhancer predictions, where SHAP values represent the directionality of the top 20 features where negative and positive SHAP values influences the predictions toward layer 1 in (A) and layer 2 in (B), respectively.

Table 1 .
A summary of training and independent test datasets used in enhancer predictors.

Table 2 .
Performance analysis of various features via benchmark and independent datasets.

Table 2 .
Performance analysis of various features via benchmark and independent datasets.

Table 3 .
Performance analysis of various classifiers via benchmark and independent datasets via all features.

Table 5 .
Performance comparison of various classifiers via independent dataset.