Deep-4mCGP: A Deep Learning Approach to Predict 4mC Sites in Geobacter pickeringii by Using Correlation-Based Feature Selection Technique

4mC is a type of DNA alteration that has the ability to synchronize multiple biological movements, for example, DNA replication, gene expressions, and transcriptional regulations. Accurate prediction of 4mC sites can provide exact information to their hereditary functions. The purpose of this study was to establish a robust deep learning model to recognize 4mC sites in Geobacter pickeringii. In the anticipated model, two kinds of feature descriptors, namely, binary and k-mer composition were used to encode the DNA sequences of Geobacter pickeringii. The obtained features from their fusion were optimized by using correlation and gradient-boosting decision tree (GBDT)-based algorithm with incremental feature selection (IFS) method. Then, these optimized features were inserted into 1D convolutional neural network (CNN) to classify 4mC sites from non-4mC sites in Geobacter pickeringii. The performance of the anticipated model on independent data exhibited an accuracy of 0.868, which was 4.2% higher than the existing model.


Introduction
Alterations in DNA play a significant role in gene expression and regulation, DNA replication, and transcriptional regulation. Methylcytosine is a key epigenetic trait at 5cytosine-phosphate-guanine-3 site. Methylcytosine is precisely correlated with cell growth and chromosomal protection [1,2]. 5-Hydroxymethylcytosine (5hmC), 5-methylcytosine (5mC), and 4-methylcytosine (4mC) are the familiar cytosine methylations in multiple genomes of prokaryotes and eukaryotes [3,4]. 5mC is a frequent type of methylcytosine and responsible for many neurodegenerative and cancerous diseases [5]. 4mC is a significant alteration that protects genomic knowledge from weakening by restriction enzymes [6].
Precise identification of 4mC sites can give important signs to understand the method of gene regulation. At present, there are several techniques to recognize 4mC sites, for example, single-molecule real-time sequencing [7], mass spectrometry [8], and bisulfite sequencing [9], but these techniques are time-consuming and expensive when utilized on next-generation sequencing data. Hence, a computational model to identify 4mC sites is needed on an urgent basis. Currently, a few computational and mathematical methods have been introduced to predict 4mC sites in multiple species. In 2017, Chen at al. [10] introduced the first computational model to predict 4mC sites in multiple species on the basis of confirmed 4mC dataset. Subsequently, Wei at al. [11] designed the novel iterative feature illustrative algorithm for the prediction of 4mC sites. Tang et al. [12] introduced the new linear integration method by merging the existing models for the identification of 4mC sites. Afterwards, Manavalan et al. [13] established the new tool Meta-4mCpred to recognize 4mC sites in six different species. Khanal et al. [14] introduced the first deep Int. J. Mol. Sci. 2022, 23, 1251 2 of 10 learning model 4mCCNN by utilizing numerous feature combinations [15][16][17] for the prediction of 4mC sites in multiple genomes [18]. Although the prediction model 4mCCNN can yield good outcomes, there is still space for more improvement.
To tackle these hitches, we constructed a 1D CNN model to recognize 4mC sites in Geobacter pickeringii. Figure 1 illustrates the flowchart of the whole study. Binary and k-mer nucleotide composition descriptors were used to encode DNA sequences of Geobacter pickeringii into feature vectors and then these features were optimized by using a correlation and gradient-boosting decision tree (GBDT)-based algorithm with incremental feature selection (IFS) method. After this, these optimized features were inserted into 1D CNN-based classifier using 10-fold cross-validation and we attained the finest model to classify 4mC from non-4mC. troduced the new linear integration method by merging the existing models for the identification of 4mC sites. Afterwards, Manavalan et al. [13] established the new tool Meta-4mCpred to recognize 4mC sites in six different species. Khanal et al. [14] introduced the first deep learning model 4mCCNN by utilizing numerous feature combinations [15][16][17] for the prediction of 4mC sites in multiple genomes [18]. Although the prediction model 4mCCNN can yield good outcomes, there is still space for more improvement.
To tackle these hitches, we constructed a 1D CNN model to recognize 4mC sites in Geobacter pickeringii. Figure 1 illustrates the flowchart of the whole study. Binary and kmer nucleotide composition descriptors were used to encode DNA sequences of Geobacter pickeringii into feature vectors and then these features were optimized by using a correlation and gradient-boosting decision tree (GBDT)-based algorithm with incremental feature selection (IFS) method. After this, these optimized features were inserted into 1D CNN-based classifier using 10-fold cross-validation and we attained the finest model to classify 4mC from non-4mC.

Performance Evaluation
We constructed a 1D CNN-based model named Deep-4mCGP for the identification of 4mC sites in Geobacter pickeringii. In the first step, we converted the sequence data in to feature vectors by using k-mer nucleotide composition and binary encodings. Subsequently, these feature vectors were improved by means of correlation and GBDT-based algorithm with IFS method. Initially, correlation and then GBDT with IFS were utilized to pick the finest features. Figure 2A,B displays the IFS curve of top features. Afterward, these finest features were inserted into 1D CNN by using 10-fold cross-validation to classify 4mC sites from non-4mC sites in Geobacter pickeringii. In this work, 10-fold cross-validation was employed to examine the efficiency of the model. The data were arbitrarily divided into 10 segments of equal proportion. Each segment was independently tested by

Performance Evaluation
We constructed a 1D CNN-based model named Deep-4mCGP for the identification of 4mC sites in Geobacter pickeringii. In the first step, we converted the sequence data in to feature vectors by using k-mer nucleotide composition and binary encodings. Subsequently, these feature vectors were improved by means of correlation and GBDT-based algorithm with IFS method. Initially, correlation and then GBDT with IFS were utilized to pick the finest features. Figure 2A,B displays the IFS curve of top features. Afterward, these finest features were inserted into 1D CNN by using 10-fold cross-validation to classify 4mC sites from non-4mC sites in Geobacter pickeringii. In this work, 10-fold cross-validation was employed to examine the efficiency of the model. The data were arbitrarily divided into 10 segments of equal proportion. Each segment was independently tested by the model, which was trained on the outstanding nine segments. Thus, 10-fold cross-validation technique was executed 10 times, and the average of the outcomes was the ultimate result. AUROC of the anticipated model was 0.986, which was 6.5% higher than the existing model.
The accuracy, precision, recall, and F1 are shown in Table 1, and the ROC curve is shown in Figure 2C.
the model, which was trained on the outstanding nine segments. Thus, 10-fold cross-validation technique was executed 10 times, and the average of the outcomes was the ultimate result. AUROC of the anticipated model was 0.986, which was 6.5% higher than the existing model. The accuracy, precision, recall, and F1 are shown in Table 1, and the ROC curve is shown in Figure 2C.  Table 1. Outcomes of single encodings and their fusion based-models on training and independent data by using different classification algorithms. Bold is used to highlight the best results.

Comparison on the Basis of Independent Data
Features fusion were inserted into LSTM [21], GBDT [22], and RF [23,24] to compare with the CNN-based model [25]. Ultimately, on the basis of AUROC, we achieved a perfect model for each predictor, which is shown in Table 1 and Figure 2F. Comparison of anticipated model with 4mCCNN by using 10-fold cross-validation is shown in Figure 2E. On the independent data (200 Pos. seq and 200 Neg. seq) the efficiency of Deep-4mCGP was checked and then compared with the existing 4mCCNN. The accuracy, precision, recall, F1, and AUROC of the 4mCCNN were 0.826, 0.818, 0.823, 0.825, and 0.920, respectively. The accuracy, precision, recall, F1, and AUROC of Deep-4mCGP were 0.868, 0.876, 0.773, 0.859, and 0.961, respectively. The performance of the anticipated Deep-4mCGP on independent data exhibited the accuracy of 0.868, which was 4.2% higher than the 4mCCNN. The performance comparison is shown in Table 2.

Materials and Methods
Authentic data are a significant requirement for the construction of a machine learningbased model [26,27]. Thus, we acquired the data of 1138 (569 Pos. seq and 569 Neg. seq) sequences of Geobacter pickeringii from the work of Chen et al. [10] for training and testing the model. Moreover, we attained the data of 400 sequences (200 Pos. seq and 200 Neg. seq) from the work of Manavalan et al. [13] for the sake of independent testing.

k-mer
k-mer composition has the ability to show interactions between nucleotides of DNA sequences [40]. The residues of nucleotides can be attained by setting the size of window and steps. A random sample F with n sequence length can be designated as where S i indicates the i-th nucleotide of the DNA sequences and can be converted in to 4 k D features vector with the help of k-mer.
where d 1 k-tuple denotes the incidence of i-th k-mer and T represents the transposition. If the value of k is equal to 1, then DNA sequence will be decoded in to 4D features vector, and if the value of k is equal to 2, then DNA sequence will be 16D features vector. In this work, k was set as 1, 2, 3, 4, 5, 6. Consequently, DNA sequences were converted into (4 1 + 4 2 + 4 3 + 4 4 + 4 5 + 4 6 = 5460D) formulated as

Binary
Binary encodings such as 0s and 1s have the ability to illustrate any information. Therefore, we can transform DNA sequence in the form of 0s and 1s. In this work, DNA sequences of Geobacter pickeringii with length of 41bp was encoded into the (4 × 41 = 164D) features vector.

Correlation
Correlation is a familiar comparison amongst two different features, e.g., if the features are un-correlated, then the correlation will be zero; otherwise, it will be ±1. Two complete modules named classical linear correlation and correlation on the basis of information theory were implemented to compute the correlation amongst the two unique variables. Linear correlation coefficient is the most acquainted and utilizable. The linear correlation coefficient 'r' for a pair of (p, q) variables is specified as Correlation generates good results in smaller datasets, but the performance of correlation coefficient is not up to the mark on gigantic amounts of data. Therefore, it is necessary to determine the substantial relationship amongst the features. Thus, we utilized the t-test to investigate the statistical correlation between the features and picked the significant features. The value of 't' can be computed as where 'r' signifies the coefficient of correlation and 'n' represents the occurrences. 'n−2 denotes the degree of freedom. Probability of the significance relation is 0.05. If 't' is greater than the probability of the significance relation 0.05, then the feature will be selected.

GBDT with IFS
GBDT is a popular machine learning-based classifier that has been utilized in various mathematical, cheminformatics, and bioinformatics tools [41,42]. It has the ability to establish a scalable and reliable prediction model by utilizing non-linear joints of weak learners [43].
{(x 1 , y 1 ) . . . ( x n , y n )} (∴ x i x ⊆ S n , and y i y ⊆ S) q k (x):= where θ k is minimal risk of the decision tree and D k (x; θ k ) is the decision tree.
GBDT also computes the concluding evaluations in an advancing mode.
Negative gradient loss function q k−1 is applied for residual computation.
Hence, we trained the anticipated model through S ki to compute the minimal risk θ k . This kind of trees rationally represents the relations between variables, e.g., plotting the input X into J fragments S 1 . . . S J , and output is Z J for area S J .
The IFS [44,45] method was implemented in this work to pick the finest feature. IFS estimates the performance of the best q-ranked features repetitively for q (1, 2, 3, . . . n), where 'n' is the overall number of the features. IFS frequently stops at the first scrutiny of performance. In IFS, features were picked incrementally from a randomly taken initial feature and the finest result from several randomly re-instated IFS processes were outputted. A brief explanation of the IFS technique can be found in [46]. for i = 1 to k do 8 t = to calculate the significance (r, ρ) for L i (∴ by utilizing the t-test value from Equation (5)) 9 if t > critical value 10 Q best = Q list 11 end 12 return Q best Algorithms 1: Cont.

Convolutional Neural Network
LeCun at al. [47] introduced convolutional neural network, and now it has been roughly utilized in many biological and bioinformatics advances [48][49][50]. The fundamental principle of CNN is to create abundant filters that have the ability to produce hidden topological features from data by executing pooling procedures and layer-wise convolutions. The performance of CNN on 2D data of images and matrices is exceptional [51]. Subsequently, 1D CNN has been used to tackle the difficulties of biomedical sequence data identification and the research associated with natural language processing [41,52]. In this work, we implemented 1D CNN to identify 4mC sites in Geobacter pickeringii. We employed Keras 2.3.1 [53], TensorFlow 2.1.0, and Python 3.5.4 to perform this experiment. The best tuning parameters are recorded in Table 3.

Metrics Evaluation
Precision, accuracy, recall, and F1 [54][55][56] were employed to examine the effectiveness of the anticipated prediction model and formulated as (11) where 'TP' symbolizes the accurately predicted 4mC sequences, 'TN' represents the perfectly predicted non-4mC sequences, 'FP' indicates the non-4mC sequences predicted as 4mC sequences, and 'FN' indicates the 4mC sequences predicted as non-4mC sequences.

Conclusions
4mC is a type of DNA alteration that has the ability to synchronize multiple biological movements for example DNA replication, gene expressions, and transcriptional regulations. Accurate prediction of 4mC sites can provide exact information to their hereditary functions. Currently, several machine learning models have been used to predict 4mC sites in multiple genomes [10,12,13,[57][58][59][60]. However, there is only one deep learning-based model, 4mCCNN [14], that exists for Geobacter pickeringii. In this work, a deep learning model was constructed to recognize 4mC sites in Geobacter pickeringii. In the anticipated model, two kinds of feature descriptors, namely, binary and k-mer composition were used to encode the DNA sequences of Geobacter pickeringii. The obtained features from their fusion were optimized by using correlation and GBDT-based algorithm with IFS method. Then, these optimized features were inserted into a 1D CNN-based classifier using 10-fold cross-validation, and we attained the finest model to classify 4mC from non-4mC. The performance of the anticipated Deep-4mCGP on independent data exhibited an accuracy of 0.868, which was 4.2% higher than the 4mCCNN. The source code and data are available at GitHub: https://github.com/linDing-groups/Deep-4mCGP (accessed on 19 January 2022). In future work, we have a plan to release a web-based application to make our anticipated model more convenient for the users without programming and statistical knowledge.