iRG-4mC: Neural Network Based Tool for Identiﬁcation of DNA 4mC Sites in Rosaceae Genome

: DNA N4-Methylcytosine is a genetic modiﬁcation process which has an essential role in changing different biological processes such as DNA conformation, DNA replication, DNA stability, cell development and structural alteration in DNA. Due to its negative effects, it is important to identify the modiﬁed 4mC sites. Further, methylcytosine may develop anywhere at cytosine residue, however, clonal gene expression patterns are most likely transmitted just for cytosine residues in strand-symmetrical sequences. For this reason many different experiments are introduced but they proved not to be viable choice due to time limitation and high expenses. Therefore, to date there is still need for an efﬁcient computational method to deal with 4mC sites identiﬁcation. Keeping it in mind, in this research we have proposed an efﬁcient model for Fragaria vesca ( F. vesca ) and Rosa chinensis ( R. chinensis ) genome. The proposed iRG-4mC tool is developed based on neural network architecture with two encoding schemes to identify the 4mC sites. The iRG-4mC predictor outperformed the existing state-of-the-art computational model by an accuracy difference of 9.95% on F. vesca (training dataset), 8.7% on R. chinesis (training dataset), 6.2% on F. vesca (independent dataset) and 10.6% on R. chinesis (independent dataset). We have also established a webserver which is freely accessible for the research community.


Introduction
DNA modification consisting of methylation and demethylation plays a crucial role in gene regulation. DNA methylation being a heritable epigenetic marker is a kind of chemical modification of DNA that changes the genetic functionality without disrupting the DNA sequence [1,2]. Research demonstrates that DNA modification exhibits the property to modify DNA protein interactions, chromatin structure, stability and conformation [3,4]. It also has a role in the regulation of a few activities such as cell development, chromosome stability and genomic imprinting [5][6][7].
In prokaryotes and eukaryotes most commonly observed methylations are N4-methylcytosine (4mC) [8], 5-methylcytosine (5mC) [9], and N6-methyladenine (6mA) [10]. Each methylation process has its specific altering site, i.e., 4mC, 5mC and 6mC occurs at 4th, 5th and 6th position of the cytosine respectively. Host DNA present in exogenous pathogenic DNA of prokaryotes can be detected by 6mA and 4mC [11], where 4mC perform error correction and regulation of DNA replication [12,13]. Additionally, 4mC belongs to a restriction-modification system that resists the degradation of host DNA caused by restriction enzymes [14]. 5mC plays an essential role in various activities of eukaryotes, such as gene imprinting, regulation, and transposon suppression. In eukaryotes, 6mA and 4mC can only be identified using high sensitive techniques [15]. The 5mC has been extensively studied and previous studies have proved that 5mC is responsible for various biological processes [16], such as diabetes, few neurological defects and cancer [17][18][19]. The 4mC exhibits the ability to resist enzyme-mediated degradation to safeguard its own DNA. Furthermore, it can manipulate different activities entailing gene expression levels, DNA replication, cell cycle, discriminating self and non-self-DNA and amendment in DNA replication abnormalities [12,20]. Even after extensive research, the accurate mechanism of 4mC epigenetic modification remains unrevealed. This makes the identification of 4mC sites to be an important task, as its identification can give a better understanding of pathological and physiological mechanisms.
Several experimental studies have been conducted to find 4mC sites throughout the genome, some of them are 4mC-Tet-assisted bisulphite sequencing, Single Molecule of Real Time (SMRT) sequencing, methylation-precise PCR and mass spectrometry [21,22]. These methodologies are time exhaustive, laborious, and financially expensive when utilized for genome-wide testing. Henceforth, it is crucial to follow a computational approach for finding 4mC sites.
Recently, various 4mC sites identifiers have been suggested for several species entailing A. thaliana, C. elegans, D. melanogaster, G. pickeringii, E. coli and G. subterraneus [23,24]. Further a first computational tool for 4mC identification in Rosaceae genomes is recently introduced which is, i4mC-ROSE [25]. This tool has been suggested to predict the 4mC sites within the genome of Rosa chinensis (R. chinensis) [26] and Fragaria vesca (F. vesca) [27]. It produced six probabilistic scores by utilizing six encoding schemes; k-space spectral nucleotide composition (KSNC), k-mer composition (Kmer), dinucleotide physicochemical properties (DPCP), electron-ion interaction pseudopotentials (EIIPs), trinucleotide physicochemical properties (TPCPs), and binary encoding (BE) that works on different characteristics of DNA sequence information. These encoded sequences are used to train random forest classifier to identify the 4mC modification. The scores acquired are concatenated with a linear regression model for improved prediction results. Studies have suggested that in plant methylation DNA is mostly found in cytosines belonging to symmetrical sequences, which makes the classification problem further difficult [28].
The proposed iRG-4mC tool encodes the input sequence using two techniques which are one-hot encoding and nucleotide chemical properties (NCP). The outputs of the two encoding scheme are combined and given to the Convolution Neural Network (CNN) model. The CNN model extracts the features using convolution layers and then gives optimal representation to these features using Long Short Term Memory (LSTM). The optimized features are used for the prediction of 4mC sites. The proposed architecture has obtained high performance. Performance analysis is carried out by using K-fold cross-validation on the training dataset and by using an independent dataset. While in comparison with the existing state-of-the-art tool which is i4mC-Rose, the proposed iRG-4mC tool have achieved higher performance for training as well as an independent dataset.

Benchmark Dataset
The benchmark dataset used in this study is acquired from MDR database [46]. The dataset contains two genomes which are F. vesca and R. chinensis. All of the sequences present in the dataset have a length of 41 base pairs (bp) and centered with cytosine nucleotide. For maintaining the high quality of the dataset, the positive sequences are verified with the help of modification score (ModQV). The cytosine nucleotide is considered to be modified if the modification score is greater than 19. Further, CD-HIT software [47] is utilized to encounter the homology bias problem, where sequences with more than 65% similarity are removed.
A large number of negative 4mC sites were gathered too and CD-HIT was applied to remove the the sequences that were having more than 65%. Out of this big dataset of negative samples, we have randomly selected the same numbers of sites as available for positive 4mC, so that there would be no class imbalance problem. The total sites collected were 6471 and 3116 for F. vesca and R. chinensis respectively for each class. Each dataset is further divided into two sets where 75% is kept for training and remaining 25% is kept as an independent dataset. Table 1 shows the summary of the database.

Methodology
The DNA sequences were in string form, therefore, an encoding scheme was required before feeding the data to the model. Previous methods in site identification have used different encoding schemes. One-hot encoding and NCP [48] encoding schemes are the most commonly used schemes. In this study, we used both of these schemes, by combining them both. Table 2 shows the summary of the encoding schemes. One-hot encoding assigned 4 bits for each nucleotide, while NCP allocated 3 bits for each nucleotide. The fusion of both encoding schemes resulted in 7 bits for each nucleotide. Each input DNA sequence was 41bp long, the encoding mechanism converted the sequence into a matrix of size 41 × 7.

Nucleotide
One-Hot NCP Fusion  Figure 1 shows the proposed iRG-4mC architecture. The final encoded sequence was given to a CNN model that contained multiple layers in a stacked way. The proposed CNN model had two main feature extraction blocks. Each feature extraction block contained a convolution layer followed by a batch normalization layer, max-pooling layer and a dropout layer. The convolution layer fetched the features from the input encoded sequence by a self-regulating mechanism. To get the optimized model, different layers had different parameter settings. In the first feature extraction stage, the convolution layer used 32 filters of size 9 with strides equal to 1 with the dropout layer being set with a ratio of 0.1, while the convolution layer of the second feature extractor used eight filters of size 5 with a single stride and the dropout ratio in this block was set to 0.25. The max-pooling layer of both stages used a pool-size of 4 and the stride value was kept at 1. Batch normalization was used in both blocks to achieve model stability by normalizing the extracted features.  Feature extractor blocks were followed by an LSTM layer which gives an optimal interpretation to the extracted features. It helped the architecture to learn the internal representation of the input sequence. Further LSTM also solved the vanishing gradient problem by adding extra interactions. The output of the LSTM layer was unstacked using the flatten layer. The final feature vector was then given to the three fully connected layers to get the classification of 4mC and non-4mC sites. Table 3 shows the neural network architecture details for iRG-4mC. To avoid the over-fitting problem, we used L2 regularization mechanism for weights and bias of the filters. The regularization rates were set to 0.0001. The ReLU activation function was used in convolution, LSTM and first two dense layers. The numerical representation of ReLU is,

MDR
The third dense layer is have a single neuron, so it uses a sigmoid activation function which is represented as, The loss function used in this study was binary cross entropy and stochastic gradient descent (SGD) was used as an optimizer with momentum set at 0.7 and learning rate set at 0.005. The SGD was used as it achieved faster iterations to reduce the system complexity. All filter sizes, number of filters, number of convolution layers, pool sizes, stride length and dropout values were optimized using hyper-parameter tuning.

Figure of Merits
The iRG-4mC tool for training datasets was evaluated using k-fold cross-validation, with the value of k set to 10 for a fair comparison with the previous methodology. The whole dataset under consideration was divided into k subsets, where a single subset was kept for testing while the other subsets were used for training purpose. This method was iteratively repeated till the time all subsets were considered for testing. For getting the final performance evaluation of the model, an average was taken of k-trials. For selecting the figure of merits, we followed the previous related work so that we could compare the similar figure of merits including sensitivity (Sn), specificity (Sp), accuracy (ACC) and Matthews correlation coefficient (MCC). The metrics can be defined mathematically using the equations: where, True Positive (TP) = 4mC correctly classified as 4mC False Positive (FP) = Non 4mC incorrectly classified as 4mC True Negative (TN) = Non 4mC correctly classified as Non 4mC False Negative (FN) = 4mC incorrectly classified as Non 4mC Besides these matrices, we used the Receiver Operating Characteristic (ROC) curve to evaluate the performance of the proposed model.

Results and Discussion
We evaluated the iRG-4mC tool on four datasets, where two datasets were for F. vesca genome and the other two are for R. chinensis. For each genome one training dataset and one independent dataset was taken into consideration. The performance evaluation of the training dataset was done using k-fold cross-validation where the value of k was kept at 10. The 10-fold cross-validation was adopted because i4mC-ROSE have also used the same technique and keeping a similar experimental setup allowed having better comparative analysis between both techniques. For the performance evaluation of independent datasets, the training was performed on its respective training dataset and tested on the independent dataset. Table 4 shows the performance comparison between the proposed model and the i4mC-ROSE model for all datasets taken into account. As can be seen, the proposed iRG-4mC tool showed a noticeable improvement in performance than i4mC-ROSE. In the case of F. vesca (training/independent) dataset, the proposed model showed better results on all evaluation matrices, while in the case of R. chinensis (training/independent) an improvement was seen in all of the matrices except specificity. The specificity of i4mC-ROSE was slightly higher than the proposed model. As can be observed however, that i4mC-ROSE had a high difference between sensitivity and specificity for all the datasets. This difference depicted that the i4mC-ROSE tool was biased towards one class and that led to a biased decision from the tool. However, in the case of iRG-4mC, the difference between sensitivity and specificity was minimum, which also led to an improvement in MCC. The iRG-4mC tool achieved high accuracy in all the datasets. The achieved accuracies for F. vesca (training), F. vesca (independent), R. chinensis (training) and R. chinensis (independent) were 0.867, 0.859, 0.871 and 0.865 respectively.  Figure 2 shows the graphical performance comparison between the two techniques. Figure 3 shows the ROC curve of 10 folds for F. vesca dataset. An ROC curve is the plot between sensitivity and specificity at different classification thresholds, while the area under the ROC curve (AUC) is the accumulated performance measure over all possible thresholds. The mean AUC achieved was 0.903 with a standard deviation of 0.03.

Conclusions
Getting inspired from the importance of N4-methylcytosine sites identification, we have proposed an identification computational tool named as iRG-4mC. The proposed identification tool is for the Rosaceae genome and that is why two different species F. vesca and R. chinensis are considered in this study. In proposed tool the input sequences are encoded using combination of one-hot encoding and nucleotide chemical properties (NCP). The final attained sequence is given to the neural network architecture for classification between 4mC and non-4mC sites. The neural network architecture contains two blocks for feature extraction followed by LSTM for feature optimization and three fully connected layers for final prediction. The architecture is optimized by hyper parameter tuning. Different figure of merits are taken into account to have comparison with existing method. The achieved results have illustrated great improvement in performance by the iRG-4mC tool. This computational tool can be of great importance for the researchers from the field of biology and bio-informatics. A user friendly web-server is made for the researcher's convenience. The webserver is freely available at: http://nsclbio.jbnu.ac.kr/tools/iRG-4mC/.