A Novel Sequence-Based Feature for the Identiﬁcation of DNA-Binding Sites in Proteins Using Jensen–Shannon Divergence

: The knowledge of protein-DNA interactions is essential to fully understand the molecular activities of life. Many research groups have developed various tools which are either structure-or sequence-based approaches to predict the DNA-binding residues in proteins. The structure-based methods usually achieve good results, but require the knowledge of the 3D structure of protein; while sequence-based methods can be applied to high-throughput of proteins, but require good features. In this study, we present a new information theoretic feature derived from Jensen–Shannon Divergence (JSD) between amino acid distribution of a site and the background distribution of non-binding sites. Our new feature indicates the difference of a certain site from a non-binding site, thus it is informative for detecting binding sites in proteins. We conduct the study with a ﬁve-fold cross validation of 263 proteins utilizing the Random Forest classiﬁer. We evaluate the functionality of our new features by combining them with other popular existing features such as position-speciﬁc scoring matrix (PSSM), orthogonal binary vector (OBV), and secondary structure (SS). We notice that by adding our features, we can signiﬁcantly boost the performance of Random Forest classiﬁer, with a clear increment of sensitivity and Matthews correlation coefﬁcient (MCC).


Introduction
Interactions between proteins and DNA play essential roles for controlling of several biological processes such as transcription, translation, DNA replication, and gene regulation [1][2][3].An important step to understand the underlying molecular mechanisms of these interactions is the identification of DNA-binding residues in proteins.These residues can provide a great insight into the protein function which leads to gene expression and could also facilitate the generation of new drugs [4,5].
Until now, several groups have published different studies based on either experimental or computational identification of DNA-binding proteins [1,[6][7][8][9][10][11] as well as residues in these proteins [12][13][14][15][16][17][18][19][20][21][22][23].However, the usage of experimental approaches for the determination of binding sites is still challenging since they are often demanding, relatively expensive, and time-consuming.To overcome the difficulty of experimental approaches, it is highly desired to develop fast and reliable computational methods for the prediction of DNA-binding residues.For this purpose, several state-of-the-art prediction methods have been developed for the automated identification of those residues.Such methods can be assigned into two main categories: (i) based on the information observed from structure and sequence in a collective manner; (ii) based on the features derived directly from the amino acid sequence alone (for more detail see reviews [24] and [25]).Although the first type of approaches provides promising information about DNA-binding residues in proteins, their application is difficult due to the limited number of experimentally determined protein structures.In contrast to structure-based approaches, sequence-based methods have been developed by extracting different sequence information features, like amino acid frequency, position-specific scoring matrix (PSSM), BLOSUM62 matrix, sequence conservation, etc. [3,4,18,19,26,27].Using these features, several machine learning techniques have been applied to construct the classifiers for the prediction of binding residues in proteins.To this end, a variety of support vector machine (SVM) classifiers have been developed in recent studies [2,[17][18][19]23,26,28].For example, Westhof et al. have recently used an SVM classifier approach in their study, named RBscore (http://ahsoka.u-strasbg.fr/rbscore/), by using the physicochemical and evolutionary features that are linearly combined with a residue neighboring network [2].Further, SVM algorithms were also applied for the models proposed in BindN [18], DISIS [19], BindN+ [23], DP-Bind [27] using different sequence information features including the biochemical property of amino acids, sequence conservation, evolutionary information in terms of PSSM, the side chain pKa value, hydrophobicity index, molecular mass and BLOSUM62 matrix.In addition, other machine learning classifiers such as neural network models [13,15], naive Bayes classifier [26], Random Forest classifiers (RF) [4,29,30] have been developed based on the features derived from protein sequences.For example, Wong et al. [29] have recently developed a successful method using RF classifier with both DNA and protein derived features to predict the specific residue-nucleotide interactions for different DNA-binding domain families.
Despite the rich literature on the sequence-based methods as mentioned above, to date there is still a need to find suitable feature extraction approaches that can enhance the characteristics of DNA-binding residues and thus help to improve the performance of existing methods for identification of DNA-binding residues in proteins.For this aim, we introduce and evaluate a new information theory-based method for the prediction of these residues using Jensen-Shannon divergence (JSD).As a divergence measure based on the Shannon entropy, JSD is a symmetrized and smoothed version of the Kullback-Leibler divergence and is often used for different problems in the field of bioinformatics [31][32][33][34][35].In this study, following the line of Capra et al. [34] we first quantify the divergence between the observed amino acid distribution of a site in a protein and the background distribution of non-binding sites by using JSD.After that, in analogy to our previous studies QCMF [32] and CMF [36], we incorporate biochemical signals of binding residues in the calculation of JSD that results in the intensification of the DNA-binding residue signals from the non-binding signals.
To demonstrate the performance and functionality of our proposed approach, we apply Random Forest (RF) classifier using our new JSD based features together with three widely used machine learning features, namely position-specific scoring matrix (PSSM), secondary structure (SS) information, and orthogonal binary vector (OBV) information (see review [24]).Our results show that using JSD based features, RF classifier reaches an improved performance in identifying DNA-binding residues with a significantly higher Matthews correlation coefficient (MCC) value in comparison to using previous features alone.Although we only applied RF classifier in this study, both of our sequence-based features could be used in other classifiers such as SVM, neural networks, or decision trees.

Results
In this study, we introduce new sequence-based features using JSD to improve the performance of previous machine learning approaches in identification of DNA-binding residues in proteins.For this purpose, we propose new sequence-based features (f JSD and f JSD-t ) using JSD in two different ways.First, using JSD, we calculate the divergences between observed amino acid distributions in multiple sequence alignments (MSAs) of proteins under study and the background distribution which is calculated according to amino acid counts at non-binding residue positions in MSAs.In the second step, we transform the observed amino acid distributions with a doubly stochastic matrix (DSM) to enhance the weak signal of binding sites in proteins which could not be predicted in the first step.Finally, we calculate for each residue in proteins JSD-based scores and use them for the improvement of the performance of machine learning approaches.
To evaluate our new features, we use two frequently considered cut-off distances of 3.5 Å and 5 Å and thus define a residue in a protein as DNA-binding if the distance between at least one atom on its backbone or side chain and the DNA molecule is smaller than the considered cut-off.
The Results section of this study comprises of two parts.First, we investigate the functionality of our new features combining them in Random Forest (RF) classifier with three previous features.The RF classifier is constructed from 4298 positive and 44,805 negative instances extracted from 263 proteins.The performance of the classifier is evaluated using a five-fold cross validation procedure in which we randomly divided the samples into five parts.The assessment is performed by choosing each of these parts as a test set and the remaining four parts as a training set for model selection.Second, to illustrate the usefulness of our new approach for the prediction of DNA-binding residues, we analyzed the proto-oncogenic transcription factor MYC-MAX (PDB-ID: 1NKP) which is a heterodimer protein complex of two proteins.It is important to note that this protein complex is not included in the training dataset.

Random Forest Classifier
To apply the Random Forest (RF) classifier, we combine our new features (f JSD and f JSD-t ) with the features f PSSM , f OBV , and f SS which are widely used for the prediction of DNA-binding residues.Our results show that using our features RF classifier reaches an improved performance in identifying DNA-binding sites with clearly higher statistical values (see Tables 1 and 2).Moreover, we individually evaluated the combination of our features with existing features.The results suggest that the classifier with f JSD-t feature has provided better sensitivity and comparable Matthews correlation coefficient (MCC) values in comparison to f JSD feature.However, its specificity is moderately decreased.A further comparison reveals that the usage of our both features together with other features does not affect the performance of the classifier.The details are presented for 3.5 Å in Table 1 and for 5 Å in Table 2 and in Appendix A with the standard error of each of the performance measures over the values obtained in the five iterations (see Tables A1 and A2).To further investigate the performance of JSD-based features proposed in this study, we analyzed two additional datasets, namely RBscore [2] and PreDNA datasets [37].Although the RBscore and PreDNA datasets initially contain 381 and 224 DNA-binding proteins, respectively, we have eliminated a few proteins since they are either included in our training dataset or ineligible due to their MSAs.Consequently, we constructed RF classifier using 263 proteins (which were also used for cross-validation) and randomly selecting 60 proteins from each dataset for testing, respectively.The results of these analyses consistently suggest that our new features show great complementary effect to the previous features which often leads to clear improvement of the classification performance (see Tables 3 and 4).The detailed performance of classifier on different features using different cut-offs for each dataset can be found in Appendix A (see Tables A3-A6).
Considering the AUC-ROC and AUC-PR as the only evaluation factor, results indicate that the RF classifier often achieved its best performance based on both cut-off distances if we combine our new f JSD-t feature together with the existing three features (see Tables 1-3).Interestingly, by analyzing the PreDNA dataset we observed that RF classifier with f JSD or f JSD-t features for the cut-off of 3.5 Å showed similar performance.However, regarding to the distance cut-off of 5 Å, the classifier with f JSD feature reached slightly better performance than those with f JSD-t feature (see Table 4).After looking at the overall performances, it is inferred that adding our new features can boost the performance of the RF classifier in terms of AUC-ROC and AUC-PR.

Position Analysis of the MYC-MAX Protein
The proto-oncobenic transcription factor MYC-MAX (PDB-Entry 1NKP) is a heterodimer protein complex that is active in cell proliferation and is over-expressed in many different cancer types [38].MYC-MAX transcription factors bind to Enhancer boxes (a core element of the promoter that consists of six nucleotides) and activate transcription of the underlying genes [39].
The amino acid chain of MYC protein consists of 88 residues, ten of which are known DNA-binding sites indicating that their distances to DNA are less than 3.5 Å. Applying RF classifier, which takes a majority vote among the random tree classifiers, with our first feature (f JSD ) combined with existing features, we predicted in total 17 residue positions to be DNA-binding in MYC protein.
Seven out of these positions (H906, N907, E910, R913, R914, P938, K939) correspond to the true DNA-binding sites of this protein.While the sites R913, R914, P938, and K939 could also be identified by RF classifier without using our new JSD-based features, the remaining three binding sites could only be detected using our features (for details see Table 5 and Figure 1).Interestingly, using f JSD-t together with f PSSM , f OBV , and f SS , the RF classifier correctly predicted these seven positions again as binding sites.
The second protein in the proto-oncobenic transcription factor complex is the MAX protein which consists of 83 residues including nine DNA-binding sites.Using f JSD or f JSD-t together with existing features individually, we observed 14 and 13 residue positions to be DNA-binding in MAX protein, respectively.Eight of the predicted positions (H207, N208, E211, R212, R214, R215, S238, R239) found by using either of our both features are true DNA-binding sites in MAX protein.However, without using our new features the RF classifier could only identify two (S238, R239) out of nine true DNA-binding sites in MAX protein (for details see Table 5 and Figure 1).Further, we observed that, the usage of f JSD-t leads to the reduction of false positive predictions in identifying DNA-binding sites in MAX protein.Moreover, when statistically evaluating both of our features, we observed that using our sequence-based features RF classifier reaches a significantly improved performance in identifying DNA-binding sites of both proteins with significantly higher sensitivity and MCC values whereas the specificity is moderately decreased.The simultaneous usage of both of our features together with f PSSM , f OBV , and f SS could result in the decrement of specificity or MCC values.The details are presented in Table 5.

Materials and Methods
In this section, we describe in particular the data we have used and our new residue-wise features designed to predict DNA-binding sites in proteins.

Materials
To compile our data needed for training and test, we started with the DBP-374 data set of representative protein-DNA complexes from the Protein Data Bank (PDB) [40] published by Wu et al. [5].Having performed a comparison with the new PDB version, we calculate for every remaining protein a multiple sequence alignment (MSA) using HHblits and the UniProt20 database (version from June 2015) [41].We eliminated all proteins, the MSA of which has less than 125 rows, so that we finally ended up with a dataset of 263 protein-DNA complexes and associated MSAs.To obtain our results we perform a five-fold cross validation.
As in [5], an amino acid residue is regarded as a binding site, if it contains at least one atom at distance of less than or equal to 3.5 Å or 5 Å from any atom of DNA molecule in the DNA-protein complex.Otherwise it is treated as non-binding site.For the distance cut-off of 3.5 Å, our set contains 4298 binding sites and 44, 805 non-binding sites.For the distance cut-off of 5 Å, however, our data set contains 7211 binding sites and 41, 892 non-binding sites.

Methods
Let M be a multiple sequence alignment, where its first row represents the protein under study.Every residue of that protein is then uniquely determined by its column.In what follows, we identify the residues of the protein with their columns of the MSA.
Grosse et al. [35] pointed out that the Jensen-Shannon divergence (JSD) is extremely useful when it comes to discriminate between two (or more) sources.Capra and Singh [34] carefully discussed several information theoretic measures like Shannon entropy, von Neumann entropy, relative entropy, and sum-of-pair measures to assess sequence conservation.They were the first using JSD in this context and stated its superiority.Gültas et al. [32] showed that the Jensen-Shannon divergence in the context of quantum information theory is of remarkable power.These three articles encouraged us to use JSD in this study.Our first idea is to design a new feature for the prediction of DNA-binding sites in proteins which leverages the Jensen-Shannon divergence (1) Therein, p k is the empirical amino acid distribution of the k-th column of the query MSA M, and p nd is the null distribution taken over all non-binding sites of our training data.
More precisely, we represent every column k of every MSA M considered by a 20 × 20 counting matrix C M k .The matrix C is symmetric and its rows as well as columns are indexed by the 20 amino acids.For every ordered pair of amino acids a, a , the matrix coefficient C M k aa is equal to the number of ordered pairs (i, j) (i = j) of row indices of M such that M ik = a and M jk = a .
To compute the null distribution p nd , we first set up the 20 × 20 counting matrix C nd using our training data.C nd is the sum over all matrices C M k , where M ranges over all training MSAs and k ranges over all non-binding site columns of M. Next, the rows of C nd are added up.Finally, the resulting row vector is normalized to obtain p nd .
There is nothing wrong with the idea that a large value JSD (p k p nd ) indicates that k is a DNA-binding residue.However, no information on binding sites is integrated.Only the non-binding sites of our training data are used to compute p nd .As we have seen in [32] and [36], transforming empirical amino acid distributions of MSA columns by a carefully designed doubly stochastic matrix is an effective way to integrate the binding site signals.To this end, we first set up a counting matrix C bind in a way similar to that of calculating the matrix C nd .The difference is that the variable column index k now ranges over all binding site columns of the training MSAs.Taking the counting matrix C bind as input, the doubly stochastic matrix D is computed by means of the canonical row-column normalization procedure [42].
Let M be the query MSA having columns.Compared with [32] and [36], we enhance the effect of transforming M's empirical column distributions by means of the doubly stochastic matrix D just defined.Let k be a column index of M. First, we compute the matrix product Second, we add up all of C (t) M k 's rows.Finally, we normalize the resulting row to obtain the transformed empirical row distribution p (t) k .We define two window scores score JSD,M (k) and score JSD-t,M (k) of residue k w.r.t.query MSA M, where the window w(k) surrounding k formally equals {k − Recapitulate that for any real x the binomial coefficient ( x 2 ) equals x(x − 1)/2.We define the scores as follows.
score JSD,M (k The preceding two score definitions are motivated as follows.Bartlett et al. [43] and Panchenko et al. [44] pointed out that exploiting conservation properties of spatial neighbors is useful to predict a residue as functionally important.Since the 3D structures are often unavailable, Capra and Singh [34] developed a window score for such predictions.The concrete shape of our scores takes pattern form Janda et al. [45], who in turn refer to Fischer et al. [33].Our scores are convex combinations of the Jensen-Shannon terms associated with the residues belonging to the surrounding window w(k).The weights fall linearly in the distance from k.
In a last step, we transform two window scores according to Equations ( 2) and ( 3) with respect to the query MSA M into final scores using the Equations ( 4) and ( 5), respectively.To this end, for every column index k ∈ {1, 2, . . ., } of M we define: The Equations ( 4) and ( 5) are basically the determination of the percentage of scores below the current one at index k.This transformation procedure is essential because it converts MSA-dependent window scores to MSA-independent scores.
To demonstrate the benefit of our new features, we adopt the features f PSSM , f OBV and f SS devised in [5].Together with our two new features f JSD and f JSD-t , we plugged them into the Random Forest (RF) classifier [46] (see Tables 1 and 2 for the combinations we used).For the RF implementation we used the WEKA data mining software [47].
To deal with the imbalanced data problem, we applied bagging techniques suggested in [48].Since we make use of five-fold cross validation, we randomly split the dataset into 5 roughly equal-sized parts.Every training phase performed on 4 parts consists of 11 sub-phases.In each such sub-phase we randomly draw twice as many non-binding sites as there are binding sites.We then construct a Random Forest (RF) taking those non-binding sites and all binding sites of the 4 parts as input.Finally, for each instance of the validation part the majority vote of above 11 RF classifiers was taken.

Discussion
Our results show that combining either feature f JSD-t or feature f JSD with the three features f PSSM , f OBV and f SS we have adopted from [5] clearly boosts the performance of the RF-based classifier in identifying the DNA-binding sites in proteins, where feature f JSD-t generally reaches a slightly better performance than feature f JSD .
Although our two new features and PSSMs are derived from MSAs, Tables 1 and 2 clearly demonstrate that these approaches carry distinct information.Thus they capture different kinds of evolutionary information.The reason for this essential difference can be explained based on the underlying algorithms.While the PSSM approach consists of statistic which indicates how likely a certain amino acids occurs at a certain position, our JSD-based approach measures the divergence of a certain distribution to a known non-binding site distribution.
The superiority of feature f JSD-t to feature f JSD deserves an explanation attempt.Feature f JSD does not integrate any information on DNA-binding sites.Only training non-binding sites are used.In contrast, feature f JSD-t additionally uses a doubly stochastic matrix gained from the training binding sites.The effect on empirical amino acid column distributions of the transformation we have devised using that matrix is the following.The empirical column probabilities of amino acids are merged, if it is very likely to co-observe them in a binding site column.Since the amino acid content of binding site columns and non-binding site columns differ, the distance between f JSD-t,M (k) and f JSD-t,M (k ) is larger and more significant than the distance between f JSD,M (k) and f JSD,M (k ), where k is a binding site column of MSA M, and k is a non-binding site column.
At first glance it is surprising that adding both feature f JSD-t and feature f JSD to the feature triplet f PSSM , f OBV , f SS is worse than adding feature f JSD-t alone.Taking into account what we have mentioned in the preceding paragraph, it turns out that if feature f JSD-t is already there, feature f JSD may increase the noise.

Conclusions
In this work, we report a new sequence-based feature extraction method for the identification of DNA binding sites in proteins.For this purpose, we adopt the ideas from Capra et al. [34] and our previous studies CMF [36] and QCMF [32].Our approach is an information theoretic method that applies the Jensen-Shannon divergence (JSD) for amino acid distributions of each site in a protein in two different ways.First, the JSD is applied to quantify the differences between observed amino acid distributions of sites and the background distribution of non-binding sites.Second, we transform the observed distributions of sites through a doubly stochastic matrix to incorporate biochemical signals of binding residues in the calculation of JSD that results in the intensification of the DNA-binding residue signals from the non-binding signals.The results of our study show that the additional usage of our new features (f JSD-t or feature f JSD ) in combination with existing features is significantly boosts the performance of RF classifier in identifying DNA binding sites in proteins.Our results further indicate the importance of our second feature (f JSD-t ) since taking into account the binding site signals in the calculation of JSD metric, the characteristics of DNA binding residues are enhanced.As a consequence, an intensification of the signal caused by DNA binding sites from non-binding sites occurs and thus the classifier achieves its improved performance.

Figure 1 .
Figure1.DNA-binding sites in proto-oncobenic transcription factor MYC-MAX protein complex (PDB-Entry 1NKP).Green spheres denote positions of the DNA-binding sites in both proteins which are detected by RF classifier either using the existing features (f PSSM , f OBV , and f SS ) alone or combining our new features with these existing features together.Purple spheres show the localization of additional binding sites which were only found by RF classifier using our new features with existing features.Moreover, there are further three binding sites in MYC protein and one binding site in MAX protein, shown with yellow spheres, that could not be identified by the classifier.

Appendix A. 1 .
Performance Measures with Standard Error

Table 1 . Prediction performance of Random Forest (RF) classifier on different features using a cut-off of 3.5 Å. The prediction system was evaluated by five-fold cross validation. Feature Sensitivity Specificity MCC AUC-ROC AUC-PR
MCC: Matthews correlation coefficient; AUC-ROC: area under the receiver operating characteristics (ROC) curve; AUC-PR: area under the precision-recall curve.

Table 2 .
Prediction performance of Random Forest (RF) classifier on different features using a cut-off of 5.0 Å.The prediction system was evaluated by five-fold cross validation.
MCC: Matthews correlation coefficient; AUC-ROC: area under the receiver operating characteristics (ROC) curve; AUC-PR: area under the precision-recall curve.

Table 3 .
Prediction performance of Random Forest (RF) classifier on RBscore dataset using different distance cut-offs.
MCC: Matthews correlation coefficient; AUC-ROC: area under the receiver operating characteristics (ROC) curve; AUC-PR: area under the precision-recall curve.

Table 4 .
Prediction performance of RF classifier on PreDNA dataset using different distance cut-offs.

Table 5 .
Prediction performance of RF classifier on different features using a cut-off of 3.5 Å for MYC-MAX protein complex (Protein Data Bank (PDB)-Entry 1NKP).

Table A1 .
Prediction performance of Random Forest (RF) classifier on different features using a cut-off of 3.5 Å.The prediction system was evaluated by five-fold cross validation.

Table A2 .
Prediction performance of Random Forest (RF) classifier on different features using a cut-off of 5.0 Å.The prediction system was evaluated by five-folds cross validation.

Table A3 .
The detailed prediction performance of Random Forest (RF) classifier on different features using a cut-off of 3.5 Å.
MCC: Matthews correlation coefficient; AUC-ROC: area under the receiver operating characteristics (ROC) curve; AUC-PR: area under the precision-recall curve.

Table A4 .
The detailed prediction performance of Random Forest (RF) classifier on different features using a cut-off of 5.0 Å. : Matthews correlation coefficient; AUC-ROC: area under the receiver operating characteristics (ROC) curve; AUC-PR: area under the precision-recall curve. MCC

Table A5 .
The detailed prediction performance of Random Forest (RF) classifier on different features using a cut-off of 3.5 Å.
MCC: Matthews correlation coefficient; AUC-ROC: area under the receiver operating characteristics (ROC) curve; AUC-PR: area under the precision-recall curve.

Table A6 .
The detailed prediction performance of Random Forest (RF) classifier on different features using a cut-off of 5.0 Å.
MCC: Matthews correlation coefficient; AUC-ROC: area under the receiver operating characteristics (ROC) curve; AUC-PR: area under the precision-recall curve.