Identification of DNA–protein Binding Sites through Multi-Scale Local Average Blocks on Sequence Information

DNA–protein interactions appear as pivotal roles in diverse biological procedures and are paramount for cell metabolism, while identifying them with computational means is a kind of prudent scenario in depleting in vitro and in vivo experimental charging. A variety of state-of-the-art investigations have been elucidated to improve the accuracy of the DNA–protein binding sites prediction. Nevertheless, structure-based approaches are limited under the condition without 3D information, and the predictive validity is still refinable. In this essay, we address a kind of competitive method called Multi-scale Local Average Blocks (MLAB) algorithm to solve this issue. Different from structure-based routes, MLAB exploits a strategy that not only extracts local evolutionary information from primary sequences, but also using predicts solvent accessibility. Moreover, the construction about predictors of DNA–protein binding sites wields an ensemble weighted sparse representation model with random under-sampling. To evaluate the performance of MLAB, we conduct comprehensive experiments of DNA–protein binding sites prediction. MLAB gives MCC of 0.392, 0.315, 0.439 and 0.245 on PDNA-543, PDNA-41, PDNA-316 and PDNA-52 datasets, respectively. It shows that MLAB gains advantages by comparing with other outstanding methods. MCC for our method is increased by at least 0.053, 0.015 and 0.064 on PDNA-543, PDNA-41 and PDNA-316 datasets, respectively.


Introduction
DNA-protein interactions exert a crucial influence on diverse biological processes and is primal for cell metabolism. Contemporary researchers have scrutinized a considerable number of DNA and protein sequences including DNA-binding proteins. In addition, there is no lack of time-consumption in silico methods. Furthermore, the experimental determination of binding sites is always difficult and is not readily feasible all the time. Therefore, forecasting by statistical learning, which had been riveted by a lot of academics conducting surveys on DNA-protein binding sites, established in the field of computational and molecular biology, should be taken for granted. Several computational methods, which had been developed to identify DNA-binding sites in proteins, were generally based on protein sequence, protein structure or through integrating the aforementioned information. Most of these investigations are the methods that depended on machine learning techniques. functional binding sites are occurring in the former, whereas they are absent at the corresponding local regions of protein space in the latter. Moreover, protein functions incline to be evolutionarily conserved in these local regions. As a result, it requires that classifiers need to capture the hints of local functional conservation as fully as possible. Based on this leitmotiv, we devise an algorithm that is called Multi-scale Local Average Blocks (MLAB) to further extract local information from PSSM. The PSA information is also utilized to ameliorate the accuracy of prediction. Due to the number of DNA-binding residues (minority class) being significantly lower than that of non-binding residues (majority class), sample rescaling as straightforward strategy is adopted to deal with the issue of imbalanced data classification. To further handle the imbalanced problem, we employ an Ensemble Classifier with

Materials and Methods
For the sake of delving DNA-protein binding residues with computational methods, one of the major challenges is to fully describe the salient points of knowledge about DNA-protein binding sites in an adequate and concise way. Prediction of DNA-protein binding residues could be regarded as a traditional binary classification problem from the view of machine learning. Therefore, how to effectively extract feature from protein sequences turns out to be the preoccupation. Since the binding residue is not isolated from each other, we have the convention that 11 vicinal amino acid residues as a window (w = 11), where the window specifically indicates the target residue and 5 neighbors on either side of the target residue itself. In light of this definition, an idiosyncratic multi-dimensional coding vector can be listed seriatim, which derives from the two aforementioned attributes of evolutionary conservation and predicted relative solvent accessibility. With the above information, the ML scenario is applied to build a prediction model about identifying DNA-protein binding sites.

Feature Extraction via Position Specific Scoring Matrix
By referring to the the form of Position Specific Scoring Matrix (PSSM), evolutionary conservation of protein sequence could be abstracted and generated by the de facto tool PSI-BLAST [41] (BLAST+ [42] options: -num_iterations 3 -db nr -inclusion_ethresh 0.001). The evolutionary information from PSSM is stored in a matrix of dimensions L × 20 (L rows and 20 columns), formulated as while each element in PSSM is calculated as where γ(i, k) is the frequency of k-th amino acid type at the position i; d(k, j) is the value of about the element in Dayhoff's mutation matrix (substitution matrix), which corresponds to the amino acids between k-th and j-th type. The substitution matrix, which usually is wielded in DNA or protein sequence alignment, can describe the rate that certain kinds of characters in a protein sequence change to some other kind of character with time elapsing. As a supplementary, small values indicate that there is less conservatism in the corresponding areas, whereas large values indicate quite conservative zones. These values are normalized to 0-1 range with min-max normalization. The original PSSM (see Equation (1)) is normalized as where p i,j represents the original score of PSSM. While the normalized PSSM (PSSM ) is represented as To distinguish DNA-binding and non-DNA-binding proteins, we need to justify the functional binding sites whether they occur at the corresponding local regions of protein space. As a further step of parenthetical explanation, protein functions in these local regions are inclined to be evolutionarily conserved. Considering this, we conceive an algorithm called Multi-scale Local Average Blocks (MLAB), which is enlightened by the Average Blocks (AB) approach that was proposed by Jeong et al. [43]. In virtue of extracting local information from normalized PSSM, Jeong et al. forecast protein function through ways that divide a protein sequence into b blocks and need not care much about the length of sequence. This idea has also been adopted in other bioinformatical issues [44]. On the occasion of this paper, each block consists of 20 features which are derived from 20 columns in PSSM. Similar to that, we formalize the value of attributes with a target by means of a window with 11 residues, and then obtain a vector of normalized PSSM scores whose gross amount is 11 × 20 = 220. However, it is quite imperative that, different from the AB algorithm, fixed size is changed into multi-scale size in our scheme; thus, the matrix is split in a horizontal manner. The PSSM-based Multi-scale Local Average Blocks (PSSM-MLAB) features can describe the relationship between target residue and neighboring residues in different resolutions.
More specifically, we partition the normalized PSSM of the target residues into six segmentations with varying composition, including global zone (A), bisection (B and C) and trichotomy (D, E and F). These segments can adequately portray multiple overlapping continuous and discontinuous interaction patterns that are schematically shown in Figure 1. The mean value of each local block is calculated with the formula as Mat k (i, j) (i = 1, . . . , B k ; j = 1, . . . , 20; k = 1, . . . , 6), where LAB(k, j) regards the mean value of k-th block in the column j; B k stands for the amount of rows in block k; and Mat k (i, j) represents the value of cell in i-th row and j-th column of block k.
Recapitulating the MLAB algorithm, through combining with normalized PSSM and partitioning manipulation on an entire sequence, we can gain a 6 × 20 = 120 dimensional feature vector.

Predicted Solvent Accessibility
Solvent accessibility has profound significance because it is closely affiliated with not only the spatial assignment of configuration, but also the swathing attitude about residues during the process of protein folding. It also coincides with the fact that there is a non-negligible association between solvent accessibility and DNA-protein interactions. The post hoc actuality has been instantiated, such as research by Ahmad et al. [45], who has demonstrated the importance of solvent accessibility to amino acid residues in predicting DNA-protein binding. By uniting the Solvent Accessibility prediction, which has been implemented with the de facto tool Nearest Neighbor method (SANN) [46], we can obtain the Predicted Solvent Accessibility (PSA) characteristics of each residue for the corresponding sequence. With min-max normalization, the PSA feature can also be normalized among a range from zero to one.

Weighted Sparse Representation Based Classifier
Sparse representation [47] as a sharp weapon of compressed sensing has aroused lots of scholarly pursuit for several years. Sparse representation-based classifier(SRC) [48,49] was firstly proposed by Wright et al. for the purpose about image recognition. In contrast to conventional taxonomization approaches such as SVM, KNN, RF, etc., SRC is robust for both outliers and noisy situations. To discriminate the sample corpus, which needs to be verified, SRC demands to create a Sparse Representation Matrix (SRM) and makes a linear combination on the training set. The reconstructed residuals of test sample for each kind of classification are measured and calculated through SRM and linear combination. Ultimately, the corpus of samples will be assigned to the corresponding category, arbitrated by minimum reconstruction residual. A group of researchers [50,51] have deployed SRC in solving issues in the area of computational biology.
Suppose there are totally C kind of classifications involved in a sufficient dataset. The assignment is how to correctly determine the attribution of a newly added sample y when we put it into the original corpus. SRC picks n c training samples from the c-th classification, which correspond to the volume of each raw in x c . x c can be expressed as where m is the aggregate feature volume with regard to one sample. Thus, the training sample matrix can be written as where n = ∑ C c=1 n c represents the amount of training samples. Then, the known test sample y will approximately fall in the linear spanning region of the training samples associated with the c-th classification as While under unknown c condition, test sample y will be in line with the representation of whole training set using linear regression as where coefficient vector is α 0 α 0 α 0 α 0 could take on a kind of sparse state, whereas the size of sample about corresponding classification is huge. The critical step of SRC algorithm is selecting the α α α vector that can both satisfy Equation (9) and minimize the l 0 -norm per se with the equations aŝ Unfortunately, searching the sparsest solution for equations (10) is NP (Non-deterministic Polynomial)-hard. Still, as a remedy, by means of solving l 1 -minimization problem which belongs to convex optimization, we can eschew the l 0 -minimization problem since l 1 -minimization problem can be viewed as problem that is approximately equivalent to l 0 -minimization. For the sake of resolving this occasion in l 1 , Equation (10) can be transformed into an expression aŝ where reflects the tolerance of reconstruction deviation.
The SRC approach allocates the label of test pattern y w.r.t. category c according to equations where v c y denotes the residuals between y andα c 1 α c 1 α c 1 X (category c). Thus, v y = min v c y means sample y will be assigned to the category that owns minimal residuals.
To fix the problem about instability of SRC which may be aroused by noise pollution, Lu et al. [52] have proposed the Weighted Sparse Representation based Classification (WSRC) method, which deals with the whole training set as a vocabulary, and imposes the locality on the weighted l 1 regularization.
where Λ is a diagonal matrix about locality adaptor as Moreover, λ denotes to the Euclidean distance from y to x c i , which is expressed as where i indicates the sample index of training set w.r.t. category c. σ corresponds to the Gaussian kernel width. y, x c i represents test sample and training sample, respectively. In addition, the values of Gaussian distance can be viewed as the weight of each sample in training sets.
Nevertheless, the output of WSRC just meets a straightforward preliminary mapping that each residual corresponds to a certain kind of classification (without prediction score w.r.t. each category), since there is a greater possibility for the minimum residue of the corresponding category. For the convenience of projecting the output about WSRC between [0, 1], there are three types of scores for binding sites prediction, which are represented as where v binding (y) and v non−binding (y) are the deviation of reconstruction about WSRC when assigning test sample y w.r.t. binding and non-binding site, respectively. The assessment about the performance with the aforementioned three types of score is shown in the experimental evaluation section.
In practice from our research, v 1 (y) = v binding (y) and v 2 (y) = v non−binding (y). The accomplishment of feature extraction implies that no matter the target residue binding sites or non-binding sites, all of them have been converted to numerical feature vectors that have their own identical dimension. The feature space for every target residue binding site is comprised of two parts that are PSSM-MLAB features ( f PSSM−MLAB ) and PSA features ( f PSA ), respectively. It needs to be reaffirmed that all of the feature vectors have been normalized by means of min-max normalization.

Ensemble Classifier and Random Under-Sampling
Imbalanced datasets, which are characterized as a larger ratio size between non-binding examples (majority category) and binding examples (minority category), always exist in the issue of classification. Exploiting a schema with ensemble classifier [53,54] is a fashionable way. Consequently, we exploit an ensemble of m classifiers with bootstrap resampling strategy [53,54]. By performing random sampling on m subsets, which also be considered with replacement, from the majority category of non-binding examples, we can make all negative subsets own the same or similar size as the minority category of binding examples. After this step, every negative subset will group with the set of binding cases and generate m new training sets. Thus, m classifiers, which are represented as , can be built according to the m training sets. Finally, the outcome is voted by arithmetic mean value of the results that come from m sub-classifiers. After calculating about every score as score(y) i , we can get the final rate of voting P(y) by where P(y) also denotes the probabilistic factor of test sample y, and score(y) i reflects the probability value of i-th base classifier. The overview of the proposed ensemble model is shown in Figure 2, Ensemble Classifier with Random Under-Sampling (EC-RUS) as the scenario to deal with the imbalanced issue.

Results
We test our method on several DNA-protein binding sites datasets to evaluate the performance of our proposed approach, including PDNA-543, PDNA-41 (independent test set of PDNA-543), PDNA-335, PDNA-52 (independent test set of PDNA-335) and PDNA-316. First, we independently analyze the performance of binding site representations, such as PSSM, PSSM-MLAB and PSA. Second, we compare our method with some outstanding methods on the above datasets.

Datasets of DNA-Protein Binding Sites
PDNA-543 and PDNA-41 are independent test datasets that have been constructed by Hu et al. [13]. They collect a dataset that contains 7, 186 DNA-binding protein chains and has clear target annotations in PDB (Protein Data Bank) [55]. After removing redundant sequences by wielding CD-hit software [56], there are totally 584 non-redundant protein sequences that can be obtained and no two sequences had more than 30% identity. Then, they divide the non-redundant sequences into two sections, which are the training dataset (PDNA-543) and the independent test dataset (PDNA-41).
PDNA-335 and PDNA-52 are independent test datasets that have been employed by Yu et al. [57]. In their research, all of the protein sequences are extracted, which are based on BioLip [58] rather than on PDB [55]. Next, the maximal pairwise sequential identity of the extracted protein sequences are culled to a 40 percent level by using PISCES software (1.0, Wang, G. and Roland, L. Dunbrack Jr, Philadelphia, PA, USA) [59]. The remaining sequences constitute the training dataset. Besides that, the test set is extracted in a similar process. Moreover, if a given sequence in the validation dataset shares more than 40 percent similarity to a sequence in the training dataset, then remove the sequence from the validation dataset. Training set and independent validation test sets contain 335 and 52 protein sequences, respectively.

Evaluation Measurements
To test the robustness, the process of random selection about training and test sets, model-building and model-evaluating are performed repeatedly, which deals with the manner of ten-fold cross validation. Seven parameters, including overall prediction accuracy (ACC), sensitivity (SN), specificity (Spec), positive predictive value (Pre), and Matthew's correlation coefficient (MCC) are used in the assessing procedure. These parameters are represented as where true positive (TP) is the number of true DNA-protein binding sites that are predicted correctly; false negative (FN) is the number of true DNA-protein binding sites that are predicted to be non-binding; false positive (FP) is the number of true non-binding sites that are predicted to be binding sites, and true negative (TN) is the number of true non-binding sites that are predicted correctly. The Area Under the Receiver Operating Characteristic (AUC) is a common summary statistic that can measure the goodness of a predictor in a binary classification task. It is equal to the probability that a predictor will rank a randomly chosen positive instance higher than a randomly chosen negative one. We would like to emphasize that two WSRC parameters are set to σ = 1.5 and = 0.5, respectively.

Selecting Optimal Size of Sliding Window and Number of Base Classifiers
Different sizes of sliding window may lead to different performance. In addition, the number of base classifiers will also affect the appearance of prediction. In Hu's work [13], they use two strategies for selecting Thresholds (T): (1) they selected the threshold that makes Sen ≈ Spec, and (2) they select the threshold that makes FPR ≈ 5% ( FPR = 1 − Spec ). In virtue of that, we adjust window size from 7 to 17 residues and number of base classifiers (m) from 1 to 29, with a step size of 2, on PDNA-543 dataset over a ten-fold cross-validation with above two strategies. Hitherto, we select Equation (16) as default score function of base classifiers (WSRC, Equation (16)).
We select the optimal size by highest MCC value, and find that 11 and 19 are the best parameters of window size, whereas number of base classifiers under FPR ≈ 5%. Under the condition that Sen ≈ Spec, the value of MCC is also high (w = 11, m = 19). The result w.r.t. PDNA-543 is shown in Figure 3. As seen from the dotted curves, the MCC increases when size increases from 7 to 11. However, it slightly declines when size increases from 11 up to 17. The first maximum MCC value is achieved when m = 19 (under FPR ≈ 5%), and no improvement can be observed with larger values of m. Recapitulating these results, we set the optimal m as 19 in the investigation. The best MCC is 0.392, when window size and number of base classifiers are 11 residues and 19 with under FPR ≈ 5%, respectively.

Performance of Different Features
To analyze the performance of PSSM, PSSM-MLAB and PSA features, we evaluate these features by EC-RUS on PDNA-543 dateset. Results for PSSM, PSSM + PSA, PSSM-MLAB and PSSM-MLAB + PSA are shown in Table 2 and Figure 4. In addition, Equation (16) is also selected as default score function of base classifiers (WSRC, Equation (16) Figure 4, we can see that the fusion feature of PSSM-MLAB and PSA has better performance than the other features in the PDNA-543 dataset.  (16)) on PDNA-543 dataset.

Selecting Optimal Score Function of Base WSRC
In order to make a decision w.r.t. score function (WSRC) from Equations (16)-(18), we evaluate the above functions on PDNA-543 across a ten-fold cross-validation test with PSSM-MLAB + PSA as feature. The results of different functions on PDNA-543 are shown in Figure 5. Obviously, the performance of three different score functions are almost identical. However, the type 1 (0.392, under FPR ≈ 5%) function achieves better performance of MCC than type 2 (0.380, under FPR ≈ 5%) and 3 (0.388, under FPR ≈ 5%). Thus, we select the Equation (16) as the score function of WSRC.

Comparison with Existing Predictors on PDNA-543
We also compare the prediction performance of our proposed method with Hu's work [13] on this dataset, as shown in Table 3. Our method achieves 0.307 MCC under Sen ≈ Spec. However, Hu's work achieves 0.304 MCC under Sen ≈ Spec. Moreover, our method achieves the best MCC of 0.392 under FPR ≈ 5%. Our method obtains better prediction results than Hu's work on the PDNA-543 dataset. Table 3. Comparison with the TargetDNA on PDNA-543 dataset by ten-fold cross-validation.

Predicted Results on the Independent Test Set of PDNA-41
In this section, we use the PDNA-543 dataset as the training set and PDNA-41 as the independent test set. It has been compared with other previous works including BindN [3], ProteDNA [9], BindN+ [5], MetaDBSite [12], DP-Bind [60], DNABind [20] and TargetDNA [13] with summarizing results in Table 4. Under FPR ≈ 5%, our method (EC-RUS built with WSRC) achieves 0.9458 accuracy, 0.2725 sensitivity, 0.4292 Pre and 0.315 MCC. Comparing with Sen ≈ Spec, sensitivity declines (0.3379), specificity, accuracy, Pre and MCC rise together (0.2006, 0.1814, 0.3061 and 0.122, respectively). Furthermore, our method achieves the best MCC (0.315) under FPR ≈ 5%. Figure 6 shows the trend (including sensitivity, specificity, accuracy, Pre and MCC) on different threshold T of probability. While the threshold of probability rises, values of specificity, accuracy, Pre and MCC are synchronously rising. The trend of sensitivity and rate p/n are opposite.
In addition, we test different types of base classifiers to build an EC-RUS model. The base classifiers together contain SVM [25,61], RF [26], L1-regularized Logistic Regression (L1-LR) [62] and Sparse Bayesian Learning (SBL) [63]. Under FPR ≈ 5%, EC-RUS (SVM), EC-RUS (RF), EC-RUS (L1-LR) and EC-RUS (SBL) achieve MCC of 0.302, 0.261, 0.246 and 0.247, respectively. The MCC (0.315) EC-RUS (WSRC) is better than the above models. We can see that a sparse representation based classifier is suitable for the classification with PSSM-MLAB features. The best performance lies in the fact that weighted SRC further improves the performance of basic SRC and the easily adjusted parameter of WSRC can exert its effect fully in our experiments.  Table 4. Comparison with some state-of-the-art works on the Independent PDNA-41 dataset.

Predicted Results on the PDNA-316 Dataset
In order to highlight the advantage of our method, we also test on the PDNA-316 dataset, which is described by Si et al. [12]. We compare the prediction performance of our proposed method with other previous works including DBS-PRED [7], BindN [3], DNABindR [6], DISIS [11], DP-Bind [60], BindN-RF [4], MetaDBSite [12] and TargetDNA [13]. In Table 5, we can see that the average prediction performance of our method, such as sensitivity, specificity, accuracy and MCC are 0.8067, 0.7818, 0.7837 and 0.356 under Sen ≈ Spec, respectively. Although DISIS [11] achieves better values of specificity and ACC, and ACC, which are 0.9800, 0.9200, the sensitivity (0.1900) and MCC (0.250) are not high. Furthermore, our method (EC-RUS built with WSRC) achieves the best MCC of 0.439 under FPR ≈ 5%. It is shown that the MLAB algorithm deeply extracts the evolutional information from PSSM.
Different types of base classifiers are used to construct the EC-RUS model, and EC-RUS (SVM), EC-RUS (RF), EC-RUS (L1-LR) and EC-RUS (SBL) achieve MCC (under FPR ≈ 5%) of 0.426, 0.394, 0.319 and 0.317, respectively. Extensive experiments in our study have illustrated that WSRC is better than other classifiers. Table 5. Comparison of the prediction performance between the proposed method and some state-of-the-art works on PDNA-316 dataset.
To further evaluate our model, we employ the PDNA-335 dataset as the training set and PDNA-52 as the independent test set. Performance comparison about our method with TargetS [57], MetaDBSite [12], DNABR [64], and alignment-based predictor on the independent validation dataset of PDNA-52 is listed in Table 6. At the occasion of imbalanced learning, the MCC provides the overall measurement about the quality of binary prediction. In Yu's work [57], they implement the evaluation by choosing the Threshold (T) of probability value, in order to maximize the MCC value of prediction. Thus, we apply the same evaluation on PDNA-52. Obviously, TargetS achieves the best overall prediction performance among the nine listed predictors with the highest MCC value of 0.377, which is about 0.13 higher than that of the second-best (0.245) performer model (EC-RUS built with WSRC). The proposed method along with TargetS does not perform well on an independent PDNA-52 dataset because TargetS has equipped the residues' 3D coordinates contained in the PDB file to spatial clustering before it probes binding sites. Results of experiments show that our model is also compatible by comparing with the rest of the methods on an independent PDNA-52 dataset.
The trends (including sensitivity, specificity, accuracy, Pre and MCC) on different threshold T of probability are shown in Figure 7. EC-RUS (WSRC) achieves the highest MCC of 0.245 when T = 0.718.

Significance Analysis
We employ the Wilcoxon rank-sum test to analyze the statistical significance of MCC between other methods (including MetaDBSite and TargetDNA) and our method on PDNA-543, PDNA-41, PDNA-316 and PDNA-52 datasets. The significance level is 0.05, and results of tests are shown in Table 7. The differences between other methods and our method are not significant (MetaDBSite p-value : 0.2667, TargetDNA p-value : 0.4610). The main reason of this is that most of the above methods are based on sequence information. Hence, the increment of MCC is small. We would consider structure information in our further work.

Case of Prediction
Examples of 4X0PD (PDB ID: 4X0P, Chain: D) and 5BMZCD (PDB ID: 5BMZ, Chain: C and D) belong to the PDNA-41 dataset. We use the PDNA-543 dataset as the training set to predict two examples, which are shown in Figure 8. The orange object of the helix is a DNA chain, while the green object is the protein sequence (containing helix, fold and loop structure). Blue regions and red regions are the true prediction and false prediction, respectively. In addition, the results of two methods (our method and DP-Bind [60]) are shown in Table 8. On the 4X0P-D, the FP of our method (34) is less than DP-Bind (154). Furthermore, the FP and FN of our method (9, 3) are both less than DP-Bind (16, 7) on 5BMZ-D.    Although the WSRC is time-consuming, the performance is better than other classifiers on PDNA-41 and PDNA-52. The running times are listed in Table 9.

Discussion
Albeit many computational approaches have been proposed to prospect DNA-protein binding sites, there still have potential enhancing space for refining the state-of-the-art prediction models. Existing methods always disregard local environments that appear neither reliable nor robust. Hence, we put forward a kind of multi-scale local average blocks idea to further leach local evolutionary information from PSSM. Compared with original PSSM, the MCC of PSSM-MLAB rises by 0.014 in a PDNA-543 dataset.
In the future, we will ameliorate the forecasting performance of MLAB by refining the feature representation and classification tactics. For feature representation, we will consider the amino acid compositions and predicted secondary structures, which have been obtained as local PSSM-based features. Furthermore, well-established classifier also can be an alternative condition. Powerful classifiers such as Modified AdaBoost(MAdaBoost) [57] and LibD3C [65,66] can all be integrated into the clustering and dynamic selecting schema.