Support Vector Machine Classiﬁer for Accurate Identiﬁcation of piRNA

Featured Application: A support vector machine was used to achieve the best jackknife and the 5-fold cross-validation outcomes for identifying piRNAs (Piwi-interacting RNA) by combining these multiple features. Abstract: Piwi-interacting RNA (piRNA) is a newly identiﬁed class of small non-coding RNAs. It can combine with PIWI proteins to regulate the transcriptional gene silencing process, heterochromatin modiﬁcations, and to maintain germline and stem cell function in animals. To better understand the function of piRNA, it is imperative to improve the accuracy of identifying piRNAs. In this study, the sequence information included the single nucleotide composition, and 16 dinucleotides compositions, six physicochemical properties in RNA, the position speciﬁcities of nucleotides both in N-terminal and C-terminal, and the proportions of the similar peptide sequence of both N-terminal and C-terminal in positive and negative samples, which were used to construct the feature vector. Then, the F-Score was applied to choose an optimal single type of features. By combining these selected features, we achieved the best results on the jackknife and the 5-fold cross-validation running 10 times based on the support vector machine algorithm. Moreover, we further evaluated the stability and robustness of our new method.


Introduction
Piwi-interacting RNA (PiRNA), a newly identified class of small non-coding RNA of which the length is 26-33 nt, can combine with PIWI proteins to regulate a transcription gene silencing process, heterochromatin modifications and to maintain germline and stem cell function in animals [1][2][3][4]. However, high-throughput sequencing indicates that tens of thousands of different piRNAs produced in various animals cannot recognize transposons [5]. Therefore, the function of piRNA needs to be further investigated. Experimental verification of piRNA targets and the piRNA-targeting rules are quite difficult to prove [6]. Crosslinking immunoprecipitation (CLIP) analyses of PIWIs suggest that they associate with diverse mRNAs. However, because diverse piRNAs engage with many mRNAs, it is hard to infer the target of a given piRNA from these CLIP analyses [6][7][8]. Therefore, additional approaches are required to distinguish piRNA sites in vivo.
In recent years, several computing biology tools have been proposed to identify piRNAs. The first model to identify piRNAs was piRNApredictor, firstly developed by Zhang et al. [9]. After three years, Wang et al. [10] proposed the second model for predicting piRNAs based on the transposon interaction and SVM (Support Vector Machine). Recently, Luo et al. [11] applied the sequence information and physicochemical features of piRNAs and non-piRNAs to construct the model to predict piRNAs. In addition, Li et al. [12] used a powerful ensemble approach, which achieved a substantial improvement. According to the most attractive work by Lin et al. [13], 2L-piRNA (two-layer ensemble classifier for identifying piRNAs) can be used to identify piRNAs and their function types. 2L-piRNA yielded an accuracy of 86.1% for identifying piRNAs and non-piRNAs, and achieved an accuracy of 77.6% for identifying piRNAs with the function of instructing target mRNA deadenylation and piRNAs without the function of instructing target mRNA deadenylation.
Aiming to improve the prediction accuracy, we developed a novel predictor, 2L-piRNAPred (2-layer integrated program for identifying piRNAs in the first layer and determining if piRNAs have the function of instructing target mRNA deadenylation in the second layer.), by considering single nucleotide composition (SNC), the 16 dinucleotides compositions (DNC), six physicochemical properties in RNA, the position specificities of nucleotides both in N-terminal and C-terminal, and the proportions of similar peptide sequences of both N-terminal and C-terminal in the positive and negative samples. Consequently, F-Score was utilized to select the most efficient unique type of features. Furthermore, all the optimized feature vectors were combined to build our prediction model based on a support vector machine classifier. Both the jackknife test and 5-fold cross-validation running 10 times were implemented to test the stability and robustness of the model. In addition, the major comparison with the previous work based on 2L-piRNA showed that our model, 2L-piRNAPred, is superior both in sensitivity and specificity for the first layer and the second layer.

Datasets
We applied the same dataset as in [13]. Firstly, the piRNA sequences were originally taken from piRBASE [14] and non-piRNA sequences were obtained from [15]. Then, the CD-HIT with a cutoff threshold of 0.8 was employed to remove high-similarity sequences. Thirdly, we randomly selected the same number of negative samples as that of the positive samples to avoid the high false negative rate caused by the imbalanced dataset. Finally, there were 709 piRNA samples having the function of instructing target mRNA deadenylation (denoted as S + inst ), 709 piRNA samples without this function (denoted as S + non−inst ), and 1418 non-piRNA samples (denoted as S − ). The benchmark positive dataset was the union of S + inst and S + non−inst . Therefore, the training datasets in this study can be formulated as:

Pse-Nucleotide Composition
The concept of the pseudo amino acid composition or Chou's PseAAC (Pseudo Amino Acid Composition) was proposed in 2001 and has been rapidly applied in all fields of computational biology [16]. To learn the detailing introduction of Chou's PseAAC and its recent development and applications, we can see the comprehensive review in [17]. In this work, SNC, DNC and tri-nucleotides composition (TNC) were employed to extract sequence information, which were formulated as follows, respectively: The length o f sequence × 100 (2) where i ∈ {A, C, G, U} and the length of the sequence is the number of nucleotides in this The length o f sequence − 1 × 100, j = 1, 2, . . . L, 16 where j ∈ {AA, AC, . . . L, GT, TT} sequences; The length o f sequence − 2 × 100, k = 1, 2, . . . , 64 where k s ∈ {AAA, AAC, . . . , UUG, UUU} equences.

Split-Position-Specific Matrix
It has been indicated in [13] that N-and C-terminal segments are critical to piRNAs. In this study, we considered the amino acids distribution both in N-and C-terminals. Moreover, the number of selected nucleotides of N-and C-terminal segments decreased successively, which were set to 15, 13, 11, 9, 7 and 5 nucleotides, respectively. Then, the bi-profile Bayes (BPB) features were used to characterize the probability distributions of each nucleotide at each position. As reported in many studies [18][19][20][21], the BPB feature vector was formulated as follows: where P is the posterior probability vector; x 1 , x 2 , . . . , x n represent the posterior probability of each nucleic acid at each position in positive peptide sequence datasets, respectively; and x n+1 , . . . , x 2n represent the posterior probability of each nucleic acid at each position in negative datasets, respectively.

Six RNA Dimer's Physicochemical Properties
Six RNA dimer's physicochemical properties, including rise, roll, shift, slide, tilt, and twist, have been used in [13], and have shown decently the prediction performance. The normalized values of six physicochemical properties for 16 dimers were derived in [13], and we listed them in Table 1 for convenience. We mapped the RNA sequence to the following vector according to the physicochemical properties: Then, we obtained a physicochemical property vector with 96 dimensions (96-D).

Feature Optimization
When four types of features were incorporated to train the model, the dimension of the hybrid features vector was 240. Additionally, the initial combined features may contain redundant and noisy information. This might exert the negative effect on model training. In this work, the importance of each feature was ranked by a feature selection tool known as F-score. The F-score of the j-th feature was defined as: where x j , x k,j denotes the j-th feature of the k-th negative instance. The greater F-score indicates that the feature is more different between positive and negative samples, and is useful to classification [22].

SVM Implementation and Parameter Selection
SVM is a set of related supervised learning methods used for classification and regression based on the statistical learning theory, and has been illustrated to be powerful in many areas of bioinformatics [23][24][25][26][27]. As in other works [28][29][30][31], SVM was trained and tested by using the LIBSVM package [30] to build the model and implement the prediction. The radial basis function kernel was used in our SVM model. For different input features, the penalty parameter C and kernel parameter γ were optimized using SVMcg in the LIBSVM package based on a 5-fold cross-validation test. The optimal parameters C = 8 and γ = 0.044194 were set for the detection of piRNAs and non-piRNAs, while C = 0.35355 and γ = 1.4142 were assigned for the distinguishing samples with the function of instructing target mRNA deadenylation.

Model Construction and Evaluation
The performance of 2L-piRNAPred was evaluated using four measurements derived based on the symbols introduced by Chou in predicting signal peptides. Particularly, its advantages have been analyzed and endorsed by a series of studies published very recently [28][29][30]. The four measurements were given as follows: where TP represents the number of true positives, TN represents the number of true negatives, FP represents the number of false positives and FN represents the number of false negatives.

Results and Discussion
The test process of our model piRNAPred is summarized in Algorithm 1.

Algorithm 1:
The Predictive Algorithm for piRNAPred.
is a set of samples and the number of categories is c. Output: The prediction label of each sample. For each X i , i = 1, 2, . . . , N do Take X i as the test sample, and the others as the training dataset. Extract features. Predict the category. end

Window Size Optimization for Bi-Profile Bayes
For the number of selected nucleotides, both N-and C-terminals affect the prediction performance, so we tried different window sizes to find the optimal prediction performance. The detailed results for the different window sizes are listed in Table S1 (Supplementary Materials). The highest Acc 82.81% was achieved at a window size of 15. Considering the minimum length in the whole training dataset was 24, we selected the maximum window size of 15. If we chose the window size that was much greater, there would be more repeated nucleic acids in the selected N-terminal and C-terminal segments.

Feature Selection for the First Layer
We first picked out the most contributing characteristics with a step of 2 for the other three types of characteristics, except for SNC with a step of 1. We decided which characteristics were retained based on the average results of 5-fold cross-validation test running 5 times for both the first and the second layers. To avoid wordiness, we only described the process of feature selection for the first layer. For the characteristic of SNC, the best prediction performance was achieved at all the four features (Table S2, Supplementary Materials). For the DNC, we sorted the 16 values according to the F-scores, and then selected 2, 4, . . . , 14, 16 features step by step. As listed in Table S3, the best performance achieved at 8 di-nucleotides selected (i.e., CG, UG, GA, AG, UA, CC, UU, and AU), with an Sn of 86.55%, an Sp of 82.05%, an Acc of 84.29%, and an MCC of 0.6870. The same processes were made for TNC and physicochemical properties (See Tables S4 and S5 (Supplementary Materials) for detail description) and the best performance parameters for these features are listed in Table 2. As we can see from Table 2, the DNC features achieved the best performance with an Acc of 84.29% and an MCC of 0.6870; on the contrary, the SNC feature achieved the worst prediction results with an Acc of 62.96% and an MCC of 0.2791. At last, all the selected characteristic vectors were combined together to get the higher results with an Acc of 88.95% and an MCC of 0.7791. Both the Sn and Sp were satisfactory.

Feature Selection for the Second Layer
The same features selection processes for the second layer were performed as those for the first layer (see Tables S6-S10, Supplementary Materials for the detail). As can be seen from Table 3, using the unique type feature of TNC, we achieved the best Acc of 77.7% and MCC of 0.554, and the worst performance again was achieved by SNC with an Acc of 71.4% and an MCC of 0.427. When we combined all the features, the Acc slightly increased to 78.7% and MCC slightly increased to 0.573. However, the Sp was 77.3%, less than 77.60% achieved by using TNC only. From the analysis, the prediction results were not as satisfactory as those for the first layer. Therefore, we had to add another type of features to further improve the prediction performance. Inspired by the k nearest neighbor (KNN) classification algorithm, we used KNN to embody the distribution of neighbor sequences [32]. Unlike in the samples investigated above, the lengths of samples in S + inst and S + non−inst were not the same. To deal with this problem, the similar strategy was adopted to consider 10 nucleotides in N-terminal and C-terminal separately. The algorithm of using KNN to extract features is described in detail in [33]. The KNN features (k = 10) of N-terminal were firstly added to the original combination of BPB(15) + SNC(4) + DNC(10) + TNC(48) + PP (24), which increased the Sp to 79.3%, and Sn to 80.1%. Then, the KNN features (k = 10) of C-terminal were further added, which increased the Sp to 83.6%, and Sn to 84.3%.

Performance of 2L-piRNAPred
One way to prove the superiority of the new model is to compare its prediction performance with that obtained by other existing methods. The compared results are listed in Table 4. For the first layer, our model achieved the best values of Sn and Sp among the four methods. While for the second layer, our model also achieved the best Sn and Sp values if our model piRNAPred was compared with the first two-layer model named 2L-piRNA. We noted that the increase in Sp value was more significant than that in Sn value both for the first and the second layers. Next, we further implemented the jackknife validation test. From the last line in Tables 2 and 3, we showed that our model piRNAPred obtained the MCC value of 0.784 for the first layer and the MCC value of 0.683 for the second layer. To further rank the classification methods for the first layer, a Friedman or a Friedman Aligned Ranks test (number of datasets: <20) with the Holm post-hoc test [34] was performed using [R]. Among all predictors, piRNAPred was significantly better than other approaches according to the MCC measurement. These comparisons illustrate the stability and robustness of our model piRNAPred. The reason why the methodology works well is the comprehensive features we extracted. The features reflected the global and local information of the samples in the dataset. Moreover, the running time for predicting one sample was about one second.

Comparison for Different Classifiers
In order to test whether the prediction performance of piRNAPred can be further improved, we tried different classifiers on the jackknife test. The prediction performances of Random Forest (RF), KNN and Ensemble for Boosting are listed in Table 5. It is illustrated that both for the first layer and the second layer, SVM achieved the best performance. Therefore, our predictor adopted SVM as the final classifier.

Conclusions
In this work, we proposed a computational method for identifying piRNAs with the function of instructing target mRNA deadenylation and piRNAs without the function of instructing target mRNA deadenylation. According to the average outcomes of 5-fold cross-validation test for running 10 times, the combination of BPB(15) + SNC(4) + DNC(8) + TNC(56) + PP(84) achieved the best Sn, Sp, Acc, and MCC values for the first layer. While for the second layer, it was a bit complex. The original combination BPB(15) + SNC(4) + DNC(10) + TNC(48) + PP(24) must contain KNN (k = 10) features in N-terminal and C-terminal to get satisfactory Sn, Sp, Acc, and MCC average results of 5-fold cross-validation test for 10 times. Moreover, the comparison between the jackknife and 5-fold cross-validation outcomes shows the robustness of 2L-piRNAPred. It should be pointed that the predictor was not tested on an independent dataset, and the prediction performance might cause a certain overfitting.
Supplementary Materials: The following are available online at http://www.mdpi.com/2076-3417/8/11/2204/ s1, Table S1: The average 5-fold cross-validation results using split-BPB on different window sizes both on N-and C-terminal for the first layer, Table S2: Features selection process for SNC by F-scores for the first layer, Table S3: Features selection process for DNC by F-scores for the first layer, Table S4: Features selection process for TNC by F-scores for the first layer, Table S5: Features selection process for physicochemical properties by F-scores for the first layer, Table S6: The average 5-fold cross-validation results using split-BPB on different window sizes both on N-and C-terminal for the second layer, Table S7: Features selection process for SNC by F-scores for the second layer, Table S8: Features selection process for DNC by F-scores for the second layer, Table S9: Features selection process for TNC by F-scores for the second layer, Table S10: Features selection process for physicochemical properties by F-scores for the second layer.
Author Contributions: T.L. conceived and designed the experiments; M.G. and Y.C. implemented SVM and created the webserver; T.L. performed the analysis and wrote the paper; R.S. and Q.Y. revised the paper. All authors read and approved the final manuscript.