Assessing the Performances of Protein Function Prediction Algorithms from the Perspectives of Identification Accuracy and False Discovery Rate

The function of a protein is of great interest in the cutting-edge research of biological mechanisms, disease development and drug/target discovery. Besides experimental explorations, a variety of computational methods have been designed to predict protein function. Among these in silico methods, the prediction of BLAST is based on protein sequence similarity, while that of machine learning is also based on the sequence, but without the consideration of their similarity. This unique characteristic of machine learning makes it a good complement to BLAST and many other approaches in predicting the function of remotely relevant proteins and the homologous proteins of distinct function. However, the identification accuracies of these in silico methods and their false discovery rate have not yet been assessed so far, which greatly limits the usage of these algorithms. Herein, a comprehensive comparison of the performances among four popular prediction algorithms (BLAST, SVM, PNN and KNN) was conducted. In particular, the performance of these methods was systematically assessed by four standard statistical indexes based on the independent test datasets of 93 functional protein families defined by UniProtKB keywords. Moreover, the false discovery rates of these algorithms were evaluated by scanning the genomes of four representative model organisms (Homo sapiens, Arabidopsis thaliana, Saccharomyces cerevisiae and Mycobacterium tuberculosis). As a result, the substantially higher sensitivity of SVM and BLAST was observed compared with that of PNN and KNN. However, the machine learning algorithms (PNN, KNN and SVM) were found capable of substantially reducing the false discovery rate (SVM < PNN < KNN). In sum, this study comprehensively assessed the performance of four popular algorithms applied to protein function prediction, which could facilitate the selection of the most appropriate method in the related biomedical research.


Introduction
The function of a protein is of great interest in the current research of biological mechanisms [1], disease development [2] and drug/target discovery [3][4][5][6][7], and a variety of databases is available SE median values of BLAST, SVM, PNN and KNN equaled 90.59%, 90.52%, 84.38% and 76.54%, respectively. As shown in Figure 1B, the majority of the SPs of all algorithms surpassed 98 99.67% and 99.44%, respectively. These results revealed a relatively low level of false discovery rates for all popular functional prediction algorithms. Due to the dominant number of negative samples in the independent test datasets, the statistical difference in ACC was very similar to that of SP ( Figure 1C). The majority of the ACCs of all algorithms surpassed 97%. The ACCs of 93 functional families were between 95 99.61% and 99.16%, respectively. MCC was frequently applied to reflect the stability of the protein function predictor and was considered as one of the most comprehensive Due to the dominant number of negative samples in the independent test datasets, the statistical difference in ACC was very similar to that of SP ( Figure 1C).  99.61% and 99.16%, respectively. MCC was frequently applied to reflect the stability of the protein function predictor and was considered as one of the most comprehensive parameters because of its full consideration of TP, TN, FP and FN. As shown in Figure 1D, the MCC of both SVM and PNN was better than that of BLAST and KNN. The majority of MCCs were over 0.6 and 0.4 for SVM-PNN and BLAST-KNN, respectively. In particular, MCCs of 93 functional families were between 0.15 and 0.99 for SVM, between 0.22 and 0.94 for BLAST, between 0.11 and 0.97 for PNN and between 0.13 and 0.76 for KNN. The median values of MCCs for BLAST, SVM, PNN and KNN equaled 0.62, 0.74, 0.72 and 0.50, respectively. In sum, there were consistently low levels of the false discovery rate among all algorithms as assessed by the metric SP. However, when the positive discovery rates (SEs) and the stability of prediction (MCC) were considered, SVM, PNN and BLAST stood out as more powerful algorithms for protein function prediction.

Evaluating the Statistical Differences in SE and MCC among Four Metrics
For the machine learning algorithms (SVM, PNN and KNN), there was a significant statistical difference in their SEs and MCCs. As shown in Figure 1A, the statistical difference in SEs between SVM and PNN equaled 3.5 × 10 −6 , while that between SVM and KNN was 1.0 × 10 −11 . Moreover, there was a significant statistical difference between PNN and KNN (p-value = 0.01). In particular, the number of families with SEs of >90%, ≤90% and >80% and ≤80% for SVM equaled 49 Figure 1A,D (from KNN to PNN to SVM).
Similar to SVM, BLAST also demonstrated great performances in both SE and MCC. The statistical differences (measured by p-value) in the SE and MCC between BLAST and SVM were 0.88 and 2.0 × 10 −7 , respectively. As demonstrated in Table 1 and Table S1, the SE of BLAST surpassed that of SVM in 51 families, but was worse than that of SVM in 40 families. Moreover, the SEs' median values (90.52% for BLAST and 90.59% for SVM) and mean values (88.92% for BLAST and 89.08% for SVM) indicated that the SE of SVM was slightly better than that of BLAST and significantly better than that of PNN and KNN. Meanwhile, MCC of SVM was higher than that of BLAST in 68 families, but was lower than that of BLAST in 20 families. The MCCs' median values (0.62 for BLAST, 0.74 for SVM) and mean values (0.61 for BLAST, 0.73 for SVM) indicated a slight improvement in prediction stabilities by SVM.
The amphibian defense peptide family (KW-0878; KW, keyword) was the family with the highest SE (99.99%) for SVM, BLAST and KNN, which was known to be a rich source of antimicrobial peptides with a broad spectrum of antimicrobial activities against pathogenic microorganisms [74][75][76]. The superior SE of this family may come from its nature as a conserved element of the defense system of various species [77].

In-Depth Assessment of the False Discovery Rate by Genome Scanning
Genome scanning has been frequently used to evaluate the false discovery rate of function prediction tools [78,79]. To have a comprehensive understanding of methods' false discovery rate, the genomes of four model organisms representing four kingdoms (Homo sapiens from Animalia, Arabidopsis thaliana from Plantae, saccharomyces cerevisiae from Fungi and Mycobacterium tuberculosis from Bacteria) were collected. As demonstrated in Table 2 and Table S2, the genome scanning revealed that the number of proteins in any of those 93 studied families predicted by SVM, PNN and KNN did not exceed 10% of the total number of proteins in the whole genome, and this was the same situation for the majority (82%) of the 93 studied families by BLAST. The higher number of proteins predicted for a certain functional family may indicate a higher false discovery rate [78,79]. For the human genome, the number of proteins identified by SVM was equivalent to or was slightly higher than that of both PNN and KNN, but was significantly lower than that of BLAST (Figure 2a). In addition, the proteins identified by PNN were lower than that of KNN in 11 families and higher in 20 families.

In-Depth Assessment of the False Discovery Rate by Genome Scanning
Genome scanning has been frequently used to evaluate the false discovery rate of function prediction tools [78,79]. To have a comprehensive understanding of methods' false discovery rate, the genomes of four model organisms representing four kingdoms (Homo sapiens from Animalia, Arabidopsis thaliana from Plantae, saccharomyces cerevisiae from Fungi and Mycobacterium tuberculosis from Bacteria) were collected. As demonstrated in Tables 2 and S2, the genome scanning revealed that the number of proteins in any of those 93 studied families predicted by SVM, PNN and KNN did not exceed 10% of the total number of proteins in the whole genome, and this was the same situation for the majority (82%) of the 93 studied families by BLAST. The higher number of proteins predicted for a certain functional family may indicate a higher false discovery rate [78,79]. For the human genome, the number of proteins identified by SVM was equivalent to or was slightly higher than that of both PNN and KNN, but was significantly lower than that of BLAST (Figure 2a). In addition, the proteins identified by PNN were lower than that of KNN in 11 families and higher in 20 families.    (Table S3, not existing in the human genome) were collected for assessing the false discovery rate of each algorithm. For example, the covalent protein-RNA linkage family (KW-0191) contained proteins attaching covalently to the RNA molecules in virus [80], and the storage protein (KW-0758) included the proteins as a source of nutrients for the development or growth of the organism in plants. For these families (Table S3), SVM did not identify any proteins from the human genome, while 0.06% and 0.25% of the proteins in the human genome were falsely assigned by BLAST to the family of covalent protein-RNA linkage protein and storage protein, respectively. As illustrated in Figure 3, several other families (such as plant defense, virulence) also demonstrated a significantly higher false discovery rate by BLAST than that of SVM.  (Table S3, not existing in the human genome) were collected for assessing the false discovery rate of each algorithm. For example, the covalent protein-RNA linkage family (KW-0191) contained proteins attaching covalently to the RNA molecules in virus [80], and the storage protein (KW-0758) included the proteins as a source of nutrients for the development or growth of the organism in plants. For these families (Table S3), SVM did not identify any proteins from the human genome, while 0.06% and 0.25% of the proteins in the human genome were falsely assigned by BLAST to the family of covalent protein-RNA linkage protein and storage protein, respectively. As illustrated in Figure 3, several other families (such as plant defense, virulence) also demonstrated a significantly higher false discovery rate by BLAST than that of SVM. For the other three genomes, their situation was similar to the human genome. Take the Arabidopsis thaliana genome as an example: proteins identified by SVM were equivalent to or slightly higher than those by PNN and KNN in all protein families, but lower than that of BLAST in 77 families, and the number of protein discovered by PNN was lower than that of KNN in 26 families. In summary, the level of false discovery rate (Figure 2b-d) could be ordered as BLAST > SVM > PNN and KNN. These results revealed that BLAST was more prone to generate a false discovery rate than the other three machine learning methods (SVM > PNN ≈ KNN).
As reported [81][82][83][84][85], an open web-server is recognized as useful for constructing effective methods and tools. A variety of web-servers have increasing impacts on medical sciences [86], driving medicinal chemistry to an unprecedented revolution [87], and efforts will be further made to develop web-based services for the performance assessment discussed in this study. For the other three genomes, their situation was similar to the human genome. Take the Arabidopsis thaliana genome as an example: proteins identified by SVM were equivalent to or slightly higher than those by PNN and KNN in all protein families, but lower than that of BLAST in 77 families, and the number of protein discovered by PNN was lower than that of KNN in 26 families. In summary, the level of false discovery rate (Figure 2b-d) could be ordered as BLAST > SVM > PNN and KNN. These results revealed that BLAST was more prone to generate a false discovery rate than the other three machine learning methods (SVM > PNN ≈ KNN).
As reported [81][82][83][84][85], an open web-server is recognized as useful for constructing effective methods and tools. A variety of web-servers have increasing impacts on medical sciences [86], driving medicinal chemistry to an unprecedented revolution [87], and efforts will be further made to develop web-based services for the performance assessment discussed in this study.

Materials and Methods
To construct a valid statistical model for a biology problem based on protein sequences [88][89][90][91][92][93][94][95][96][97], a rule of five steps is needed [98]. Firstly, a valid construction of datasets for both training and testing the model is required. Secondly, an effective conversion of the sequence to the digital feature vector is asked to represent their targeted properties. Thirdly, a powerful statistical method should be designed for the functional prediction. Fourthly, the accuracies of the constructed statistic model should be validated correctly. Fifthly, a web-server based on the constructed model may be further developed for public access. The corresponding methods and steps adopted in this study are provided and described below. Table 1 provides a full list of 93 protein families collected from UniProt [43], and the performances of the popular protein function prediction methods (BLAST, KNN, PNN, SVM) were measured via independent test datasets (the way to generate an independent dataset is shown in the following Section 2.2). These 93 included 12 families of binding molecules (e.g., sodium-, potassium-, SH3-and RNA-binding), 15 ligand families (e.g., plastoquinone ligand, vitamin C ligand and ubiquinone ligand), 58 families defined by Gene Ontology (40 molecular functions and 18 biological processes) and 8 broad families defined by UniProt [43]. All families were contained in the keyword categories of UniProt, and the majority (82.7%) of these 93 families were able to be mapped to GO terms (Table 1). Protein entries that have not been manually annotated and reviewed by UniProtKB curators in a keyword category were not considered for analysis in this study. As a result, 107~49,517 protein-entries from 93 families were collected.

Construction of the Training and Testing Datasets
The independent test dataset was frequently constructed to evaluate the performances of protein function predictors in recent years [99][100][101][102][103][104]. To construct a valid set of data for building the predictor of each family, the datasets of the training, testing and independent test were generated by a strictly defined process after the data collection described in Section 2.1. Firstly, all proteins of different sequences in a specific family are assigned randomly with a number, which is within the range of the total number of proteins in that family. Secondly, these sequences in each protein function family were sequentially selected based on the number assigned and then iteratively added to the training, testing and independent test datasets. Samples in these datasets are all known as the positive samples. Thirdly, the Pfam families [16] of the proteins of a certain functional family were retrieved from the Pfam database [16] for generating negative samples. The Pfam family with protein(s) of this functional family was defined as the "positive" one, and the remaining families were grouped into the "negative" ones. Finally, 3 representatives were randomly picked out of the negative families and sequentially added to the training, testing and independent test datasets, and samples in these datasets are thus known as the negative samples. It is necessary to emphasize that there was no overlap among the datasets of the training, testing and independent test [60,61].
To assess the false discovery rate among algorithms, the genomes of four model organisms representing four kingdoms (Homo sapiens from Animalia, Arabidopsis thaliana from Plantae, Saccharomyces cerevisiae from Fungi and Mycobacterium tuberculosis from Bacteria) were collected from UniProt. The protein entries without any manual annotation and review by the UniProtKB curators were not taken into consideration. In total, 20,183, 15,169, 6721 and 2166 protein sequences in FASTA format were collected for human, Arabidopsis thaliana, Saccharomyces cerevisiae and Mycobacterium tuberculosis, respectively.

Feature Vectors Used for Representing the Protein Sequence
The conversion of the protein sequence into the digital feature vector was conducted based on properties of each residue within that protein. These properties include: (1) charge; (2) polarizability; (3) polarity; (4) surface tension; (5) amino acid (AA) composition; (6) van der Waals volume via normalizing; (7) hydrophobicity; (8) solvent accessibility; and (9) protein secondary structure [36,[105][106][107]. Then, 3 features were applied to describe each property [36]. These features contained: (a) composition (No. of AAs of a particular property over the total No. of AAs; (b) transition (the percentage of AAs with a certain property was followed by AAs with a different property); and (c) distribution (the sequence lengths within which the first, one fourth, half, three-quarters and all of the AAs of specific property were localized). The detailed procedure for generating the feature vector from the sequence was described in previous publications [36,65]. These features have already been successfully applied to facilitate the prediction of enzyme functional [108] and structural classes [107].

Functional Prediction of Protein Constructed by Machine Learning
To construct the prediction model, the parameters of machine learning methods were optimized using the testing dataset for each training process. Once suitable parameters were discovered, a new training set was constructed by combining the original training and testing datasets, and the corresponding parameters were directly accepted for training a new model. To assess the performance of the constructed models and detect possible over-fitting, the independent test set was further applied. It is necessary to emphasize that all duplicates in the protein sequence were removed during datasets' construction.

Construction of Protein Functional Prediction Model Based on Sequence Similarity
Sequence similarity was assessed by the NCBI Protein-Protein BLAST (Version 2.6.0+) [53,54]. Firstly, the combined training and testing dataset was adopted to form the BLAST database, and the sequences in the independent test dataset were used as queries. The BLAST E-value and percentage sequence identity were usually applied to represent the level of similarity between sequences [109]. The functional variation between proteins was reported to be rare when their sequence identity was more than 40% [110,111]. Thus, an E-value of 0.001 and a sequence identity of 40% were adopted as the cutoffs in this study to assess the functional conservation of BLAST hits.

Assessing the Identification Accuracies of the Studied Methods
The performance of protein function prediction algorithms was systematically assessed by four popular metrics, sensitivity (SE), specificity (SP), accuracy (ACC) and Matthews correlation coefficient (MCC), based on the independent test datasets generated from the 93 studied families (Supplementary  Materials Table S1). All 4 metrics were widely used in assessing the performance of protein function predictors [112][113][114][115][116][117]. In particular, SE is defined by the percentage of true positive samples correctly identified as "positive" [118,119] (shown in Equation (1)): SP indicates the proportion of true negative samples that were correctly predicted as "negative" [118,119] (in Equation (2)): ACC refers to the number of true samples (positive plus negative) divided by the number of all samples studied (shown in Equation (3)): The MCC was an important metric reflecting the stability of a protein function predictor, which described the correlation between a predictive value and an actual value [118,119]. It has been considered as one of the most comprehensive parameters in any category of predictors due to its full consideration of all four results. In particular, the MCC could be calculated by Equation 4: In particular, those four results were TP (No. of true positive samples), TN (No. of true negative samples), FP (No. of false positive samples) and FN (No. of false negative samples) [118,119]. It is very important to emphasize that these four metrics are applicable to the single-class situations (each protein is grouped into just one family). For the multi-class situations frequently observed in complicated biological networks [81][82][83][84] and biomedical researches [84,89,117], different metrics should be defined [120].

The Rates of False Discovery of the In Silico Methods Studied Here
As reported, genome scanning was a comprehensive method to evaluate the capacity of protein functional prediction tools in identifying and classifying protein families [78,79]. In this paper, an evaluation of the false discovery rate of the studied protein function predictors was performed by scanning the genomes of 4 model organisms representing 4 kingdoms (Homo sapiens from Animalia, Arabidopsis thaliana from Plantae, Saccharomyces cerevisiae from Fungi and Mycobacterium tuberculosis from Bacteria). The false discovery rates were assessed by reconstructing the prediction models of those in silico algorithms. In particular, the sequences of proteins in a certain functional family were all put into the reference database for BLAST scanning and were also used to reconstruct the machine learning models using the optimized parameters obtained in Section 3.4. In reality, the total amount of proteins not belonging to a certain family should be much larger than that of proteins in that family. Therefore, a tiny reduction in the value of SP may lead to a significant discovery of false positive hits, which reminded us to use SP as an effective indicator when evaluating the model's false discovery rates.

Conclusions
This study discovered substantially higher sensitivity (SP) and stability (MCC) of BLAST and SVM than that of PNN and KNN. However, the machine learning algorithms (PNN, KNN and SVM) were found capable of significantly reducing the false discovery rate (with PNN and KNN performed the best). In conclusion, this study comprehensively assessed the performances of popular algorithms applied to protein function prediction, which could facilitate the selection of the appropriate method in the related biomedical research.

Conflicts of Interest:
The authors declare no conflict of interest.