Protein–protein interaction (PPI) is one of the central research topics in experimental and computational biology. Rapid reconstruction of genome-scale protein–protein interaction (PPI) networks is instrumental in not only understanding cellular processes and disease pathogenesis but also developing therapeutic drugs. Recent years have witnessed the rapid accumulation of PPI data in various databases, e.g., HPRD [1
], BioGrid [2
], Reactome [3
], KEGG [4
], IntAct [5
], HitPredict [6
], STRING [7
], DIP [8
], BIND [9
], etc. These databases provide abundant information for us to further experimentally or theoretically analyze the underlying PPI, signaling or other molecular mechanisms [10
]. The PPI experimental techniques, including X-ray crystallography, yeast two-hybrid, mass spectrometry, and affinity purification, are very credible in general. However, these techniques also exhibit a high fraction of false positive rate and low agreements with each other [12
]. To date, various computational methods have been proposed to eliminate these discrepancies and enhance the known PPI databases. Although much effort has been devoted to computational reconstruction of intra-species [13
] and inter-species [19
] PPI networks, there still are several major issues that need to be properly addressed.
is the first critical factor that affects the performance of computational methods. Because of the limitations of experimental techniques [12
], the major PPI databases [1
] more or less contain a certain level of noise. Furthermore, only a small portion of data are experimentally verified physical PPIs [1
]. For instance, the PPI database STRING [7
] has collected massive PPI networks of 2031 species, but the data have been reported to be of low quality the majority of which have not been experimentally or computationally validated [23
]. In general, the PPIs verified by multiple experimental techniques and computational methods are assumed to be of high quality. Meanwhile, how to generate quality negative training data should also attract more attention to train a less biased model, because the experimentally verified negative dataset is very small. Unfortunately, this problem has not been satisfactorily addressed yet.
As regards inter-species or pathogen-host protein interactions, the issue of data scarcity seems to be more serious. In such cases, advanced machine learning approaches such as transfer learning become the first solution to augment training data or borrow useful information from auxiliary data [19
]. However, a large genome gap between source species and target species potentially results in negative knowledge transfer that would adversely affect the model performance [19
is the second critical factor that determines whether a predictive model could generalize well to unseen examples or patterns. Many features have been used to predict PPIs, e.g., k
-mer, sequence similarity, binding motif, domain co-occurrence, gene expression profile, gene co-expression, structural similarity, post-translational modification, PPI network topological properties, etc. Among these features, gene ontology (GO) has been reported to be the most discriminative feature [26
]. As a cheap and easily available feature, sequence k
-mer is frequently used to predict PPIs [13
]. However, some studies claim that simple sequence k
-mer cannot predict PPIs [14
]. In response to this controversy, Park et al. [15
] argues that the failure of computational modeling results from the sampling ratio of negative data instead of the sequence k
-mer feature itself.
Negative data sampling
is the third critical factor of computational modeling. To computational biologists, negative data are as important as positive data, but negative observations are often neglected or discarded by experimental biologists without being collected in the public repositories. As a makeshift, computational biologists generally resort to random sampling to generate negative data. However, without experimental validation, randomly sampled data lack supporting biological evidence [13
]. The evidence supporting random sampling merely comes from the theory of complex networks, which assumes PPI networks to be scale-free instead of random networks [29
]. Under such an assumption, proteins do not interact randomly and thus two randomly sampled proteins do not interact with high probability. Though less biased, random sampling method runs a high risk of sampling false negative data and are hard to biologically interpret. To reduce the false negative rate, Ben-Hur et al. [30
] selected those protein pairs that are not subcellularly co-localized as negative examples. As such, the obtained negative data are more reliable but less representative, because this method does not cover the subcellularly co-localized protein pairs that do not interact, which are more useful to reveal the PPI mechanisms. Blohm et al. [31
] build the Negatome database to collect the experimentally verified or structurally curated non-interacting protein pairs. Unfortunately, Negatome is very small. Trabuco et al. [32
] took advantage of the pairs of bait and prey observed via yeast two-hybrid technique to infer the bait-prey pairs that are not experimentally observed as negative, i.e., non-interaction. This policy is very smart but only applicable to two-hybrid experiments. Furthermore, the pairs of bait and prey are not reported to be simultaneously detected by the same experimental platform, thus the experimentally unobserved pairs of bait and prey are not necessarily negative.
is the fourth critical concern in computational or machine learning modeling. K
-fold cross validation is generally a necessary step to evaluate model performance. Nevertheless, cross validation is not sufficient to check model bias, because exhaustive parameter tuning could in practice produce a seemingly perfect and unbiased model to fit the training data. What we are really concerned about is to gain knowledge about how well the model generalizes to unseen examples. For this reason, independent test is also an indispensable step to model evaluation. However, existing methods seldom conduct independent test or are merely evaluated on the positive independent test data [13
]. The emphasis of evaluation on the negative independent test data is not because the negative results biologically matter but because we need to know how much the model is potentially biased. If a model performs very well on the positive independent test data but rather poorly on the negative independent test data, the model runs a high risk of bias toward the positive, so that the results are probably not so credible as to contain a high level of false positive predictions.
In this study, we use the very limited manually-curated negative data from Negatome as seeds to infer more negative data for computational modeling. The concept of interlog [33
] assumes that the paralogs or orthologs of two interacting proteins also interact. Based on the notion of orthologous or paralogous structure conservation, we assume that the paralogs or orthologs of two non-interacting proteins also do not interact with high probability and coin this assumption as Neglog. To reduce the risk of bias toward the non-interactions from Negatome [31
], we use the less biased random sampling method to enlarge the coverage of negative data. The positive training data are taken from the physical PPIs in HPRD [1
] and BioGrid [2
], and the positive independent test data are taken from Reactome [3
], IntAct [5
], and HitPredict [6
]. The negative independent test data are taken from Negatome [31
], random sampling, and Neglogs. The comprehensive estimation is to know about and reduce the risk of model bias. We adopt gene ontology (GO) terms as features and conduct homolog knowledge transfer to tackle GO sparsity. L2
-regularized logistic regression is used as the predictive model to counteract noise from homologs and fast train on large dataset. Lastly, we validate the PPIs in STRING [7
] using the proposed model and merge the validated PPIs into the comprehensive human physical PPI networks for further research.
Protein-protein interaction (PPI) networks play important roles in inferring signaling pathways, discovering network hallmarks of disease, screening drug targets and estimating pharmacological risks. Recent years have witnessed much progress in computational reconstruction of protein–protein interaction networks. Nevertheless, the existing computational methods still leave several major issues to be properly addressed. In this study, we comprehensively review the existing methods from the aspects of (1) data quality; (2) feature construction; (3) negative data sampling; (4) and model evaluation. Based on these issues, we focus on how to sample credible and biologically interpretable negative data for genome-scale reconstruction of human protein–protein interaction networks.
Existing methods as a whole pay more attention to feature extraction or developing/applying novel algorithms (e.g., deep learning). Although the predictive performance has been greatly improved, the improvement brought about by novel methods is very limited. The issue of negative data sampling potentially affects the model performance heavily, but it has received little attention in recent years. Random sampling is less biased and easily hits protein pairs that do not interact in scale-free PPI networks. As such, it is still the major solution adopted by the existing methods. However, random sampling is by nature hard to interpret biologically and potentially yield a certain level of false negatives. To reduce the risk of false negative sampling, subcellularly restricted random sampling sample negative data from the space of protein pairs that are not subcellularly co-localized. Though more credible, the method are highly biased because it does no cover the protein pairs that are subcellularly co-localized but do not interact.
In this study, we propose the assumption of Neglog to exploit and augment the available experimental negative data from Negatome [31
] for genome-scale reconstruction of human protein–protein interaction networks. The assumption is based on structural conservation between a protein and its ortholog/paralog protein. Two proteins that do not interact are potentially mismatched spatiotemporally, which also probably take place between orthologs/paralogs. Because of the limited coverage of genes in Negatome, it is hard to validate the assumption via Negatome and we resort to protein structure alignment for analyses. Protein structure alignments show that the orthologs/paralogs of two interacting proteins are more structurally mismatched than a randomly sampled protein pair. As compared with random sampling, the Neglog method is easily interpretable and more credible. As compared with orthologs, paralogs develop new functions and have their structures varied, so that the paralogs of two proteins that do not interact still could develop a chance to interact. This noise could be to some extent counteracted by L2
-regualrized logistic regression model.
As the Negatome database only covers a very small number of genes/proteins, the Neglog set derived from this seed still covers very limited genes. As a result, the negative data solely from Neglogs run a high risk of training a biased model. To our knowledge, there are no other sources that manually curate a large number of human genes that do not interact. To enlarge the negative class space and reduce bias, the Neglog method also uses random sampling to sample a certain ratio of negative data. As the ratio of randomly sampled negative data is very low (equal to 0.3), the risk of introducing false negative data is much reduced. Reducing positive training data to the size of negative data in Negatome to train an ensemble of individual classifiers may be another alternative solution. This solution could reduce the bias of each individual classifier. However, as all the individual classifiers are trained on the same negative data, the diversities of individual classifiers are very limited. As a result, the final ensemble classifier trained in the small space of negative data is still potentially highly biased.
Cross validation and independent test show that the Neglog method achieves much better and less biased performance than random sampling. In addition, the Neglog methods outperforms existing methods in terms of ROC-AUC scores. Computational results show that ROC-AUC score is not sufficient to estimate the potential model bias and independent test especially on the negative class is indispensable for model evaluation. Lastly, we use the trained model to validate the PPIs from STRING [7
], in which most PPIs are not experimentally or computationally validated. GO enrichment analyses of the predicted partners of protein YBX2 to some extent validates the rationality of predictions.