Benchmark Evaluation of Protein–Protein Interaction Prediction Algorithms

Protein–protein interactions (PPIs) perform various functions and regulate processes throughout cells. Knowledge of the full network of PPIs is vital to biomedical research, but most of the PPIs are still unknown. As it is infeasible to discover all of them experimentally due to technical and resource limitations, computational prediction of PPIs is essential and accurately assessing the performance of algorithms is required before further application or translation. However, many published methods compose their evaluation datasets incorrectly, using a higher proportion of positive class data than occuring naturally, leading to exaggerated performance. We re-implemented various published algorithms and evaluated them on datasets with realistic data compositions and found that their performance is overstated in original publications; with several methods outperformed by our control models built on ‘illogical’ and random number features. We conclude that these methods are influenced by an over-characterization of some proteins in the literature and due to scale-free nature of PPI network and that they fail when tested on all possible protein pairs. Additionally, we found that sequence-only-based algorithms performed worse than those that employ functional and expression features. We present a benchmark evaluation of many published algorithms for PPI prediction. The source code of our implementations and the benchmark datasets created here are made available in open source.


Introduction
Proteins are large macromolecules that perform a variety of functions necessary for an organism. Biological processes are driven by multiple proteins creating a protein complex or a signaling cascade through biophysical interactions. Understanding which proteins interact with each other is an important first step towards understanding molecular mechanisms of biological functions and of diseases, and in the design of therapeutics. Knowledge of binary protein-protein interactions (PPIs) is also useful in conducting further research on blocking or enhancing different interactions to control biological processes and functions, and to design therapeutics. However, given that human genome contains approximately 20,000 protein-coding genes and therefore nearly 200 million unique protein pairs even without accounting for multiple isoforms of genes, resolving all pairs of interacting protein is not a trivial task.
Many experimental biology methods have been devised to find pairs of interacting proteins. Benchwork experiments like co-immunoprecipitation (co-IP) are resource-intensive and low-throughput, often leading to false negatives depending on experimental conditions, for example due to ineffective antibodies or due to the transient nature of PPIs [1]. In contrast, high throughput screens such as yeast two-hybrid (Y2H) and tandem affinity purification mass-spectrometry (TAP-MS), and more recently a sequencing approach called PROPER-seq [2] attempt to capture tens-to-hundreds of thousands of PPIs in a single experiment. In the past, statistical estimates suggested that Y2H had 25%-45% false positives, and difficulties in detecting PPIs for certain subsets of proteins such as membrane proteins [3]. In later studies, it was shown that extensive filtering techniques can be used to decrease sampled protein pairs (excluding the pairs that are known to interact) in place of negativelabel instances. Randomly chosen pairs are utilized due to a lack of known, non-interacting pairs from experimental methods, as a failure to identify a PPI in a biological experiment only suggests that a PPI was not observed under those experimental conditions and does not prove that the two proteins never interact. Therefore, failed experiments do not create negative-labeled data. This lack of ability to experimentally determine non-interacting protein pairs prevents the creation of a traditional gold standard negative dataset. However, taking into account that only 0.325-1.5% of protein pairs have PPIs, computational methods utilize randomly sampled protein pairs as negatively labeled data, with the majority of randomly selected pairs likely being truly non-interacting pairs.
Although it is estimated that there exist about 500,000 to 3 million PPIs out of a total of 200 million protein pairs in humans (0.325-1.5%), most algorithms are trained and tested on datasets containing 50% positive label data. This simple approach for assessing an individual algorithm's real-world application is questionable. Additionally, the protein interactome is believed to be a small-world network [31]. Such networks have hubs, i.e., nodes that connect directly to many other nodes, or in this case, proteins with a large number of PPIs. For example, The Biological General Repository for Interaction Datasets (BioGRID) contains~125,000 unique, non-self-interactions among~14,500 proteins (represented by their gene names/symbols and not distinguishing isoforms; this aspect that proteins are referred to by their genes is applicable throughout this paper except where explicitly mentioned otherwise) [32]. This averages to around 17 interactions per protein; however, currently 360 proteins have more than a 100 PPIs each, with one protein, APP, involved in over 2000 PPIs. On the other hand, over 9000 proteins have 10 or fewer PPIs each. While positive class PPI data are from a scale-free network, randomly paired data (negative class) are sampled uniformly, and thus have a significantly different distribution of proteins. This can lead to biasing problems in machine learning, where a single protein appears far more times in the positive dataset, allowing machine learning algorithms to simply predict pairs containing such proteins to be of positive labels, generating a high accuracy on the test datasets from corresponding distributions. Past experiments in this field have suggested similar findings, with works by Yu as well as Park and Marcotte suggesting that many prediction algorithms primarily predict bias in underlying datasets [33,34].
In addition to dataset creation, evaluation metrics also play a role in correctly assessing whether an algorithm can make good predictions on real data. In evaluating classification algorithms for rare category data, accuracy and AUC are not suitable methods, and precision-recall (P-R) curves are recommended [35]. In biological and clinical domains, where the natural distribution of class labels may be highly imbalanced, a P-R curve provides more reliable information and distinguishes models that are practical for real world applications, whereas AUC may misleadingly convey more impressive accuracy than are realistically achievable on the rare category that is of interest (e.g., an interacting protein-pair or the presence of a disease) [36]. However, most published works have used AUC/accuracy metrics. Both the aspects of data composition and evaluation metrics were taken into account by Qi et al. in the seminal paper on proteome-wide human PPI prediction (i.e., when a protein is compared against all the proteins) for membrane receptor proteins [29]. They set their test data at 1:1000 positive to negative instances and evaluated with a P-R curve. Our own method, called High-Precision Protein-Protein Interaction Prediction (HiPPIP) for human interactome prediction also was evaluated on the lines of Qi et al.'s, both in terms of rareness of positive class and using a P-R curve [37]. Additionally, the HiPPIP method was also evaluated for its ability to predict proteome-wide PPIs, by computing the cumulative number of true-positives versus the increasing number of false-positives (note that in this scenario of proteome-wide evaluation (i.e., where a hub protein is paired with every other protein in the human proteome), the unlabeled data may not all be false-positives, and may indeed me true positives that are currently unknown) on hub proteins with the reasonable assumption that many, if not all, of their PPIs are known; this metric was used as a reasonable approximation of the number of accurate predictions produced by HiPPIP.
In this study, we evaluate different PPI prediction algorithms to determine how well they perform on realistically proportioned datasets. We first implement various algorithms, using similar feature sets and classification models to those described in the original publications. Where possible, we download the datasets used by these algorithms to test whether our implementations produce similar results. Next, we created six new evaluation datasets containing three proportions of positive label instances (50%, 10% and 0.3%) and two sampling methods (sampled randomly from the full list of proteins as is commonly done in literature, henceforth known as Full), and using a held out set of proteins for evaluation (referred to as Held-Out, known as C3 data in Park and Marcotte's prior work [33]). Finally, we also created control models to compare these predictors against, by using illogical features (e.g., frequency of the proteins in the dataset or random vectors to represent the proteins) with simple, naïve classifiers. These types of predictions do not consider the pairwise compatibility of two proteins, and simply predict PPIs based on the distinct distributions of proteins in positive and negative classes in the datasets. This allows us to determine how well the algorithms perform relative to these illogical features.

Data
Datasets used to validate the implementation of various sequence-based interaction predictors were downloaded from previous works. The datasets were adjusted from their original formats where necessary to include only proteins with a sequence length greater than 30 amino acids, as several algorithms have this requirement. Datasets which provided only protein names instead of the sequences had protein sequence data downloaded from UniProt (21 June 2021, 25 June 2021, 6 August 2021, 17 August 2021), with protein pairs removed if the names no longer mapped to valid proteins.
For prediction methods that employ protein annotations, the data were not downloaded from original sources for validation because most methods do not provide their datasets, and the availability of protein annotations such as domains, functions, expression and localizations increases over time. Some annotation databases are no longer available (e.g., InterDom), while others have increased their number of annotations since original publications; thus, we cannot try to match our results from these methods against the original publications. We treat our implementations of these as methods to be inspired by the original publications rather than exact reproductions. To create the necessary features, the Gene Ontology was download from the Gene Ontology Consortium (28 March 2021), and GO annotations of proteins were downloaded from the Gene Ontology Annotation Database (16 February 2021) [38,39]. Domain features were downloaded from InterPro (27 March 2021), Prosite (17 September 2021), and Pfam (17 September 2021) to compute domain and family-related features [40][41][42]. When annotations are provided using UniProt Identifiers, a union of all features assigned to a given Entrez Gene ID was used as features for a given protein [43].
Known PPI data were downloaded from the BioGRID (v4.4.198, compiled 25 May 2021). Only direct biophysical protein interactions, encoded by Proteomics Standards Initiative-Molecular Interaction (PSI-MI) ontology identifier MI:0407 and its descendants, were included. After filtering out pairs containing non-human proteins, self-interactions, and protein-RNA bindings, the Entrez Gene IDs provided by BioGRID were then mapped to UniProt IDs [43,44]. To account for Entrez Gene IDs mapping to multiple UniProt identifiers, the longest amino acid sequence was assigned to the gene. To ensure compliance with various sequence-based algorithms, protein sequences less than or equal to 30 amino acids were removed. A total of 123,626 unique interactions among 14,678 proteins remained as positively labeled protein pairs (i.e., PPIs). A total of 19,115 proteins with UniProt ID to Entrez Gene ID mapping that met the minimum sequence length were used to represent the full set of human proteins, with random pairs drawn from this dataset to use as negative labeled protein pairs (i.e., non-interacting protein pairs) (see Supplementary File S1 for details).
Training and testing datasets were created using two methodologies: In the first, a random set of non-overlapping interacting pairs and non-interacting pairs are sampled for training and testing from the full set of all possible proteins (Full). This method is widely used in the literature. We do not take any other precautions when randomly sampling non-interacting pairs, such as selecting proteins in different subcellular locations, as this could induce a bias towards the type of protein pairs chosen and creates a separate bias by limiting the number of proteins available to use in the non-interacting dataset. The second methodology is based on the work of Park and Marcotte, where some proteins are held out and used exclusively in the test dataset (Held Out) [33]. For this dataset creation method, we separated the proteins into 6 equal sized bins and hold out either 1 or 2 bins to create the test data. No pair using a protein from these bins is used in training data. Training data is created from bins excluding those held-out to create test data.
Multiple datasets were created for testing with different percentage compositions of positively labeled protein pairs (known interactions). Specifically, training sets were created with 50% and 20% positive data, while test sets were created with 50%, 10%, and 0.3% positive data. Models generated to test both the 10% and 0.3% positive test datasets used the 20% positive data for training, while 50% positive test datasets utilized 50% positive train datasets for training. For data based on random pairs, the five 50% positive train and test sets were created using stratified cross validation on a single set of 125,000 protein pairs. Data for the 20% training set, and 10% and 0.3% positive test sets, were chosen randomly from all pairs while ensuring that no test data overlapped with the training data. For the second method, namely by holding out proteins used in test data, training and test sets of the same ratios as mentioned earlier were created for each of the 21 combinations of holding out 1 or 2 of the 6 bins. Again, the 10% positive and 0.3% positive test sets shared the same training sets with 20% positive data.

Feature Computation
We replicated various sequence-based features as described in prior publications. Some, such as auto covariance and conjoint triads are used in multiple papers we reimplement, whereas others such as the Weighted-Skip Sequential Conjoint Triad method are used in only a single paper utilized in our recreations. While most methods listed can be computed from the sequences of the proteins in the pair, two of the listed methods required usage of other proteins' sequence data. Skip-gram models compute the similarity of words, or in this case, amino acids, by training a neural network with neighboring words to create word embeddings [45]. For protein sequence features, we trained the skip-gram models using all protein sequences related to the organism under study. The second method, PSSMs (i.e., protein sequence similarity matrices), were computed using PSI-BLAST over UniProt's SwissProt protein sequence database, using 3 iterations and a significance value of 0.001 [46]. If no hits were found, sequences were encoded using BLOSUM62 values. For features relying on physicochemical properties (e.g., hydrophobicity, hydrophilicity, polarity, charge, solvent-accessible surface area, etc.), we employed commonly used amino acid property values unless otherwise specified (see Supplementary File S2 for details). Finally, where possible, we normalized values produced by different feature computations to map to a low range of values, preferably between 0 and 1, or −1 and 1 (see Supplementary File S2 for details).
For annotation-based features, we primarily focused on pairs of domains and GO annotations appearing in known interactions, as well as semantic similarity calculations on GO annotations. As proteins commonly have multiple GO annotations and domains, many algorithms compute these features as a grid of scores between all possible pairs of annotations between a pair of proteins and convert the grid to a single score through an aggregation, such as average, max, sum, product, or the best matching average (see Supplementary File S3 for details). All protein pairs where the necessary features did not exist were scored as zero to ensure that we could train and test using the same datasets used for sequence-based prediction testing.

Model Construction
Classification models were constructed using Python (version 3.8.5) libraries Scikit-Learn, PyTorch, LightGBM, and ThunderSVM, or an independently created Rotation Forest library [47][48][49][50][51]. While methods in the selected publications have used other programming languages and/or libraries, these should be capable of reproducing similar results.
Models for sequence-based features were constructed to match published methods as closely as possible. For some models, such as support vector machines (SVMs), which are highly sensitive to different scales between features, features were scaled using min-max or standard scaling. Learning rates for neural networks implemented in PyTorch were adjusted to obtain similar results as reported in prior works, and all models were fitted with a decaying learning rate to ensure that the model would conclude quickly when learning had plateaued, ensuring quick running times on larger datasets. For publications that do not clearly state hyperparameters or other exact details used to construct models, we utilized settings that seemed similar to concepts in the referenced publication or other related publications, executed quickly, and performed comparably to reported results in the publications on their datasets. A total of 36 algorithms utilizing random forests, rotation forests, boosting, neural networks, and support vector machines were implemented based on previously published literature (see Supplementary Files S4 and S5 for details related to hyperparameters and any changes between previous work and our implementations).
For annotation-based features, a total of 6 models were created using domain and GO data. Two of these models, based on the works of Chen et al. and Maetschke et al., use large binary/ternary features to represent protein pairs. Chen et al. utilizes a pairwise sum of binary features containing which domains existed in each protein [26]. Maetschke et al. generated binary feature vectors from GO annotations, with non-zero values representing all annotations up to the lowest common ancestor (LCA) common to both proteins [28]. The final 4 annotation-based models used significantly smaller feature representations for each protein pair. Guo et al. and Zhang et al. each developed models relying on semantic similarity using maximum aggregation. Guo et al. relied on a single metric per ontology and a logistic regression model, while Zhang used 10 different semantic similarity measurements to train an SVM [24,25]. Zhang et al. used a score based on different domain databases to predict protein interactions, taking the maximum score across the databases per domain pair and using product aggregation to create a final score [27]. As we are not using pre-computed scores from domain databases, and the list of domains we obtained do not natively map between databases, we created a variation of their algorithm by computing the product aggregation score for each set of domain annotations, and used logistic regression to compute a final score. We refer to this algorithm as the Domain Variant model. Finally, inspired by approaches that utilize a large set of features, we created a simple ensemble model using a random forest with features computed from best matching average aggregations for level 2 GO annotations in protein interactions, our 3 domains annotations in protein interactions, and Resnik semantic similarity as features [29,30]. See Supplementary Files S3 and S5 for full details of each algorithm.

Illogical/Random Feature Models to Serve as Control
Five different models were created that use either illogical or random features to demonstrate that the models may capture bias, rather than interaction-related properties, in datasets. By illogical feature, we mean that there is no logical reason to expect the feature to distinguish between interacting and non-interacting protein pairs. The expectation is that accurate PPI prediction methods should outperform these illogical/random models. See Supplementary Files S5 and S6 for full implementation details.
Count Bias simply counts how many more times the protein appears in positive training instances than negative training instances. A prediction score is simply the sum of these numbers for the two proteins in a data instance. Sequence Similarity Bias (Seq Sim Bias) is analogous to Count Bias, except that instead of counting each individual proteins' appearance in the training data, it counts the positive and negative instances of up to the 5 most sequence similar proteins used in the training set, and performs a weighted average based on their similarity. A third feature, Sequence Similarity + Protein Bias (Seq Sim + Prot Bias), calculates the 5 most similar proteins, and their weights, from a combination of sequence similarity and the number of overlapping proteins shared by a given pair of Entrez IDs. This protein bias check is designed to ensure that non-sequence-based features (i.e., gene annotation features) are not getting an advantage when aggregating features from proteins that map to multiple genes. Sequence-based methods are not tested for this spurious feature. Finally, random features for each protein are represented by vectors of 500 random numbers, with pairs of proteins represented by a concatenation of these values. These features are then utilized by either a random forest or a simple neural network. The neural network architecture contains multiple shared layers before the protein pair's values are concatenated and run through a final prediction layer.

Evaluation Metrics
For our evaluations, test sets with 50% positive data are compared on accuracy (Acc) and area under the receiver operating curve (AUC) measurements. These metrics allow us to compare our results with previous literature, which most commonly utilize accuracy to measure their model's predictive capabilities. When data is imbalanced, as is common in the biological domain, typically having many more negatively labeled instances than positively labeled instances, precision recall curves (P-R) are recommended, as simple accuracy and AUC calculations may be heavily influenced by predicting the negatively labelled, non-rare class frequently [35,36]. Therefore, test sets with 10% or 0.3% positive data are compared on the precision at 3% recall (Prec) and average precision (Avg P), rather than relying on accuracy-based measures. 3% recall was chosen to determine whether the algorithms are capable of making good predictions on their top scoring pairs, a necessity to provide laboratories with good sets of protein pairs to test using experimental methods. See Supplementary File S7 for additional notes on implementing these metrics.

Results
To illustrate the reasoning presented in the Introduction, we started by testing representative methods by evaluating their predictions on hub proteins against the known PPIs. First, HiPPIP is compared to Qi et al. as originally reported [37], but with recently updated data of known PPIs from HPRD and BioGRID. See Figure 1, where 1A is recomputed on the same hub proteins as in [37], and 1B is computed for hub proteins in current data with degree >100. In Figure 1B, a sequence-based predictor SPRINT [52], which in its original publication reported 79-89% AUC on 1:1 data composition, was generated from BioGRID; however, this high performance is not reflected at the proteome scale, and it is outperformed by the seminal work by Qi et al., which was published in 2009, and by HiPPIP.
We implemented various PPI prediction algorithms and tested most of them on their original datasets and evaluation metrics for verifying that the results are reproduced similarly. We created reusable datasets and evaluated the algorithms on these datasets, with commonly reported evaluation metrics as well as metrics more suitable for this domain, to benchmark their performance when predicting proteome-wide PPIs where interacting protein pairs are very rare. We suggested suitable evaluation metrics and are releasing the datasets and source code of re-implementations so that these together may lead to a rapid advancement of computational approaches for PPI prediction and their benchmarking.

Realistic Datasets for Benchmark Evaluations
We created benchmark training and test datasets at various proportions of positive class instances (1:1, 1:4, 1:9, and 1:332). In creating them, we employed two different approaches: the first approach selected pair-wise instances while ensuring that the pairs used in training and testing did not overlap (referred henceforth as Full data); the second approach held out proteins to be used exclusively in test data (referred henceforth as Held Out data). The size of each dataset is shown in Table 1.

Realistic Datasets for Benchmark Evaluations
We created benchmark training and test datasets at various proportions of positive class instances (1:1, 1:4, 1:9, and 1:332). In creating them, we employed two different approaches: the first approach selected pair-wise instances while ensuring that the pairs used in training and testing did not overlap (referred henceforth as Full data); the second approach held out proteins to be used exclusively in test data (referred henceforth as Held Out data). The size of each dataset is shown in Table 1. e 1. Sizes of different datasets created for testing. Pair sizes vary for the held-out data due to the number of pairs n from a single bin (such as 4, 4) being half the size of drawing pairs from 2 bins (such as 3, 4). Held out test sets for 0%) and 1:9 (10%) positive pairs use all positive data per held out bin.

Sequence-Based Predictors
We implemented 36 sequence-based PPI prediction algorithms with various different feature representations (Table 2) and evaluated them on the same datasets that were used in original publications ( Table 3) to confirm that our implementations were comparable (Table 4). Excluding Tian et al.'s SVM algorithm, our implementations had <1% difference in accuracy on an average relative to those reported in those publications, with a correla-

Sequence-Based Predictors
We implemented 36 sequence-based PPI prediction algorithms with various different feature representations (Table 2) and evaluated them on the same datasets that were used in original publications ( Table 3) to confirm that our implementations were comparable (Table 4). Excluding Tian et al.'s SVM algorithm, our implementations had <1% difference in accuracy on an average relative to those reported in those publications, with a correlation around 0.89. Large differences (>5%) were primarily observed when testing the smallest datasets (Martin Human and Martin H.Pylori). Overall, we obtained comparable results without extensive tuning or performing other significant hyperparameter optimization, showing that our implementations provide a good representation of previously produced works. Table 2. Different preprocessing algorithms used on amino acid sequences prior to model training.

Amino Acid Composition (AAC) Auto Covariance (AC) [9] Chaos Game Representation [53] Conjoint Triad (CT) [54] Composition-Transition-Distribution (CTD) [55]
Dipeptide Composition [56] Discrete Wavelet Transform Physicochemical [57] Encoding Based on Grouped Weight (EBGW) [58] Geary Autocorrelation [59] Local Descriptor (LD) [60] Multi-scale Local Descriptor [61] Multi-scale Continuous and Discontinuous [62] Multivariate Mutual Information [63] Moran Autocorrelation [59] Normalized Moreau-Broto Autocorrelation [59] Numeric/One Hot encoding Pseudo Amino Acid Composition [64] PSSM ( [46]) PSSM(DPC)/PSSM(Bi-gram) [13] PSSM Discrete Cosine Transform [19] Quasi Sequence Order Descriptor [65] Sequence Order [11] Skip Gram [45] Weighted Skip-Sequential Conjoint Triad [13] For each dataset used in previous literature, we compared our implementations of sequence-based predictors against the illogical/random feature control models. These comparisons are shown in Table 5. We can see that illogical/random feature models perform about as well as most previous implementations. In fact, the best result of each control model exceeds or falls within 4% accuracy of sequence-based predictors on all but two datasets. Our explanation is that sequence-based predictors likely capture information that some proteins are overrepresented in PPIs compared to random pairs (which is not an aspect that can be exploited in real-world interactome prediction). Thus, the datasets with much fewer proteins used in random pairs, such as Guo Multi Chen and both Pan Human datasets, are the easiest to obtain high accuracy on, as having several proteins exist in positive pairs without existing in negative pairs creates an easily exploitable bias when making predictions. These results strongly suggest that sequence-based prediction models inherently predict biases of individual proteins, or at least have been primarily evaluated on datasets where the biases are easy to exploit.

Benchmark Evaluation of Sequence-Based Methods
Each sequence-based predictor was trained on each of our 52 training datasets (26 based on Full data, and 26 based on Held Out data) and evaluated on the corresponding testing datasets (see Table 1). The results of these experiments are shown in Table 6 and Figure 2. Overall, most algorithms seem to be matched or outperformed by one or more of the control methods. Specifically, when scoring on datasets generated from the Full set of protein pairs with 1:1 positive class labels, using random numbers as features with a neural network model tied for 3rd in accuracy, and scored 5th in AUC in comparison to the dozens of sequence-based methods. In datasets where class distribution is skewed at 1:9, or more realistically at 1:332, this model placed 6th and 13th in average precision, outperforming over half of the sequence-based methods. Even when measuring precision at 3% recall on the skewed-class data, it outperforms half of the sequence-based predictors.
When testing on protein pairs containing Held Out proteins for test data, the accuracies and precisions of all published algorithms are much lower, with the best prediction accuracy on 1:1 data falling below 70%, and the precision at 3% recall and average precision on the 1:332 data falling below 10% for all models. Biases that exploit the number of positive and negative instances in the training data that each protein from the test dataset appear in, such as simple counts and random numbers, are eliminated when using our Held Out protein data generation method. This is shown by their prediction accuracies and precisions reducing down to being nearly equal to the percentage of positive data in the test set, implying random ordering. However, predicting on individual proteins based on sequence similarity still places in the top half of all comparisons, and in the top 10 in accuracy and AUC on 50% positive test data. Additionally, many of the published algorithms based on protein domains, and to a lesser extent on GO annotations, utilized information related to interactions either directly, by computing the probability that a pair of annotations belongs to known interactions, or indirectly by utilizing a domain database that bases its scores on the probability of a pair of domains belonging to protein interactions. When utilizing these data without filtering out interacting pairs in the test data, the results could be biased such that rare pairs of domains or GO annotations that occur in an interaction in the test dataset are scored highly solely because of that interaction's usage when generating the protein pair's features. The performance of our two algorithms using features that rely on interactions is shown in Table 8 when allowing the features to be created using all, non-testing, and non-held out proteins exclusively. Overall, there is a significant drop in accuracy when removing test pairs from the feature creation process, with a more moderate drop when holding out entire proteins instead of just the pairs in the test dataset. As test data should not influence feature creation and training, we believe this shows a significant bias which must be accounted for when calculating features. However, compared to sequence-based predictors, the drop found when changing from holding out test data to holding out entire Computations when performed on held out proteins instead of selecting from the full set of pairs. When holding out protein pairs, the algorithms exhibit a significant performance drop. Additionally, the algorithms and bias measurements score similarly across all 6 tests. Table 3. Datasets reported in prior publications. The original publications that created these datasets are referenced in the first column. The next two columns report the species of the data and the publications that report PPI prediction accuracies on these data. The data sizes (number of PPIs are number of unique proteins) are reported after mapping protein names to UniProt and filtering to remove proteins shorter than 31 residues.

Proteins in Random Data
Du [56] Yeast a Du [ Table 5. Comparisons of the range of values per dataset in previous literature and our implementations, versus 4 algorithms that rely on simple counting or random numbers. Only 1 dataset was successfully predicted by an algorithm at a rate of over 10% above using random numbers, while most algorithms struggle to predict any better than counting or using random numbers per protein.
Even the best results from each of the three datasets used in 10 or more previous implementations only exceed predictions using random numbers by less than 3%, with several algorithms falling below the bias-based predictions.

De-Biasing Annotation-Based Predictors
Computations based on gene annotation features (non-sequence-based) can induce biases when dealing with missing data. If the missing data is much more prominent in the positive or negative class (which it can be for PPI prediction datasets [72]), algorithms can learn to make predictions based on number of missing values as a spurious feature. In Table 7, we show the amount of missing data for each dataset. While negative pairs have more missing data, most pairs contain at least one of the three GO features, and some Pfam and InterPro features. Given that more than half of the negative data contain information from all 3 GO ontologies, more than 40% of them contain domain data from Prosite, the feature that is most often missing, combined with the heavy skew of negative data in the imbalanced datasets, we do not believe that, from these features, missing data alone would provide a significant advantage to predicting PPIs. Additionally, many of the published algorithms based on protein domains, and to a lesser extent on GO annotations, utilized information related to interactions either directly, by computing the probability that a pair of annotations belongs to known interactions, or indirectly by utilizing a domain database that bases its scores on the probability of a pair of domains belonging to protein interactions. When utilizing these data without filtering out interacting pairs in the test data, the results could be biased such that rare pairs of domains or GO annotations that occur in an interaction in the test dataset are scored highly solely because of that interaction's usage when generating the protein pair's features. The performance of our two algorithms using features that rely on interactions is shown in Table 8 when allowing the features to be created using all, non-testing, and non-held out proteins exclusively. Overall, there is a significant drop in accuracy when removing test pairs from the feature creation process, with a more moderate drop when holding out entire proteins instead of just the pairs in the test dataset. As test data should not influence feature creation and training, we believe this shows a significant bias which must be accounted for when calculating features. However, compared to sequence-based predictors, the drop found when changing from holding out test data to holding out entire proteins is much more moderate, suggesting less of a biasing issue when not holding out entire proteins. While the difference is smaller, for the purpose of fair comparisons, we utilized features calculated with proteins being held-out when comparing to sequence-based methods on the Held Out datasets.

Benchmark Evaluation of Annotation-Based Methods
We tested our implementations of annotation-based methods on the benchmark datasets. These results, along with predictions from illogical/random feature models and some of the best sequence-based predictors, are shown in Table 9. Overall, the results from these methods are worse than control models in several categories when not holding out any data. However, unlike control and sequence-based methods, the four methods that use small feature vectors not relying heavily on individual protein data (i.e., excluding 'Chen 2005 Dom RF' and 'Maetschke 2012 ULCA') drop much less when running on held out data, with some even improving their precision at 3% recall. Unlike the sequence-based methods, these 4 methods manage to obtain over 85% precision at 3% recall on 1:9 positive data, and 15-25% precision on 1:332 data. The latter result represents a 50x-80x performance improvement over random data, over a 7x improvement over predicting based on similar sequences, and a 2x-4x improvement over the best sequence-based methods at low recall levels. While methods using aggregations showed the ability to make good predictions at low recall levels, Chen et al.'s method using the pairwise sum of binary data per protein and Maetschke's method using a union of GO annotations between two proteins up to and including their lowest common ancestor both struggled like sequence-based methods. This may be a reflection of the large number of features used by these models, as well as how the feature vectors utilized mostly reflect each individual protein's features or a simple calculation between individual protein's features. A plot of the best annotation-based predictors versus the best sequence-based predictors can be found in Figure 3. In both precision recall curves, there is a clear gap in precision between the 4 best annotation-based predictors and sequence-based predictors at low recall levels. Table 8. Accuracy of a domain feature predictor (Zhang 2016 Domain Variant) and ensemble (Simple Ensemble) predictor using all interactions, non-test interactions, and non-held out protein pair interactions when computing features. Held out protein pairs are not valid for datasets created by selecting random pairs. The accuracy of these predictors drops significantly when removing the test interactions from the feature creation process.  Table 9. Results of non-sequence-based predictors versus control and annotation-based predictors.   Non-sequence-based methods (solid lines) perform better at lower recall levels when positive data is more rare than random pairs.

Discussion
We implemented different PPI prediction algorithms and evaluated them on benchmark datasets containing different class distributions. Our results show that most published methods perform much lower than originally reported when evaluation data are created with realistic proportions of positive and negative classes. We also showed that many of these methods may be capturing spurious, illogical features that represent the frequency of specific proteins in the data rather than meaningful information about PPIs themselves; such methods will not translate to a real-world application of proteome-wide discovery of PPIs where every protein will be tested against every other protein in the proteome.
In prior publications, most sequence-based predictors were evaluated on datasets with 50% positively labeled instances, with randomly selected protein-pairs serving as negative class data. In those datasets, PPIs (i.e., the positive class data) are drawn from a scale-free network where some nodes are hubs, and therefore appear in many protein pairs, whereas randomly paired negative class data are drawn from nearly complete-graph data. Thus, the usage rates of different proteins, particularly hub proteins, in the positively and negatively labeled instances are dramatically different, creating an easily exploitable bias. Some of this can be inferred from Table 3, which shows far more unique proteins in positive class data than in negative class data in many datasets. (If say 100 pairs are drawn from scale-free distribution and there is one hub with 25 PPIs, it contains 26 unique proteins; whereas, if that hub did not exist and it was drawn from a complete graph, the number of unique proteins from those 25 PPIs would be between 26 and 50) When evaluating these datasets, algorithms may assign class labels based on protein frequency rather than true characteristics of an interacting protein pair, and falsely shows higher performance in evaluations; the class label is assigned independently of the second protein in the pair but learns a likely invalid premise for real world interactome prediction. Our experiments showed that both on datasets from original publications and on our newly created Full datasets, control models with illogical features that simply capture protein membership can perform on par with most of the models from the literature. When analyzing the results of test datasets with proteins not utilized in the training set (Held Out), we find that most algorithms' accuracies drop significantly. Accuracies on our 1:1 test data dropped from their original results of 75-85% accuracy down to 55-67% accuracy. On more relevant metrics, namely precision at 3% recall on 1:332 positive class data and holding out test proteins gave a mere 6.1% for the best sequence-based algorithm. Thus, when both the data ratio and evaluation metric are suitably chosen, the true ability is revealed to be impractically low.
Some of the previous methods filtered out protein sequences that have a high sequence similarity, which we have not implemented; however, we have mapped our data to genelevel information, and so multiple isoforms are not included in our test data, minimizing the number of highly sequence similar proteins in our data. It is likely that if we removed proteins with similar sequences from our datasets, the results when predicting on held out proteins would be lower, as exploiting sequence similarity provides similar results to sequence-based methods. This may be analyzed in future work.
When training the methods from literature that were originally designed for 50% positive training and test datasets, we did not make adjustments to the designs or implementations to adjust for the different ratios of data. Some methods, such as class weighting, are commonly recommended when training models on imbalanced data. However, to keep the models as similar to those found in previous literature as possible, we decided not to implement adjustments per positive data ratio. We note that we did test all models on 50% positive test data and found the results to be similar to what could be obtained by prioritizing proteins by their number of known interactions. Therefore, we believe it is unlikely that using concepts such as class-weighting would drastically increase the precision of these methods.
When using annotation features, we found that feature representation has a significant impact on the results of the model. Most methods we tested were unable to outperform control models made from illogical features when generating data from all pairs of proteins. However, using Held Out protein data showed moderate precision at 3% recall even on heavily imbalanced class data, showing their predictive capabilities do not depend on exploitable, individual protein-based biases in the underlying data.
Models that use features computed from pairs of protein domains and GO annotations that appear in other interactions performed well at predicting interactions; these are well recognized to be meaningful features in predicting interactions (e.g., that two protein domains that are known to interact are highly likely to conserve that interaction/function when those protein domains appear in other proteins). Thus, using the knowledge of interacting protein domains or compatible GO annotations (specific ligand and receptor annotation in the two proteins), and along with other protein features to learn an effective classification machine learning model which helps shortlist protein-pairs for experimental validation. We also note that the two methods using only 3 features, i.e., 'Domain Variant' and ' Guo 2006 Sem', performed as well or better than other non-sequence methods based on 10-20 features. Surprisingly, our ensemble method, which contained Resnik semantic similarity, GO annotation frequency, and all 3 domain features performed worse than the methods using only 3 domain or Resnik semantic similarity features. This was most likely due to the different aggregation methods, with Domain Variant and Guo 2006 Sem using product and max aggregations respectively, while our ensemble used best matching average aggregation. This could imply that using max or product aggregation is better for predicting protein interactions, or this could suggest a bias where protein interactions are primarily known for genes with a high number of annotations. If the latter is true, product, sum, and maximum aggregations could exploit this bias, as all three functions monotonically increase as more data are provided. We leave analyzing this to future work.
As for methods where the features used by the models were highly similar to features produced for each individual protein, such as Chen 2005 Dom RF and Maetschke 2012 ULCA, we found that their performance mirrored the performances of sequence-based methods. This implies that using sequences alone is not the problematic part for sequencebased methods, but rather, any methodology that relies on producing unique feature sets per protein and using simple combinations of these features to create data for machine learning methods seem to mostly make predictions based on underlying biases in generated PPI datasets. Only when creating a small number of more complex features using pairs of proteins, instead of individual proteins, do models see a significant improvement beyond bias when positive interaction data are used as the rare class.
In conclusion, we compare P-R performance of HiPPIP with selected sequence-based and annotation-based methods ( Figure 4A) that performed the best and find that HiPPIP outperforms all of them significantly. In this evaluation, for all predictors, known PPIs from HPRD dataset are also included in assessing whether a prediction is a true positive. HiPPIP was evaluated on two sets of test data, each set containing 10 datasets: in one set, the data are created in the same fashion as Full data ('HiPPIP-a' in Figure 4A) and in the other, the positive instances for test data are taken from BioGRID published January 2017 ('HiPPIP-b' in Figure 4A); Figure 4 shows aggregated results across the datasets in 4A and individual datasets in Figure 4B,C for the two sets. The aforementioned caveat is: the specific pairs used in training HiPPIP are unavailable; therefore, there may be some overlap with training data; based on the number of known PPIs in BioGRID and considering that 20,000 PPIs were used in training HiPPIP, it is estimated that HiPPIP-a and HiPPIP-b test datasets may have 24% and 14% overlap for positive instances respectively, whereas overlap for negative data are negligible.
teracting protein domains or compatible GO annotations (specific ligand and receptor annotation in the two proteins), and along with other protein features to learn an effective classification machine learning model which helps shortlist protein-pairs for experimental validation. We also note that the two methods using only 3 features, i.e., 'Domain Variant' and 'Guo 2006 Sem', performed as well or better than other non-sequence methods based on 10-20 features. Surprisingly, our ensemble method, which contained Resnik semantic similarity, GO annotation frequency, and all 3 domain features performed worse than the methods using only 3 domain or Resnik semantic similarity features. This was most likely due to the different aggregation methods, with Domain Variant and Guo 2006 Sem using product and max aggregations respectively, while our ensemble used best matching average aggregation. This could imply that using max or product aggregation is better for predicting protein interactions, or this could suggest a bias where protein interactions are primarily known for genes with a high number of annotations. If the latter is true, product, sum, and maximum aggregations could exploit this bias, as all three functions monotonically increase as more data are provided. We leave analyzing this to future work.
As for methods where the features used by the models were highly similar to features produced for each individual protein, such as Chen 2005 Dom RF and Maetschke 2012 ULCA, we found that their performance mirrored the performances of sequence-based methods. This implies that using sequences alone is not the problematic part for sequencebased methods, but rather, any methodology that relies on producing unique feature sets per protein and using simple combinations of these features to create data for machine learning methods seem to mostly make predictions based on underlying biases in generated PPI datasets. Only when creating a small number of more complex features using pairs of proteins, instead of individual proteins, do models see a significant improvement beyond bias when positive interaction data are used as the rare class. Comparison of HiPPIP with selected sequence-based and annotation-based methods that performed the best. The two lines HiPPIP-a and HiPPIP-b correspond to whether the known PPIs in the test data were drawn from recent version of BioGRID (May 2021, same as used in this paper) or from an earlier version of BioGRID (January 2017). Both lines in (A) show an average performance across 10 datasets, whereas performance on individual datasets in each set are shown in (B) and (C), respectively. For all methods, including sequence-based and annotation methods, pairs that appears PPIs in HPRD database are also treated as true-positives; there is no discernable difference for previously published between Figure 3F and here, despite inclusion of HPRD PPIs as positive labels. It is estimated annotation-based methods that performed the best. The two lines HiPPIP-a and HiPPIP-b correspond to whether the known PPIs in the test data were drawn from recent version of BioGRID (May 2021, same as used in this paper) or from an earlier version of BioGRID (January 2017). Both lines in (A) show an average performance across 10 datasets, whereas performance on individual datasets in each set are shown in (B) and (C), respectively. For all methods, including sequence-based and annotation-based methods, pairs that appear in the HPRD database are also treated as true-positives in this figure; there is no discernable difference between the results of the sequence-based and annotation-based algorithms in Figure 3C and here, despite the inclusion of HPRD PPIs as positive labels. It is estimated that HiPPIP-a and HiPPIP-b test datasets may have about 24% and 14% overlap for positive instances respectively, because the specific pairs used in training HiPPIP are unavailable.

Conclusions
In this analysis, we compared protein interaction predictions from a variety of different algorithms in the literature, including methods based on sequences, Gene Ontology, and domain features. Overall, we found significant biases in several previous studies. When removing these biases, our results showed that pairwise features, such as semantic similarity and pairs of domains existing in known protein interactions, predicted interactions in highly imbalanced data at low recall levels that outperformed other competing methods, such as those using sequence-based features. When using methods involving feature sets per protein, as is done with most sequence-based predictions, we found that most algorithms fail to predict interactions significantly better than simple illogical features based on individual proteins.
In these implementations, we did not make any significant adjustments or tuning of the algorithm hyperparameters in order to compare these results with original reported results. We believe that by releasing the source code and the datasets, the algorithms will continue to be improved by the scientific community by devising better features and algorithms, or even by tuning the algorithms to handle the underlying skewed distribution.
Supplementary Materials: The following are available online. S1: Details of downloading and processing positive protein interactions. S2: Implementation details of sequence-based features. S3: Implementation details of annotation-based features. S4: Details of classification models and hyperparameter processing. S5: List and brief descriptions of published algorithms that are reimplemented here. S6: Implementation details of control models with illogical and random features. S7: Evaluation Metric Notes.