Sequence-Based Protein–Protein Interaction Prediction and Its Applications in Drug Discovery

Charih, François; Green, James R.; Biggar, Kyle K.

doi:10.3390/cells14181449

Open AccessReview

Sequence-Based Protein–Protein Interaction Prediction and Its Applications in Drug Discovery

by

François Charih

^1,2,3

,

James R. Green

^1,3,*,†

and

Kyle K. Biggar

^2,3,*,†

¹

Department of Systems and Computer Engineering, Carleton University, Ottawa, ON K1S 5B6, Canada

²

Institute of Biochemistry, Department of Biology, Carleton University, Ottawa, ON K1S 5B6, Canada

³

NuvoBio Corporation, Ottawa, ON K1M 2J2, Canada

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Cells 2025, 14(18), 1449; https://doi.org/10.3390/cells14181449

Submission received: 25 July 2025 / Revised: 4 September 2025 / Accepted: 11 September 2025 / Published: 16 September 2025

(This article belongs to the Special Issue Unraveling Protein–Protein Interactions for Innovative Therapeutics and Nanodelivery)

Download

Browse Figures

Versions Notes

Abstract

Aberrant protein–protein interactions (PPIs) underpin a plethora of human diseases, and disruption of these harmful interactions constitute a compelling treatment avenue. Advances in computational approaches to PPI prediction have closely followed progress in deep learning and natural language processing. In this review, we outline the state-of-the-art methods for sequence-based PPI prediction and explore their impact on target identification and drug discovery. We begin with an overview of commonly used training data sources and techniques used to curate these data to enhance the quality of the training set. Subsequently, we survey various PPI predictor types, including traditional similarity-based approaches, and deep learning-based approaches with a particular emphasis on transformer architecture. Finally, we provide examples of PPI prediction in system-level proteomics analyses, target identification, and designs of therapeutic peptides and antibodies. This review sheds light on sequence-based PPI prediction, a broadly applicable alternative to structure-based methods, from a unique perspective that emphasizes their roles in the drug discovery process and rigorous model assessment.

Keywords:

protein–protein interaction prediction; sequence-based models; protein language models; drug target identification; drug discovery; biologics

1. Introduction

Understanding how proteins interact is central to deciphering the molecular machinery of life. Protein–protein interactions (PPIs) underlie virtually every cellular process, from signal transduction to immune surveillance, and are essential for maintaining homeostasis across organisms. As our ability to probe these interactions has grown, through both experimental and computational tools, so too has our appreciation for their complexity and functional importance. In particular, disruptions or aberrant formations of PPIs have emerged as key contributors to disease, transforming our view of PPIs from abstract molecular partnerships into tangible drug targets. With the explosion of proteomic data and advances in artificial intelligence, the field has entered a new phase where it is now possible to accurately predict PPIs at the proteome scale. This review explores how sequence-based PPI prediction has evolved into a critical tool for both basic biology and drug discovery.

Indeed, PPIs are the most prevalent type of interaction involving biomacromolecules [1]. They can be quasi-permanent, transient, functionally obligate, or non-obligate [2]. Virtually all biological processes involve PPIs whereby proteins physically interact to exert their function in an elegant and concerted fashion: DNA repair [3] and transcription [4], protein translation [5], cell signaling [6], and protein quality control [7], to name a few. As previously highlighted, it should therefore come as no surprise that abnormal PPIs are the main culprit in a wide range of human diseases [8,9,10]. Certain diseases are caused by, or influenced by, unwanted interactions. For example, this is notably the case in numerous neurodegenerative disorders, where protein aggregation in neural tissue is a key feature, e.g., Alzheimer’s disease (β amyloid and tau protein) [11], amyotrophic lateral sclerosis (TDP-43) [12], Parkinson’s disease (α-synuclein) [13], Huntington’s disease (Huntingtin) [14], and Creutzfeldt–Jakob disease (prion protein) [15], to name a few. Another common way through which PPIs can cause disease is through changes in interaction affinity following a mutation, as is the case for mutations in KRAS, which impact interactions with its effectors [14]. A well-known example of a KRAS mutation affecting protein–protein interaction affinity is the KRAS G12D mutation, which impacts its interaction with GTPase-activating proteins (GAPs) and downstream effectors like RAF kinases. Altered protein stoichiometry caused by the disruption of normal gene expression patterns leads to alterations in the PPI network and is also another key component of cancer [15,16,17].

The human proteome is currently believed to contain about 20,000 proteins—excluding isoforms resulting from alternative splicing or proteins modified post-translationally. As such, the total number of possible pairwise protein interactions is likely to be in the order of at least 200 million. Proteins selectively interact with a fraction of all possible interaction partners; though, how many of these protein pairs physically interact in a biologically relevant context is still open to speculation. To help validate PPIs, a variety of biochemical and biophysical techniques to detect interactions are commonly used and include yeast two-hybrid, affinity-purification coupled with mass spectrometry, phage display, and pull-down assays, to name a few. Interested readers may refer to the following review articles for details regarding these experimental methods [18,19,20,21,22].

Because experimental techniques are resource-intensive, expensive, and limited in their throughput, scientists increasingly rely on sequence-based, structure-based and hybrid in silico predictors to identify which potential interactions they should prioritize for in vitro investigations and validation experiments. The earliest families of PPI predictors, surveyed in [23], largely relied, in explicit ways, on genomic (co-localization of genes), evolutionary (sequence co-evolution), and structural information (presence of binding motifs and domains). These approaches have largely been superseded by machine learning (ML)-based models, though some of these older approaches remain in use [24,25,26]. New developments in computational PPI prediction have closely mirrored advances in ML and deep learning (DL), especially those of natural language processing. Recently, structure-based models have received significant attention at the expense of sequence-based methods, even if the latter remain highly relevant and broadly applicable because of the relative scarcity of high-quality protein structures and because they make fewer assumptions, as we argue later in this review (Section 2).

Beyond furthering our understanding of biological processes at the proteome scale, the ability to accurately predict whether two specific proteins are likely to engage in a physical interaction has significant implications in drug discovery. Indeed, in addition to streamlining the target identification process, this ability promises to significantly accelerate the drug design process itself, notably through the engineering of artificial peptide–protein interactions (PepPIs) [27,28,29].

In this review, we turn our attention to sequence-based PPI prediction and emphasize the importance of rigorous model assessment practices, which have been inconsistently applied. More specifically, we discuss issues including dataset bias, data leakage, and class imbalance. We also discuss PPI prediction from a unique angle that puts sequence-based PPI prediction at the center of the drug discovery process, both for target identification and therapeutic design. In contrast with other recent reviews that focus heavily on structure-based methods, we provide a broad survey of sequence-based PPI predictors that have been losing attention to structural approaches despite being more broadly applicable.

First, we make the case for sequence-based approaches and explain why they represent a competitive alternative to structure-based approaches. Second, we review the machine learning methodology used to train and evaluate PPI predictors. Next, we discuss how the lack of training data for non-model organisms is managed and the issue of class imbalance. We then provide the reader with a survey of recent machine learning-based and similarity-based PPI predictors. Subsequently, we describe how sequence-based PPI prediction is reshaping the drug discovery landscape, especially peptide binder and antibody development. Finally, we briefly discuss challenges and future trends within this blooming field of research.

2. The Case for Sequence-Based PPI Predictors; Advantages over Structure-Based Prediction

Modern PPI predictors largely fall into one of three paradigms, depending on the nature of the information they use as inputs: sequence-based, structure-based, and hybrid prediction. Sequence-based predictors utilize the amino acid sequences of the proteins in a pair to make predictions. In contrast, structure-based methods make use of the coordinates of atoms in three-dimensional space. Hybrid predictors are informed by both sequence and structural information. There is an unresolved dispute among experts surrounding which paradigm shows the most promise.

While structure-based and hybrid methods have performed well and may appear more “powerful” at first, since they make use of rich, highly granular information, they are not without their limitations. First, in order to make accurate predictions, these methods require high-quality structures. At the time of writing, the worldwide Protein Data Bank (wwPDB) [30] contains high-resolution (≤2Å) structures for slightly over 28,200 structures involving 3772 distinct human proteins, of which only about ~40% are not significantly truncated (>80% of the full-length protein). The growth in available high-quality structures cannot keep up with the growth of experimentally validated PPI databases (Figure 1).

While models like AlphaFold2/3 [33,34], ESMFold [35], Chai [36], and Boltz-1/2 [37,38] have produced comparatively impressive structure predictions, the quality of the predictions vary at the proteome scale. As a result, they are not expected to replace experimental methods such as X-ray crystallography and are said to be most valuable for guiding hypothesis and accelerating early discovery [39]. Furthermore, these tools attempt to model intrinsically disordered regions within proteins that lack a clearly defined structure with limited success [40,41,42,43,44]. This is significant, given that it is estimated that intrinsically disordered regions represent 30–40% [45] of the human proteome. Finally, even the most accurate protein structure prediction models are also limited in their ability to model proteins whose conformation is dynamic, e.g., in response to a switch between the cofactor-free apo- and the cofactor-bound holo-states. Structure predictors like AlphaFold2 tend to model the most stable domain orientation in proteins, which undergo major conformational changes [46].

The successful design of peptide binders with affinities in the nanomolar range against the neural cell adhesion molecule 1 (NCAM1) and anti-Müllerian hormone type 2 receptor (AMHR2) by PepMLM [47] illustrates the value of sequence-based methods. In fact, PepMLM succeeded where its structure-based state-of-the-art counterpart, RFDiffusion [48] failed.

Taken together, these shortcomings support the argument that while accurate under certain conditions, structure-based approaches are far from being a panacea when it comes to predicting PPIs.

3. How PPI Predictors Work: Machine Learning Methods and Evaluation Metrics

3.1. Paradigms

One of the most widely accepted definitions for “machine learning” is that of Tom Mitchell [49]:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

This definition applies to PPI prediction; as ML-based predictors become more accurate (P) at distinguishing between interacting and non-interacting protein pairs (T), they are “shown” more PPIs (E).

The PPI prediction challenge is a binary classification problem where protein pairs must be assigned to one of two classes: interacting (“positive”) or non-interacting (“negative”). Related challenges, such as predicting binding affinities [50,51,52,53] or identifying interaction interfaces [54,55,56,57], have also been explored for two decades.

Supervised learning remains the dominant paradigm in PPI prediction, though breakthroughs in deep learning have led to the emergence of a new paradigm with which it is often combined: self-supervised learning. Supervised learning is a ML paradigm wherein models are trained using labeled data, i.e., data for which the target variable to predict is known. Its objective is to discover patterns in data that correlate with the target variable. In the PPI prediction context, because the classes of all protein pairs in the training set are known, the parameters of the model can be tuned so that it identifies and processes the correct patterns to make the best possible predictions.

Self-supervised learning is a relatively new paradigm that arose in the field of natural language processing in response to the availability of colossal quantities of unlabeled data (e.g., Wikipedia, internet forums, scientific articles and books, corpora of digitized books, etc.). This paradigm is typically not used to make predictions directly, but rather to uncover effective ways to distill complex data in compact and information-rich representations, typically as long vectors of real numbers referred to as embeddings. These embeddings are then used as features for various related prediction tasks. Self-supervised learning is now ubiquitous in protein-related ML applications and the embeddings generated in self-supervised settings are commonly used in conjunction with traditional supervised ML models.

3.2. Methodology

The development of most ML-based predictors, regardless of the application, follows a standard methodology (Figure 2), which we expand on in this section.

3.2.1. Data Curation

Consistent with the “garbage in, garbage out” adage, the creation of high-quality training and test sets is a necessary step towards the creation of a reliable PPI predictor. PPIs are typically retrieved from carefully curated databases, which catalog experimentally validated and probable PPIs that are made publicly available for use by biologists, biochemists and bioinformatic practitioners. These databases list physical (direct) and genetic (indirect) interactions to facilitate the functional characterization of proteins and drug target identification, among others. The most widely used and recently updated databases are listed in Table 1.

These databases provide the model with a source of interacting pairs (positives) but do not tabulate non-interacting pairs, which are required to train binary classifiers. For this reason, a set of non-interacting pairs must be carefully assembled. While this may seem simple at first, it is difficult to prove that two proteins never interact. It is possible that a protein pair not currently known to interact may eventually be shown to interact. As a result, it is typical to create negative pairs using one of the following strategies:

Assume random pairs of proteins to not interact [24];

Shuffle the amino acids in pairs (e.g., in triplets) of interacting proteins, to retain the original amino acid composition [62];

Assume pairs of proteins located in different cellular components (e.g., cytosol and the nucleus) to not interact [62] (argued to lead to overoptimistic performance estimates due to functional bias [63]).

Regardless of the approach taken, there is a risk of mislabeling a protein pair as negative when they actually would interact, though this risk is assumed to be negligible in practice.

Another resource that has been used [64] to gather negative pairs is Negatome [65,66], a database that lists protein pairs deemed to be unlikely to interact physically. The database was populated using text mining against PubMed-indexed articles and structural information.

In contrast with protein structures, sequences for all known proteins are readily available in databases such as Swiss-Prot/UniProt [67], so the sequences for all positive and negative protein pairs in a dataset can easily be retrieved.

PPI datasets are subject to sources of bias, which lower the ability of predictors to generalize to proteins sharing low sequence identity with those in the training data. First, human PPIs represent the majority of validated interactions by a very wide margin. For instance, the most recent BioGRID database release (4.4.248) [31] lists >1 M non-redundant, physical interactions for Homo sapiens. This is an order of magnitude greater than the 180,511 PPIs in Saccharomyces cerevisiae, the next organism with the most known PPIs. In fact, it is not unusual for non-model organisms to have fewer than a hundred known interactions. We discuss the use of cross-species prediction as a strategy to mitigate this in Section 4. Second, the number of known interactors can vary drastically between proteins for reasons unrelated to their capacity to engage in PPIs. Certain proteins considered to be of high biological or clinical relevance, such as the p53 tumor suppressor protein, have more known interaction partners than little studied proteins [68]. Interactions involving certain proteins may also be more difficult to identify because of the limitations of experimental detection methods [69]. Hub proteins are also overrepresented in datasets, causing issues not only with prediction, but also with PPI network analyses downstream [68]. Finally, many of the interactions in PPI training sets involve homologous proteins that share high levels of sequence identity. This can lead to overly optimistic estimates of predictive accuracy if it is not accounted for while assembling a test set. Correctly predicting protein pairs in a test set that share high levels of identity with pairs in the training set does not inform about the ability of a model to generalize. We introduce redundancy reduction as a widely applied technique to mitigate that issue.

3.2.2. Feature Engineering and Data Splitting

Until the mainstream adoption of deep learning, sequence-based PPI predictors relied on human-engineered (or interpretable) vectors of real numbers as inputs to train supervised models (Figure 3A). These vectors, whose components are referred to as features or descriptors, vary in length and are numerical representations of the properties of proteins in the pair or the pair as a whole. Given that supervised models extract patterns from these descriptors and combinations thereof, the careful design of features that correlate with the target variable is primordial.

At the time of writing, however, representations of proteins and protein pairs are largely learned. Large models trained with self-supervised learning (Figure 3B) on large datasets comprising millions of protein sequences now generate effective representations of proteins in absence of any external sources of information about the proteins (e.g., physicochemical properties, evolutionary information, etc.). We discuss specific feature extraction strategies later, in our survey of ML-based PPI predictors.

Regardless of what and how features are extracted from protein pairs to enable classification, the dataset consisting of positive and negative protein pairs is invariably split into a training set and a test set. This can be performed in a stratified fashion or not, i.e., the positive-to-negative ratio may or may not be the same in the training and the test sets.

The training set is used to tune the model, i.e., to find the parameters that allow it to make the best possible predictions on those training pairs. The test set, on the other hand, is set aside early and only used to evaluate the model’s predictive accuracy on new, previously unseen data. The accuracy of the model’s predictions on a carefully prepared test set provides a measure of how well the model is expected to perform in a “real life” setting.

It is standard practice to correct for high sequence redundancy, as the presence of high similarity sequences introduces bias [70,71]. A typical way to address the issue of redundancy is to cluster the pairs based on the identity of the interacting proteins [64,72,73,74] with tools such as CD-HIT [71] or MMSeqs2 [75]. A threshold of 40% identity (which allows sequences in the dataset to have ≤40% identity) appears to be commonplace, in practice [24,64,73,76,77,78]. While reducing redundancy in the training data may lower bias and lead to better generalizability, redundancy reduction invariably leads to fewer training data to learn from. To our knowledge, the relationship between the identity threshold used for redundancy reduction and prediction generalizability has not been studied rigorously. As such, the optimal identity threshold and the extent to which it influences generalizability for different models remain unclear.

Correction for redundancy is not only important for reducing bias during the training of ML-based predictors, but also to obtain accurate estimates of performance. The presence of protein sequences in the test set that share high identities with other sequences in the training set is likely to lead to overly optimistic performance estimates, which do not generalize at the proteome scale. This was recognized by Park and Marcotte as early as in 2012 [79]. They argue for the need to distinguish between three classes of protein pairs in the training set: pairs where both proteins are found in at least one interaction in the training set (C1; easiest), pairs where one of the two proteins belongs to an interacting pair in the training set (C2; moderate), and pairs where both proteins do not appear in the training data (C3; challenging). They argue that success in classifying pairs of Class 1 is unlikely to generalize at the proteome scale. So-called “hub proteins”—promiscuous proteins—are especially susceptible to bias performance estimates and require attention when preparing a test set [80].

3.2.3. Model Training

A variety of standard machine learning models are fit, in isolation or as ensembles, to the training data. Commonly encountered models include multilayer perceptrons (MLPs), support vector machines (SVMs), random forests (RFs), extra trees (ETs), and convolutional neural networks (CNNs).

MLPs are simple neural networks that apply compositions of non-linear operations (e.g., sigmoid function, rectified linear unit, etc.) to linear combinations of the input features to generate a single real number corresponding to the probability of an interaction. The parameters of the model that are learned during training are the coefficients of the linear combinations. SVMs, on the other hand, project the training pairs into a high-dimensional space and attempt to fit the hyperplane that best separates, i.e., with the largest margin, interacting pairs and non-interacting pairs. RFs are ensembles of decision trees where individual trees operate on separate random subsets of the feature space and are trained to find the split points that best separate interacting from non-interacting pairs. CNNs treat protein pairs as an image (a 2- or 3-dimensional tensor) and apply a series of convolution and pooling operations to the images in a way that mimics the way in which the processing of visual information was believed to happen in the primary visual cortex. These algorithms are described in great depth in most introductory machine learning textbooks [81,82,83].

In general, the quality of the training set and the fashion in which proteins and protein pairs are numerically represented has a greater influence on classification accuracy than the specific learning algorithm used (e.g., SVMs, RFs, ETs). It is for that reason that the use of traditional machine learning models has decreased in favor of deep learning models (e.g., CNNs, RNNs, transformers), which learn rich, effective numerical representations automatically from large datasets of protein sequences.

The specific algorithmic details pertaining to how each of these models are trained are beyond the scope of this review. Suffice to say, the training procedure is, invariably, an optimization routine that minimizes an objective function (also referred to as a loss function), typically some form of misclassification error aggregate over the training set.

3.2.4. Model Evaluation

Several standard metrics are used to assess the quality of a PPI predictor (Figure 4). Several of the metrics, which vary between 0 and 1, are formulated as ratios of true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs) from predictions made on the test set of protein pairs that were not used to train the model.

Recall (also known as sensitivity) quantifies a predictor’s ability to detect interactions:

R e = \frac{T P}{T P + F N}

In contrast, specificity provides a measure of how well the predictor can detect non-interacting protein pairs:

S p = \frac{T N}{T N + F P}

Precision, arguably the performance metric that matters the most when the predictor is deployed to validate novel PPIs in vitro, provides the expected fraction of predicted interacting pairs, which would be confirmed to interact upon testing:

P r = \frac{T P}{T P + F P}

The F₁-score is sometimes found to be convenient as it captures both the recall and precision of a predictor in a single number by means of a harmonic mean:

F_{1} -score = 2 \frac{P r \times R e}{P r + R e}

Accuracy is rarely used, for reasons that are discussed later, but it quantifies the fraction of expected correct predictions:

A c c = \frac{T P + T N}{T P + F P + F N + T N}

ML-based predictors output a score—almost always a probability between 0 and 1. The closer to 0 the score is, the more confident the predictor is that the protein pair does not interact, while the opposite is true as the score approaches 1.

The metrics above require a user to define a decision threshold on the score above at which protein pairs are predicted to interact. This threshold is arbitrary but is selected to achieve the desired balance between recall and precision, or less frequently, recall and specificity. Alas, improved precision incurs a cost in recall and vice versa. Two metrics that are frequently used to report a model’s performance over the range of possible threshold values are (1) the area under the receiver operating characteristic curve (AUROC) and (2) the area under the precision–recall curve (AUPRC), also sometimes called “average precision”. The AUROC and AUPRC are useful to compare the performance of different predictors, as are metrics such as the precision at a fixed recall value (e.g., the precision at 50% recall; Pr@50Re).

4. Generalizing Beyond Model Systems: Challenges and Solutions in Cross-Species PPI Prediction

The literature distinguishes between three main prediction schemes: intra-, inter-, and cross-species PPI prediction (Figure 5). The distinction is important as species are thought to differ from one another with respect to their interaction patterns [25,84].

Intra-species prediction is the most common prediction scheme wherein one seeks to predict a full or partial interaction network within a single organism, using known interactions within that organism.

By contrast, in the inter-species prediction scheme, predictions involving proteins from different organisms are made. This scheme has received significant attention because of the COVID-19 pandemic where the interaction of SARS-CoV-2 proteins and human proteins, notably Spike and ACE2, were determined to be key to the infection and proliferation of the virus. Human–virus protein interaction prediction with sequence-based PPI predictors has been the topic of multiple studies since [85,86,87]. Inter-species PPI prediction has also been used to predict interactions between soybean (Glycine max) and the soybean cyst nematode (Heterodera glycines) [88], a parasite that contaminates crops and leads to millions of dollars in yield losses as well as the human–HIV interactions [89], to name a few.

Unfortunately, the scarcity of training data (i.e., known, validated PPIs) for the organism of interest is a frequently encountered problem. Cross-species PPI prediction, which differs from inter-species prediction, is the most frequently used strategy to mitigate this issue [25,84,90,91]. In cross-species prediction, a well-studied organism closely related to the target organism is taken as a “proxy” [84] for the target organism and PPIs from the proxy organism are used to inform the predictor. For instance, at the beginning of the COVID-19 pandemic, few interactions between SARS-CoV-2 proteins and human proteins had been experimentally demonstrated. Dick et al. used 689 PPIs involving proteins from closely related viral families (Flaviviridae, Togaviridae, Arteriviridae, Coronaviridae, and Hepeviridae) to inform their predictions [84]. Numerous references to studies predicting interactomes for understudied eukaryotic organisms with cross-species schemes can be found elsewhere [92].

In a recent publication, Volzhenin et al. considered, with impressive minutiae, how their predictor named SENSE-PPI performs under three different prediction schemes [93]. Unsurprisingly, they confirmed that the quality of the predictions in the cross-species schema is inversely correlated with the phylogenetic distance between the proxy organisms and the target organism for which PPIs are to be predicted. To demonstrate this, the authors used the so-called “mean pair sequence identity”. This metric, computed for each test pair, corresponds to the mean of the maximum identity of each protein in the pair to proteins in the training set. They observed that the performance of their model trained on human pairs was directly correlated with the mean pair sequence identity of the test set for both model (e.g., mouse, fly, yeast) and non-model organisms (e.g., cow, horse, snake). Dick et al. also made similar observations by assessing the impact of the evolutionary distance (in million years since divergence) of proteins of the organisms whose proteins are in the training and test sets on prediction accuracy [25].

5. The Class Imbalance Problem in PPI Prediction

As is the case for many binary classification problems in bioinformatics such as miRNA prediction [94], posttranslational modification (PTM) prediction [95], and antimicrobial peptide prediction [96], PPI prediction is plagued with the class imbalance problem. This occurs when instances of one class, typically the “negative” class, vastly outnumber instances of the rare class, the “positives”.

It is reasonable for a researcher to wonder why this matters, especially since many published predictors fail to account for this imbalance. The answer is that testing a model on a balanced test set is a recipe for unrealistic and overly optimistic performance estimates, because a balanced test set is not a representative sample of the actual, imbalanced distribution that is to be expected upon deployment. Mitigating the effects of class imbalance involves two things: the use of an imbalanced test set to evaluate the predictor and the use of appropriate performance metrics.

In general, though not always, the higher the expected imbalance is, the more challenging the classification task is [97], so it is essential to test the predictor on a test set that is as challenging as the set of all protein pairs. Testing a predictor on a balanced dataset would be akin to administering a high school-level exam to a graduate student: the results would not provide useful information about the graduate student’s expected performance upon graduation from post-secondary school. The use of the appropriate performance metrics is another way to mitigate the effects of class imbalance. Several metrics widely adopted for benchmarking are unsuitable or largely uninformative for the evaluation of PPI predictors: accuracy and the AUROC.

Accuracy is not an adequate metric as it disproportionately favors correct classification of the majority class. For example, if 1% of all protein pairs physically interact, a predictor that consistently predicts pairs as non-interacting would achieve an accuracy of 99%. Clearly, this predictor is of no practical use, despite its very high accuracy.

The ROC curve and the area under it, in contrast with the PR curve, is insensitive to class imbalance [98,99]. As such, it does not correlate with the difficulty of the classification problem at different imbalance ratios. We illustrate these phenomena with simulated datasets in Figure 6.

In general, the metrics that are the most relevant in the presence of a high class imbalance are (1) recall, (2) precision, and (3) their derivatives, AUPRC and F₁-score; we are mainly interested in detecting truly positive instances with high sensitivity and making correct positive predictions [100]. Rarely are we ever interested in correct classification of negatives in the context of PPI prediction. Some predictors developed in imbalanced settings report the prevalence-corrected precision, which can be used to estimate precision under different imbalance ratios [94,101]:

P C P r = \frac{R e}{R e + r (1 - S p)}

where

R e

is the recall,

S p

is the specificity, and

r

, the prevalent-to-rare instance ratio. As an example, for

r = 100

in a scenario where a 1:100 ratio is expected.

The PCPr has been applied to evaluate models in the context of PPI prediction [101,102,103], but also in other bioinformatics analyses where large imbalances are expected (e.g., miRNA discovery [94,104]). Correcting for class lowers the expected precision of the predictor upon deployment to estimate performance on realistic data distribution.

The key evaluation metrics in the context of class imbalance are summarized in Table 2.

Though the ratio between interacting and non-interacting protein pairs is difficult to approximate, it is widely accepted that the majority of protein pairs do not interact. In a 2009 study [105], Venkatesan et al. predicted based on the results of a yeast two-hybrid assay variant they developed that the human interactome contains somewhere between 74,000 and 200,000 interactions (i.e., a ~1:1200–3400 imbalance ratio). That estimate is still cited [106]. Stumpf et al. [107] estimated the size of the human interactome at 650,000 interactions (i.e., a ~1:400 imbalance ratio) with a graph-based approach. It has been argued that these estimates may be too high because of low-quality evidence supported by a single, potentially unreproducible, observation and because some detected interactions are indirect and occur in a co-complex [108]. Conversely, other authors argue for the existence of a “dark interactome” comprising interactions that cannot be identified with traditional techniques and/or experimental conditions [15]. Rolland et al. revised the human interactome size estimate to ~140,000 in 2014 [109]. PPI predictors are typically tested under much lower imbalance regimes such as 1:10 [25,73,77] or 1:100 [24]. This suggests that the performance of the predictors may be consistently overestimated. While the exact imbalance ratio remains unclear, authors should refrain from using low imbalance estimates (e.g., 1:1, 1:10) unsupported by the current evidence and opt for more plausible ratios (e.g., 1:100 or 1:1000) [24,110]. The use of a standardized imbalanced dataset for benchmarking would significantly benefit the field.

6. Old but Gold: Sequence-Based Protein–Protein and Peptide–Protein Predictors

In this section, we survey the two main families of PPI predictors: machine learning-based and similarity-based predictors. A table outlining key aspects of these predictors can be found in the Supplementary Materials (Table S1).

6.1. Machine Learning-Based Approaches

As stated previously, many PPI predictors employ standard, out-of-the-box machine learning models or combinations thereof. The most noteworthy difference between most predictors is the approach used to generate protein representations, i.e., feature vectors, which is the focal point of interest in this section.

The first widely acknowledged sequence-based predictor was published by Guo et al. in 2008 [62]. Their model used seven physicochemical properties of the side chains as features: hydrophobicity, hydrophilicity, side chain volume, polarity, polarizability, solvent-accessible surface area, and net charge index. To account for the fact that proteins vary in length, these features are aggregated along the length of the protein into a single number using the auto-covariance (AC) aggregator to generate a seven-dimensional protein representation. The authors describe AC as “the average interactions between residues, a certain lag apart throughout the whole sequence”. The two vectors for a protein pair are then concatenated to generate a single 14-dimensional (14-D) vector. The classification model is a SVM trained on protein pairs represented as 14-D vectors. This work was highly influential, especially for its use of AC to generate a feature vector with a fixed length from protein sequences that vary in length, as well as for the strategy employed to generate non-interacting pairs.

Since then, a wide variety of human-engineered features to describe protein sequences as vectors of real numbers have been proposed. These feature sets are summarized in Table 3. The amino acid composition (AAC) is simply a 20-D vector where each entry corresponds to the percentage of one of the 20 amino acids within the sequence. Chou proposed the pseudoamino acid composition (PseAAC) [111], a variation on AAC that aims to preserve some of the sequential information in the protein sequence by adding to the AAC vector factors that account for the correlation of physicochemical properties of sets of amino acids spaced 1, 2, 3, …, λ amino acids apart. PseAAC generates a (20 + λ)-D vector.

Another frequently encountered feature vector encountered in PPI prediction is the output by the conjoint triad (CT) method [112]. In the CT method, each of the 20 possible amino acids are assigned to one of seven groups whose members share similar physicochemical properties (side chain volume and dipole). The counts of each possible group triplet in the protein sequence make up the entries of the 7 × 7 × 7 = 343-D feature vector, which is subsequently min–max normalized.

The composition, transition, and distribution (CTD) method [113] accounts for several properties (seven in the original paper), each of which is split into three groups of amino acids (e.g., the “neutral”, “negative”, and “positive” groups for the charge property). The “composition” feature corresponds to the frequency of each of those groups in the sequence. The “transition” feature measures the frequency at which an amino acid of one group is followed by an amino acid of another group (a so-called “transition”). Finally, the “distribution” features represent the sequence lengths required to contain the first, 25%, 50%, 75%, and 100% of the amino acids of a particular group. The CTD method with seven properties and three groups per property yields a 441-D vector.

Another common numerical description of a protein sequence is the position-specific scoring matrix (PSSM). This 20 × L matrix is generated by running the PSI-BLAST program [114] against a large database of protein sequences to compute the probability of observing all 20 amino acids at a given position in an alignment of high-scoring BLAST hits obtained from the input sequence for each of its L residues. The PSSM captures important conservation patterns between the protein of interest and other proteins.

Romero-Molina et al. used ProtDCal [115] to generate tens of thousands of descriptors, of which a small number were selected to train a SVM model [64]. ProtDCal applies a variety of grouping schemes, weights, and aggregation operations on the physicochemical properties of the residues in the input sequence to generate features that represent the protein globally and locally.

The Word2Vec model [116] was also used by some groups [86,117] to generate protein representations that capture context. The Word2Vec model, initially developed for natural language processing, can be used to generate protein embeddings if it is trained on large numbers of protein sequences. Two strategies to train Word2Vec are used: skip-gram, where the model learns representation by being trained to predict context from a word, and continuous bag of words, where the model learns representations by being trained to predict a missing word from context.

The models trained on combinations of these representations to predict PPIs include CNNs [72,118,119,120,121,122], SVMs [62,64,123], RFs [76,85], MLPs [122,124], and ensembles of such models [76,85,125].

Owing to their capacity to learn effective representations combining global and local features, CNNs have dominated the PPI prediction landscape at the expense of traditional, human-engineered representations. The field is currently undergoing a rapid paradigm shift accelerated by the advent of the attention mechanism and the transformer neural network architecture, which we introduce in the next section. However, it has been argued that this shift might be a mirage, as it has been shown that pre-training protein models with CNNs are significantly more efficient and competitive or superior to transformers for downstream tasks such as protein structure prediction, mutation effect prediction, and protein fitness prediction [126].

6.2. Protein Language Model-Based Approaches

The publication of the transformer neural network architecture by Google in the 2017 paper, Attention Is All You Need [127], was nothing short of transformative for the field of natural language processing (NLP). The multibillion-parameter large language models (LLMs) which we are familiar with at the time of this publication, such as OpenAI’s ChatGPT [128] and Google’s Gemini [129], invariably build on top of this architecture.

What makes transformer-based models so powerful for modeling language is the use of the self-attention mechanism, which allows models to attend more to relevant and/or related elements when processing a sequence. For instance, let us consider the natural language sentence: “Peptide inhibitors are a promising therapeutic modality”. When embedding the word “Peptide” into the representation of the sentence, the model would consider the strength of its relationship with the words “inhibitors” and “promising” as stronger than with the words “are” or “a”. This mechanism allows transformers to better capture the relationship between words in a sequence than other deep learning models such as long short-term memory networks (LSTMs) could previously.

The idea of treating protein sequences as a language with its own vocabulary, semantics, and grammar is a relatively recent idea, even if statistical models such as Hidden Markov Models (HMMs) have been used for homology modeling and sequence retrieval in databases for decades [130,131,132]. Before the emergence of the transformer, LSTMs were used to model the protein language [133], but they had since fell out of favor and were replaced with transformer-based models, except for applications where little training data is available. LLMs architectures trained to model the language of protein sequences are referred to as protein language models (pLMs) and have become a mainstay in PPI prediction and protein bioinformatics more broadly.

The main purpose of pLMs is to generate rich, high-dimensional embeddings of protein sequences that indirectly capture a sequence’s physicochemical, evolutionary, functional, and structural information. These models typically learn effective protein representations through the masked language modeling task where one or multiple residues within tens of millions of protein sequences are masked and the model is tasked to correctly predict the missing residues based on the context (i.e., the other known residues in the sequence).

A number of “foundational” pLMs trained at great expense on very large sets of pre-clustered, non-redundant protein sequences such as UniRef [134] or the “Big Fantastic Database” [135], are publicly available (Table 4). These models can be used to generate representations that can be used as is for downstream classification challenges with traditional models (e.g., SVMs, MLs, CNNs, RFs, etc.) or fine-tuned on additional data to generate protein sequences having desired properties, like antimicrobial peptides [136,137], for example.

Over the last couple of years, pLM-generated representations have been increasingly used to predict PPIs. For instance, Hu et al. trained their CNN-based model, KSGPP [121], on ESM-2 protein embeddings combined with a graph-based representation of the STRING PPI network. Another model, TuNA [140], also uses ESM-2 embeddings that are further processed in another transformer network and classified with a Gaussian process classifier. The xCAPT5 model by Dang et al. [122], uses the embeddings produced by the ProtT5-XL model trained on UniRef50 as inputs to their Siamese CNN model.

6.3. Similarity-Based Approaches

The prediction of the comprehensive human interactome had long been intractable before the development of a “massively parallel” implementation of the PIPE algorithm (MP-PIPE) [141]. Thanks to its ability to predict whether a protein pair will interact in a fraction of a second, MP-PIPE was deployed on the 250 M pairs of human proteins and provided the first map of the human interactome in 2011. This feat, which involved a significant amount of computing time (~3 months on a cluster with 800 compute cores or ~170,000 CPU-hours), revealed 130,470 high-confidence novel interactions (when setting the threshold at a 0.05% false positive rate). The Scoring Protein INTeractions (SPRINT) algorithm [24] was demonstrated to have the ability to predict the human interactome in a matter of hours with a consumer-grade workstation in 2017.

More recently, deep learning models were used to make predictions for large numbers (i.e., tens of millions) of pairs, but these methods first applied filters to reduce the number of predictions to make (e.g., only scoring pairs with a common subcellular localization [106]). Because of MP-PIPE and SPRINT’s unique ability to predict entire interactomes, it is worth discussing these two PPI prediction algorithms, which belong to a family of methods referred to as “similarity-based methods”.

Similarity-based methods rely on gap-free alignment scores obtained with substitution matrices like PAM120 [142] as a measure of similarity between the windows of contiguous amino acids within proteins.

The fundamental idea underpinning these methods is that protein interactions are mediated by short windows of contiguous amino acids. Clearly, this working assumption is incorrect as residues located at the interface, which mediate the interaction, may be far apart in the protein’s sequence but proximal in 3D space as a result of protein folding. Nonetheless, similarity-based methods have been surprisingly useful. To quantify the evidence in favor of a putative interaction, these methods posit that a pair of known interacting partners, (I₁, I₂), provide evidence for an interaction between a pair of query proteins, (Q₁, Q₂), if Q₁ contains a short window similar to a window in I₁ and Q₂ contains a short window that is similar to a window in I₂.

The various implementations of PIPE [25,141,143] and SPRINT merely quantify the evidence by counting the number of similarity occurrences with proteins in an interaction list. While PIPE outputs a score manipulated to lie in the range [0, 1], SPRINT does not and outputs a score in the interval [0, ∞), and because they essentially produce counts, the scores output by PIPE and SPRINT should not be interpreted as interaction probabilities. Instead, they should be thought of as the strength of evidence in support of an interaction. Since similarity-based methods produce counts of similarity with interacting proteins, the concept of a non-interacting pair is foreign to these algorithms, and they therefore do not require a set of non-interacting protein pairs to make predictions.

PIPE and SPRINT are very similar, but differ in a few key aspects, some of which pertain to computational efficiency while others pertain to the scoring itself. In contrast with PIPE, SPRINT does not perform an exhaustive search for regions of similarity between proteins. Instead, it uses spaced seeds to coarsely look for potential regions of similarity to be assessed for similarity with more scrutiny. This heuristic allows SPRINT to achieve impressive speed gains over PIPE, as identifying the regions of similarity between proteins is the most computationally intensive step of both algorithms. Second, PIPE uses a simple threshold and does not account for the level of similarity between windows in query proteins and interacting proteins, while SPRINT weighs the counts with the alignment scores.

Because these methods have no awareness of the concept of a non-interacting protein pair and are not intrinsically classifiers, one must set a threshold on the score below which two proteins are assumed to not interact. One way to achieve this is to determine the score threshold, which achieves a certain sensitivity and/or specificity in a cross-validation experiment [141,143]. That said, Dick and Green made the argument that a global threshold applied to all protein pairs may not be appropriate and suggested a meta-classifier called Reciprocal Perspective (RP) [101]. RP considers the interaction scores of the query proteins with one another among all scores from the perspective of both query proteins and outputs a revised interaction scores on which a global threshold can be set. The use of RP leads to significantly more sensitive and precise predictions than is possible with the use of a global threshold on the original interaction scores.

In spite of their relative simplicity and the incorrect assumption that the interactions are mediated by contiguous amino acids, PIPE and SPRINT remain competitive to this day [26], and newer models are still frequently compared against them [26,78,80,91,144].

7. Protein–Protein Interaction Prediction for Drug Development

7.1. Identification of Drug Targets with PPI Network Analysis

The prediction and validation of novel PPIs with in silico and in vitro approaches and the mapping of PPI networks (local and full interactomes) grant us the ability to identify proteins and PPIs to target and, consequently, to modulate biological pathways and treat diseases.

A number of network-based approaches operating on PPI networks have been developed for the specific purpose of identifying drug targets. These approaches represent PPI networks as graphs and apply graph theory metrics (e.g., degree, betweenness, distance, eccentricity, modularity, coreness, etc.) and algorithms (e.g., shortest distance) to identify proteins that can be targeted [145,146]. The analysis may include constraints on connectivity aiming to minimize potential side effects resulting from the changes to the networks associated with the disruption of a PPI. Such approaches have been leveraged to identify targets in glioblastoma [147], depression [148], cancer [149], and mucopolysaccharidosis [150]. Notably, Gordon et al. used a similar approach to identify targets for drug repurposing to treat SARS-CoV-2 infections [151]. These analyses were conducted using incomplete PPI networks extracted from large databases of experimentally validated PPIs, but not from predicted interactomes. However, predicted interactomes could and are likely to be used in the future to conduct such analyses.

The details of these approaches are out of scope. More thorough treatment of PPI network-driven target identification can be found in other review articles [145,152,153].

7.2. Targeting PPIs with Peptide Binders

One of the reasons that explains why small molecules have prevailed over peptides up to this point is that peptides tend to have much lower bioavailability, meaning that they reach the target site of action in lower quantities due to stability issues, and, as a result, tend to be much more difficult to successfully deliver, especially through the oral route of delivery [154]. Significant progress in peptide delivery has been achieved. For example, chemical modifications of peptides can be used to enhance their stability while advanced delivery vehicles such as implants, nanoparticles, gels, or emulsions can control the release of peptides into the blood or tissues [155]. Recent advances in targeted peptide delivery are surveyed in depth in several review articles [156,157,158]. Thanks to these advances, peptide therapeutics are now considered an “emergent therapeutic approach” [159].

Peptides have several advantages over small molecules for the disruption of PPIs:

Specificity: Peptides can be designed such that they bind to a target protein and few off-target proteins. This is a highly desirable property, as off-target interactions can lead to undesired side effects and limit the usefulness of a drug. In general, specificity is considered to be more difficult to achieve with small molecules. For example, side effects such as cytotoxicity are observed with tyrosine kinase inhibitors used to treat cancers as a result of their low specificity [154].

Ease of synthesis: While certain small molecules predicted to have desirable properties may be difficult or impossible to synthesize, the entire (linear) peptide space is chemically accessible because peptides are simply chains of amino acids that can be linked with well-understood chemistry. Modern peptide synthesis methods such as solid-phase peptide synthesis [160] make the screening of large peptide libraries possible. Peptides can also be modified in a number of ways: cyclization, stapling, lipidation, etc., to enhance their pharmacological properties (e.g., stability, solubility, specificity, etc.).

Wider target range: Thanks to their larger sizes, peptides can be used to target large surfaces that small molecules cannot and inhibit PPIs, which typically involve large and shallow surfaces [154].

Because peptides are short proteins, sequence-based PPI predictors can be leveraged directly or indirectly to design binders of proteins involved in a PPI one wishes to disrupt [161], and several groups have capitalized on that fact.

InSiPS [162,163] is a genetic algorithm developed at Carleton University that evolves peptide binders against a specific target. InSiPS uses the interaction scores output by MP-PIPE, which is informed by validated PPIs, as a measure of peptide quality (“fitness”). More specifically, it attempts to design a peptide that has a high interaction score with the target protein and low interaction scores with off-target proteins. InSiPS defines peptide fitness as

f = s_{t a r g e t} [1 - m a x (S_{o f f - t a r g e t})]

where

s_{t a r g e t}

is the interaction score output by MP-PIPE and

S_{o f f - t a r g e t}

is the set of scores between the peptide and all off-target proteins. InSiPS was first validated in the laboratory through the design of polypeptides targeting several yeast proteins. In fact, the authors successfully designed and characterized a binder targeting the yeast protein Psk1 with an affinity of 2 nM [163]. In 2022, InSiPS was used to design peptide binders that interact with SARS-CoV-2’s Spike protein with a dissociation constant in the nanomolar range [164]. To our knowledge, InSiPS is the only published peptide binder design algorithm that accounts for and attempts to simultaneously minimize off-target interactions.

The CAMP model [165], a sequence-based PepPI predictor, which combines CNNs and self-attention modules for both the peptide and the protein. While it has not been used specifically for the purpose of designing peptide binders, one could apply it in a way similar to InSiPS.

The Chatterjee group at Duke University have proposed several sequence-based generative tools that rely on PPIs and/or PepPIs to design peptide binders of proteins. For instance, Cut&CLIP [166] applied a contrastive learning strategy. First, it embeds peptide fragments extracted from known interacting proteins and the target protein’s sequences with ESM-2. Then, they train peptide and protein encoders to re-embed into a common latent space where the embeddings of interacting peptide and protein are in close proximity while the embeddings of non-interacting peptides and proteins are far apart.

The same group published PepMLM [47], an ESM-2-based transformer model that was fine-tuned to design peptide binders by framing the problem as a masked language modeling problem. In other words, PepMLM was tasked with predicting masked residues (i.e., the peptide binder’s sequence) through the use of context, i.e., the target’s sequence. PepMLM was trained on 10,000 PepPIs curated from the Propedia database [61]. With their method, they were able to generate “ubiquibodies”, ubiquitin-bound peptides that interacted with Huntington’s disease-related proteins (MSH3 and mHTT) to cause their degradation through proteolysis in vitro. In addition to those targets, PepMLM-generated ubiquibodies were demonstrated to also cause the degradation of NCAM1, a key marker of acute myeloid leukemia, and AMHR2, a protein involved in polycystic ovarian syndrome.

PepPrepCLIP [167] was trained on 11,597 PepPIs extracted from the PDB with a contrastive learning approach similar as that in Cut&CLIP. The main difference lies in how the peptides are generated. In PepPrepCLIP, the peptides are generated by adding gaussian noise to existing peptide binder embeddings in the ESM-2 latent space, which are subsequently decoded back into sequence space. Using PepPrepCLIP, the authors were able to generate different ubiquibodies that degraded β-catenin via ubiquitin-mediated proteolysis in vitro.

Taken together, this demonstrates that sequence-based PPI predictors and generative models trained on PPIs/PepPIs show great promise for designing peptide-based therapeutics, which are currently in high demand.

7.3. Antibody Design

Attempts have also been made to leverage sequence-based models to design antibody-based therapeutics that can disrupt PPIs. These sequence-based models, trained on tens of thousands of antibodies, are almost without exception variants of well-known pLM architectures. These models are especially useful to optimize the sequence of antibodies against a known target protein, i.e., to increase the affinity of an existing PPI.

AntiBERTy [168] is a frequently cited variation on the classic BERT transformer [169], which was trained on 558 M natural antibody sequences to learn the “antibody language” and produce rich embeddings of antibodies. AntiBERTa [170] is another pLM based on the RoBERTa transformer architecture [171], which was trained on 42 M heavy chains and 15 M light chains to produce rich representations of B cell receptors. Furthermore, these models can be fine-tuned for paratope generation. Other antibody language models include IgBert and IgT5 [172], which are variants of the BERT and T5 transformer [173] architectures, and AbLang [174], which is also a RoBERTa variant.

Several groups have proposed generative sequence-based models for the design of antibodies. Among those, we find the Immunoglobulin Language Model (IgLM) [175], which is based on the GPT-2 architecture [176]. IgLM was trained on 558 M natural antibody sequences from the observed antibody space database [177] to infill fragments of masked residues in antibody sequences. By performing as such, it learned to generate human-like antibodies. One application of IgLM proposed by the authors is the generation of CDR loop variants, which can be assembled into a library and screened as part of an antibody optimization pipeline.

Recently, Hie et al. [178] used the ESM-1b pLM [179] and the ESM-1v pLM ensemble [180] to optimize or “mature” seven antibodies against coronavirus, ebolavirus, and influenza A virus. They were able to produce antibodies with improved neutralization activity after only two rounds of laboratory evaluation and evolution. Interestingly, they suggest that using models trained on all protein sequences as opposed to antibody sequences only (e.g., AntiBERTy/AntiBERTa) is advantageous as these models learned more general rules of evolution. To optimize the antibody sequences, they select from mutations in the antibody sequences predicted by the pLMs to be most likely to occur as part of natural evolution.

8. Summary and Future Trends

Until recently, machine learning models such as SVMs and CNNs trained on human-engineered features constituted the majority of sequence-based PPI predictors. However, there is currently a clear trend towards pLMs-generated protein representations. This paradigm shift towards language modeling of biopolymers now expands beyond protein classification and design and well into genomics [181,182,183] and transcriptomics [184,185].

While pLMs have enabled significant advances in PPI prediction and other applications in computational protein science, the computing resources necessary to train these pLMs constitute a serious issue. In fact, large architectures may require millions of dollars to train “from scratch”. For instance, the cost of training the largest 15B parameter ESM-2 architecture using Amazon Web Services was estimated at 1.5M USD in 2024 [186]. Fortunately, training architectures from the ground up is rarely necessary. Smaller research groups largely rely on foundational pLMs (e.g., ESM-2, ProtT5, Ankh) pre-trained at great expense by large companies or laboratories and either fine-tune them or train downstream models on the embeddings they generate, both of which can be achieved at a much lower cost. Research has also demonstrated that scaling model size may not be as impactful as the quality of the training data, which suggests that smaller models could be trained in the future with negligible performance costs, allowing for the democratization of the field. For example, the AMPLIFY pLM architecture with 350 M parameters introduced by Fournier et al. [186], combined with a thoughtful training set curation strategy, was shown to outperform ESM-2 (15B parameters) in several tasks with 43× fewer parameters and 17× fewer floating-point operations.

Progress in PPI prediction remains slow, nonetheless. A panoply of models exploiting advances in deep learning have been proposed within the last decade, but similarity-based methods such as PIPE and SPRINT remain competitive in rigorous benchmarks, even if they predate them [26]. Progress is hindered by factors such as flawed methodology. For example, a surprisingly large number of authors fail to account for class imbalance and evaluate their models on balanced test sets. Dunham and Ganapathiraju found that most published methods experience a dramatic drop in performance when applied to realistic and imbalanced test sets [80]. In addition, the notion of pair difficulty (C1/C2/C3) introduced by Park and Marcotte [79], which we discussed, is also ignored by most groups. This causes “data leakage”, i.e., the sharing of information from the training set to the test set, which should be completely independent. Data leakage thus also leads to performance overestimates, a widespread issue [26].

Issues surrounding model evaluation could be mitigated through the creation and maintenance of standardized benchmark test sets of carefully curated PPIs, which have not been deposited in public databases, as was carried out in the protein structure prediction community with the Critical Assessment of Structure Prediction (CASP). Standardized benchmark datasets are also customary in many machine-learning subdisciplines such as natural language processing and computer vision. One or more benchmark test set(s) should be carefully curated and updated at regular intervals to account for Park and Marcotte’s notion of pair difficulty, sequence identity, or alignment scores between protein pairs in the proposed test set and publicly available interacting pairs and ensure that models are tested using a uniform, realistic imbalance ratio that is updated based on the most recent evidence.

In contrast with structure-based methods, sequence-based PPI predictors do not consider proteins as static structures and do not rely on potentially inaccurate predicted structures or scarce high-quality experimentally determined protein structures. Not only do they facilitate the identification of actionable drug targets within protein interaction networks, but they can also be used to design therapeutic biologics such as peptide binders and antibodies. There is a growing interest for biologics as a treatment modality, considering that they were found to be more likely to succeed in clinical trials than small molecules [187,188,189]. In a 2021 study [189], Yamaguchi et al. considered 3999 compounds assessed through clinical trials between 2000 and 2010 and found that biologics (excluding monoclonal antibodies) had the highest success rates of all drug modalities (15.2%; 495 successful trials), followed by small molecules (13.0%; 3086 successful trials).

The development and dissemination of these tools have significant ethical implications related to their misuse. In a recent correspondence article, Wang et al. identified structure prediction methods and genome foundation models as posing a significant threat to global biosecurity [190]. Given that sequence-based PPI predictors require little computing resources, they could in theory be exploited by individuals or groups with sufficient know-how to design toxins or enhance viral virulence, turning these accessible and inexpensive PPI predictors into bioweapon production tools. Calls for the creation of safeguards have been multiplying with the proliferation of artificial intelligence (AI) technologies within biomedicine [190,191,192,193]. Regulatory frameworks, preparedness plans, and significant investments by governments and the industry will be required to minimize and respond to those threats on a global scale. OpenAI, the leading AI company behind ChatGPT, recognized the significance of the risk that AI technologies posed and presented the safeguards it plans to implement to address the risks associated with AI misuse for bioweapon development and bioterrorism [194]. Academic researchers working on the development of tools such as those described in this review will also be required to contribute to our collective effort to maintain global biosecurity.

Sequence-based PPI predictors hold great promise, and it will be interesting to see whether academia and the biotechnology industry decide to invest in and capitalize on it.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/cells14181449/s1, Table S1: Summary of sequence-based PPI predictors published in the last decade.

Author Contributions

F.C. conducted the literature review and drafted the manuscript. K.K.B. and J.R.G. acquired resources, reviewed, and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science and Engineering Research Council (NSERC) Canada Discovery grant awarded to K.K.B. (RGPIN-2023-04651) and J.R.G. (RGPIN-2021-04184), and financially supported François Charih’s early doctoral studies through a PGS-D scholarship (PGSD3-534359-2019).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data was created during the conduction of this review.

Conflicts of Interest

Authors François Charih, James R. Green and Kyle K. Biggar were employed by the company NuvoBio Corporation.

References

Elhabashy, H.; Merino, F.; Alva, V.; Kohlbacher, O.; Lupas, A.N. Exploring Protein-Protein Interactions at the Proteome Level. Structure 2022, 30, 462–475. [Google Scholar] [CrossRef]
Nooren, I.M.A.; Thornton, J.M. Diversity of Protein–Protein Interactions. EMBO J. 2003, 22, 3486–3492. [Google Scholar] [CrossRef]
Friedhoff, P.; Li, P.; Gotthardt, J. Protein-Protein Interactions in DNA Mismatch Repair. DNA Repair 2016, 38, 50–57. [Google Scholar] [CrossRef]
Guarracino, D.A.; Bullock, B.N.; Arora, P.S. Protein-Protein Interactions in Transcription: A Fertile Ground for Helix Mimetics. Biopolymers 2011, 95, 1–7. [Google Scholar] [CrossRef]
Jia, X.; He, X.; Huang, C.; Li, J.; Dong, Z.; Liu, K. Protein Translation: Biological Processes and Therapeutic Strategies for Human Diseases. Signal Transduct. Target. Ther. 2024, 9, 44. [Google Scholar] [CrossRef]
Westermarck, J.; Ivaska, J.; Corthals, G.L. Identification of Protein Interactions Involved in Cellular Signaling. Mol. Cell. Proteomics MCP 2013, 12, 1752–1763. [Google Scholar] [CrossRef]
Buchner, J. Molecular Chaperones and Protein Quality Control: An Introduction to the JBC Reviews Thematic Series. J. Biol. Chem. 2019, 294, 2074–2075. [Google Scholar] [CrossRef] [PubMed]
Gonzalez, M.W.; Kann, M.G. Chapter 4: Protein Interactions and Disease. PLoS Comput. Biol. 2012, 8, e1002819. [Google Scholar] [CrossRef]
Cheng, F.; Zhao, J.; Wang, Y.; Lu, W.; Liu, Z.; Zhou, Y.; Martin, W.R.; Wang, R.; Huang, J.; Hao, T.; et al. Comprehensive Characterization of Protein–Protein Interactions Perturbed by Disease Mutations. Nat. Genet. 2021, 53, 342–353. [Google Scholar] [CrossRef] [PubMed]
Greenblatt, J.F.; Alberts, B.M.; Krogan, N.J. Discovery and Significance of Protein-Protein Interactions in Health and Disease. Cell 2024, 187, 6501–6517. [Google Scholar] [CrossRef] [PubMed]
Knopman, D.S.; Amieva, H.; Petersen, R.C.; Chételat, G.; Holtzman, D.M.; Hyman, B.T.; Nixon, R.A.; Jones, D.T. Alzheimer Disease. Nat. Rev. Dis. Primers 2021, 7, 33. [Google Scholar] [CrossRef]
Janssens, J.; Van Broeckhoven, C. Pathological Mechanisms Underlying TDP-43 Driven Neurodegeneration in FTLD–ALS Spectrum Disorders. Hum. Mol. Genet. 2013, 22, R77–R87. [Google Scholar] [CrossRef]
Bloem, B.R.; Okun, M.S.; Klein, C. Parkinson’s Disease. Lancet 2021, 397, 2284–2303. [Google Scholar] [CrossRef]
Huang, L.; Guo, Z.; Wang, F.; Fu, L. KRAS Mutation: From Undruggable to Druggable in Cancer. Signal Transduct. Target. Ther. 2021, 6, 386. [Google Scholar] [CrossRef] [PubMed]
Tabar, M.S.; Parsania, C.; Chen, H.; Su, X.-D.; Bailey, C.G.; Rasko, J.E.J. Illuminating the Dark Protein-Protein Interactome. Cell Rep. Methods 2022, 2, 100275. [Google Scholar] [CrossRef] [PubMed]
Kim, M.; Park, J.; Bouhaddou, M.; Kim, K.; Rojc, A.; Modak, M.; Soucheray, M.; McGregor, M.J.; O’Leary, P.; Wolf, D.; et al. A Protein Interaction Landscape of Breast Cancer. Science 2021, 374, eabf3066. [Google Scholar] [CrossRef] [PubMed]
Fu, H.; Mo, X.; Ivanov, A.A. Decoding the Functional Impact of the Cancer Genome through Protein–Protein Interactions. Nat. Rev. Cancer 2025, 25, 189–208. [Google Scholar] [CrossRef]
Dunham, W.H.; Mullin, M.; Gingras, A. Affinity-purification Coupled to Mass Spectrometry: Basic Principles and Strategies. Proteomics 2012, 12, 1576–1590. [Google Scholar] [CrossRef]
Sidhu, S.S.; Fairbrother, W.J.; Deshayes, K. Exploring Protein–Protein Interactions with Phage Display. ChemBioChem 2003, 4, 14–25. [Google Scholar] [CrossRef]
Zhou, M.; Li, Q.; Wang, R. Current Experimental Methods for Characterizing Protein–Protein Interactions. ChemMedChem 2016, 11, 738–756. [Google Scholar] [CrossRef]
Brückner, A.; Polge, C.; Lentze, N.; Auerbach, D.; Schlattner, U. Yeast Two-Hybrid, a Powerful Tool for Systems Biology. Int. J. Mol. Sci. 2009, 10, 2763–2788. [Google Scholar] [CrossRef]
Akbarzadeh, S.; Coşkun, Ö.; Günçer, B. Studying Protein–Protein Interactions: Latest and Most Popular Approaches. J. Struct. Biol. 2024, 216, 108118. [Google Scholar] [CrossRef]
Pitre, S.; Alamgir, M.; Green, J.R.; Dumontier, M.; Dehne, F.; Golshani, A. Computational Methods For Predicting Protein–Protein Interactions. In Protein–Protein Interaction; Werther, M., Seitz, H., Eds.; Advances in Biochemical Engineering/Biotechnology; Springer: Berlin/Heidelberg, Germany, 2008; Volume 110, pp. 247–267. ISBN 978-3-540-68817-4. [Google Scholar]
Li, Y.; Ilie, L. SPRINT: Ultrafast Protein-Protein Interaction Prediction of the Entire Human Interactome. BMC Bioinform. 2017, 18, 485. [Google Scholar] [CrossRef]
Dick, K.; Samanfar, B.; Barnes, B.; Cober, E.R.; Mimee, B.; Tan, L.H.; Molnar, S.J.; Biggar, K.K.; Golshani, A.; Dehne, F.; et al. PIPE4: Fast PPI Predictor for Comprehensive Inter- and Cross-Species Interactomes. Sci. Rep. 2020, 10, 1390. [Google Scholar] [CrossRef]
Bernett, J.; Blumenthal, D.B.; List, M. Cracking the Black Box of Deep Sequence-Based Protein–Protein Interaction Prediction. Brief. Bioinform. 2024, 25, bbae076. [Google Scholar] [CrossRef] [PubMed]
Andrei, S.A.; Sijbesma, E.; Hann, M.; Davis, J.; O’Mahony, G.; Perry, M.W.D.; Karawajczyk, A.; Eickhoff, J.; Brunsveld, L.; Doveston, R.G.; et al. Stabilization of Protein-Protein Interactions in Drug Discovery. Expert. Opin. Drug Discov. 2017, 12, 925–940. [Google Scholar] [CrossRef] [PubMed]
Macalino, S.J.Y.; Basith, S.; Clavio, N.A.B.; Chang, H.; Kang, S.; Choi, S. Evolution of In Silico Strategies for Protein-Protein Interaction Drug Discovery. Molecules 2018, 23, 1963. [Google Scholar] [CrossRef]
Wang, X.; Ni, D.; Liu, Y.; Lu, S. Rational Design of Peptide-Based Inhibitors Disrupting Protein-Protein Interactions. Front. Chem. 2021, 9, 682675. [Google Scholar] [CrossRef] [PubMed]
wwPDB consortium. Protein Data Bank: The Single Global Archive for 3D Macromolecular Structure Data. Nucleic Acids Res. 2019, 47, D520–D528. [Google Scholar] [CrossRef]
Oughtred, R.; Rust, J.; Chang, C.; Breitkreutz, B.-J.; Stark, C.; Willems, A.; Boucher, L.; Leung, G.; Kolas, N.; Zhang, F.; et al. The BioGRID Database: A Comprehensive Biomedical Resource of Curated Protein, Genetic, and Chemical Interactions. Protein Sci. 2021, 30, 187–200. [Google Scholar] [CrossRef]
Rose, P.W.; Prlić, A.; Altunkaya, A.; Bi, C.; Bradley, A.R.; Christie, C.H.; Costanzo, L.D.; Duarte, J.M.; Dutta, S.; Feng, Z.; et al. The RCSB Protein Data Bank: Integrative View of Protein, Gene and 3D Structural Information. Nucleic Acids Res. 2017, 45, D271–D281. [Google Scholar] [CrossRef]
Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly Accurate Protein Structure Prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef]
Abramson, J.; Adler, J.; Dunger, J.; Evans, R.; Green, T.; Pritzel, A.; Ronneberger, O.; Willmore, L.; Ballard, A.J.; Bambrick, J.; et al. Accurate Structure Prediction of Biomolecular Interactions with AlphaFold 3. Nature 2024, 630, 493–500. [Google Scholar] [CrossRef]
Lin, Z.; Akin, H.; Rao, R.; Hie, B.; Zhu, Z.; Lu, W.; Smetanin, N.; Verkuil, R.; Kabeli, O.; Shmueli, Y.; et al. Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model. Science 2023, 379, 1123–1130. [Google Scholar] [CrossRef]
Chai Discovery; Boitreaud, J.; Dent, J.; McPartlon, M.; Meier, J.; Reis, V.; Rogozhnikov, A.; Wu, K. Chai-1: Decoding the Molecular Interactions of Life. bioRxiv 2024. [Google Scholar] [CrossRef]
Wohlwend, J.; Corso, G.; Passaro, S.; Reveiz, M.; Leidal, K.; Swiderski, W.; Portnoi, T.; Chinn, I.; Silterra, J.; Jaakkola, T.; et al. Boltz-1 Democratizing Biomolecular Interaction Modeling. bioRxiv 2024. [Google Scholar] [CrossRef]
Passaro, S.; Corso, G.; Wohlwend, J.; Reveiz, M.; Thaler, S.; Somnath, V.R.; Getz, N.; Portnoi, T.; Roy, J.; Stark, H.; et al. Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction. bioRxiv 2025. [Google Scholar] [CrossRef] [PubMed]
Terwilliger, T.C.; Liebschner, D.; Croll, T.I.; Williams, C.J.; McCoy, A.J.; Poon, B.K.; Afonine, P.V.; Oeffner, R.D.; Richardson, J.S.; Read, R.J.; et al. AlphaFold Predictions Are Valuable Hypotheses and Accelerate but Do Not Replace Experimental Structure Determination. Nat. Methods 2023, 21, 110–116. [Google Scholar] [CrossRef]
Verburgt, J.; Zhang, Z.; Kihara, D. Multi-Level Analysis of Intrinsically Disordered Protein Docking Methods. Methods 2022, 204, 55–63. [Google Scholar] [CrossRef]
Kibar, G.; Vingron, M. Prediction of Protein–Protein Interactions Using Sequences of Intrinsically Disordered Regions. Proteins Struct. Funct. Bioinforma. 2023, 91, 980–990. [Google Scholar] [CrossRef] [PubMed]
Lee, C.Y.; Hubrich, D.; Varga, J.K.; Schäfer, C.; Welzel, M.; Schumbera, E.; Djokic, M.; Strom, J.M.; Schönfeld, J.; Geist, J.L.; et al. Systematic Discovery of Protein Interaction Interfaces Using AlphaFold and Experimental Validation. Mol. Syst. Biol. 2024, 20, 75–97. [Google Scholar] [CrossRef] [PubMed]
Orand, T.; Jensen, M.R. Binding Mechanisms of Intrinsically Disordered Proteins: Insights from Experimental Studies and Structural Predictions. Curr. Opin. Struct. Biol. 2025, 90, 102958. [Google Scholar] [CrossRef] [PubMed]
Luppino, F.; Lenz, S.; Chow, C.F.W.; Toth-Petroczy, A. Deep Learning Tools Predict Variants in Disordered Regions with Lower Sensitivity. BMC Genom. 2025, 26, 367. [Google Scholar] [CrossRef] [PubMed]
Yuan, R.; Zhang, J.; Zhou, J.; Cong, Q. Recent Progress and Future Challenges in Structure-Based Protein-Protein Interaction Prediction. Mol. Ther. 2025, 33, 2252–2268. [Google Scholar] [CrossRef]
Raisinghani, N.; Parikh, V.; Foley, B.; Verkhivker, G. Assessing Structures and Conformational Ensembles of Apo and Holo Protein States Using Randomized Alanine Sequence Scanning Combined with Shallow Subsampling in AlphaFold2: Insights and Lessons from Predictions of Functional Allosteric Conformations. bioRxiv 2024. [Google Scholar] [CrossRef]
Chen, L.T.; Quinn, Z.; Dumas, M.; Peng, C.; Hong, L.; Lopez-Gonzalez, M.; Mestre, A.; Watson, R.; Vincoff, S.; Zhao, L.; et al. Target Sequence-Conditioned Design of Peptide Binders Using Masked Language Modeling. Nat. Biotechnol. 2025, 1–13. [Google Scholar] [CrossRef]
Watson, J.L.; Juergens, D.; Bennett, N.R.; Trippe, B.L.; Yim, J.; Eisenach, H.E.; Ahern, W.; Borst, A.J.; Ragotte, R.J.; Milles, L.F.; et al. De Novo Design of Protein Structure and Function with RFdiffusion. Nature 2023, 620, 1089–1100. [Google Scholar] [CrossRef]
Mitchell, T. Machine Learning; McGraw-Hill Series in Computer Science; McGraw-Hill Professional: New York, NY, USA, 1997. [Google Scholar]
Yugandhar, K.; Gromiha, M.M. Protein–Protein Binding Affinity Prediction from Amino Acid Sequence. Bioinformatics 2014, 30, 3583–3589. [Google Scholar] [CrossRef]
Abbasi, W.A.; Yaseen, A.; Hassan, F.U.; Andleeb, S.; Minhas, F.U.A.A. ISLAND: In-Silico Proteins Binding Affinity Prediction Using Sequence Information. BioData Min. 2020, 13, 20. [Google Scholar] [CrossRef]
Guo, Z.; Yamaguchi, R. Machine Learning Methods for Protein-Protein Binding Affinity Prediction in Protein Design. Front. Bioinform. 2022, 2, 1065703. [Google Scholar] [CrossRef]
Romero-Molina, S.; Ruiz-Blanco, Y.B.; Mieres-Perez, J.; Harms, M.; Münch, J.; Ehrmann, M.; Sanchez-Garcia, E. PPI-Affinity: A Web Tool for the Prediction and Optimization of Protein–Peptide and Protein–Protein Binding Affinity. J. Proteome Res. 2022, 21, 1829–1841. [Google Scholar] [CrossRef]
Ofran, Y.; Rost, B. Predicted Protein–Protein Interaction Sites from Local Sequence Information. FEBS Lett. 2003, 544, 236–239. [Google Scholar] [CrossRef]
Ezkurdia, I.; Bartoli, L.; Fariselli, P.; Casadio, R.; Valencia, A.; Tress, M.L. Progress and Challenges in Predicting Protein–Protein Interaction Sites. Brief. Bioinform. 2009, 10, 233–246. [Google Scholar] [CrossRef]
Zhang, J.; Kurgan, L. Review and Comparative Assessment of Sequence-Based Predictors of Protein-Binding Residues. Brief. Bioinform. 2018, 19, 821–837. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Golding, G.B.; Ilie, L. DELPHI: Accurate Deep Ensemble Model for Protein Interaction Sites Prediction. Bioinformatics 2021, 37, 896–904. [Google Scholar] [CrossRef]
Szklarczyk, D.; Kirsch, R.; Koutrouli, M.; Nastou, K.; Mehryary, F.; Hachilif, R.; Gable, A.L.; Fang, T.; Doncheva, N.T.; Pyysalo, S.; et al. The STRING Database in 2023: Protein-Protein Association Networks and Functional Enrichment Analyses for Any Sequenced Genome of Interest. Nucleic Acids Res. 2023, 51, D638–D646. [Google Scholar] [CrossRef]
del Toro, N.; Shrivastava, A.; Ragueneau, E.; Meldal, B.; Combe, C.; Barrera, E.; Perfetto, L.; How, K.; Ratan, P.; Shirodkar, G.; et al. The IntAct Database: Efficient Access to Fine-Grained Molecular Interaction Data. Nucleic Acids Res. 2022, 50, D648–D653. [Google Scholar] [CrossRef] [PubMed]
Licata, L.; Briganti, L.; Peluso, D.; Perfetto, L.; Iannuccelli, M.; Galeota, E.; Sacco, F.; Palma, A.; Nardozza, A.P.; Santonico, E.; et al. MINT, the Molecular Interaction Database: 2012 Update. Nucleic Acids Res. 2012, 40, D857–D861. [Google Scholar] [CrossRef]
Martins, P.; Mariano, D.; Carvalho, F.C.; Bastos, L.L.; Moraes, L.; Paixão, V.; Cardoso de Melo-Minardi, R. Propedia v2.3: A Novel Representation Approach for the Peptide-Protein Interaction Database Using Graph-Based Structural Signatures. Front. Bioinform. 2023, 3, 1103103. [Google Scholar] [CrossRef] [PubMed]
Guo, Y.; Yu, L.; Wen, Z.; Li, M. Using Support Vector Machine Combined with Auto Covariance to Predict Protein–Protein Interactions from Protein Sequences. Nucleic Acids Res. 2008, 36, 3025–3030. [Google Scholar] [CrossRef]
Ben-Hur, A.; Noble, W.S. Choosing Negative Examples for the Prediction of Protein-Protein Interactions. BMC Bioinform. 2006, 7, S2. [Google Scholar] [CrossRef]
Romero-Molina, S.; Ruiz-Blanco, Y.B.; Harms, M.; Münch, J.; Sanchez-Garcia, E. PPI-Detect: A Support Vector Machine Model for Sequence-Based Prediction of Protein-Protein Interactions: PPI-Detect: A Support Vector Machine Model for Sequence-Based Prediction of Protein-Protein Interactions. J. Comput. Chem. 2019, 40, 1233–1242. [Google Scholar] [CrossRef]
Smialowski, P.; Pagel, P.; Wong, P.; Brauner, B.; Dunger, I.; Fobo, G.; Frishman, G.; Montrone, C.; Rattei, T.; Frishman, D.; et al. The Negatome Database: A Reference Set of Non-Interacting Protein Pairs. Nucleic Acids Res. 2010, 38, D540–D544. [Google Scholar] [CrossRef] [PubMed]
Blohm, P.; Frishman, G.; Smialowski, P.; Goebels, F.; Wachinger, B.; Ruepp, A.; Frishman, D. Negatome 2.0: A Database of Non-Interacting Proteins Derived by Literature Mining, Manual Annotation and Protein Structure Analysis. Nucleic Acids Res. 2014, 42, D396–D400. [Google Scholar] [CrossRef]
The UniProt Consortium. UniProt: The Universal Protein Knowledgebase in 2025. Nucleic Acids Res. 2025, 53, D609–D617. [Google Scholar] [CrossRef] [PubMed]
Schaefer, M.H.; Serrano, L.; Andrade-Navarro, M.A. Correcting for the Study Bias Associated with Protein–Protein Interaction Measurements Reveals Differences between Protein Degree Distributions from Different Cancer Types. Front. Genet. 2015, 6, 260. [Google Scholar] [CrossRef]
Luck, K.; Kim, D.-K.; Lambourne, L.; Spirohn, K.; Begg, B.E.; Bian, W.; Brignall, R.; Cafarelli, T.; Campos-Laborie, F.J.; Charloteaux, B.; et al. A Reference Map of the Human Binary Protein Interactome. Nature 2020, 580, 402–408. [Google Scholar] [CrossRef] [PubMed]
Sikic, K.; Carugo, O. Protein Sequence Redundancy Reduction: Comparison of Various Method. Bioinformation 2010, 5, 234–239. [Google Scholar] [CrossRef]
Fu, L.; Niu, B.; Zhu, Z.; Wu, S.; Li, W. CD-HIT: Accelerated for Clustering the next-Generation Sequencing Data. Bioinformatics 2012, 28, 3150–3152. [Google Scholar] [CrossRef]
Chen, M.; Ju, C.J.-T.; Zhou, G.; Chen, X.; Zhang, T.; Chang, K.-W.; Zaniolo, C.; Wang, W. Multifaceted Protein–Protein Interaction Prediction Based on Siamese Residual RCNN. Bioinformatics 2019, 35, i305–i314. [Google Scholar] [CrossRef]
Sledzieski, S.; Singh, R.; Cowen, L.; Berger, B. D-SCRIPT Translates Genome to Phenome with Sequence-Based, Structure-Aware, Genome-Scale Predictions of Protein-Protein Interactions. Cell Syst. 2021, 12, 969–982.e6. [Google Scholar] [CrossRef]
Yu, B.; Chen, C.; Zhou, H.; Liu, B.; Ma, Q. GTB-PPI: Predict Protein–Protein Interactions Based on L1-Regularized Logistic Regression and Gradient Tree Boosting. Genom. Proteom. Bioinform. 2020, 18, 582–592. [Google Scholar] [CrossRef]
Steinegger, M.; Söding, J. MMseqs2 Enables Sensitive Protein Sequence Searching for the Analysis of Massive Data Sets. Nat. Biotechnol. 2017, 35, 1026–1028. [Google Scholar] [CrossRef]
Chen, C.; Zhang, Q.; Yu, B.; Yu, Z.; Lawrence, P.J.; Ma, Q.; Zhang, Y. Improving Protein-Protein Interactions Prediction Accuracy Using XGBoost Feature Selection and Stacked Ensemble Classifier. Comput. Biol. Med. 2020, 123, 103899. [Google Scholar] [CrossRef]
Liu, D.; Young, F.; Lamb, K.D.; Quiros, A.C.; Pancheva, A.; Miller, C.; Macdonald, C.; Robertson, D.L.; Yuan, K. PLM-Interact: Extending Protein Language Models to Predict Protein-Protein Interactions. bioRxiv 2024. [Google Scholar] [CrossRef]
Zheng, X.; Du, H.; Xu, F.; Li, J.; Liu, Z.; Wang, W.; Chen, T.; Ouyang, W.; Li, S.Z.; Lu, Y.; et al. PRING: Rethinking Protein-Protein Interaction Prediction from Pairs to Graphs. arXiv 2025, arXiv:2507.05101. [Google Scholar] [CrossRef]
Park, Y.; Marcotte, E.M. Flaws in Evaluation Schemes for Pair-Input Computational Predictions. Nat. Methods 2012, 9, 1134–1136. [Google Scholar] [CrossRef]
Dunham, B.; Ganapathiraju, M.K. Benchmark Evaluation of Protein–Protein Interaction Prediction Algorithms. Molecules 2021, 27, 41. [Google Scholar] [CrossRef] [PubMed]
Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification, 2nd ed.; Wiley: New York, NY, USA, 2001; ISBN 978-0-471-05669-0. [Google Scholar]
Bishop, C.M. Pattern Recognition and Machine Learning; Information Science and Statistics; Springer: New York, NY, USA, 2006; ISBN 978-0-387-31073-2. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; Adaptive Computation and Machine Learning; The MIT Press: Cambridge, MA, USA, 2016; ISBN 978-0-262-03561-3. [Google Scholar]
Dick, K.; Chopra, A.; Biggar, K.K.; Green, J.R. Multi-Schema Computational Prediction of the Comprehensive SARS-CoV-2 vs. Human Interactome. PeerJ 2021, 9, e11117. [Google Scholar] [CrossRef]
Yang, X.; Yang, S.; Li, Q.; Wuchty, S.; Zhang, Z. Prediction of Human-Virus Protein-Protein Interactions through a Sequence Embedding-Based Machine Learning Method. Comput. Struct. Biotechnol. J. 2020, 18, 153–161. [Google Scholar] [CrossRef] [PubMed]
Tsukiyama, S.; Hasan, M.M.; Fujii, S.; Kurata, H. LSTM-PHV: Prediction of Human-Virus Protein–Protein Interactions by LSTM with Word2vec. Brief. Bioinform. 2021, 22, bbab228. [Google Scholar] [CrossRef] [PubMed]
Dong, T.N.; Brogden, G.; Gerold, G.; Khosla, M. A Multitask Transfer Learning Framework for the Prediction of Virus-Human Protein–Protein Interactions. BMC Bioinform. 2021, 22, 572. [Google Scholar] [CrossRef]
Nissan, N.; Hooker, J.; Arezza, E.; Dick, K.; Golshani, A.; Mimee, B.; Cober, E.; Green, J.; Samanfar, B. Large-Scale Data Mining Pipeline for Identifying Novel Soybean Genes Involved in Resistance against the Soybean Cyst Nematode. Front. Bioinform. 2023, 3, 1199675. [Google Scholar] [CrossRef] [PubMed]
Barnes, B.; Karimloo, M.; Schoenrock, A.; Burnside, D.; Cassol, E.; Wong, A.; Dehne, F.; Golshani, A.; Green, J.R. Predicting Novel Protein-Protein Interactions between the HIV-1 Virus and Homo Sapiens. In Proceedings of the 2016 IEEE EMBS International Student Conference (ISC), Ottawa, ON, Canada, 29–31 May 2016; pp. 1–4. [Google Scholar]
Singh, R.; Devkota, K.; Sledzieski, S.; Berger, B.; Cowen, L. Topsy-Turvy: Integrating a Global View into Sequence-Based PPI Prediction. Bioinformatics 2022, 38, i264–i272. [Google Scholar] [CrossRef]
Szymborski, J.; Emad, A. INTREPPPID—An Orthologue-Informed Quintuplet Network for Cross-Species Prediction of Protein–Protein Interaction. Brief. Bioinform. 2024, 25, bbae405. [Google Scholar] [CrossRef]
James, K.; Wipat, A.; Cockell, S.J. Expanding Interactome Analyses beyond Model Eukaryotes. Brief. Funct. Genom. 2022, 21, 243–269. [Google Scholar] [CrossRef] [PubMed]
Volzhenin, K.; Bittner, L.; Carbone, A. SENSE-PPI Reconstructs Interactomes within, across, and between Species at the Genome Scale. iScience 2024, 27, 110371. [Google Scholar] [CrossRef]
Ajila, V.; Colley, L.; Ste-Croix, D.T.; Nissan, N.; Cober, E.R.; Mimee, B.; Samanfar, B.; Green, J.R. Species-Specific microRNA Discovery and Target Prediction in the Soybean Cyst Nematode. Sci. Rep. 2023, 13, 17657. [Google Scholar] [CrossRef]
Biggar, K.K.; Charih, F.; Liu, H.; Ruiz-Blanco, Y.B.; Stalker, L.; Chopra, A.; Connolly, J.; Adhikary, H.; Frensemier, K.; Hoekstra, M.; et al. Proteome-Wide Prediction of Lysine Methylation Leads to Identification of H2BK43 Methylation and Outlines the Potential Methyllysine Proteome. Cell Rep. 2020, 32, 107896. [Google Scholar] [CrossRef]
Wang, G.; Vaisman, I.I.; van Hoek, M.L. Machine Learning Prediction of Antimicrobial Peptides. Methods Mol. Biol. Clifton NJ 2022, 2405, 1–37. [Google Scholar] [CrossRef]
Brzezinski, D.; Minku, L.L.; Pewinski, T.; Stefanowski, J.; Szumaczuk, A. The Impact of Data Difficulty Factors on Classification of Imbalanced and Concept Drifting Data Streams. Knowl. Inf. Syst. 2021, 63, 1429–1469. [Google Scholar] [CrossRef]
Liu, Y.; Li, Y.; Xie, D. Implications of Imbalanced Datasets for Empirical ROC-AUC Estimation in Binary Classification Tasks. J. Stat. Comput. Simul. 2024, 94, 183–203. [Google Scholar] [CrossRef]
Richardson, E.; Trevizani, R.; Greenbaum, J.A.; Carter, H.; Nielsen, M.; Peters, B. The Receiver Operating Characteristic Curve Accurately Assesses Imbalanced Datasets. Patterns 2024, 5, 100994. [Google Scholar] [CrossRef]
Langote, M.; Zade, N.; Gundewar, S. Addressing Data Imbalance in Machine Learning: Challenges and Approaches. In Proceedings of the 2025 6th International Conference on Mobile Computing and Sustainable Informatics (ICMCSI), Goathgaun, Nepal, 7–8 January 2025; pp. 1745–1749. [Google Scholar]
Dick, K.; Green, J.R. Reciprocal Perspective for Improved Protein-Protein Interaction Prediction. Sci. Rep. 2018, 8, 11694. [Google Scholar] [CrossRef]
Zheng, W.; Wuyun, Q.; Cheng, M.; Hu, G.; Zhang, Y. Two-Level Protein Methylation Prediction Using Structure Model-Based Features. Sci. Rep. 2020, 10, 6008. [Google Scholar] [CrossRef] [PubMed]
Baranwal, M.; Magner, A.; Saldinger, J.; Turali-Emre, E.S.; Elvati, P.; Kozarekar, S.; VanEpps, J.S.; Kotov, N.A.; Violi, A.; Hero, A.O. Struct2Graph: A Graph Attention Network for Structure Based Predictions of Protein–Protein Interactions. BMC Bioinform. 2022, 23, 370. [Google Scholar] [CrossRef]
Peace, R.J.; Biggar, K.K.; Storey, K.B.; Green, J.R. A Framework for Improving microRNA Prediction in Non-Human Genomes. Nucleic Acids Res. 2015, 43, e138. [Google Scholar] [CrossRef][Green Version]
Venkatesan, K.; Rual, J.-F.; Vazquez, A.; Stelzl, U.; Lemmens, I.; Hirozane-Kishikawa, T.; Hao, T.; Zenkner, M.; Xin, X.; Goh, K.-I.; et al. An Empirical Framework for Binary Interactome Mapping. Nat. Methods 2009, 6, 83–90. [Google Scholar] [CrossRef]
Zhang, J.; Humphreys, I.R.; Pei, J.; Kim, J.; Choi, C.; Yuan, R.; Durham, J.; Liu, S.; Choi, H.-J.; Baek, M.; et al. Computing the Human Interactome. bioRxiv 2024. [Google Scholar] [CrossRef]
Stumpf, M.P.H.; Thorne, T.; de Silva, E.; Stewart, R.; An, H.J.; Lappe, M.; Wiuf, C. Estimating the Size of the Human Interactome. Proc. Natl. Acad. Sci. USA 2008, 105, 6959–6964. [Google Scholar] [CrossRef] [PubMed]
Vidal, M. How Much of the Human Protein Interactome Remains to Be Mapped? Sci. Signal. 2016, 9, eg7. [Google Scholar] [CrossRef]
Rolland, T.; Taşan, M.; Charloteaux, B.; Pevzner, S.J.; Zhong, Q.; Sahni, N.; Yi, S.; Lemmens, I.; Fontanillo, C.; Mosca, R.; et al. A Proteome-Scale Map of the Human Interactome Network. Cell 2014, 159, 1212–1226. [Google Scholar] [CrossRef]
Schoenrock, A.; Samanfar, B.; Pitre, S.; Hooshyar, M.; Jin, K.; Phillips, C.A.; Wang, H.; Phanse, S.; Omidi, K.; Gui, Y.; et al. Efficient Prediction of Human Protein-Protein Interactions at a Global Scale. BMC Bioinform. 2014, 15, 383. [Google Scholar] [CrossRef]
Chou, K.-C. Prediction of Protein Cellular Attributes Using Pseudo-Amino Acid Composition. Proteins Struct. Funct. Bioinform. 2001, 43, 246–255. [Google Scholar] [CrossRef] [PubMed]
Shen, J.; Zhang, J.; Luo, X.; Zhu, W.; Yu, K.; Chen, K.; Li, Y.; Jiang, H. Predicting Protein–Protein Interactions Based Only on Sequences Information. Proc. Natl. Acad. Sci. USA 2007, 104, 4337–4341. [Google Scholar] [CrossRef] [PubMed]
Govindan, G.; Nair, A.S. Composition, Transition and Distribution (CTD)—A Dynamic Feature for Predictions Based on Hierarchical Structure of Cellular Sorting. In Proceedings of the 2011 Annual IEEE India Conference, Hyderabad, India, 16–18 December 2011; pp. 1–6. [Google Scholar]
Altschul, S.F.; Madden, T.L.; Schäffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.J. Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs. Nucleic Acids Res. 1997, 25, 3389–3402. [Google Scholar] [CrossRef]
Ruiz-Blanco, Y.B.; Paz, W.; Green, J.; Marrero-Ponce, Y. ProtDCal: A Program to Compute General-Purpose-Numerical Descriptors for Sequences and 3D-Structures of Proteins. BMC Bioinform. 2015, 16, 162. [Google Scholar] [CrossRef]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
Yao, Y.; Du, X.; Diao, Y.; Zhu, H. An Integration of Deep Learning with Feature Embedding for Protein–Protein Interaction Prediction. PeerJ 2019, 7, e7126. [Google Scholar] [CrossRef] [PubMed]
Hashemifar, S.; Neyshabur, B.; Khan, A.A.; Xu, J. Predicting Protein–Protein Interactions through Sequence-Based Deep Learning. Bioinformatics 2018, 34, i802–i810. [Google Scholar] [CrossRef]
Hu, X.; Feng, C.; Zhou, Y.; Harrison, A.; Chen, M. DeepTrio: A Ternary Prediction System for Protein–Protein Interaction Using Mask Multiple Parallel Convolutional Neural Networks. Bioinformatics 2022, 38, 694–702. [Google Scholar] [CrossRef]
Soleymani, F.; Paquet, E.; Viktor, H.L.; Michalowski, W.; Spinello, D. ProtInteract: A Deep Learning Framework for Predicting Protein–Protein Interactions. Comput. Struct. Biotechnol. J. 2023, 21, 1324–1348. [Google Scholar] [CrossRef]
Hu, J.; Li, Z.; Rao, B.; Thafar, M.A.; Arif, M. Improving Protein-Protein Interaction Prediction Using Protein Language Model and Protein Network Features. Anal. Biochem. 2024, 693, 115550. [Google Scholar] [CrossRef]
Dang, T.H.; Vu, T.A. xCAPT5: Protein–Protein Interaction Prediction Using Deep and Wide Multi-Kernel Pooling Convolutional Neural Networks with Protein Language Model. BMC Bioinform. 2024, 25, 106. [Google Scholar] [CrossRef] [PubMed]
Eid, F.-E.; ElHefnawi, M.; Heath, L.S. DeNovo: Virus-Host Sequence-Based Protein–Protein Interaction Prediction. Bioinformatics 2016, 32, 1144–1150. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Han, P.; Wang, G.; Chen, W.; Wang, S.; Song, T. SDNN-PPI: Self-Attention with Deep Neural Network Effect on Protein-Protein Interaction Prediction. BMC Genom. 2022, 23, 474. [Google Scholar] [CrossRef] [PubMed]
Gao, H.; Chen, C.; Li, S.; Wang, C.; Zhou, W.; Yu, B. Prediction of Protein-Protein Interactions Based on Ensemble Residual Convolutional Neural Network. Comput. Biol. Med. 2023, 152, 106471. [Google Scholar] [CrossRef]
Yang, K.K.; Fusi, N.; Lu, A.X. Convolutions Are Competitive with Transformers for Protein Sequence Pretraining. Cell Syst. 2024, 15, 286–294.e2. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar] [CrossRef]
Team, G.; Georgiev, P.; Lei, V.I.; Burnell, R.; Bai, L.; Gulati, A.; Tanzer, G.; Vincent, D.; Pan, Z.; Wang, S.; et al. Gemini 1.5: Unlocking Multimodal Understanding across Millions of Tokens of Context. arXiv 2024, arXiv:2403.05530. [Google Scholar] [CrossRef]
Krogh, A.; Brown, M.; Mian, I.S.; Sjölander, K.; Haussler, D. Hidden Markov Models in Computational Biology: Applications to Protein Modeling. J. Mol. Biol. 1994, 235, 1501–1531. [Google Scholar] [CrossRef]
Martelli, P.L.; Fariselli, P.; Krogh, A.; Casadio, R. A Sequence-Profile-Based HMM for Predicting and Discriminating β Barrel Membrane Proteins. Bioinformatics 2002, 18, S46–S53. [Google Scholar] [CrossRef]
Söding, J. Protein Homology Detection by HMM–HMM Comparison. Bioinformatics 2005, 21, 951–960. [Google Scholar] [CrossRef]
Bepler, T.; Berger, B. Learning the Protein Language: Evolution, Structure, and Function. Cell Syst. 2021, 12, 654–669.e3. [Google Scholar] [CrossRef]
Suzek, B.E.; Wang, Y.; Huang, H.; McGarvey, P.B.; Wu, C.H.; The UniProt Consortium. UniRef Clusters: A Comprehensive and Scalable Alternative for Improving Sequence Similarity Searches. Bioinformatics 2015, 31, 926–932. [Google Scholar] [CrossRef]
Steinegger, M.; Söding, J. Clustering Huge Protein Sequence Sets in Linear Time. Nat. Commun. 2018, 9, 2542. [Google Scholar] [CrossRef]
Medina-Ortiz, D.; Contreras, S.; Fernández, D.; Soto-García, N.; Moya, I.; Cabas-Mora, G.; Olivera-Nappa, Á. Protein Language Models and Machine Learning Facilitate the Identification of Antimicrobial Peptides. Int. J. Mol. Sci. 2024, 25, 8851. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Xiong, S.; Xu, L.; Liang, J.; Zhao, X.; Zhang, H.; Tan, X. Leveraging Protein Language Models for Robust Antimicrobial Peptide Detection. Methods 2025, 238, 19–26. [Google Scholar] [CrossRef] [PubMed]
Elnaggar, A.; Heinzinger, M.; Dallago, C.; Rehawi, G.; Wang, Y.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.; Steinegger, M.; et al. ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7112–7127. [Google Scholar] [CrossRef] [PubMed]
Elnaggar, A.; Essam, H.; Salah-Eldin, W.; Moustafa, W.; Elkerdawy, M.; Rochereau, C.; Rost, B. Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling. bioRxiv 2023. [Google Scholar] [CrossRef]
Ko, Y.S.; Parkinson, J.; Liu, C.; Wang, W. TUnA: An Uncertainty-Aware Transformer Model for Sequence-Based Protein–Protein Interaction Prediction. Brief. Bioinform. 2024, 25, bbae359. [Google Scholar] [CrossRef]
Schoenrock, A.; Dehne, F.; Green, J.R.; Golshani, A.; Pitre, S. MP-PIPE: A Massively Parallel Protein-Protein Interaction Prediction Engine. In Proceedings of the International Conference on Supercomputing—ICS ’11, Tucson, AZ, USA, 31 May–4 June 2011; ACM Press: Tucson, AZ, USA, 2011; p. 327. [Google Scholar]
Dayhoff, M.; Schwartz, R.; Orcutt, B. A Model of Evolutionary Change in Proteins. Atlas Protein Seq. Struct. 1978, 5, 345–352. [Google Scholar]
Pitre, S.; North, C.; Alamgir, M.; Jessulat, M.; Chan, A.; Luo, X.; Green, J.R.; Dumontier, M.; Dehne, F.; Golshani, A. Global Investigation of Protein–Protein Interactions in Yeast Saccharomyces Cerevisiae Using Re-Occurring Short Polypeptide Sequences. Nucleic Acids Res. 2008, 36, 4286–4294. [Google Scholar] [CrossRef] [PubMed]
Bell, E.W.; Schwartz, J.H.; Freddolino, P.L.; Zhang, Y. PEPPI: Whole-Proteome Protein-Protein Interaction Prediction through Structure and Sequence Similarity, Functional Association, and Machine Learning. J. Mol. Biol. 2022, 434, 167530. [Google Scholar] [CrossRef] [PubMed]
Feng, Y.; Wang, Q.; Wang, T. Drug Target Protein-Protein Interaction Networks: A Systematic Perspective. BioMed Res. Int. 2017, 2017, 1289259. [Google Scholar] [CrossRef]
Harrold, J.M.; Ramanathan, M.; Mager, D.E. Network-Based Approaches in Drug Discovery and Early Development. Clin. Pharmacol. Ther. 2013, 94, 651–658. [Google Scholar] [CrossRef]
Kim, Y.-A.; Wuchty, S.; Przytycka, T.M. Identifying Causal Genes and Dysregulated Pathways in Complex Diseases. PLoS Comput. Biol. 2011, 7, e1001095. [Google Scholar] [CrossRef]
Basar, M.A.; Hosen, M.F.; Kumar Paul, B.; Hasan, M.R.; Shamim, S.M.; Bhuyian, T. Identification of Drug and Protein-Protein Interaction Network among Stress and Depression: A Bioinformatics Approach. Inform. Med. Unlocked 2023, 37, 101174. [Google Scholar] [CrossRef]
Peng, Q.; Schork, N.J. Utility of Network Integrity Methods in Therapeutic Target Identification. Front. Genet. 2014, 5, 12. [Google Scholar] [CrossRef]
Li, Z.-C.; Zhong, W.-Q.; Liu, Z.-Q.; Huang, M.-H.; Xie, Y.; Dai, Z.; Zou, X.-Y. Large-Scale Identification of Potential Drug Targets Based on the Topological Features of Human Protein-Protein Interaction Network. Anal. Chim. Acta 2015, 871, 18–27. [Google Scholar] [CrossRef]
Gordon, D.E.; Jang, G.M.; Bouhaddou, M.; Xu, J.; Obernier, K.; White, K.M.; O’Meara, M.J.; Rezelj, V.V.; Guo, J.Z.; Swaney, D.L.; et al. A SARS-CoV-2 Protein Interaction Map Reveals Targets for Drug Repurposing. Nature 2020, 583, 459–468. [Google Scholar] [CrossRef]
Agamah, F.E.; Mazandu, G.K.; Hassan, R.; Bope, C.D.; Thomford, N.E.; Ghansah, A.; Chimusa, E.R. Computational/in Silico Methods in Drug Target and Lead Prediction. Brief. Bioinform. 2019, 21, 1663–1675. [Google Scholar] [CrossRef]
Zhang, X.; Wu, F.; Yang, N.; Zhan, X.; Liao, J.; Mai, S.; Huang, Z. In Silico Methods for Identification of Potential Therapeutic Targets. Interdiscip. Sci. Comput. Life Sci. 2022, 14, 285–310. [Google Scholar] [CrossRef] [PubMed]
Wang, L.; Wang, N.; Zhang, W.; Cheng, X.; Yan, Z.; Shao, G.; Wang, X.; Wang, R.; Fu, C. Therapeutic Peptides: Current Applications and Future Directions. Signal Transduct. Target. Ther. 2022, 7, 48. [Google Scholar] [CrossRef] [PubMed]
Rosson, E.; Lux, F.; David, L.; Godfrin, Y.; Tillement, O.; Thomas, E. Focus on Therapeutic Peptides and Their Delivery. Int. J. Pharm. 2025, 675, 125555. [Google Scholar] [CrossRef]
Sivasankaran, R.P.; Snell, K.; Kunkel, G.; Georgiou, P.G.; Puente, E.G.; Maynard, H.D. Polymer-Mediated Protein/Peptide Therapeutic Stabilization: Current Progress and Future Directions. Prog. Polym. Sci. 2024, 156, 101867. [Google Scholar] [CrossRef]
Nicze, M.; Borówka, M.; Dec, A.; Niemiec, A.; Bułdak, Ł.; Okopień, B. The Current and Promising Oral Delivery Methods for Protein- and Peptide-Based Drugs. Int. J. Mol. Sci. 2024, 25, 815. [Google Scholar] [CrossRef]
Xiao, W.; Jiang, W.; Chen, Z.; Huang, Y.; Mao, J.; Zheng, W.; Hu, Y.; Shi, J. Advance in Peptide-Based Drug Development: Delivery Platforms, Therapeutics and Vaccines. Signal Transduct. Target. Ther. 2025, 10, 74. [Google Scholar] [CrossRef] [PubMed]
Cabri, W.; Cantelmi, P.; Corbisiero, D.; Fantoni, T.; Ferrazzano, L.; Martelli, G.; Mattellone, A.; Tolomelli, A. Therapeutic Peptides Targeting PPI in Clinical Development: Overview, Mechanism of Action and Perspectives. Front. Mol. Biosci. 2021, 8, 697586. [Google Scholar] [CrossRef]
Coin, I.; Beyermann, M.; Bienert, M. Solid-Phase Peptide Synthesis: From Standard Procedures to the Synthesis of Difficult Sequences. Nat. Protoc. 2007, 2, 3247–3256. [Google Scholar] [CrossRef] [PubMed]
Charih, F.; Biggar, K.K.; Green, J.R. Assessing Sequence-Based Protein–Protein Interaction Predictors for Use in Therapeutic Peptide Engineering. Sci. Rep. 2022, 12, 9610. [Google Scholar] [CrossRef] [PubMed]
Schoenrock, A.; Burnside, D.; Moteshareie, H.; Wong, A.; Golshani, A.; Dehne, F.; Green, J.R. Engineering Inhibitory Proteins with InSiPS: The in-Silico Protein Synthesizer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on—SC ’15, Austin, TX, USA, 15–20 November 2015; ACM Press: Austin, TX, USA, 2015; pp. 1–11. [Google Scholar]
Burnside, D.; Schoenrock, A.; Moteshareie, H.; Hooshyar, M.; Basra, P.; Hajikarimlou, M.; Dick, K.; Barnes, B.; Kazmirchuk, T.; Jessulat, M.; et al. In Silico Engineering of Synthetic Binding Proteins from Random Amino Acid Sequences. iScience 2019, 11, 375–387. [Google Scholar] [CrossRef] [PubMed]
Hajikarimlou, M.; Hooshyar, M.; Moutaoufik, M.T.; Aly, K.A.; Azad, T.; Takallou, S.; Jagadeesan, S.; Phanse, S.; Said, K.B.; Samanfar, B.; et al. A Computational Approach to Rapidly Design Peptides That Detect SARS-CoV-2 Surface Protein S. NAR Genom. Bioinform. 2022, 4, lqac058. [Google Scholar] [CrossRef]
Lei, Y.; Li, S.; Liu, Z.; Wan, F.; Tian, T.; Li, S.; Zhao, D.; Zeng, J. A Deep-Learning Framework for Multi-Level Peptide–Protein Interaction Prediction. Nat. Commun. 2021, 12, 5465. [Google Scholar] [CrossRef]
Palepu, K.; Ponnapati, M.; Bhat, S.; Tysinger, E.; Stan, T.; Brixi, G.; Koseki, S.R.T.; Chatterjee, P. Design of Peptide-Based Protein Degraders via Contrastive Deep Learning. bioRxiv 2022. [Google Scholar] [CrossRef]
Bhat, S.; Palepu, K.; Hong, L.; Mao, J.; Ye, T.; Iyer, R.; Zhao, L.; Chen, T.; Vincoff, S.; Watson, R.; et al. De Novo Design of Peptide Binders to Conformationally Diverse Targets with Contrastive Language Modeling. Sci. Adv. 2025, 11, eadr8638. [Google Scholar] [CrossRef]
Ruffolo, J.A.; Gray, J.J.; Sulam, J. Deciphering Antibody Affinity Maturation with Language Models and Weakly Supervised Learning. arXiv 2021, arXiv:2112.07782. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar] [CrossRef]
Leem, J.; Mitchell, L.S.; Farmery, J.H.R.; Barton, J.; Galson, J.D. Deciphering the Language of Antibodies Using Self-Supervised Learning. Patterns 2022, 3, 100513. [Google Scholar] [CrossRef]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Kenlay, H.; Dreyer, F.A.; Kovaltsuk, A.; Miketa, D.; Pires, D.; Deane, C.M. Large Scale Paired Antibody Language Models. PLoS Comput. Biol. 2024, 20, e1012646. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Olsen, T.H.; Moal, I.H.; Deane, C.M. AbLang: An Antibody Language Model for Completing Antibody Sequences. Bioinform. Adv. 2022, 2, vbac046. [Google Scholar] [CrossRef]
Shuai, R.W.; Ruffolo, J.A.; Gray, J.J. IgLM: Infilling Language Modeling for Antibody Sequence Design. Cell Syst. 2023, 14, 979–989.e4. [Google Scholar] [CrossRef]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models Are Unsupervised Multitask Learners. OpenAI Blog [Online]. 2019. Available online: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (accessed on 5 July 2025).
Kovaltsuk, A.; Leem, J.; Kelm, S.; Snowden, J.; Deane, C.M.; Krawczyk, K. Observed Antibody Space: A Resource for Data Mining Next-Generation Sequencing of Antibody Repertoires. J. Immunol. 2018, 201, 2502–2509. [Google Scholar] [CrossRef]
Hie, B.L.; Shanker, V.R.; Xu, D.; Bruun, T.U.J.; Weidenbacher, P.A.; Tang, S.; Wu, W.; Pak, J.E.; Kim, P.S. Efficient Evolution of Human Antibodies from General Protein Language Models. Nat. Biotechnol. 2024, 42, 275–283. [Google Scholar] [CrossRef] [PubMed]
Rives, A.; Meier, J.; Sercu, T.; Goyal, S.; Lin, Z.; Liu, J.; Guo, D.; Ott, M.; Zitnick, C.L.; Ma, J.; et al. Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences. Proc. Natl. Acad. Sci. USA 2021, 118, e2016239118. [Google Scholar] [CrossRef] [PubMed]
Meier, J.; Rao, R.; Verkuil, R.; Liu, J.; Sercu, T.; Rives, A. Language Models Enable Zero-Shot Prediction of the Effects of Mutations on Protein Function. bioRxiv 2021. [Google Scholar] [CrossRef]
Boshar, S.; Trop, E.; de Almeida, B.P.; Copoiu, L.; Pierrot, T. Are Genomic Language Models All You Need? Exploring Genomic Language Models on Protein Downstream Tasks. Bioinformatics 2024, 40, btae529. [Google Scholar] [CrossRef]
Consens, M.E.; Li, B.; Poetsch, A.R.; Gilbert, S. Genomic Language Models Could Transform Medicine but Not Yet. npj Digit. Med. 2025, 8, 212. [Google Scholar] [CrossRef]
Ali, S.; Qadri, Y.A.; Ahmad, K.; Lin, Z.; Leung, M.-F.; Kim, S.W.; Vasilakos, A.V.; Zhou, T. Large Language Models in Genomics—A Perspective on Personalized Medicine. Bioengineering 2025, 12, 440. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Qiao, G.; Wang, G. scKEPLM: Knowledge Enhanced Large-Scale Pre-Trained Language Model for Single-Cell Transcriptomics. bioRxiv 2024. [Google Scholar] [CrossRef]
Zeng, Y.; Xie, J.; Shangguan, N.; Wei, Z.; Li, W.; Su, Y.; Yang, S.; Zhang, C.; Zhang, J.; Fang, N.; et al. CellFM: A Large-Scale Foundation Model Pre-Trained on Transcriptomics of 100 Million Human Cells. Nat. Commun. 2025, 16, 4679. [Google Scholar] [CrossRef] [PubMed]
Fournier, Q.; Vernon, R.M.; Van Der Sloot, A.; Schulz, B.; Chandar, S.; Langmead, C.J. Protein Language Models: Is Scaling Necessary? bioRxiv 2024. [Google Scholar] [CrossRef]
Smietana, K.; Siatkowski, M.; Møller, M. Trends in Clinical Success Rates. Nat. Rev. Drug Discov. 2016, 15, 379–380. [Google Scholar] [CrossRef]
Mullard, A. Parsing Clinical Success Rates. Nat. Rev. Drug Discov. 2016, 15, 447. [Google Scholar] [CrossRef]
Yamaguchi, S.; Kaneko, M.; Narukawa, M. Approval Success Rates of Drug Candidates Based on Target, Action, Modality, Application, and Their Combinations. Clin. Transl. Sci. 2021, 14, 1113–1122. [Google Scholar] [CrossRef]
Wang, M.; Zhang, Z.; Bedi, A.S.; Velasquez, A.; Guerra, S.; Lin-Gibson, S.; Cong, L.; Qu, Y.; Chakraborty, S.; Blewett, M.; et al. A Call for Built-in Biosecurity Safeguards for Generative AI Tools. Nat. Biotechnol. 2025, 43, 845–847. [Google Scholar] [CrossRef]
de Lima, R.C.; Sinclair, L.; Megger, R.; Maciel, M.A.G.; Vasconcelos, P.F.d.C.; Quaresma, J.A.S. Artificial Intelligence Challenges in the Face of Biological Threats: Emerging Catastrophic Risks for Public Health. Front. Artif. Intell. 2024, 7, 1382356. [Google Scholar] [CrossRef]
Wheeler, N.E. Responsible AI in Biotechnology: Balancing Discovery, Innovation and Biosecurity Risks. Front. Bioeng. Biotechnol. 2025, 13, 1537471. [Google Scholar] [CrossRef] [PubMed]
Brent, R.; McKelvey, T.G., Jr. Contemporary AI Foundation Models Increase Biological Weapons Risk. arXiv 2025, arXiv:2506.13798. [Google Scholar] [CrossRef]
OpenAI. Preparing for Future AI Capabilities in Biology. Available online: https://openai.com/index/preparing-for-future-ai-capabilities-in-biology/ (accessed on 29 August 2025).

Figure 1. Gap between the total number of PPIs and PPIs for which high-quality structures are available. Only a fraction of the total number of experimentally validated physical PPIs in the BioGRID PPI database [31], involving at least one human protein (blue), have high-quality structures (<2Å resolution) for both interactors deposited in the RCSB Protein Data Bank [32] (orange). This fraction of PPIs with high-quality structures has been decreasing over time. Furthermore, the fraction of useful structures may be lower, since many of these structures were resolved for heavily truncated protein constructs (Data retrieved and compiled using the RCSB PDB API).

Figure 2. Machine learning workflow for PPI prediction. A standard methodology is typically employed to train ML-based PPI predictors. First, known interactions and their protein sequences are retrieved from databases and curated. Pairs of proteins assumed to be non-interacting (“negatives” pairs) are added to the dataset. Second, numerical descriptors of protein sequences and/or structures are extracted, and the dataset is split into a training and a validation set. Third, a model is trained by minimizing misclassifications over the training protein pairs. Fourth, the model predicts PPIs in a test set composed of new, unseen protein pairs, allowing for an accurate estimate of performance. Finally, the model is deployed to identify new interactions for in vitro validation and predict the interactome.

Figure 3. Supervised learning and self-supervised learning in PPI prediction (A) In the supervised learning setting, a model is trained to make predictions on numerical representations (vectors of descriptors/features) representing a pair of proteins (green and red). These descriptors can be human engineered (traditional ML) or learned from data by deep learning models. The model’s parameters are adjusted to maximize the number of correct predictions over a training set. (B) In self-supervised learning, a protein language model (pLM) learns to produce representations (embeddings) for proteins by learning to predict a masked residue from surrounding amino acids (masked language modeling).

Figure 4. Assessment of a PPI predictor performance. (A) Schematic representation of the score distributions for non-interacting (red) and interacting (green) pairs. Protein pairs with scores/probabilities above the arbitrarily set decision threshold are classified as positive. (B) Schematic representation of how the score distributions and predictions are used to compute important performance metrics. (C) The receiver operating characteristic curve (left) illustrates the tradeoff between the true positive rate (TPR) and specificity (Sp) over the range of decision thresholds. The ROC curve of the perfect PPI predictor passes through the upper-left corner of the plot, while that of random guessing is shown as a dashed line. The precision–recall curve (right) illustrates the tradeoff between precision and recall over the range of decision thresholds. The PR curve of the perfect predictor passes through the upper-right corner of the plot.

Figure 5. Frequently encountered PPI prediction schemes. In intra-species predictions, PPIs from an organism are used as training data to discover new interactions within the same organism. In the inter-species scheme, interactions from two or more organisms are used to predict interactions between the organisms. Cross-species PPI prediction involves predicting interactions within understudied organisms from PPIs from better studied, evolutionarily related organisms that act as “proxies”.

Figure 6. Sensitivity of the ROC and PR curves to class imbalance. In this simulated scenario, where the predicted interaction probabilities (left) of interacting (green) and non-interacting (orange, brown, and red) protein pairs are assumed to be normally distributed, we compare the ROC (middle) and PR curves (right) obtained for different class imbalance ratios (1:1, 1:100, and 1:500).

Table 1. Recently updated PPI and PepPI databases incorporating experimental data involving human proteins.

Database	URL	Human Interactions
BioGRID [31]	https://thebiogrid.org (accessed on 1 August 2025)	1,890,522
STRING [58]	https://string-db.org (accessed on 1 August 2025)	2,219,787
IntAct [59]	https://www.ebi.ac.uk/intact (accessed on 1 August 2025)	1,702,367
MINT [60]	https://mint.bio.uniroma2.it (accessed on 1 August 2025)	139,901
Propedia [61]	http://bioinfo.dcc.ufmg.br/propedia (accessed on 1 August 2025)	19,813

Table 2. Summary of commonly used metrics for PPI predictor evaluation.

Metric	Summary	Comment
Accuracy (Ac)	Fraction of correctly classified pairs	Not useful in the context of PPI prediction, as it emphasizes the correct classification of non-interacting pairs that vastly outnumber interacting pairs (high class imbalance)
Recall (Re)	Fraction of interacting pairs also predicted to interact
Precision (Pr)	Fraction of pairs predicted to interact that truly interact	Useful in high class imbalance situations; allows for the estimation of experiments required to identify a fixed number of new interacting pairs
Specificity (Sp)	Fraction of non-interacting pairs also predicted to not interact	Usually not particularly relevant in the context of PPI prediction
F1-score	Harmonic mean of precision and recall
Prevalence-corrected precision (PCPr)	Formulation of precision as a function of recall, specificity, and the class imbalance ratio	Particularly useful, as it allows for the prediction of the anticipated precision for imbalance ratios in cases for different hypothetical imbalance ratios (the true ratio is often unknown)
Area under the receiver operating characteristic curve (AUROC)	Average recall–specificity tradeoff over the range of operating thresholds	Insensitive to class imbalance; likely to lead to overoptimistic performance estimates
Area under the precision–recall curve (AUPRC)	Average recall–precision tradeoff over the range of operating thresholds	Sensitive to class imbalance; captures precision, a highly relevant metric in the context of PPI prediction

Table 3. Summary of frequently used human-engineered feature sets.

Feature Set	Input	Dimension	Description
Amino acid composition (AAC)	Amino acid sequence	20	Frequency of the amino acids within the sequence of interest; limited information content (no evolutionary information, structural information, etc.)
Conjoint → triad method (CT)	Amino acid sequence and a letter code built from shared physicochemical features	343 (for a 7-letter code)	The sequence is rewritten as a code (usually consisting of 7 letters) where each amino acid is assigned to one of those letters (based on physicochemical properties), and the counts for each possible triplet form the feature vector
Composition, transition and distribution (CTD)	Amino acid sequence and 3 amino acid groups defined for 7 physicochemical features	441	3 groups for 7 physicochemical properties are defined. Composition (C): proportion of the residues in the sequence belonging to the 3 possible groups computed for each of the 7 properties Transition (T): number of transitions from one group to another (or vice versa) for all 7 physicochemical properties Distribution (D): chain length at which the 1st, first 25%, first 50%, first 75%, and first 100% of the amino acids in a group are encompassed These features are computed for each of the three equal thirds of the protein and concatenated
Pseudo amino acid composition (PseAAC)	Amino acid sequence and a set of physicochemical properties	20 + λ	Amino acid composition descriptors to which correlation factors are added to account for the autocorrelation between hydrophobicity, hydrophilicity, and side chain mass values of residues up to λ positions apart
Position-specific scoring matrix (PSSM)	Amino acid sequence and a reference protein database	20 × L (matrix)	Matrix tabulating the likelihood of a mutation to each of the 20 amino acids through evolution for all L amino acids in the sequence; rich in evolutionary/phylogenetic information
Autocorrelation of physicochemical properties (AC)	Amino acid sequence, physicochemical properties and a “lag” parameter defining the window size within which properties are aggregated	14 (for 7 properties)	Autocorrelation of physicochemical property values in a sequence between residues in residue neighborhoods whose sizes are defined by the “lag”
ProtDCal-extracted features	Amino acid sequence and a list of physicochemical properties and aggregators	>10,000 (variable)	Ensemble of grouping schemes, weights, and aggregation operations applied to the physicochemical properties of the amino acids in a sequence

Table 4. Foundational protein language models used to predict PPIs.

Model	Embedding Dimension	Parameters (Approx.)	Training Strategy	Training Data
ProtT5 [138]	1024	3B	1-gram random masking with demasking	BFD (pre-training; ~2.1B sequences) and UniRef50 (finetuning; ~45M sequences)
Ankh [139]	1536	1B	1-gram random masking with full sequence reconstruction	UniRef50 (~45M sequences)
ESM-2 [35]	1280	650M	1-gram random masking with demasking	UniRef50+90 (~65M sequences)

Note: We use short forms “B” and “M” to refer to billion and million, respectively.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Charih, F.; Green, J.R.; Biggar, K.K. Sequence-Based Protein–Protein Interaction Prediction and Its Applications in Drug Discovery. Cells 2025, 14, 1449. https://doi.org/10.3390/cells14181449

AMA Style

Charih F, Green JR, Biggar KK. Sequence-Based Protein–Protein Interaction Prediction and Its Applications in Drug Discovery. Cells. 2025; 14(18):1449. https://doi.org/10.3390/cells14181449

Chicago/Turabian Style

Charih, François, James R. Green, and Kyle K. Biggar. 2025. "Sequence-Based Protein–Protein Interaction Prediction and Its Applications in Drug Discovery" Cells 14, no. 18: 1449. https://doi.org/10.3390/cells14181449

APA Style

Charih, F., Green, J. R., & Biggar, K. K. (2025). Sequence-Based Protein–Protein Interaction Prediction and Its Applications in Drug Discovery. Cells, 14(18), 1449. https://doi.org/10.3390/cells14181449

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sequence-Based Protein–Protein Interaction Prediction and Its Applications in Drug Discovery

Abstract

1. Introduction

2. The Case for Sequence-Based PPI Predictors; Advantages over Structure-Based Prediction

3. How PPI Predictors Work: Machine Learning Methods and Evaluation Metrics

3.1. Paradigms

3.2. Methodology

3.2.1. Data Curation

3.2.2. Feature Engineering and Data Splitting

3.2.3. Model Training

3.2.4. Model Evaluation

4. Generalizing Beyond Model Systems: Challenges and Solutions in Cross-Species PPI Prediction

5. The Class Imbalance Problem in PPI Prediction

6. Old but Gold: Sequence-Based Protein–Protein and Peptide–Protein Predictors

6.1. Machine Learning-Based Approaches

6.2. Protein Language Model-Based Approaches

6.3. Similarity-Based Approaches

7. Protein–Protein Interaction Prediction for Drug Development

7.1. Identification of Drug Targets with PPI Network Analysis

7.2. Targeting PPIs with Peptide Binders

7.3. Antibody Design

8. Summary and Future Trends

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI