1. Introduction
Protein–protein interactions (PPIs) play fundamental roles in numerous biological processes, including signal transduction, metabolic pathways, and gene regulation [
1]. In plants, a deeper understanding of PPIs is especially valuable for elucidating mechanisms underlying growth, development, stress responses, and disease resistance [
2]. However, experimental techniques for identifying PPIs—such as yeast two-hybrid assays and co-immunoprecipitation—are often time-consuming, costly, and limited in scale [
3]. Consequently, computational approaches for PPI prediction have emerged as essential tools, enabling rapid screening of large sets of protein pairs.
Early computational methods relied primarily on sequence similarity, domain information, and genomic context to infer interactions [
4]. Subsequent approaches incorporated handcrafted features derived from protein sequences and their biological properties [
5,
6]. Although these feature-based methods performed well for well-characterized organisms, they often struggled with limited data availability and were unable to fully capture the complex sequence patterns governing protein interactions. The rise in deep learning marked a significant shift, as neural networks could automatically learn relevant patterns directly from raw sequences, eliminating the need for manual feature engineering [
7]. Representative models include DPPI [
8], which uses convolutional neural networks to detect local motifs; PIPR [
9], which employs a Siamese residual recurrent convolutional network; DeepTrio [
10], which integrates multi-scale convolution and masking strategies; D-SCRIPT [
11], which combines pre-trained language model embeddings with a structure-aware design to predict inter-protein contact maps and achieves interpretable, cross-species PPI prediction; and TAGPPI [
12], which learns spatial structures of proteins by integrating 1D sequence convolutions with graph learning on AlphaFold-predicted contact maps to improve prediction accuracy. These methods demonstrated strong performance on benchmark datasets from yeast and humans, establishing deep learning as a powerful framework for PPI prediction.
Inspired by parallels between biological sequences and natural language, researchers began to adapt natural language processing (NLP) techniques to computational biology [
13,
14]. By treating amino acid sequences as textual documents—where residues or short peptides correspond to words—early methods utilized shallow embeddings such as word2vec and doc2vec to represent proteins in continuous vector spaces. For instance, DeepFE-PPI [
15] adopted Res2vec for residue embedding. Despite their simplicity and reliance on relatively small training corpora, these approaches showed that distributional representations could effectively capture sequence-level semantic information.
The introduction of the Transformer architecture in 2017 revolutionized both NLP and protein sequence modeling [
16]. Its attention mechanism allows dynamic weighting of information across sequence positions, capturing both local motifs and global dependencies [
17]. This advancement addressed earlier limitations in computational efficiency and memory, paving the way for large-scale protein language models (PLMs). Rives et al. developed one of the first such models, laying the foundation for modern protein representation learning [
18]. Among these, ESM has become particularly influential. Trained extensively on the UniRef50 database, ESM learns hierarchical protein representations that encode biochemical properties, evolutionary relationships, and structural constraints [
19]. The success of ESM and related models has been demonstrated across diverse downstream tasks, including function annotation, structure prediction, and interaction analysis [
20,
21]. Pre-trained embeddings from these models provide powerful, transferable numerical representations that often surpass task-specific feature engineering or training from scratch on limited datasets. For PPI prediction specifically, PLM-based approaches typically report substantial gains over earlier feature-based or non-PLM deep learning methods. For example, ESMAraPPI [
22] feeds ESM-1b embeddings into an Multi-Layer Perceptron (MLP) for Arabidopsis PPI prediction and reported an PR AUC of 0.81 on an independent test set with unseen proteins, outperforming generic and plant-specific baselines.
Despite the promise of protein language models for PPI prediction, their application to plant species—especially economically important crops—remains challenging. Grape (
Vitis vinifera), a globally significant fruit crop with annual production exceeding 77 million tonnes, would greatly benefit from improved insights into its protein interaction networks to support targeted crop improvement. A major obstacle is the scarcity of high-quality experimental PPI data for grape compared to model organisms such as yeast and human. Although databases like STRING continue to expand plant data coverage, the number of known grape interactions remains substantially lower than for mammals or microbes [
23]. This data scarcity complicates both model training and evaluation, underscoring the need for approaches that generalize effectively with limited grape-specific PPI data.
The challenges extend beyond data availability. Models developed for yeast or human PPI prediction may not perform well on plant data due to species-specific differences in protein interactions and domains. Even within plants, cross-species generalization can be limited. For example, ESMAraPPI, an Arabidopsis-specific model using ESM embeddings, showed excellent performance on Arabidopsis data [
22]. However, when we applied the same architecture to grape PPI prediction, its effectiveness was limited. This observation highlights the species-specific nature of protein interactions within the plant kingdom and emphasizes the need for specialized models tailored to particular crops. Given the potential of accurate PPI prediction to identify networks related to yield, nutrient use, and stress tolerance, developing robust computational models for grape PPIs is crucial to support experimental efforts and accelerate grape functional genomics and molecular breeding.
In this study, we present GrapePPI, the first deep learning framework specifically designed for grape PPI prediction that leverages ESM embeddings together with a species-adapted architecture (sequence encoder and multi-layer interaction predictor). We evaluated GrapePPI on standard benchmarks and plant-specific datasets under both balanced and imbalanced class distributions, comparing its performance with state-of-the-art methods including ESMAraPPI, DeepFE-PPI, and PIPR. Our specific contributions are: (1) a grape-specific PPI prediction framework combining pre-trained ESM embeddings with a tailored neural architecture; (2) comprehensive evaluation on grape datasets with 1:1 and 1:10 positive-to-negative ratios and on yeast and Arabidopsis benchmarks; (3) ablation studies demonstrating the importance of the sequence encoder and multi-layer predictor for grape; and (4) statistically significant improvements over existing methods on grape data. To our knowledge, the most similar prior work to our approach is the PLM-embedding and MLP paradigm such as ESMAraPPI, which use a single predictor on top of pooled embeddings and do not include a dedicated sequence encoder to adapt representations to the PPI task. Our ablation study provides an explicit architectural contrast with the no_seq_encoder variant of GrapePPI (analogous to ESMAraPPI) and demonstrates that the sequence encoder and multi-layer interaction predictor are critical for adapting general-purpose embeddings to grape PPI prediction. GrapePPI advances beyond ESMAraPPI by introducing a task-specific sequence encoder, enabling substantially better performance on grape data (see
Section 3). The rest of the paper is organized as follows: Materials and methods describe data, model architecture, and training; Results present benchmark and grape-specific performance and ablations; Discussion addresses limitations and future work; and Conclusion summarizes the findings. Our results show that GrapePPI outperforms competing methods across most evaluation metrics on every dataset, with statistically significant advantages particularly evident on grape-specific data.
2. Materials and Methods
2.1. Data Source and Preprocessing
We constructed the grape protein–protein interaction dataset from the STRING database version 12.0 [
23], a widely used resource for protein association networks and functional enrichment. Physical interaction pairs with confidence scores ≥0.7 were extracted, yielding 24,477 interactions. To reduce sequence redundancy, we used CD-HIT [
24] with a 40% sequence identity threshold; in addition, proteins shorter than 50 amino acids were filtered out. The impact of each quality control step on interaction counts was as follows. Length filtering (retaining only pairs in which both proteins have ≥50 amino acids) removed 542 interactions, leaving 23,935. Subsequently, CD-HIT redundancy reduction (keeping one representative per cluster at 40% sequence identity) removed a further 12,937 interactions, leaving 10,998 positive interactions. After all filtering steps, the dataset comprised 2465 unique proteins. The resulting positive PPI network has 10,998 edges, yielding an average node degree of approximately 8.9; the network has low density and a right-skewed degree distribution typical of biological PPI networks (
Table S1). Two datasets with different class distributions were prepared to evaluate model performance under balanced and imbalanced conditions:
Grape-1: Balanced dataset with a 1:1 positive-to-negative ratio. The final Grape-1 dataset contained 10,998 positive and 10,998 negative interaction pairs (21,996 total).
Grape-10: Imbalanced dataset with a 1:10 positive-to-negative ratio, reflecting the typical scarcity of interacting pairs in large-scale PPI prediction. The final Grape-10 dataset contained 10,998 positive and 109,980 negative pairs (120,978 total).
Negative sampling. Negative samples were generated by randomly pairing proteins from the same filtered protein set such that the pair was not present in the set of known positive interactions (STRING physical links after our filters). Sampling was performed without replacement: each candidate pair was drawn by randomly selecting two distinct proteins and accepted only if it was not in the positive set and not already in the negative set. We did not use additional databases or subcellular localization to certify that negatives are non-interacting; thus negatives are “unknown” rather than validated non-interactors. A brief discussion of the limitations of random negative sampling, including the possibility that some negatives may be undiscovered true interactions, is given in the Discussion.
2.2. Model Architecture
GrapePPI consists of four main components: ESM embedding, sequence encoder, feature combination, and interaction predictor (
Figure 1).
Protein sequence embedding. We used ESM (esm2_t36_3B) to obtain embeddings for each protein. This model was pre-trained on UniRef sequences and outputs a 2560-dimensional vector per amino acid. Sequences were padded or truncated to 1024 residues, and average pooling was applied along the sequence length to produce a fixed-size 2560-dimensional representation per protein.
Sequence encoder. The sequence encoder is a two-layer fully connected (linear) network with no convolutional, recurrent, or transformer layers and no batch normalization. It transformed the 2560-dimensional ESM embeddings into a lower-dimensional space: the first layer 2560 to 512 with ReLU activation and dropout (rate 0.2), the second layer 512 to 256 with ReLU. Both proteins were processed independently through the same encoder, yielding encoded representations and (each 256-dimensional). This design provides a task-specific projection from general-purpose ESM space to a compact representation suited to PPI prediction.
Feature combination. The two protein representations were merged by concatenation only: , resulting in a 512-dimensional combined feature vector (no element-wise product or attention-based fusion in the main model). Our preliminary experiments in grape PPI prediction showed that element-wise product or attention-based fusion—or their combinations—did not consistently outperform simple concatenation, so the main model retains concatenation only.
Interaction predictor. A multi-layer fully connected network took the combined features and predicted the interaction probability. It comprised four layers with neuron counts 512 to 512, 512 to 256, 256 to 128, and 128 to 1. Each layer except the last was followed by ReLU activation and dropout (rate 0.2); no batch normalization was used. The final layer output was passed through a sigmoid function for binary classification.
2.3. Training Procedure
ESM parameters were kept frozen (not fine-tuned); only the sequence encoder and interaction predictor were trained. Models were trained using the AdamW optimizer with an initial learning rate of 0.001 and weight decay . A ReduceLROnPlateau scheduler reduced the learning rate by a factor of 0.5 if the validation loss did not improve for 15 consecutive epochs. Training was performed for up to 500 epochs with a batch size of 128. Early stopping was applied with a patience of 30 epochs based on validation loss. The loss function was binary cross-entropy. For Grape-1 (balanced data) we used no class weights; for Grape-10 (imbalanced, 1:10 ratio) we used the same unweighted loss so that the reported results reflect the model’s robustness without class weighting (exploration of class weighting or oversampling for imbalanced data is noted in the Discussion as future work). Gradient clipping (max norm 1.0) was applied. These hyperparameters were applied to all datasets (Grape-1, Grape-10, Yeast-1, Yeast-2, and Ath) to ensure a fair and consistent cross-species evaluation. For the baseline methods (DeepFE-PPI, PIPR, ESMAraPPI), hyperparameters were kept as reported in the original publications for all datasets, with no dataset-specific tuning.
We employed 5-fold cross-validation at the pair level (not at the protein level). The full dataset was partitioned into five equal, non-overlapping subsets. For each fold, one subset (~20% of all pairs) served as the validation set and the remaining four subsets (~80%) formed the training set. No pair appeared in both the training and validation sets within a fold. The validation set was used for early stopping, learning rate scheduling, and final performance evaluation. The model checkpoint with the best validation loss was retained for evaluation on the validation fold. The same five splits were applied identically to GrapePPI and all baseline models to ensure a fair comparison.
All experiments were conducted on an NVIDIA L4 GPU (24 GB VRAM) with 52.96 GB system RAM using PyTorch 2.10 in a Google Colab environment. The random seed was fixed at 1234 for all experiments, including dataset splitting, negative sampling, model weight initialization, and dropout, to ensure full reproducibility.
2.4. Baseline Methods
We compared GrapePPI with three state-of-the-art PPI prediction methods:
DeepFE-PPI [
15]: Uses Res2vec embeddings and separate deep networks for each protein, followed by a joint prediction module.
PIPR [
9]: Employs a Siamese residual recurrent convolutional network with property-aware embeddings.
ESMAraPPI [
22]: An Arabidopsis-specific model that employs ESM embeddings for plant PPI prediction.
For fair comparison on grape-specific datasets (Grape-1 and Grape-10), all baseline models were retrained from scratch on the same grape data using the same 5-fold splits as GrapePPI. Hyperparameters and training procedures for the baselines were kept as in the original publications (no grape-specific tuning).
No modifications were made to the baseline model architectures or input formats: each method was applied with its own native input representation (DeepFE-PPI with Res2vec residue embeddings, PIPR with its property-aware sequence embeddings, and ESMAraPPI with ESM-1b embeddings), ensuring that each method is evaluated under the conditions for which it was designed.
2.5. Evaluation Metrics
We used the following standard metrics:
where TP, TN, FP, FN denote true positives, true negatives, false positives, and false negatives, respectively.
We also report:
ROC AUC: Area under the receiver operating characteristic curve.
PR AUC: Area under the precision-recall curve, especially informative for imbalanced data.
2.6. Statistical Analysis
One-way ANOVA was used to assess overall performance differences among models. Paired t-tests were conducted across the 5 folds for pairwise comparisons. All analyses were performed using scipy.stats in Python 3.12, with significance level set at p < 0.05.
2.7. Software Availability
To facilitate practical application, we provide Jupyter notebooks in the ‘grape-ppi’ directory of the GitHub repository (
https://github.com/lchsz/grape-ppi, accessed on 9 February 2026). ‘train.ipynb’ allows users to train the GrapePPI model on custom datasets, while ‘predict.ipynb’ enables inference with a pre-trained GrapePPI model on new protein pairs. A companion dataset package, including pre-split training (‘train.tsv’), validation (‘val.tsv’), and test (‘test.tsv’; used to evaluate the pre-trained model via ‘predict.ipynb’) sets, pre-trained model weights (‘grape_1_model.pt’ and ‘grape_10_model.pt’ for Grape-1 and Grape-10, respectively), is publicly available at
https://drive.google.com/drive/folders/1pXfNQ98OtSXV-WY3ZFIsWZpqFjsguLEg (accessed on 9 February 2026). For this companion package, the dataset was partitioned into training, validation, and test sets at a 70/15/15 ratio using stratified random sampling (random seed 1234), splitting at the pair level to prevent data leakage. The pre-trained model weights provided in this package were trained on the training and validation sets (70% and 15% of the data, respectively) using the same hyperparameters as described in the Training Procedure subsection above. Note that this fixed single split is provided solely to support end users of the notebooks; all experimental results reported in this paper were obtained using 5-fold cross-validation (train/validation partition per fold), as described in the Training Procedure subsection above.
3. Results
3.1. GrapePPI Demonstrates Strong Performance on Benchmark Datasets with Cross-Species Generalization
Although GrapePPI is designed for grape-specific PPI prediction, we first evaluated its generalization capability on widely used benchmark datasets. We compared GrapePPI with several state-of-the-art PPI prediction models on two yeast datasets (Yeast-1 and Yeast-2) and one Arabidopsis dataset (Ath). Yeast-1 and Yeast-2 are standard benchmarks from PIPR [
9] and DeepFE-PPI [
15], respectively. The Ath dataset is derived from ESMAraPPI [
22], an Arabidopsis-specific model that also employs ESM embeddings. These datasets were constructed using different strategies, providing a robust test of model adaptability. We compared GrapePPI against three deep-learning models–DeepFE-PPI, PIPR, and ESMAraPPI. Results are summarized in
Table 1.
On Yeast-1, GrapePPI achieved the highest scores across all metrics: accuracy (96.17%), F1 (96.17%), ROC AUC (99.29%), and PR AUC (99.37%). PIPR performed second best (F1: 95.38%, PR AUC: 99.29%), followed by DeepFE-PPI. ESMAraPPI showed notably poor and variable results (F1: 52.57%, PR AUC: 69.73%), indicating limited generalization to non-plant species despite using ESM embeddings.
On Yeast-2, GrapePPI again outperformed all other models, attaining an accuracy of 95.58%, F1 of 95.54%, ROC AUC of 99.08%, and PR AUC of 99.21%. DeepFE-PPI ranked second, while ESMAraPPI showed moderate improvement over its Yeast-1 performance but still lagged behind the other methods.
On the Arabidopsis-specific Ath dataset, ESMAraPPI achieved the highest PR AUC (82.96%) and F1 (75.24%), with GrapePPI performing closely (PR AUC: 81.44%, F1: 73.16%). In contrast, the other models exhibited substantial drops in performance, particularly in recall and F1 score. This indicates that models designed primarily for non-plant data perform poorly on Arabidopsis-specific interaction patterns, whereas GrapePPI generalizes effectively across species.
GrapePPI consistently achieves competitive or superior performance across diverse benchmarks. It outperforms all compared methods on yeast data and attains performance comparable to a specialized plant model on Arabidopsis data, demonstrating robust generalization capability. Based on the above cross-dataset evaluation, GrapePPI demonstrated the potential to serve as a “universal PPI predictor”. Therefore, we selected GrapePPI as the foundation to proceed with training and evaluation for the grape PPI prediction task.
3.2. ESM2_t36_3B Yields Optimal Embeddings for Grape PPI Prediction
To optimize GrapePPI’s performance, we evaluated the impact of different protein language model embeddings. Given that ESMAraPPI achieved optimal performance using esm1b_t33_650M embeddings [
22], we included this ESM1 variant in our comparison. Additionally, since ESM2 represents a more recent generation of pre-trained protein language models with improved capabilities compared to ESM1 [
21], we also evaluated two ESM2 variants: esm2_t33_650M and esm2_t36_3B. Thus, we compared three pre-trained ESM variants: esm1b_t33_650M, esm2_t33_650M, and esm2_t36_3B. These models differ in architecture and scale: esm1b_t33_650M is an ESM1 variant with 33 transformer layers and 650 million parameters; esm2_t33_650M is an ESM2 variant with the same architecture scale (33 layers, 650M parameters) but with improved pre-training; and esm2_t36_3B is a larger ESM2 variant with 36 transformer layers and 3 billion parameters, offering increased model capacity. Performance was assessed on both the Grape-1 (balanced) and Grape-10 (imbalanced) datasets (
Table 2).
On Grape-1, all three models performed well, with esm2_t36_3B achieving the highest F1 score (89.25%) and esm1b_t33_650M attaining the best PR AUC (95.11%). Differences among models were modest, indicating that GrapePPI can effectively utilize various ESM embeddings. On the more challenging Grape-10 dataset, esm2_t36_3B achieved the highest F1 (83.50%) and PR AUC (89.83%), outperforming the smaller models. The larger capacity of esm2_t36_3B (3B parameters) appears advantageous for capturing complex interaction patterns under imbalanced conditions. Overall, esm2_t36_3B provides the best balance of performance across both datasets and was therefore selected as the embedding source for GrapePPI.
3.3. GrapePPI Achieves Superior Performance on Grape-Specific Datasets
We evaluated GrapePPI on two grape-specific datasets with different class distributions: Grape-1 (balanced, 1:1 positive-to-negative ratio) and Grape-10 (imbalanced, 1:10 ratio). This dual evaluation assesses model robustness under both ideal and realistic (imbalanced) conditions.
3.3.1. Performance on the Balanced Grape-1 Dataset
On Grape-1, GrapePPI significantly outperformed all baseline models (
Figure 2), achieving an F1 score of 89.34%, ROC AUC of 95.70%, and PR AUC of 95.29%.
Paired
t-tests confirmed that GrapePPI’s improvements over ESMAraPPI, PIPR, and DeepFE-PPI were statistically significant (all
p < 0.001;
Table 3).
ROC and PR curves further illustrate GrapePPI’s superiority (
Figure 3). Its ROC curve is closest to the top-left corner (AUC = 95.64%), and its PR curve maintains high precision across recall levels (PR AUC = 95.21%).
3.3.2. Performance on the Imbalanced Grape-10 Dataset
On the imbalanced Grape-10 dataset, GrapePPI again achieved the best overall performance (
Table 4), with an F1 score of 85.43%, ROC AUC of 98.29%, and PR AUC of 90.87%. The PR AUC, which is especially informative for imbalanced data, was substantially higher than those of ESMAraPPI (56.43%), PIPR (79.80%), and DeepFE-PPI (54.35%). The performance gap between GrapePPI and other models widened under imbalance. For example, GrapePPI’s recall (85.48%) was more than double that of ESMAraPPI (37.44%). This demonstrates GrapePPI’s robustness to class imbalance, a common scenario in large-scale PPI prediction.
Statistical comparisons also demonstrated that GrapePPI significantly outperformed ESMAraPPI, PIPR, and DeepFE-PPI (all
p < 0.01;
Table 5).
ROC and PR curves for Grape-10 (
Figure 4) further confirm GrapePPI’s advantage, with its curves consistently dominating those of other models.
3.3.3. Overall Assessment on Grape Data
GrapePPI consistently delivers superior performance on grape-specific data, under both balanced and imbalanced conditions. On Grape-1, it achieved statistically significant improvements in all metrics. On Grape-10, it maintained high recall and PR AUC, demonstrating particular robustness to class imbalance. These results validate GrapePPI as an effective and reliable framework for grape PPI prediction.
The baseline that is architecturally most similar to GrapePPI is ESMAraPPI, which also uses pre-trained ESM embeddings but feeds them into a single MLP without a dedicated sequence encoder. On Grape-1, GrapePPI outperformed ESMAraPPI by a large margin (F1: 89.34% vs. 78.0%). On Grape-10, the gap widened further (F1: 85.43% vs. 49.06%; PR AUC: 90.87% vs. 56.43%). This contrast highlights the novelty of GrapePPI: the sequence encoder and multi-layer predictor are not merely incremental additions but are critical for adapting general-purpose ESM representations to grape PPI prediction, especially under class imbalance where the simpler ESM and MLP architecture fails to maintain recall and precision.
3.3.4. Independent Held-Out Test Set Evaluation
To assess generalization on completely unseen data and to align with the companion package provided to users, we report GrapePPI performance on an independent held-out test set. As described in
Section 2.7, the grape dataset was partitioned once at a 70/15/15 ratio (training, validation, test) using stratified random sampling at the pair level; the pre-trained models (grape_1_model.pt and grape_10_model.pt) were trained only on the training and validation portions. The test set (15% of pairs) was never used for training, validation, or model selection. Evaluation on this test set was performed using the released weights and the companion script (predict.ipynb). Results are summarized in
Table 6.
On the balanced Grape-1 dataset, performance was high and consistent across metrics (accuracy 90.24%, F1 90.18%, precision and recall both around 90%), and both ROC AUC (95.86%) and PR AUC (95.52%) remained high, indicating strong discriminative performance. These results are consistent with the 5-fold cross-validation (e.g., F1 ~89.34% in
Section 3.3.1), supporting that the model generalizes well without obvious overfitting. On the imbalanced Grape-10 dataset, accuracy reached 96.79% in part because the majority class (non-interacting pairs) dominates; the more informative metrics are recall (79.34%), F1 (82.33%), and PR AUC (89.18%), which show that the model still recovers a substantial fraction of true interactions and maintains useful precision-recall trade-off under class imbalance. Precision (85.55%) was somewhat lower than on Grape-1, reflecting the inherent tension between recall and precision when positive samples are scarce.
3.4. Synergistic Contributions of ESM Embeddings and Network Architecture
To clarify the individual contributions of ESM embeddings and the GrapePPI architecture, we conducted an ablation study on Grape-1, Grape-10, and Ath datasets using three model variants: (1) baseline: full GrapePPI architecture; (2) no_seq_encoder: no sequence encoder, using averaged ESM embeddings directly; (3) no_interaction_predictor: single linear layer after feature concatenation.
3.4.1. Sequence Encoder Is Essential for Grape PPI Prediction
The ablation study results reveal dataset-specific patterns regarding the importance of the sequence encoder (
Table 7). On both Grape-1 and Grape-10 datasets, the baseline model (with full sequence encoder) consistently ranked first across all performance metrics, including accuracy, precision, recall, F1 score, ROC AUC, and PR AUC. In contrast, the no_seq_encoder variant, which uses averaged ESM embeddings directly without transformation, ranked second across all metrics on both grape datasets. This consistent ranking pattern indicates that the sequence encoder is essential for transforming general-purpose ESM embeddings into task-specific representations suitable for grape PPI prediction.
The performance gap between the baseline and no_seq_encoder variants was consistent across both balanced (Grape-1) and imbalanced (Grape-10) datasets, with the baseline model maintaining its advantage in all evaluated metrics. This robustness across different data distributions underscores that the sequence encoder’s role in adapting ESM embeddings is fundamental for grape PPI prediction, regardless of class balance. The two-layer fully connected network in the sequence encoder (2560 512 256) effectively reduces dimensionality while learning task-relevant features that enhance the discriminative power of the combined protein representations. Without this transformation, the raw ESM embeddings, despite their rich biological information, are less effective for distinguishing interacting from non-interacting protein pairs in grape datasets.
3.4.2. Dataset-Specific Architecture Requirements
Interestingly, the ablation results on the Ath dataset revealed a different pattern. On Arabidopsis data, the no_seq_encoder variant achieved the best overall performance, ranking first in recall, F1 score, ROC AUC, and PR AUC, while the baseline model ranked first only in accuracy and precision but third in recall. This suggests that for Arabidopsis, the raw ESM embeddings may already be well-suited for PPI prediction without additional transformation, possibly due to the model’s training data or Arabidopsis-specific protein characteristics. In contrast, the sequence encoder is crucial for grape PPI prediction, highlighting the species-specific nature of optimal architecture design.
3.4.3. Multi-Layer Interaction Predictor Is Critical for Learning Complex Patterns
The no_interaction_predictor variant consistently ranked last on both Grape-1 and Grape-10 datasets, with a pronounced performance drop on the imbalanced Grape-10 dataset. On the Ath dataset it also ranked last overall, though it achieved second place in recall. This underscores that a multi-layer predictor is necessary to effectively model complex interaction patterns from combined protein features across different species.
3.5. Comparison of Pooling Strategies
The GrapePPI model uses mean pooling over residue-level ESM embeddings to obtain one fixed-size vector per protein before the sequence encoder and interaction predictor. To assess whether a more expressive aggregation could help, we implemented GrapePPI_Seq, a variant that keeps residue-level embeddings and applies a configurable pooling module (mean, attention-based, or INT-FUP-like [
25]) before the same sequence encoder and predictor. Attention-based pooling weights residues via a learned query and softmax over positions; INT-FUP-like pooling combines membership (attention-like scores) with hesitation (sequence-wise variance) to down-weight uncertain positions. Implementation details and full results on Grape-1 are provided in
File S1. Paired
t-tests showed no statistically significant difference between any two strategies (all
p > 0.05). INT-FUP-like pooling achieved the highest mean precision and PR AUC, whereas mean pooling achieved the highest recall and F1. Given the lack of significant gains from attention-based or INT-FUP-like pooling and the simplicity of mean pooling, we retained mean pooling as the default in the main pipeline.
4. Discussion
In this study, we developed GrapePPI, a deep learning framework for grape protein–protein interaction prediction. Our results demonstrate that GrapePPI outperforms state-of-the-art methods across multiple evaluation scenarios, including benchmark datasets from yeast and Arabidopsis, as well as grape-specific datasets under both balanced and imbalanced class distributions. The success of GrapePPI stems from two key elements: the effective use of pre-trained ESM embeddings and a tailored network architecture that adapts these embeddings for PPI prediction.
GrapePPI showed strong generalization on benchmark datasets. On yeast datasets (Yeast-1 and Yeast-2), it consistently surpassed competing methods such as DeepFE-PPI and PIPR, and substantially outperformed ESMAraPPI. This cross-species performance is explained by two complementary factors: ESM embeddings trained on UniRef50 encode biochemical, evolutionary, and structural information that is conserved across organisms, providing a rich and transferable protein representation; and the GrapePPI sequence encoder further distills these representations into a task-adapted, PPI-relevant space, enabling effective discrimination of interacting from non-interacting pairs regardless of species. Notably, ESMAraPPI—although also ESM-based—performed poorly on yeast data (F1: 52.57% on Yeast-1), underscoring that embedding quality alone does not guarantee cross-species generalization; the architecture of the downstream predictor is equally critical. On the Arabidopsis dataset, GrapePPI achieved a performance nearly matching that of ESMAraPPI (PR AUC: 81.44% vs. 82.96%; F1: 73.16% vs. 75.24%), a method specifically designed and optimized for Arabidopsis PPI prediction. Non-plant methods (PIPR, DeepFE-PPI), by contrast, showed substantially lower recall and F1 on Ath, indicating limited adaptability to plant-specific interaction patterns. The contrasting behavior of ESMAraPPI across datasets—excellent on Arabidopsis but poor on yeast—illustrates that predictor training on a specific species can impose a strong species bias even when universal embeddings are used. Together, these results imply that both the choice of embedding model and the design of the downstream predictor must be validated for each target species rather than assumed to transfer. GrapePPI’s consistent competitiveness across yeast and plant datasets suggests it can serve as a template for PPI prediction in other crops or less-studied species where experimental data are scarce, with the expectation that grape-trained weights provide a strong initialization that can be further fine-tuned with minimal species-specific data. Future work could systematically evaluate transfer learning from GrapePPI to species with even less experimental data (e.g., other fruit crops), and investigate whether fine-tuning the ESM encoder on plant-specific sequences further improves generalization.
The four compared methods—DeepFE-PPI, PIPR, ESMAraPPI, and GrapePPI—rely on different pre-trained or handcrafted amino acid (or residue-level) embeddings, which likely contribute to the observed performance differences. DeepFE-PPI uses Res2vec, a shallow distributional embedding analogous to word2vec for residues, learned from co-occurrence statistics in sequence corpora [
15]. PIPR employs property-aware embeddings that encode physicochemical and structural properties of amino acids within a Siamese recurrent–convolutional architecture [
9]. These are not large-scale language-model embeddings but task-oriented representations. ESMAraPPI and GrapePPI both use ESM protein language models, but from different model generations and scales: ESMAraPPI uses ESM1 (esm1b_t33_650M, 650M parameters), whereas GrapePPI uses ESM2 (esm2_t36_3B, 3B parameters), which benefits from improved pre-training and greater capacity [
19]. The superior performance of GrapePPI on grape data and its robustness under class imbalance are consistent with the hypothesis that richer, evolutionarily informed ESM2 embeddings—combined with the GrapePPI architecture—better capture interaction-relevant sequence patterns than Res2vec or property-aware embeddings when training data are limited or imbalanced. Because the four methods use different sequence representations, our direct comparison reflects both the choice of embedding and the design of the predictor; performance differences therefore cannot be attributed to architecture alone. Future work could fix the embedding type (e.g., re-implementing baselines with ESM2) to isolate the contribution of the prediction architecture.
The choice of ESM2_t36_3B as the embedding source proved optimal. Among the tested variants, the larger esm2_t36_3B model (3B parameters) excelled on the imbalanced Grape-10 dataset, achieving the highest F1 score and PR AUC. This indicates that greater model capacity helps capture complex interaction patterns under challenging data conditions. Smaller ESM models still performed competitively on balanced data, suggesting that GrapePPI can be adapted to different resource constraints while retaining reasonable accuracy.
On grape-specific data, GrapePPI delivered superior performance. For the balanced Grape-1 dataset, it achieved statistically significant improvements over all baselines (F1: 89.34%, PR AUC: 95.29%). More importantly, on the imbalanced Grape-10 dataset–which better reflects real-world scarcity of positive samples–GrapePPI remained robust. Its recall substantially exceeded that of other methods, demonstrating a strong ability to identify true interactions amid class imbalance, a crucial trait for large-scale PPI screening.
Ablation studies clarified the contributions of each component. Both high-quality ESM embeddings and a carefully designed network architecture are essential for optimal performance, though the optimal architecture may vary by species. For grape PPI prediction, the sequence encoder was essential for adapting general ESM embeddings to the PPI task; variants without it showed marked performance drops. Similarly, the multi-layer interaction predictor was indispensable—a simplified version ranked last across all metrics. The multi-layer interaction predictor captures intricate interaction patterns. Thus, both high-quality embeddings and a carefully designed architecture are necessary for optimal performance. The following paragraphs explain why, despite the prevalence of attention and graph-based designs in PPI prediction, GrapePPI keeps a compact fully connected head on top of frozen ESM2.
Why we use an MLP-only head instead of extra attention layers. Many recent PPI predictors use attention mechanisms. GrapePPI instead uses a deliberately simple pipeline: frozen ESM2 embeddings, mean pooling, a shared fully connected sequence encoder per protein, concatenation of the two encoded vectors, and a multi-layer interaction predictor. Experiments on how to fuse the two encoded protein vectors (
Section 2.2) and on how to pool within each sequence at residue resolution (
Section 3.5 and
File S1) showed that, under our grape data and evaluation protocol, added attention on top of pooled vectors did not consistently improve accuracy.
Attention already in the embedding stage. ESM2 (esm2_t36_3B) is Transformer-based; its pre-training uses multi-head self-attention over the full sequence. After mean pooling, each protein is already represented by a vector that aggregates attention-weighted, context-dependent information. Stacking a second attention module on these pooled vectors risks redundancy and adds parameters where labels are scarce, increasing overfitting risk for grape relative to yeast or human benchmarks.
Inter-protein fusion and intra-protein pooling. Merging the two encoded protein representations (after the sequence encoder) via element-wise product or attention-based fusion, and combinations thereof, did not consistently outperform simple concatenation (
Section 2.2). Separately, aggregating residue embeddings within each protein before the same encoder and predictor was compared in GrapePPI_Seq using mean, attention-based, and INT-FUP-like pooling; paired
t-tests found no significant difference between strategies, and mean pooling gave the highest recall and F1 on Grape-1 (
Section 3.5,
File S1). Together, these results support retaining concatenation for the two-protein merge and mean pooling within each sequence, rather than introducing further attention layers.
Graph-based and structure-aware alternatives. Graph neural approaches usually require a graph at training or inference. Here we target pair-level prediction from sequences alone and compare fairly to sequence baselines (DeepFE-PPI, PIPR, ESMAraPPI). Building graphs from STRING or sequence similarity would add assumptions, leakage risk, and complexity; methods that rely on AlphaFold contact maps also depend on structure coverage that is uneven for grape. We therefore leave graph-based or structure-aware extensions to future work when network context or uniform structural inputs are available, or when the task is framed as graph prediction rather than isolated pair classification (see also Limitations and future work).
Our negative samples were obtained by random pairing of proteins not known to interact in STRING. Such negatives may include undiscovered true interactions (false negatives), which can lead to underestimation of model performance: some “negative” pairs could be real PPIs not yet in the database. Alternative strategies—e.g., filtering by subcellular localization so that pairs in incompatible compartments are treated as negatives, or domain-based exclusion—could reduce this bias but were not used here because of limited annotation for grape and to keep the pipeline simple and reproducible. Future work could incorporate more sophisticated negative sampling or integrate additional biological constraints as they become available.
Practically, GrapePPI can accelerate grape functional genomics by predicting interaction networks involved in development, stress response, and disease resistance. Such predictions can guide targeted experimental validation, reducing the cost and time of high-throughput screening. Moreover, identified networks can inform molecular breeding programs by highlighting candidate proteins and pathways for crop improvement. The model’s robustness to class imbalance makes it particularly suitable for genome-wide PPI prediction, where non-interacting pairs dominate.
Limitations and future work. Our approach relies on STRING for positive labels and on random negative sampling, with the limitations noted above. Regarding evaluation, our 5-fold cross-validation uses the validation fold both for early stopping and for final performance evaluation, which may introduce a slight optimistic bias in the reported absolute metrics; however, since all compared methods were evaluated under identical conditions, the relative comparisons and statistical conclusions remain valid. For limited data, we did not explicitly model data scarcity (e.g., via meta-learning or few-shot learning); instead we relied on transferable ESM representations and a compact architecture that performs well with limited grape data. To support interpretability and flexible aggregation of sequence features, we plan to add the GrapePPI sequence model (which accepts residue-level ESM embeddings, as described in
File S1) to the next version of GrapePPI, so that users can perform per-residue interpretability (e.g., saliency maps and residue importance) and optionally use configurable pooling strategies. Beyond this planned extension, future directions include: (1) interpretability analyses such as saliency maps on sequences, residue-level importance visualization, and functional enrichment of predicted PPIs to extract biological insights; (2) advanced pooling methods (e.g., intuitionistic fuzzy pooling) to capture sequence features more richly; (3) oversampling or hybrid sampling (e.g., FADA-SMOTE-based methods [
26]) for imbalanced grape data; and (4) explicit modeling of data scarcity for crop species; and (5) graph-based or structure-aware predictors (e.g., GNNs over PPI or similarity graphs, or integration of predicted contact maps when uniformly available) when the use case provides network context or when genome-wide screening is framed as graph prediction rather than isolated pair classification.
In summary, GrapePPI advances computational PPI prediction for crops by integrating pre-trained ESM embeddings with a task-specific architecture. It achieves state-of-the-art performance on grape data, generalizes well across species, and handles class imbalance effectively. As experimental PPI data expand and protein language models improve, GrapePPI and similar frameworks will become increasingly valuable for plant biology and agricultural innovation.