Modeling Peptide–Protein Interactions by a Logo-Based Method: Application in Peptide–HLA Binding Predictions

Peptide–protein interactions form a cornerstone in molecular biology, governing cellular signaling, structure, and enzymatic activities in living organisms. Improving computational models and experimental techniques to describe and predict these interactions remains an ongoing area of research. Here, we present a computational method for peptide–protein interactions’ description and prediction based on leveraged amino acid frequencies within specific binding cores. Utilizing normalized frequencies, we construct quantitative matrices (QMs), termed ‘logo models’ derived from sequence logos. The method was developed to predict peptide binding to HLA-DQ2.5 and HLA-DQ8.1 proteins associated with susceptibility to celiac disease. The models were validated by more than 17,000 peptides demonstrating their efficacy in discriminating between binding and non-binding peptides. The logo method could be applied to diverse peptide–protein interactions, offering a versatile tool for predictive analysis in molecular binding studies.


Introduction
Peptide-protein interactions are among the main fundaments in molecular biology, underlying a diverse array of cellular functions like cellular signaling, structural organization, and enzymatic activities within living organisms.These interactions are based on electrostatic attractions and repulsions, hydrogen bonding, van der Waals forces, and hydrophobic effects and they determine the affinity between the molecules and the stability of peptide-protein complexes [1].These interactions not only dictate the folding and assembly of proteins but also play a crucial role in modulating their function.For example, the binding of the peptide insulin to its receptor regulates glucose metabolism [2].Chaperone proteins, assisted by peptide interactions, facilitate the correct folding of newly synthesized proteins and prevent misfolding, contributing to cellular homeostasis [3].Peptide-based therapeutics, such as peptide inhibitors or mimetics, target specific protein-protein interactions associated with diseases, offering promising opportunities for novel drug design and development [4].
Advancements in experimental techniques have significantly contributed to unraveling the complexities of peptide-protein interactions.Nuclear Magnetic Resonance (NMR) spectroscopy, X-ray crystallography, and cryo-electron microscopy provide detailed structural insights into these interactions at the atomic level, elucidating binding interfaces and conformational changes [5].Additionally, computational approaches, including molecular Molecules 2024, 29, 284 2 of 10 dynamics simulations and docking studies, complement experimental data, facilitating the prediction and understanding of peptide-protein interactions in silico [6].
Here, we present a computational method for quantitative description of peptide-protein interactions based on amino acid frequencies at each of the nine positions from the peptide binding core.The normalized frequencies enter a quantitative matrix (QM).Because the amino acid frequencies for a given binding motif are presented by a sequence logo [7], we named the QMs derived by this method 'logo models'.We describe the development of logo models for the peptide binding prediction of the proteins HLA-DQ2.5 and HLA-DQ8.1.HLA-DQ2 and HLA-DQ8 proteins play a pivotal role in presenting gluten fragments to immune cells, setting off an immune response in individuals afflicted with celiac disease.These proteins engage with gluten peptides, facilitating their presentation to T cells, which in turn triggers the activation of the immune system and ultimately leads to damage of the small intestine [8,9].The peptide binding sites within HLA-DQ2 and HLA-DQ8 are remarkably well-preserved binding grooves created to accommodate peptide binding cores comprised of nine amino acids with specific compositions.The assessment of novel proteins for their potential as HLA-DQ2 and/or HLA-DQ8 binders carries profound significance in the development of secure and nourishing products accessible to all consumers, including those managing celiac disease.The ability of the logo models to recognize peptides binding to these two alleles was evaluated by external test sets.Although the logo method was developed to predict peptide binding to DQ proteins, it is universal and can be applied to any peptide-protein interaction.

Sequence Logo-Based Model for Peptide Binding Prediction for HLA-DQ2.5
The training and test datasets contain known binders to HLA-DQ proteins (positive sets) and non-binders (negative sets).The training sets were used for the development of logo models.The test sets were used for the validation of the derived models.
The training set of binders to HLA-DQ2.5 was derived from Stepniak et al. [10].It consisted of 125 nonamer binding cores.Despite our diligent efforts to curate a reliable negative set comprising experimentally validated non-binders sourced from various databases and references, we encountered challenges and were unable to derive such a set.To tackle this challenge, we devised an alternative approach by creating a set of non-binders.This involved considering the total combination of non-preferred amino acids across all nine positions within the binding core, according to the known binding motifs for DQ2.5 [11].Given that the peptides encompass experimentally confirmed non-preferred amino acids at each position, the combination of these specific amino acids inevitably yields sequences classified as non-binders.The non-preferred amino acids for DQ2.5 are as follows: for position 1 (p1)-Ala, Arg, Asn, Asp, Glu, Gly, Lys, Ser, and Thr; for position 2 (p2)-Ala and Leu; for position 3 (p3)-Arg, Asp, Glu, Ile, Leu, and Lys; for position 4 (p4)-Ala, Arg, Gln, Gly, Ile, Leu, Lys, Met, Phe, Ser, Thr, Trp, and Val; for position 5 (p5)-Arg, Leu, Lys, and Thr; for position 6 (p6)-Arg, Asn, Gln, Gly, Leu, Lys, Pro, Ser, Thr, and Val; for position 7 (p7)-Ala, Asn, Gln, Gly, Leu, Lys, Pro, Ser, Thr, and Val; for position 8 (p8)-Ala, Gln, Ile, Leu, and Thr; for position 9 (p9)-Ala, Arg, Gln, Lys, Met, Pro, Ser, and Thr [11].The total combination of them generated a pool of 24,710,400 non-binding nonamers.Among them, a set of 125 nonamers was randomly selected and used as a negative training set for the development of the logo model for DQ2.5.
The test set of binders to DQ2.5 was obtained from LC-MS data, comprising 4249 binders of varying lengths [12].The same number of non-binding nonamers was randomly chosen from the non-binders pool, including nonamers distinct from those in the negative training set used for model development.Table 1 presents a summary of the peptides utilized in this study.The amino acid frequencies at each position within a nonamer are normalized by their mean using the following formula: where X i is the frequency of amino acid i at a given position, X mean is the mean frequency at the same position, and X max and X min are the maximum and the minimum frequencies, respectively, at the same position.The normalized values fall in the range [−1, 1].The normalized values are organized into a quantitative matrix (QM) measuring 9 positions by 20 amino acids (Table 2).This QM is termed a 'logo model,' inspired by the graphical representation of amino acid sequence conservation in proteins [7].In the sequence logo, each position is represented by a stack of letters, where the size of the letters reflects their frequency within the sequences.Figure 1 (left) presents the sequence logo for HLA-DQ2.5, derived from the training set comprising 125 binding nonamers.We further develop this method by quantifying the amino acid frequencies at each position of the binding peptide.In the logo model, the size of the letters is substituted by positive and negative quantitative values (Table 2).The positive values correspond to preferred amino acids at a given peptide position, and the negative values correspond to non-preferred ones.We further develop this method by quantifying the amino acid frequencies at each position of the binding peptide.In the logo model, the size of the letters is substituted by positive and negative quantitative values (Table 2).The positive values correspond to preferred amino acids at a given peptide position, and the negative values correspond to non-preferred ones.
Similarly, a logo model was constructed for the set of non-binders (Table 3).The two logo models were used to calculate the binding (BSs) and non-binding scores (NBSs) of a tested peptide by summarizing the values of the corresponding amino acids at each position.If BS is higher than NBS, the nonamer is classified as a binder.Otherwise, it is a non-binder.The predictive ability of the logo models was tested on an external test set consisting of 4249 binders and 4249 non-binders to HLA-DQ2.5.The performance of the logo models is given in Table 4.They recognize 90% of the binders and 100% of the non-binders.Similarly, a logo model was constructed for the set of non-binders (Table 3).The two logo models were used to calculate the binding (BSs) and non-binding scores (NBSs) of a tested peptide by summarizing the values of the corresponding amino acids at each position.If BS is higher than NBS, the nonamer is classified as a binder.Otherwise, it is a non-binder.The predictive ability of the logo models was tested on an external test set consisting of 4249 binders and 4249 non-binders to HLA-DQ2.5.The performance of the logo models is given in Table 4.They recognize 90% of the binders and 100% of the non-binders.
When the BS/NBS ratio for the binders and the NBS/BS ratio for the non-binders approach unity (between −1 and +1), the prediction uncertainty escalates, leading to a 47% occurrence of false negatives (FNs) as depicted in Figure 2. Within the true positive (TP) category, this uncertainty remains low at only 2%.Among true negatives (TNs) and false positives (FPs), it diminishes further to 0%.Optimal predictions occur at ratios below −1 and above +1.When the BS/NBS ratio for the binders and the NBS/BS ratio for the non-binders approach unity (between −1 and +1), the prediction uncertainty escalates, leading to a 47% occurrence of false negatives (FNs) as depicted in Figure 2. Within the true positive (TP) category, this uncertainty remains low at only 2%.Among true negatives (TNs) and false positives (FPs), it diminishes further to 0%.Optimal predictions occur at ratios below −1 and above +1.

Sequence Logo-Based Model for Peptide Binding Prediction to HLA-DQ8.1
A set of 463 strong binding nonamers was selected from Tran et al. [13] and used to derive the sequence logo for DQ8.1 (Figure 1 right).The logo derived in the present study is in good agreement with Nielsen's logo [14].The only difference is the preference for Val and Ile at anchor p1, followed by Glu, according to our model.The p1 pocket is deep and polar, lined by two positively charged residues His24 and Arg52 and two negatively charged ones Glu31 and Glu86 [15].Henderson et al. [15] has shown that Glu at p1 forms two salt bridges with Arg52 and a water-mediated network with His24, Glu31, and Arg53.The preference for the hydrophobic Val and Ile at p1 is a novel observation for HLA-DQ8.1.

Sequence Logo-Based Model for Peptide Binding Prediction to HLA-DQ8.1
A set of 463 strong binding nonamers was selected from Tran et al. [13] and used to derive the sequence logo for DQ8.1 (Figure 1 right).The logo derived in the present study is in good agreement with Nielsen's logo [14].The only difference is the preference for Val and Ile at anchor p1, followed by Glu, according to our model.The p1 pocket is deep and polar, lined by two positively charged residues His24 and Arg52 and two negatively charged ones Glu31 and Glu86 [15].Henderson et al. [15] has shown that Glu at p1 forms two salt bridges with Arg52 and a water-mediated network with His24, Glu31, and Arg53.The preference for the hydrophobic Val and Ile at p1 is a novel observation for HLA-DQ8.1.
The normalized amino acid frequencies for binders and non-binders enter the corresponding logo models (Tables 5 and 6).Their predictive ability was tested on an external test set.The test set of binders to HLA-DQ8.1 were collected from LC-MS data and contained 4339 known strong binders [13].An equivalent quantity of non-binding nonamers was randomly chosen from the non-binder pool, comprising nonamers distinct from those present in the negative training set.The performance is given in Table 4.They recognize 98% of the binders and 100% of the non-binders.Here again, like the predictions for HLA-DQ2.5, uncertainty in prediction intensifies when both the BS/NBS and NBS/BS ratios approximate unity (between −1 and +1).Among the true positives, 40% show a BS/NBS ratio within this range, while 85% of the false negatives are within this range (Figure 3).All true negatives consistently present an NBS/BS ratio below −1, and no instances of false positives have been observed.

Discussion
In the present study, we introduce a new method for evaluating and predicting interactions between peptides and proteins that can be applied universally.The method employs a scoring system based on amino acid frequencies at specific positions in binding and non-binding peptides.We trained the method using datasets of known binders and non-binders to HLA-DQ2.5 and HLA-DQ8.1 and evaluated its performance on external

Discussion
In the present study, we introduce a new method for evaluating and predicting interactions between peptides and proteins that can be applied universally.The method employs a scoring system based on amino acid frequencies at specific positions in binding and non-binding peptides.We trained the method using datasets of known binders and non-binders to HLA-DQ2.5 and HLA-DQ8.1 and evaluated its performance on external test sets of more than 17,000 peptides.The results demonstrated that our approach achieved a high level of accuracy in predicting peptide binding to these two proteins.The derived models could be applied for the in silico search of peptides binding to HLA-DQ2.5 or/and HLA-DQ8.1, according to the European Food Safety Authority's (EFSA) guidance on the risk assessment of novel proteins and their capacity to cause celiac disease [16].The guidance outlines a framework for assessing the potential risk posed by new proteins introduced into the food supply, including the use of in silico, in vitro, and in vivo methods to evaluate their potential for triggering an immune response.According to this guidance, initially, the tested protein is compared to known proteins associated with celiac disease and if any identity or similarity is observed, then the protein is searched for binders to HLA-DQ2.5 or/and HLA-DQ8.1.If strong peptide binders to HLA-DQ2.5 and DQ8.1 are predicted, then HLA-DQ binding assays are performed to confirm or reject these predictions together with in vitro digestibility tests and/or tests for identification of T-cell epitopes.The guidance ensures the safety of novel proteins in the food supply and provides a useful tool for food manufacturers and regulatory agencies to evaluate the risk of celiac disease associated with new proteins.
The main binding motif for HLA-DQ2.5 includes bulky hydrophobic residues at positions p1 and p9 and negatively charged residues at positions p4, p6, and/or p7 [17,18].Stepniak et al. have defined another binding motif consisting of proline or polar residues at p1; acidic or polar residues at p4, p6, and p7; and hydrophobic or polar residues at p9 [10].Koşalo glu-Yalçın et al. analyzed peptides eluted by high-throughput mass spectroscopy and found that 75% of the HLA-DQ2.5 binders did not conform to any known binding motif [11].They conclude that HLA-DQ2.5 can bind peptides promiscuously using alternate modes.
The binding motif for HLA-DQ2.5 obtained in the present study is compatible with Stepniak's definition because the positive training set used to derive the logo model for DQ2.5 binders was compiled from their paper [10].Nevertheless, the model recognized 90% of 4249 binders from the external test set.We attribute this high predictive ability to the unique selection of the negative set, which has been defined in a novel way.
The binding motif for HLA-DQ8.1 is more consistent.Specifically, HLA-DQ8.1 prefers peptides that contain negatively charged or polar residues at positions p1, p7, p8, and p9 and small aliphatic residues at positions p4 and p6 [13].The predictive ability of the logo models here is even higher-99% accuracy for a test set of 8678 peptides.Again, this is due to the powerful combinatorial library of non-binders generated from the non-preferred amino acids at all nine positions of the binding core.
From the pool of DQ2.5 non-binders, a subset of 125 nonamers was randomly chosen as a negative training set for the development of the DQ2.5 logo model.Similarly, for DQ8.1, a subset of 463 nonamers was randomly selected for the development of the DQ8.1 logo model from the non-binder pool.
The test sets comprising binders were compiled from LC-MS data.For HLA-DQ2.5, it encompassed 4249 binders of varying lengths [12], while for HLA-DQ8.1, it comprised 4339 known strong binders [13].To ensure unbiased testing, an equivalent number of non-binding nonamers were randomly drawn from the non-binder pool, excluding those from the negative training sets used in model development.

Measures of Validation Accuracy
The assessment of the predictive capability of the logo models developed in this study involved evaluating sensitivity, specificity, and accuracy.Sensitivity gauges the accurate prediction of binding peptides and is computed using the formula: Sensitivity = true positives true positives + f alse negatives Specificity measures the accurate prediction of non-binding peptides and is calculated using the formula: Speci f icity = true negatives true negatives + f alse positives Accuracy reflects the overall performance of the models and is calculated using the formula: Accuracy = true positives + true negatives true positives + true negatives + f alse positives + f alse negatives True positives represent peptides correctly predicted as binders, while true negatives are peptides correctly identified as non-binders.False positives refer to non-binders incorrectly predicted as binders, and false negatives are non-binders inaccurately predicted as binders.

Table 4 .
Predictive ability of the logo models for HLA-DQ2.5 and HLA-DQ8.1 tested on external test sets.If the binding score (BS) is higher than the non-binding score (NBS), the nonamer is classified as a binder.Otherwise, it is a non-binder.

Figure 2 .
Figure 2. The percentages of true positives (TPs), false negatives (FNs), true negatives (TNs), and false positives (FPs) vary concerning the ranges of the BS/NBS ratio for binders and NBS/BS ratio for non-binders to HLA-DQ2.5, as indicated by predictions on the external test set.Optimal predictions occur at ratios below −1 and above +1.

Figure 2 .
Figure 2. The percentages of true positives (TPs), false negatives (FNs), true negatives (TNs), and false positives (FPs) vary concerning the ranges of the BS/NBS ratio for binders and NBS/BS ratio for non-binders to HLA-DQ2.5, as indicated by predictions on the external test set.Optimal predictions occur at ratios below −1 and above +1.

Figure 3 .
Figure 3.The percentages of true positives (TPs), false negatives (FNs), true negatives (TNs), and false positives (FPs) vary concerning the ranges of the BS/NBS ratio for binders and NBS/BS ratio for non-binders to HLA-DQ8.1, as indicated by predictions on the external test set.Optimal predictions occur at ratios below −1 and above +1.

Figure 3 .
Figure 3.The percentages of true positives (TPs), false negatives (FNs), true negatives (TNs), and false positives (FPs) vary concerning the ranges of the BS/NBS ratio for binders and NBS/BS ratio for non-binders to HLA-DQ8.1, as indicated by predictions on the external test set.Optimal predictions occur at ratios below −1 and above +1.

Table 1 .
Number of peptides in the training and test sets used in the study.

Table 4 .
Predictive ability of the logo models for HLA-DQ2.5 and HLA-DQ8.1 tested on external test sets.If the binding score (BS) is higher than the non-binding score (NBS), the nonamer is classified as a binder.Otherwise, it is a non-binder.