AI-Driven Identification of Candidate Peptides for Immunotherapy in Non-Obese Diabetic Mice: An In Silico Study

Doytchinova, Irini; Dimitrov, Ivan; Atanasova, Mariyana; Mihaylova, Nikolina M.; Tchorbanov, Andrey

doi:10.3390/ai7040140

Open AccessArticle

AI-Driven Identification of Candidate Peptides for Immunotherapy in Non-Obese Diabetic Mice: An In Silico Study

by

Irini Doytchinova

^1,2,*

,

Ivan Dimitrov

^1,2

,

Mariyana Atanasova

^1,2

,

Nikolina M. Mihaylova

³

and

Andrey Tchorbanov

³

¹

Drug Design and Bioinformatics Lab, Faculty of Pharmacy, Medical University of Sofia, 1000 Sofia, Bulgaria

²

Centre of Excellence in Informatics and Information and Communication Technologies, 1113 Sofia, Bulgaria

³

Department of Immunology, Institute of Microbiology, Bulgarian Academy of Sciences, Academician G. Bonchev Street, Block 26, 1113 Sofia, Bulgaria

^*

Author to whom correspondence should be addressed.

AI 2026, 7(4), 140; https://doi.org/10.3390/ai7040140

Submission received: 9 March 2026 / Revised: 3 April 2026 / Accepted: 10 April 2026 / Published: 15 April 2026

(This article belongs to the Special Issue AI in Bio and Healthcare Informatics)

Download

Browse Figures

Versions Notes

Abstract

Type 1 diabetes (T1D) is an autoimmune disease characterized by T-cell-mediated destruction of pancreatic β-cells. Antigen-specific peptide immunotherapy represents a promising strategy to restore immune tolerance. Reliable identification of relevant T-cell epitopes requires accurate prediction of peptide binding to disease-associated major histocompatibility complex (MHC) molecules. In this study, we developed and validated artificial intelligence (AI)-driven machine learning (ML) predictive models for peptides binding to the NOD mouse-specific MHC class I molecules H-2D^b and H-2K^d and the class II molecule I-A^g7. Balanced datasets of experimentally validated binders and non-binders were compiled, divided into training and test sets, and used to construct position-specific logo models and supervised ML classifiers based on z-scale physicochemical descriptors. External validation demonstrated moderate predictive performance for the logo models (ROC AUC 0.685–0.738), whereas AI models, including Random Forest, Support Vector Machine, and Gradient Boosting, achieved substantially improved discrimination (ROC AUC 0.888–0.906). The validated models were applied to the major T1D autoantigens glutamic acid decarboxylase 65, insulin-1, insulin-2 and zinc transporter 8 and predicted multiple binders, with some overlapping with previously reported immunodominant regions. Selected binders were prioritized for further synthesis and in vivo immunogenicity testing in NOD mice.

Keywords:

type 1 diabetes; peptide immunotherapy; MHC binding prediction; artificial intelligence; machine learning; NOD mice; H-2D^b; H-2K^d; I-A^g7

1. Introduction

Diabetes mellitus (DM) is a chronic metabolic disorder characterized by persistent hyperglycemia resulting from defects in insulin secretion, insulin action, or both, leading to disturbances in carbohydrate, lipid, and protein metabolism. The two major forms are type 1 diabetes (T1DM), caused by autoimmune destruction of pancreatic β-cells, and type 2 diabetes (T2DM), which is primarily associated with insulin resistance and relative insulin deficiency [1,2]. The global burden of diabetes has increased dramatically in recent decades, driven by population aging, urbanization, obesity, and sedentary lifestyles, and it is now recognized as a major public health challenge due to its association with microvascular complications (retinopathy, nephropathy, and neuropathy) and macrovascular disease (cardiovascular and cerebrovascular disorders). Effective management requires early diagnosis, lifestyle modification, glucose-lowering pharmacotherapy, and continuous monitoring to prevent long-term complications and reduce mortality [3,4].

Immunotherapy is a therapeutic approach that modulates or harnesses the immune system to prevent, control, or treat diseases, including cancer, autoimmune disorders, infectious diseases, and transplant rejection. Immunotherapy for diabetes primarily targets T1DM. Immunotherapeutic strategies aim to modulate or suppress the autoreactive immune response, preserve residual β-cell function, and delay disease progression [5]. Approaches under investigation and clinical evaluation include anti-CD3 monoclonal antibodies (teplizumab) [6], anti-CD20 therapy (rituximab) [7], CTLA-4–Ig (abatacept) [8], antigen-specific immunotherapy using insulin or glutamic acid decarboxylase 65 (GAD65) [9], regulatory T-cell (Treg) enhancement [10], and cytokine modulation [11,12].

Special attention should be given to peptide-specific immunotherapy in T1D based on administering CD4⁺ and/or CD8⁺ T-cell epitopes derived from key autoantigens such as insulin, GAD65 and zinc transporter 8 (ZnT8) to induce antigen-specific immune tolerance. Unlike whole-protein vaccination, peptide-based strategies aim to selectively target autoreactive CD4⁺ and/or CD8⁺ T-cells, promote regulatory T-cell (Treg) expansion, and shift cytokine responses from proinflammatory (IFN-γ) toward regulatory or Th2-type profiles (IL-10). Clinical studies using insulin- and GAD65-derived peptides have demonstrated safety and measurable immunomodulatory effects, including altered T-cell reactivity and increased regulatory responses, although consistent long-term preservation of β-cell function remains challenging [9,13].

Here, we focus on the application of logo modeling and machine learning (ML) approaches to develop predictive models for the identification of CD4⁺ and/or CD8⁺ T-cell epitopes derived from the major β-cell autoantigens insulin-1, insulin-2, GAD65, and ZnT8. Antigen processing and presentation are constrained by strain-specific major histocompatibility complex (MHC) molecules, which bind peptide fragments generated from autoantigens and present them on the cell surface for recognition by T cells [14]. Because the computationally identified epitopes will be validated experimentally in non-obese diabetic (NOD) mice—a well-established murine model of type 1 diabetes [15]— ML modeling was specifically tailored to peptides binding to NOD-relevant MHC molecules, namely class I molecules H-2K^d and H-2D^b and the class II molecule I-A^g7. Although several tools such as IEDB [16], NetMHCpan [17] and RANKPEP [18] provide robust MHC binding predictions, they are primarily trained on human HLA data and may show reduced performance for murine alleles, particularly I-A^g7, which exhibits unique structural and binding characteristics. Unfortunately, IEDB and NetMHCpan do not support prediction for I-A^g7 binding, while RANKPEP identified only 3 out of 75 experimentally validated binders (Supplementary S1). Therefore, there is a need for specialized predictive models tailored to NOD mouse models of type 1 diabetes. The key research question addressed in this study is whether AI-driven ML models can reliably identify NOD mice-specific candidate MHC binders for subsequent experimental validation. Model performance was systematically evaluated and compared using standard classification metrics to assess predictive accuracy, robustness, and generalization capacity.

2. Datasets and Methods

2.1. Datasets

2.1.1. H-2D^b

A total of 539 unique binding and 583 non-binding nonamer peptides, experimentally evaluated using either a competitive radiolabeled binding assay with purified MHC molecules or a fluorescence-based binding assay on cellular MHC, were obtained from the Immune Epitope Database (IEDB; https://www.iedb.org/ [16], accessed 27 January 2026). All duplicates inside sets and between them were removed, and the sets of binders and non-binderes were balanced to include the same amount of nonamers. Furthermore, the dataset was divided into training and test sets in a 4:1 ratio. Five hundred different stratified 80/20 splits were evaluated and the most representative test set was selected for each method. The training sets were used to derive the logo and AI-based predictive models, while the test sets were used to validate model performance.

2.1.2. H-2K^d

A dataset of 326 unique binding and 205 non-binding nonamer peptides, experimentally characterized using a competitive radiolabeled binding assay with purified MHC molecules or a direct fluorescence-based binding assay performed on cellular MHC, was retrieved from IEDB (accessed on 27 January 2026). All duplicates inside sets and between them were removed and the sets of binders and non-binders were balanced to include the same amount of nonamers. The set was split into training and test sets at a ratio of 4:1, preserving class distribution. The most representative test set was selected from 500 stratified 80/20 splits. The training set was used for the development and training of the sequence logo and AI models, whereas the test set was served to assess model performance.

2.1.3. I-A^g7

A dataset of 123 nonamer ligands and a set of 365 binding cores from ligands of different lengths carrying the known binding motif to I-A^g7 identified by mass spectrometry from cellular MHC molecules were retrieved from IEDB (accessed 27 January 2026). The binding motif includes hydrophobic residues at p1 and acidic residues (Asp or Glu) at p9 [19,20]. In addition, a set of 74 non-binding peptides to I-A^g7 with different lengths and experimentally validated by direct or competitive fluorescence- or radioactivity-based binding assays performed on cellular or purified MHC molecules was collected from IEDB (accessed 27 January 2026). These peptides were fragmented into overlapping nonamers, yielding a total of 436 unique nonamer sequences. All duplicates inside sets and between them were removed, and the sets of binders and non-binders were balanced to include the same amount of nonamers. The final sets contain 451 binders and 451 non-binders. Furthermore, the sets were divided into training and test subsets at a 4:1 ratio with preserved class distribution.

2.2. Methods

2.2.1. Logo Protocol

For each peptide set, the frequency of every amino acid at each position of the nonamer sequence was calculated and mean-normalized across all nine positions. This procedure rescales the values to the interval [−1, 1], with positive values describing the preferred amino acids and negative values describing the non-preferred ones. They were organized into a quantitative matrix (QM) with dimensions of 9 positions × 20 amino acids. This matrix, termed a logo model [21,22,23], is conceptually inspired by sequence logos—graphical representations of residue conservation in aligned protein sequences—where each position is depicted as a stack of amino acid symbols and the height of each letter reflects its relative frequency at that position [24]. Two logo models were constructed from the training set: one derived from the binding peptides and one from the non-binding ones. These models were used to compute the binding scores (BSs) and non-binding scores (NBSs) for each nonamer in the test set. Peptides with BS > NBS were classified as binders, whereas those with BS ≤ NBS were classified as non-binders.

2.2.2. AI Protocol

The peptide datasets were encoded using Wold’s three z-scale amino acid descriptors, which capture key physicochemical properties—hydrophobicity (z1), steric bulk (z2), and electronic effects (z3) [25]. For each nonamer, this encoding generated a vector of 27 numerical features (9 positions × 3 descriptors per residue). The resulting descriptor matrices for the training and test sets, comprising both binders and non-binders, were submitted to the AI tool ChatGPT-5.2 [26] with instructions to develop supervised machine learning (ML) classification models and to evaluate their predictive performance. Three supervised ML algorithms, Random Forest (RF) [27], Support Vector Machines (SVM) [28] and Gradient Boosting (GB) [29]—were trained on the training sets, and their predictive performance was evaluated on the test sets. The codes used in the study are given in Supplementary S2–S4. They contain the implementation details, including model initialization, training procedures, and evaluation protocols, ensuring that the results can be independently reproduced. The optimized hyperparameter configuration for the RF model includes 500 trees, Gini impurity, feature sampling, and unrestricted depth, with justification focused on variance reduction and capturing nonlinear interactions. For the SVM model, we select the RBF kernel with parameter settings (C and γ) and their standard scaling rationale. For the GB model, we provide the tree depth, learning rate, number of estimators, and loss function, emphasizing the balance between bias and overfitting. A hyperparameter selection following a controlled, performance-driven strategy was chosen rather than exhaustive grid or Bayesian optimization. Given the moderate dataset size and the use of external validation, the performance of the models was systematically validated by independent test sets. This approach minimizes the risk of overfitting that may arise from aggressive hyperparameter optimization in relatively small datasets.

SHAP analysis [30] was performed for the best performing model for each allele. Each violin represents the distribution of SHAP values for a given feature across the test set, reflecting both the magnitude and variability of its contribution to binding predictions. Features with wider distributions indicate stronger and more variable influences, while values centered near zero have minimal impact.

2.2.3. Model Validation Metrics

Model validation was performed on the external test sets derived by 4:1 splits using standard metrics, including true positives (TP; binders correctly predicted as binders), true negatives (TN; non-binders correctly predicted as non-binders), false positives (FP; non-binders incorrectly predicted as binders), and false negatives (FN; binders incorrectly predicted as non-binders). From these values, sensitivity (TP/total binders), specificity (TN/total non-binders), overall accuracy ((TP + TN)/total peptides), F1 score, Matthews correlation coefficient (MCC), and the area under the receiver operating characteristic curve (ROC AUC) were calculated to assess the classification performance [31].

3. Results

3.1. H-2D^b Models

3.1.1. Logo Models

The logo models for H-2D^b binders and non-binders are summarized in Tables S1 and S2, while the sequence logos are shown in Figure 1. Clear distinctions between the two groups are evident, particularly at the key anchor positions. In the binder logo, p5 is strongly dominated by Asn, reflecting its essential role in stabilizing peptide–MHC interactions. By contrast, the non-binder logo shows only modest enrichment of Asn at p5, indicating a lack of selective constraint at this site. P9 also exhibits substantial divergence: binders preferentially accommodate hydrophobic residues—especially Leu, Ile, Val and Met—whereas non-binders display reduced residue preferences. At the remaining positions (p1–p4 and p6–p8), binders show moderate residue preferences, while non-binders are relatively silent. These observations are in strong agreement with the established H-2D^b binding motif [32,33].

The logo models were used to predict the BS and NBS of the peptides from the test set as described in the Methods Section. If the BS is higher than the NBS, the peptide is classified as a binder; otherwise, it is classified as a non-binder. The model validation metrics from the external validation are given in Table 1.

The logo-based model demonstrates only moderate predictive performance on the test set. With a sensitivity of 0.620, it correctly identifies approximately 62% of true binders, leaving more than one-third undetected (FN = 41). This indicates that although the motif captures major binding determinants, it fails to cover the full sequence diversity of binders. The specificity of 0.694 reflects a limited capacity to exclude non-binders, as evidenced by 33 false positives. Thus, motif features enriched in binders are not sufficiently distinctive and appear in part of non-binding peptides. An overall accuracy of 0.657 supports the conclusion of moderate performance. The F1 score (0.644), MCC of 0.316 and ROC AUC of 0.685 further confirms limited discriminative power.

To overcome the moderate performance of the logo model, we applied AI-driven ML methods to develop a more robust and highly predictive classification model.

3.1.2. AI Model

The predictive performance of the models derived by RF, SVM and GB evaluated by the test set of binding and non-binding peptides to H-2D^b are given in Supplementary S5. The Random Forest (RF) [27] model demonstrates the strongest predictive performance and substantial improvement in classification quality on the external test set (Table 1). The model was configured with 500 decision trees to ensure model stability and low variance. Trees were grown without restriction on depth, allowing the model to capture complex nonlinear relationships and higher-order residue interactions. Node splitting was performed using the Gini impurity criterion, selecting splits that maximally reduced class impurity at each node. At each split, a random subset of features equal to the square root of the total number of features was considered to promote tree decorrelation and improve ensemble generalization. Other parameters were set to default values (min_samples_split = 2, min_samples_leaf = 1, bootstrap = true, class_weight = none, and random_state = 42). The code used is given in Supplementary S2. The optimal threshold of 0.5 was determined by systematically evaluating model performance across all possible probability thresholds and selecting the value that gave the maximum balanced accuracy on the test set. With a sensitivity of 0.778, the model correctly identifies 78% of true binders, significantly reducing false negatives (from 41 to 24) compared to logo metrics. This indicates high capability for detecting immunogenic peptides, which is particularly important in epitope discovery contexts. At the same time, the specificity of 0.843 shows that the model effectively rejects non-binders, keeping false positives relatively low (17). The overall accuracy of 0.810, F1 score of 0.804, MCC of 0.622 and ROC AUC of 0.888 demonstrate excellent discriminative ability, indicating that the AI model effectively separates binders from non-binders.

The SHAP analysis identified p5 as the most influential determinant of binding, with strong contributions across hydrophobic (z1), steric (z2), and electronic (z3) descriptors, indicating a central role of this position in stabilizing peptide–MHC interactions (Figure 1). P2 also showed significant contributions associated with steric and electronic properties. P9 contributed moderately by hydrophobic effects. The remaining positions have relatively weak effects.

3.2. H-2K^d Models

3.2.1. Logo Models

The logo models for H-2K^d binders and non-binders are given in Tables S3 and S4, respectively, and their corresponding sequence logos are shown in Figure 2. The logos highlight a distinct anchor-driven binding pattern. Among binders, p2 exhibits a strong enrichment of Tyr and Phe, indicating a dominant primary anchor residue, while p9 shows a preference for Leu, Ile and Val. The found logo is in good agreement with the known binding motif for H-2K^d [34,35,36,37]. Non-binders display weaker preferences for the same amino acids at these key positions. The remaining positions (p1 and p3–p8) demonstrate limited conservation in both groups, suggesting a minor role in binding specificity. Overall, the logos show a weak discrimination between binders and non-binders.

The performance metrics of the logo models on the external test set is summarized in Table 2. Expectedly, the logo model shows moderate predictive performance. With a sensitivity of 0.683, the model correctly identifies about 68% of true binders, meaning less than one-third of binders in the test set (FN = 13) are missed. This indicates that while the motif captures key binding features, it does not fully represent the diversity of binder sequences. The specificity of 0.585 reflects only modest ability to reject non-binders, with 17 false positives in the test set. This suggests that the motif features present in binders are not sufficiently exclusive and occur in a substantial fraction of non-binders. The accuracy of 0.634, F1 score of 0.651, the MCC of 0.270 and ROC AUC of 0.738 show moderate discriminative ability.

3.2.2. AI Model

Among the AI-generated models (Supplementary S6), the best-performing algorithm was the Support Vector Machine (SVM) [28] with Radial Basis Function (RBF) [38] (C = 1.0, γ = scale, Platt scaling, no class weight, enabled shrinking, optimization tolerance = 0.001) (Table 2). The code used is given in Supplementary S3. With a sensitivity of 0.854, the model correctly identifies more than 85% of true binders, substantially reducing false negatives compared to the logo-based approach. At the same time, a specificity of 0.902 indicates effective rejection of non-binders, maintaining a relatively low false-positive rate. The balanced sensitivity and specificity are reflected in an overall accuracy of 0.878. An F1 score of 0.875 confirms a strong balance between precision and sensitivity, while an MCC of 0.757 indicates substantial overall predictive power and robust performance across both classes. Furthermore, ROC AUC of 0.903 demonstrates excellent discriminative ability, showing that the AI model effectively separates binders from non-binders across a wide range of decision thresholds.

SHAP analysis revealed that peptide binding predictions were driven by position-specific physicochemical properties (Figure 2). The most influential feature was the steric descriptor (z2) at p2, indicating a strong size-dependent anchor interaction. P9 exhibited combined contributions from hydrophobic (z1) and electronic (z3) properties. Central positions (p4–p6) showed moderate contributions across multiple descriptors. Comparison between logo model and SHAP interpretation revealed strong concordance in identifying key binding positions, particularly p2 and p9 as dominant anchors.

3.3. I-A^g7 Models

3.3.1. Logo Models

The logo models for binders and non-binders to I-A^g7 are given in Tables S5 and S6, and the corresponding sequence logos are presented in Figure 3. The binder logo shows strong amino acid preferences at p9, with a significant enrichment of negatively charged Glu and Asp, as well as hydrophobic residues like Leu. In contrast, the non-binder logo displays heterogeneous amino acid distribution across all positions. Similar binding motifs for I-A^g7 have previously been reported by other authors [19,20].

The BS and NBS of the peptides from the test set were calculated by the corresponding logo models, and the validation metrics are presented in Table 3. The logo models correctly identify 40 of 75 binders and 54 of 75 non-binders. With a sensitivity of 0.533 and specificity of 0.720, the model shows a better ability to reject non-binders than to detect true binders. The overall accuracy of 0.627, F1 score of 0.588, MCC of 0.258 and ROC AUC of 0.726 reflect a weak classification performance of the logo models.

3.3.2. AI Model

The AI model based on GB [36] demonstrates stronger and well-balanced predictive performance on the external test set (Table 3) compared to RF and SVM models (Supplementary S7). The model employed decision trees as base learners with a maximum depth of 3, enabling controlled model complexity and reducing overfitting. The boosting procedure was performed using the logistic loss function (log_loss), appropriate for binary classification. The learning rate was set to 0.1, and the number of boosting stages (n_estimators) was 100. No subsampling was applied (subsample = 1.0), resulting in deterministic gradient boosting. Default values were retained for other hyperparameters (min_samples_split = 2, min_samples_leaf = 1, max_features = none, and ccp_alpha = 0.0). A fixed random seed (random_state = 42) was used to ensure reproducibility. The code used is given in Supplementary S4. The model correctly classifies 62 of 75 binders and 71 of 75 non-binders, resulting in a sensitivity of 0.827 and specificity of 0.947. The overall accuracy of 0.887, F1 score of 0.879, MCC of 0.779 and ROC AUC of 0.906 further confirm excellent discriminatory capacity. Expectedly, the AI-driven model captures the complex sequence determinants of binding to I-A^g7 more effectively than position-specific logo scoring.

The SHAP analysis reveals that peptide binding predictions are predominantly driven by hydrophobicity-related descriptors, with the strongest contributions observed at the terminal positions p1 and p9 (Figure 3). These positions exhibit the largest SHAP value dispersion, indicating their critical role as anchor residues. In contrast, central positions (p4–p6) display narrow distributions centered around zero, suggesting a limited contribution to binding. The asymmetric spread of SHAP values, with both strong positive and negative extremes, indicates that the model captures not only favorable but also unfavorable physicochemical patterns.

3.4. Prediction of Mouse Major T1D Autoantigens

The developed predictive models for peptide binding to the NOD mouse-specific MHC class I molecules H-2D^b and H-2K^d, and the class II molecule I-A^g7, were subsequently applied to screen the main pancreatic β-cell autoantigens—GAD65, insulin-1, insulin-2 and ZnT8—for predicted binding peptides as potential T-cell epitopes for immunotherapy.

3.4.1. Glutamic Acid Decarboxylase 65 (GAD65)

The mouse GAD65 (UniProt IDs: P48320 and Q548L4) was presented as a set of overlapping nonamer peptides, and each nonamer was classified as a binder or non-binder for each specific MHC molecule. The complete list of predicted binders and non-binders is provided in Table S7. For the RF model (H-2D^b), the threshold of 0.5 provides an optimal trade-off between sensitivity and specificity, avoiding bias toward either false positives or false negatives. For the SVM model (H-2K^d) and the GB model (I-A^g7), only high-confidence binders were retained by applying prediction probabilities greater than 0.96. This higher threshold aims to reduce the number of false positives and is directly related to the subsequent experimental work. Table 4 summarizes only those nonamers predicted as binders by both logo and AI models for the corresponding MHC molecule. In total, 22 peptides were predicted as binders, with the majority showing specificity toward a single MHC allele, while several peptides (¹⁵⁰ADQPQNLEEI, ¹⁸⁶LDMVGLAAD, ⁴²⁶LFQQDKHYD and ⁴⁴³ALQCGRHVDVF) demonstrated potential promiscuity and enhanced immunological relevance. Overall, the predicted epitopes are distributed across the full length of the protein, indicating multiple regions that may contribute to T-cell recognition in the murine immune system.

3.4.2. Insulins

The two mouse insulin isoforms—insulin-1 (UniProt IDs: P01325) and insulin-2 (UniProt IDs: P01326)—show a high degree of sequence similarity, differing only at several positions across the preproinsulin sequence. Both proteins were presented as a set of overlapping nonamer peptides, and each nonamer was classified as a binder or non-binder for each specific MHC molecule. The complete list of predicted binders and non-binders is provided in Table S8. Unfortunately, for both insulin isoforms, the logo and AI models did not identify the same peptides as binders. Within the diabetogenic insulin B-chain T-cell epitope ⁹SHLVEALYLVCGERG [39,40], the sequence logo models predicted three potential binders: ¹³EALYLVCGE binding to H-2D^b, ¹⁴ALYLVCGER binding to H-2K^d, and ¹⁵LYLVCGERG binding to I-A^g7. However, these predictions were not confirmed by the corresponding AI models, and we did not classify these peptides as binders.

3.4.3. Zinc Transporter 8 (ZnT8)

The mouse ZnT8 (UniProt IDs: Q8BGG0) was presented as a set of overlapping nonameric peptides, and each nonamer was classified as a binder or non-binder for each specific MHC molecule. The complete list of predicted binders and non-binders is provided in Table S9. Table 5 shows only those nonamers predicted as binders by both the logo-based and AI-driven models for the respective MHC molecule.

Table 5 shows that mouse ZnT8 yields a limited but diverse set of predicted nonameric binders distributed across the protein sequence and restricted to all three NOD-associated MHC alleles. The majority of peptides are predicted to bind I-A^g7 (starting positions 26, 31, 79, 234, 260 and 286), consistent with a predominance of potential CD4⁺ T-cell determinants, while two peptides (starting positions 318 and 333) are restricted to H-2K^d and one (starting position 74) to H-2D^b, indicating the presence of candidate CD8⁺ T-cell epitopes as well.

4. Discussion

The present study demonstrates that AI-driven ML models substantially outperform classical position-specific logo scoring for the prediction of peptide binding to NOD mouse-specific MHC molecules. Across all three alleles examined (H-2D^b, H-2K^d and I-A^g7), the logo models captured the major anchor residue preferences and reproduced established binding motifs, confirming that the datasets were biologically coherent and that the implemented normalization procedure was appropriate. However, their predictive performance remained moderate, with MCC values ranging from 0.270 to 0.316 and ROC AUC values below 0.750. In contrast, the AI-derived models (RF for H-2D^b, SVM-RBF for H-2K^d, and GB for I-A^g7) consistently achieved high class discrimination (ROC AUC 0.888–0.906; MCC 0.622–0.779), indicating that non-linear algorithms trained on physicochemical descriptors can capture complex sequence–property relationships that significantly outperform simple positional residue frequencies.

The differences in optimal model selection (RF for H-2D^b, SVM for H-2K^d, and GB for I-A^g7) reflect the distinct structural and binding characteristics of these MHC molecules. The peptide binding to H-2D^b (RF model) is strongly driven by dominant anchor positions (particularly p5 and p9), with relatively clear and consistent physicochemical preferences. The SHAP analysis shows concentrated, high-impact contributions at these positions across all three descriptors (z₁, z₂, and z₃). Such patterns are well captured by RF models, which efficiently learn hierarchical, non-linear feature interactions and handle dominant feature contributions without requiring strict global decision boundaries.

The H-2K^d allele (SVM model) exhibits a more constrained and smoother binding landscape, primarily governed by key anchor residues at p2 and p9 with strong steric (z₂) and hydrophobic/electronic contributions. The SHAP profiles indicate relatively well-separated feature distributions, which are particularly suitable for SVM with RBF kernels that construct smooth decision boundaries in feature space and perform well when class separation is structured but not highly irregular.

In contrast, the peptide binding to I-A^g7 (GB model) is more heterogeneous and less canonical, characterized by multiple weak and context-dependent contributions. SHAP analysis reveals distributed importance across positions, with dominant but variable effects at p1 and p9, largely driven by hydrophobicity (z₁), but with significant variability and bidirectional contributions (both favorable and unfavorable). This reflects the known structural features of I-A^g7, including its flexible binding groove and tolerance for multiple binding registers. The GB algorithm performs best in this setting because it incrementally builds an ensemble of shallow trees that can capture subtle, additive, and context-dependent effects across features. This makes it particularly suitable for modeling the complex and less well-defined binding rules of I-A^g7, where no single dominant pattern governs binding.

Application of the validated models to mouse T1D autoantigen GAD65, insulins and ZnT8 identified potential MHC binders. GAD65 is one of the major autoantigens implicated in autoimmune diabetes and has been extensively studied in the context of T-cell recognition in NOD mice and patients with T1D [41,42]. The use of overlapping nonamers combined with both logo and ML models enabled the identification of 22 peptides predicted as binders to H-2D^b, H-2K^d, and I-A^g7, indicating ability to stimulate both CD8⁺ and CD4⁺ T-cell responses. The predicted epitopes are distributed throughout the entire protein sequence, which is consistent with experimental studies showing that GAD65 contains multiple immunogenic regions recognized by autoreactive T cells in NOD mice [41,43]. Several predicted peptides correspond to regions previously implicated in autoimmune responses. For example, the peptide ²²⁸IGWPGGSGD lies close to the well-known GAD65 region around residues 221–235, which has been experimentally shown to stimulate CD4⁺ T cells in NOD mice [44]. Most predicted peptides were allele-specific, but several demonstrated potential promiscuity, including ¹⁸⁶LDMVGLAAD, ⁴²⁶LFQQDKHYD, and the longer sequence region around positions 150–159. Promiscuous peptides capable of binding multiple MHC alleles are often considered particularly important in autoimmune responses because they can activate broader T-cell repertoires [45]. Another specific feature is the dominance of predicted binders for the diabetogenic class II molecule I-A^g7. This observation aligns with the central role of CD4⁺ T cells in the pathogenesis of autoimmune diabetes in NOD mice [15]. The unique structural properties of I-A^g7, including the absence of Asp at position β57 of the MHC β chain, influence peptide binding preferences and TCR recognition [41,44]. The set of GAD65 promiscuous binders selected for further synthesis and in vivo tests is summarized in Table 6.

Both mouse insulin proteins contain the typical signal peptide followed by the B-chain, C-peptide, and A-chain regions characteristic of insulin precursors. Sequence alignment reveals that most residues are conserved, particularly within the B-chain region (FVKQHLCGPHLVEALYLVCGERGFFYTP), which is functionally important for receptor binding and immune recognition. The B-chain region contains the experimentally established dominant T-cell epitope in NOD mice 9–23 peptide (⁹SHLVEALYLVCGERG), which is presented by I-A^g7 and recognized by pathogenic CD4⁺ T cells (Table 6) [39,40]. One difference occurs at position 33, where insulin-1 contains Pro while insulin-2 contains Ser. Three potential binders were predicted by the logo models, but they were not confirmed by the AI models. This lack of high-confidence predictions is consistent with previous experimental observations showing that many diabetogenic insulin epitopes bind I-A^g7 relatively weakly and are presented in non-optimal binding registers [42,46]. I-A^g7 typically prefers acidic residues (Asp or Glu) at p9 anchor position [19,20]. However, the dominant insulin epitope presents neutral residues in this position depending on the binding register [19,44]. As a result, the peptideMHC complex is relatively unstable, which explains why computational models trained on peptides carrying canonical binding motifs fail to classify such peptides as strong binders. Another important factor is that diabetogenic insulin epitopes often adopt unconventional binding registers within the MHC groove. For the B:9–23 peptide, several alternative registers have been proposed, and pathogenic T-cell receptors recognize complexes formed in a weak binding mode [40,46]. Weak peptide–MHC interactions allow autoreactive T cells to escape negative selection in the thymus while still being capable of activation in the periphery [15]. In the NOD model, the relatively low stability of insulin–I-A^g7 complexes has been proposed to contribute to defective central tolerance and to the activation of diabetogenic T-cell clones [39]. Therefore, the lack of strong binders predicted in this analysis reflect an important biological feature of insulin epitopes rather than a limitation of the computational models. Despite the absence of strong binders predicted from the insulin sequences in this study, the well-established diabetogenic T-cell epitope SHLVEALYLVCGERG was selected for further synthesis and in vivo evaluation (Table 6).

ZnT8 displayed a moderate number of predicted binders distributed across all three alleles. Most predicted binders were restricted to I-A^g7, suggesting that ZnT8 may preferentially contribute to CD4⁺ T-cell responses in NOD mice. Once again, this dominance is consistent with the well-established role of CD4⁺ T cells and the diabetogenic MHC class II molecule I-A^g7 in initiating autoimmune responses leading to T1D [15]. At the same time, the identification of a small number of peptides predicted to bind H-2K^d and H-2D^b indicates that ZnT8 may also generate candidate CD8⁺ T-cell epitopes, which could contribute to β-cell destruction during later stages of the disease. Comparison with previously reported epitopes supports the biological plausibility of several predicted peptides [47,48]. T-cell epitopes derived from ZnT8 have been mapped in humans, including regions located in the transmembrane and cytoplasmic domains of the transporter [48,49]. Some of the predicted peptides in the present study, such as ²⁶⁰ASTVMILKD and ²⁸⁶VKEIILAVD, originate from the C-terminal cytoplasmic region of ZnT8, which corresponds to a region previously implicated in immune recognition in human studies. The predicted peptide ⁷⁹ICFIFMVAE lies within a hydrophobic segment of the protein corresponding to a transmembrane region. Similar hydrophobic regions have previously been shown to generate MHC class I epitopes after intracellular processing of membrane proteins [50]. The identification of ⁷⁴CAASAICFI as a potential H-2D^b binder further supports the possibility that ZnT8-derived peptides could contribute to CD8⁺ T-cell responses in NOD mice. Although ZnT8-specific CD8⁺ T cells have been less extensively characterized than CD4⁺ responses, cytotoxic T cells are known to play a critical role in β-cell destruction in the NOD model [50]. Similarly to GAD65, the predicted epitopes are distributed across the entire protein sequence rather than clustering in a single dominant region. This pattern is consistent with previous studies indicating that ZnT8 contains multiple potential immunogenic determinants and may contribute to epitope spreading during disease progression [48]. Epitope spreading is a well-recognized feature of autoimmune diabetes, where immune responses initially directed against one autoantigen expand to target additional β-cell proteins over time [15]. Two promiscuous peptides from ZnT8 were selected for further synthesis and in vivo tests (Table 6).

Despite their superior predictive performance, AI-driven models have several limitations. As data-based models, their performance depends on the quality, size, and representativeness of the training datasets, making them sensitive to bias, noise, and data scarcity. Their ability to generalize beyond the training space is limited, and they may fail to capture atypical binding modes or novel peptide patterns. In addition, most models operate as black boxes with limited interpretability and are prone to overfitting, especially when trained on small or imbalanced datasets.

A very specific limitation of current AI-based prediction models lies in their reduced ability to accurately identify pathogenic weak-binding epitopes, which represent an important yet underexplored class of immunologically relevant peptides in autoimmune diseases. Most existing models are trained on datasets enriched for high-affinity binders and canonical binding motifs, leading to an inherent bias toward strong and well-defined interactions [16,51]. Consequently, peptides that bind with low affinity or adopt non-canonical binding registers may be underrepresented or misclassified, despite growing evidence that such epitopes can play critical roles in autoimmune pathogenesis [40,52]. In addition, key biological processes influencing epitope immunogenicity—such as antigen processing, peptide stability, MHC loading dynamics, and T-cell receptor recognition—are not fully captured by current predictive frameworks [53,54].

These observations underline once again that MHC binding is necessary but not sufficient for immunogenicity. Antigen processing, peptide abundance, T-cell repertoire availability, and tolerance mechanisms all contribute to the emergence of dominant epitopes. Therefore, while the high predictive accuracy of AI models make them powerful prioritization tools, experimental validation in NOD mice remains essential. The integration of predicted high-confidence binders with in vivo T-cell assays, cytokine profiling, and disease modulation studies will determine their true therapeutic relevance.

5. Conclusions

This study illustrates the practical utility of generative AI-assisted model development for rapid immunoinformatics workflows. The pipeline from descriptor encoding to model training and validation was implemented efficiently, enabling rapid model development and evaluation. While the results indicate improved predictive performance compared to simpler methods, the models remain dependent on the quality and scope of the available data. Therefore, their application should be considered as a supportive tool for prioritizing candidate peptides, requiring further experimental validation. Such approaches may contribute to the design of antigen-specific immunotherapies in autoimmune diabetes and potentially in other immune-mediated diseases.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/ai7040140/s1, Supplementary S1: RANKPEP predictions of binders to I-A^g7; Supplementary S2: RF code used in the study; Supplementary S3: SVM code used in the study; Supplementary S4: GB code used in the study; Supplementary S5: Model validation metrics on the external test set for H-2D^b by three ML models (RF, SVM and GB); Supplementary S6: Model validation metrics on the external test set for H-2K^d by three ML models (RF, SVM and GB); Supplementary S7: Model validation metrics on the external test set for I-A^g7 by three ML models (RF, SVM and GB); Table S1: Logo model for binders to H-2D^b; Table S2: Logo model for non-binders to H-2D^b; Table S3: Logo model for binders to H-2K^d; Table S4: Logo model for non-binders to H-2K^d; Table S5: Logo model for binders to I-A^g7; Table S6: Logo model for non-binders to I-A^g7; Table S7: Complete list of overlapping nonamers originating from mouse GAD65 predicted to bind to NOD mice-specific MHC; Table S8: Complete list of overlapping nonamers originating from mouse insulin-1 predicted to bind to NOD mice-specific MHC; Table S9: Complete list of overlapping nonamers originating from mouse ZnT8 predicted to bind to NOD mice-specific MHC.

Author Contributions

Conceptualization, I.D. (Irini Doytchinova), I.D. (Ivan Dimitrov) and M.A.; methodology, I.D. (Irini Doytchinova); validation, I.D. (Irini Doytchinova); investigation, I.D. (Irini Doytchinova); resources, I.D. (Ivan Dimitrov); data curation, I.D. (Ivan Dimitrov); writing—original draft preparation, I.D. (Irini Doytchinova); writing—review and editing, I.D. (Irini Doytchinova), I.D. (Ivan Dimitrov), M.A., N.M.M. and A.T.; visualization, I.D. (Irini Doytchinova); supervision, I.D. (Irini Doytchinova), N.M.M. and A.T.; project administration, I.D. (Irini Doytchinova) and N.M.M.; funding acquisition, I.D. (Irini Doytchinova) and N.M.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Bulgarian National Science Fund, grant number KP-06-N93/7/2025. AI modeling was performed in the Centre of Excellence in Informatics and ICT, supported by the Science and Education for Smart Growth Operational Program and co-financed by the European Union through the European Structural and Investment funds (Grant No. BG16RFPR002-1.014-0018-C01/2025).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in the study are available upon request.

Acknowledgments

During the preparation of this manuscript, the authors used the AI tool ChatGPT 5.2 for ML model development and for English editing. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
AUC	Area Under the Curve
BS	Binding Score
FN	False Negative
FP	False Positive
GAD65	Glutamic Acid Decarboxylase 65
GB	Gradient Boosting
H-2D^b	MHC class I molecule H-2D^b
H-2K^d	MHC class I molecule H-2K^d
I-A^g7	MHC class II molecule I-A^g7
IEDB	Immune Epitope Database
MCC	Matthews Correlation Coefficient
MHC	Major Histocompatibility Complex
ML	Machine Learning
NBS	Non-Binding Score
NOD	Non-Obese Diabetic (mouse model)
RF	Random Forest
ROC	Receiver Operating Characteristic
SVM	Support Vector Machine
T1D	Type 1 Diabetes
TN	True Negative
TP	True Positive
ZnT8	Zinc Transporter 8

References

DeFronzo, R.A.; Ferrannini, E.; Groop, L.; Henry, R.R.; Herman, W.H.; Holst, J.J.; Hu, F.B.; Kahn, C.R.; Raz, I.; Shulman, G.I.; et al. Type 2 diabetes mellitus. Nat. Rev. Dis. Primers 2015, 1, 15019. [Google Scholar] [CrossRef]
Atkinson, M.A.; Eisenbarth, G.S.; Michels, A.W. Type 1 diabetes. Lancet 2014, 383, 69–82. [Google Scholar] [CrossRef]
World Health Organization. Diabetes Fact Sheet; WHO: Geneva, Switzerland, 2023. [Google Scholar]
International Diabetes Federation. IDF Diabetes Atlas, 10th ed.; IDF: Brussels, Belgium, 2021. [Google Scholar]
Bluestone, J.A.; Herold, K.; Eisenbarth, G. Genetics, pathogenesis and clinical interventions in type 1 diabetes. Nature 2010, 464, 1293–1300. [Google Scholar] [CrossRef]
Herold, K.C.; Bundy, B.N.; Long, S.A.; Bluestone, J.A.; DiMeglio, L.A.; Dufort, M.J.; Gitelman, S.E.; Gottlieb, P.A.; Krischer, J.P.; Linsley, P.S.; et al. An anti-CD3 antibody, teplizumab, in relatives at risk for type 1 diabetes. N. Engl. J. Med. 2019, 381, 603–613. [Google Scholar] [CrossRef]
Pescovitz, M.D.; Greenbaum, C.J.; Krause-Steinrauf, H.; Becker, D.J.; Gitelman, S.E.; Goland, R.; Gottlieb, P.A.; Marks, J.B.; McGee, P.F.; Moran, A.M.; et al. Rituximab, B-lymphocyte depletion, and preservation of β-cell function. N. Engl. J. Med. 2009, 361, 2143–2152. [Google Scholar] [CrossRef]
Orban, T.; Bundy, B.; Becker, D.J.; DiMeglio, L.A.; Gitelman, S.E.; Goland, R.; Gottlieb, P.A.; Greenbaum, C.J.; Marks, J.B.; Monzavi, R.; et al. Costimulation modulation with abatacept in patients with recent-onset type 1 diabetes. Lancet 2011, 378, 412–419. [Google Scholar] [CrossRef]
Rodriguez-Fernandez, S.; Almenara-Fuentes, L.; Perna-Barrull, D.; Barneda, B.; Vives-Pi, M. A century later, still fighting back: Antigen-specific immunotherapies for type 1 diabetes. Immunol. Cell Biol. 2021, 99, 461–474. [Google Scholar] [CrossRef]
Bender, C.; Wiedeman, A.E.; Hu, A.; Ylescupidez, A.; Sietsema, W.K.; Herold, K.C.; Griffin, K.J.; Gitelman, S.E.; Long, S.A. T-Rex Study Group. A phase 2 randomized trial with autologous polyclonal expanded regulatory T cells in children with new-onset type 1 diabetes. Sci. Transl. Med. 2024, 16, eadn2404. [Google Scholar] [CrossRef]
Vohidova, D.; Desai, P.; Moreno Lozano, A.; Veiseh, O. Modulating immune response for the prevention and treatment of type 1 diabetes. Front. Immunol. 2026, 17, 1715863. [Google Scholar] [CrossRef]
Lu, J.; Liu, J.; Li, L.; Lan, Y.; Liang, Y. Cytokines in type 1 diabetes: Mechanisms of action and immunotherapeutic targets. Clin. Transl. Immunol. 2020, 9, e1122. [Google Scholar] [CrossRef]
Bonifacio, E.; Ziegler, A.-G.; Klingensmith, G.; Schober, E.; Bingley, P.; Rottenkolber, M.; Theil, A.; Eugster, A.; Puff, R.; Peplow, C.; et al. Effects of high-dose oral insulin on immune responses in children at high risk for type 1 diabetes (Pre-POINT trial). JAMA 2015, 313, 1541–1549. [Google Scholar]
Patronov, A.; Doytchinova, I. T-cell epitope vaccine design by immunoinformatics. Open Biol. 2013, 3, 120139. [Google Scholar] [CrossRef]
Anderson, M.S.; Bluestone, J.A. The NOD mouse: A model of immune dysregulation. Annu. Rev. Immunol. 2005, 23, 447–485. [Google Scholar] [CrossRef]
Vita, R.; Blazeska, N.; Marrama, D.; Duesing, S.; Bennett, J.; Greenbaum, J.; De Almeida Mendes, M.; Mahita, J.; Wheeler, D.K.; Cantrell, J.R.; et al. The Immune Epitope Database (IEDB): 2024 update. Nucleic Acids Res. 2025, 53, D436–D443. [Google Scholar]
Reynisson, B.; Alvarez, B.; Paul, S.; Peters, B.; Nielsen, M. NetMHCpan-4.1 and NetMHCIIpan-4.0: Improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res. 2020, 48, W449–W454. [Google Scholar]
Reche, P.A.; Reinherz, E.L. Prediction of peptide–MHC binding using profiles. Methods Mol. Biol. 2007, 409, 185–200. [Google Scholar]
Corper, A.L.; Stratmann, T.; Apostolopoulos, V.; Scott, C.A.; Garcia, K.C.; Kang, A.S.; Wilson, I.A.; Teyton, L. A structural framework for deciphering the link between I-A^g7 and autoimmune diabetes. Science 2000, 288, 505–511. [Google Scholar]
Lee, K.H.; Wucherpfennig, K.W.; Wiley, D.C. Structure of a human insulin peptide–HLA-DQ8 complex and similarities to the I-A^g7 diabetes-associated class II MHC molecule. Nat. Immunol. 2001, 2, 501–507. [Google Scholar]
Doytchinova, I.; Atanasova, M.; Fernandez, A.; Moreno, F.J.; Koning, F.; Dimitrov, I. Modeling peptide–protein interactions by a logo-based method: Application in peptide-HLA binding predictions. Molecules 2024, 29, 284. [Google Scholar]
Doytchinova, I.; Atanasova, M.; Sotirov, S.; Dimitrov, I. In silico identification of peanut peptides suitable for allergy immunotherapy in HLA-DRB1*03:01-restricted patients. Pharmaceuticals 2024, 17, 1097. [Google Scholar]
Doytchinova, I.; Sotirov, S.; Dimitrov, I. Molecular insights into tumor immunogenicity. Curr. Issues Mol. Biol. 2025, 47, 641. [Google Scholar] [CrossRef]
Schneider, T.D.; Stephens, R.M. Sequence logos: A new way to display consensus sequences. Nucleic Acids Res. 1990, 18, 6097–6100. [Google Scholar] [CrossRef]
Hellberg, S.; Sjöström, M.; Skagerberg, B.; Wold, S. Peptide quantitative structure-activity relationships, a multivariate approach. J. Med. Chem. 1987, 30, 1126–1135. [Google Scholar] [CrossRef]
OpenAI. ChatGPT, Version 5.2; Large language model; OpenAI: San Francisco, CA, USA, 2026.
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef]
Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
Zhao, R.; Loftus, D.J.; Appella, E.; Collins, E.J. Structural evidence of T cell xeno-reactivity in the absence of molecular mimicry. J. Exp. Med. 1999, 189, 359–370. [Google Scholar] [CrossRef]
White, W.L.; Bai, H.; Kim, C.J.; Jude, K.M.; Sun, R.; Guerrero, L.; Han, X.; Chen, X.; Chaudhuri, A.; Bonzanini, J.E.; et al. Design of solubly expressed miniaturized SMART MHCs. Proc. Natl. Acad. Sci. USA 2026, 123, e2505932123. [Google Scholar] [CrossRef]
Rammensee, H.-G.; Falk, K.; Rötzschke, O. Peptides naturally presented by MHC class I molecules. Annu. Rev. Immunol. 1993, 11, 213–244. [Google Scholar] [CrossRef]
Falk, K.; Rötzschke, O.; Stevanović, S.; Jung, G.; Rammensee, H.-G. Allele-specific motifs revealed by sequencing of self-peptides eluted from MHC molecules. Nature 1991, 351, 290–296. [Google Scholar] [CrossRef]
Matsumura, M.; Fremont, D.H.; Peterson, P.A.; Wilson, I.A. Emerging principles for the recognition of peptide antigens by MHC class I molecules. Science 1992, 257, 927–934. [Google Scholar] [CrossRef]
Madden, D.R. The three-dimensional structure of peptide–MHC complexes. Annu. Rev. Immunol. 1995, 13, 587–622. [Google Scholar] [CrossRef]
Schölkopf, B.; Smola, A.J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond; MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
Nakayama, M.; Abiru, N.; Moriyama, H.; Babaya, N.; Liu, E.; Miao, D.; Yu, L.; Wegmann, D.R.; Hutton, J.C.; Elliott, J.F.; et al. Prime role for an insulin epitope in the development of type 1 diabetes in NOD mice. Nature 2005, 435, 220–223. [Google Scholar] [CrossRef]
Stadinski, B.D.; Zhang, L.; Crawford, F.; Marrack, P.; Eisenbarth, G.S.; Kappler, J.W. Diabetogenic T cells recognize insulin bound to I-A^g7 in an unexpected, weakly binding register. Proc. Natl. Acad. Sci. USA 2010, 107, 10978–10983. [Google Scholar] [CrossRef]
Tisch, R.; McDevitt, H. Insulin-dependent diabetes mellitus. Cell 1996, 85, 291–297. [Google Scholar] [CrossRef]
Atkinson, M.A.; Eisenbarth, G.S. Type 1 diabetes: New perspectives on disease pathogenesis and treatment. Lancet 2001, 358, 221–229. [Google Scholar] [CrossRef]
Kaufman, D.L.; Clare-Salzler, M.; Tian, J.; Forsthuber, T.; Ting, G.S.; Robinson, P.; Atkinson, M.A.; Sercarz, E.E.; Tobin, A.J.; Lehmann, P.V. Spontaneous loss of T-cell tolerance to glutamic acid decarboxylase in murine insulin-dependent diabetes. Nature 1993, 366, 69–72. [Google Scholar] [CrossRef]
Yoshida, K.; Corper, A.L.; Herro, R.; Jabri, B.; Wilson, I.A.; Teyton, L. The diabetogenic mouse MHC class II molecule I-A^g7 is endowed with a switch that modulates TCR affinity. J. Clin. Investig. 2010, 120, 1578–1590. [Google Scholar] [CrossRef]
Sette, A.; Sidney, J. Nine major HLA class I supertypes account for the vast majority of HLA-A and HLA-B polymorphism. Immunogenetics 1999, 50, 201–212. [Google Scholar] [CrossRef]
Crawford, F.; Stadinski, B.; Jin, N.; Michels, A.; Nakayama, M.; Pratt, P.; Marrack, P.; Eisenbarth, G.; Kappler, J.W. Specificity and detection of insulin-reactive CD4⁺ T cells in type 1 diabetes in the NOD mouse. Proc. Natl. Acad. Sci. USA 2011, 108, 16729–16734. [Google Scholar] [CrossRef] [PubMed]
Wenzlau, J.M.; Juhl, K.; Yu, L.; Moua, O.; Sarkar, S.A.; Gottlieb, P.; Rewers, M.; Eisenbarth, G.S.; Jensen, J.; Davidson, H.W.; et al. The cation efflux transporter ZnT8 is a major autoantigen in human type 1 diabetes. Proc. Natl. Acad. Sci. USA 2007, 104, 17040–17045. [Google Scholar] [CrossRef] [PubMed]
Dang, M.; Rockell, J.; Wagner, R.; Wenzlau, J.M.; Yu, L.; Hutton, J.C.; Gottlieb, P.A.; Davidson, H.W. Human type 1 diabetes is associated with T-cell autoimmunity to zinc transporter 8. J. Immunol. 2011, 186, 6056–6063. [Google Scholar] [CrossRef] [PubMed]
Chujo, D.; Foucat, E.; Nguyen, T.S.; Chaussabel, D.; Banchereau, J.; Ueno, H. ZnT8-Specific CD4⁺ T cells display distinct cytokine expression profiles between type 1 diabetes patients and healthy adults. PLoS ONE 2013, 8, e55595. [Google Scholar] [CrossRef] [PubMed]
DiLorenzo, T.P.; Graser, R.T.; Ono, T.; Christianson, G.J.; Chapman, H.D.; Roopenian, D.C.; Nathenson, S.G.; Serreze, D.V. Major histocompatibility complex class I-restricted T cells are required for all but the end stages of diabetes development in nonobese diabetic mice and use a prevalent T cell receptor alpha chain gene rearrangement. Proc. Natl. Acad. Sci. USA 1998, 95, 12538–12543. [Google Scholar] [CrossRef]
Nielsen, M.; Lundegaard, C.; Worning, P.; Lauemøller, S.L.; Lamberth, K.; Buus, S.; Brunak, S.; Lund, O. Reliable prediction of T-cell epitopes using neural networks with novel sequence representations. Protein Sci. 2003, 12, 1007–1017. [Google Scholar] [CrossRef]
Levisetti, M.G.; Lewis, D.M.; Suri, A.; Unanue, E.R. Weak proinsulin peptide–major histocompatibility complexes are targeted in autoimmune diabetes in mice. Diabetes 2008, 57, 1852–1860. [Google Scholar] [CrossRef][Green Version]
Paul, S.; Lindestam Arlehamn, C.S.; Scriba, T.J.; Dillon, M.B.; Oseroff, C.; Hinz, D.; McKinney, D.M.; Carrasco Pro, S.; Sidney, J.; Peters, B.; et al. Development and validation of a broad scheme for prediction of HLA class II restricted T cell epitopes. J. Immunol. Methods 2015, 422, 28–34. [Google Scholar] [CrossRef]
Croft, N.P.; Smith, S.A.; Pickering, J.; Sidney, J.; Peters, B.; Faridi, P.; Witney, M.J.; Sebastian, P.; Flesch, I.E.; Heading, S.L.; et al. Most viral peptides displayed by class I MHC on infected cells are immunogenic. Proc. Natl. Acad. Sci. USA 2019, 116, 3112–3117. [Google Scholar] [CrossRef]

Figure 1. The sequence logo of binders to H-2D^b (a) shows a strong enrichment of Asn at position p5 and hydrophobic residues (Leu, Ile, Val, and Met) at p9. The sequence logo of non-binders to H-2D^b (b) is characterized by weaker and less distinct residue preferences across positions. The SHAP analysis of the RF model (c) illustrates the contribution of physicochemical descriptors at each peptide position to model predictions. Each feature corresponds to a z-scale descriptor (z1: hydrophobicity, z2: steric bulk, and z3: electronic properties) at a specific position (p1–p9). The width of each violin reflects the distribution of SHAP values across the test set, indicating the magnitude and variability of feature importance. Position p5 exhibits the strongest influence across all descriptors, followed by notable contributions at p2 and p9, confirming their critical role in peptide–MHC binding.

Figure 2. The sequence logo of binders to H-2K^d (a) reveals dominant anchor residues at p2 (Tyr and Phe) and p9 (Leu, Ile, and Val). The sequence logo of non-binders to H-2K^d (b) displays weaker residue conservation and reduced positional specificity. The SHAP analysis for the SVM model (c) highlights the contribution of position-specific physicochemical descriptors to classification outcomes. The steric descriptor (z2) at p2 emerges as the most influential feature, indicating the importance of residue size at this anchor position. Additional contributions are observed at p9 (z1 and z3), while central positions (p4–p6) show moderate but distributed effects.

Figure 3. The sequence logo of binders to I-A^g7 (a) depicts a strong preference for negatively charged residues (Asp and Glu) at position p9 and additional enrichment of hydrophobic residues. The sequence logo of non-binders to I-A^g7 (b) exhibits a heterogeneous amino acid distribution with no clear positional preferences. The SHAP analysis of the GB model (c) illustrates that hydrophobicity (z1) dominates the feature importance, particularly at terminal positions p1 and p9, which display the widest SHAP distributions and thus the strongest influence on binding predictions. Central positions (p4–p6) contribute minimally, with SHAP values centered near zero.

Table 1. Model validation metrics on the external test set of binders and non-binders to H-2D^b.

Metrics	Logo Model (95% CI ¹)	AI Model (95% CI ¹) Random Forest
Training set positives	431	431
Training set negatives	431	431
Test set positives	108	108
Test set negatives	108	108
Optimal threshold	BS > NBS ²	0.5
True positives	67	84
True negatives	75	91
False positives	33	17
False negatives	41	24
Sensitivity	0.620 (0.53–0.71)	0.778 (0.70–0.85)
Specificity	0.694 (0.91–0.78)	0.843 (0.77–0.91)
Accuracy	0.657 (0.59–0.72)	0.810 (0.76–0.86)
F1 score	0.644 (0.56–0.71)	0.804 (0.74–0.86)
MCC	0.316 (0.18–0.44)	0.622 (0.51–0.73)
ROC AUC	0.685 (0.53–0.71)	0.888 (0.84–0.93)

¹ 95% CI—confidential interval; ² if the BS is higher than the NBS, the peptide is classified as a binder, and otherwise, it is classified as a non-binder.

Table 2. Model validation metrics on the external test set of binders and non-binders to H-2K^d.

Metrics	Logo Model (95% CI ¹)	AI Model (95% CI ¹) SVM (RBF ² Kernel)
Training set positives	164	164
Training set negatives	164	164
Test set positives	41	41
Test set negatives	41	41
Optimal threshold	BS > NBS ³	0.5
True positives	28	35
True negatives	24	37
False positives	17	4
False negatives	13	6
Sensitivity	0.683 (0.53–0.81)	0.854 (0.71–0.93)
Specificity	0.585 (0.43–0.73)	0.902 (0.77–0.97)
Accuracy	0.634 (0.52–0.74)	0.878 (0.79–0.93)
F1 score	0.651 (0.53–0.75)	0.875 (0.78–0.93)
MCC	0.270 (0.06–0.46)	0.757 (0.62–0.86)
ROC AUC	0.738 (0.63–0.84)	0.903 (0.83–0.97)

¹ 95% CI—confidential interval; ² Radial Basis Function; ³ if the BS is higher than the NBS, the peptide is classified as a binder, and otherwise, it is classified as a non-binder.

Table 3. Model validation metrics on the external test set of binders and non-binders to I-A^g7.

Metrics	Logo Model (95% CI ¹)	AI Model (95% CI ¹) Gradient Boosting
Training set positives	301	301
Training set negatives	301	301
Test set positives	75	75
Test set negatives	75	75
Optimal threshold	BS > NBS ²	0.5
True positives	40	62
True negatives	54	71
False positives	21	4
False negatives	35	13
Sensitivity	0.533 (0.42–0.64)	0.827 (0.74–0.91)
Specificity	0.720 (0.62–0.82)	0.947 (0.89–0.99)
Accuracy	0.627 (0.55–0.70)	0.887 (0.83–0.93)
F1 score	0.588 (0.50–0.68)	0.879 (0.82–0.93)
MCC	0.258 (0.11–0.40)	0.779 (0.68–0.87)
ROC AUC	0.726 (0.65–0.80)	0.906 (0.85–0.96)

¹ 95% CI—confidential interval; ² if the BS is higher than the NBS, the peptide is classified as a binder, and otherwise, it is classified as a non-binder.

Table 4. Nonamers originating from mouse GAD65 predicted to bind to NOD mice-specific MHC.

Starting Position	Sequence	Binding to
17	SADPENPGT	H-2K^d
109	AFLHATDLL	I-A^g7
119	LQYVVKSFD	I-A^g7
150	ADQPQNLEE	H-2K^d
151	DQPQNLEEI	H-2D^b
173	TGHPRYFNQ	H-2K^d
186	LDMVGLAAD	H-2K^d, I-A^g7
197	TSTANTNMF	H-2D^b
199	TANTNMFTY	H-2D^b
228	IGWPGGSGD	I-A^g7
243	GAISNMYAM	H-2D^b
253	IARYKMFPE	H-2K^d
288	GAAALGIGT	H-2K^d
360	WMHVDAAWG	H-2D^b
389	SVTWNPHKM	H-2D^b
426	LFQQDKHYD	H-2K^d, I-A^g7
443	ALQCGRHVD	I-A^g7
445	QCGRHVDVF	H-2D^b
481	LYTIIKNRE	I-A^g7
499	PQHTNVCFW	H-2D^b
561	ISNPAATHQ	H-2K^d
564	PAATHQDID	H-2K^d

Table 5. Nonamers originating from mouse ZnT8 predicted to bind to NOD mice-specific MHC.

Starting Position	Sequence	Binding to
26	LRQKPVNKD	I-A^g7
31	VNKDQCPGD	I-A^g7
74	CAASAICFI	H-2D^b
79	ICFIFMVAE	I-A^g7
234	ALIIYFKPD	I-A^g7
260	ASTVMILKD	I-A^g7
286	VKEIILAVD	I-A^g7
318	VATAASQDS	H-2K^d
333	IAQALSSFD	H-2K^d

Table 6. Peptides selected for synthesis and in vivo tests for immunogenicity in NOD mice.

Starting Position	Sequence	Binding to
Mouse GAD65
150	ADQPQNLEEI	H-2Db, H-2K^d
186	LDMVGLAAD	H-2K^d, I-A^g7
426	LFQQDKHYD	H-2K^d, I-A^g7
443	ALQCGRHVDVF	H-2D^b, I-A^g7
Insulin-2
33 (9)	SHLVEALYLVCGERG	I-A^g7 [39,40]
ZnT8
26	LRQKPVNKDQCPGD	I-A^g7
74	CAASAICFIFMVAE	H-2D^b, I-A^g7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Doytchinova, I.; Dimitrov, I.; Atanasova, M.; Mihaylova, N.M.; Tchorbanov, A. AI-Driven Identification of Candidate Peptides for Immunotherapy in Non-Obese Diabetic Mice: An In Silico Study. AI 2026, 7, 140. https://doi.org/10.3390/ai7040140

AMA Style

Doytchinova I, Dimitrov I, Atanasova M, Mihaylova NM, Tchorbanov A. AI-Driven Identification of Candidate Peptides for Immunotherapy in Non-Obese Diabetic Mice: An In Silico Study. AI. 2026; 7(4):140. https://doi.org/10.3390/ai7040140

Chicago/Turabian Style

Doytchinova, Irini, Ivan Dimitrov, Mariyana Atanasova, Nikolina M. Mihaylova, and Andrey Tchorbanov. 2026. "AI-Driven Identification of Candidate Peptides for Immunotherapy in Non-Obese Diabetic Mice: An In Silico Study" AI 7, no. 4: 140. https://doi.org/10.3390/ai7040140

APA Style

Doytchinova, I., Dimitrov, I., Atanasova, M., Mihaylova, N. M., & Tchorbanov, A. (2026). AI-Driven Identification of Candidate Peptides for Immunotherapy in Non-Obese Diabetic Mice: An In Silico Study. AI, 7(4), 140. https://doi.org/10.3390/ai7040140

Article Menu

AI-Driven Identification of Candidate Peptides for Immunotherapy in Non-Obese Diabetic Mice: An In Silico Study

Abstract

1. Introduction

2. Datasets and Methods

2.1. Datasets

2.1.1. H-2Db

2.1.2. H-2Kd

2.1.3. I-Ag7

2.2. Methods

2.2.1. Logo Protocol

2.2.2. AI Protocol

2.2.3. Model Validation Metrics

3. Results

3.1. H-2Db Models

3.1.1. Logo Models

3.1.2. AI Model

3.2. H-2Kd Models

3.2.1. Logo Models

3.2.2. AI Model

3.3. I-Ag7 Models

3.3.1. Logo Models

3.3.2. AI Model

3.4. Prediction of Mouse Major T1D Autoantigens

3.4.1. Glutamic Acid Decarboxylase 65 (GAD65)

3.4.2. Insulins

3.4.3. Zinc Transporter 8 (ZnT8)

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.1.1. H-2D^b

2.1.2. H-2K^d

2.1.3. I-A^g7

3.1. H-2D^b Models

3.2. H-2K^d Models

3.3. I-A^g7 Models