1. Introduction
Influenza viruses are segmented, negative-sense ribonucleic acid (RNA) viruses belonging to the Orthomyxoviridae family and are responsible for recurrent respiratory infections in humans and animals [
1]. Based on antigenic differences in the nucleoprotein and matrix protein, influenza viruses are classified into three major types: A, B, and C. Among these, Influenza A virus is the most clinically significant due to its broad host range, high transmissibility, and ability to cause severe disease outbreaks [
2]. Seasonal influenza epidemics and periodic pandemics associated with Influenza A virus continue to result in substantial morbidity and mortality worldwide, making it a persistent public health concern.
A defining characteristic of Influenza A virus is its pronounced genetic variability. The virus evolves rapidly through molecular mechanisms such as antigenic drift, driven by point mutations, and antigenic shift, resulting from reassortment of gene segments when multiple viral strains co-infect the same host [
3]. These evolutionary processes can give rise to novel viral subtypes with altered antigenic profiles, as observed during historical influenza pandemics. Such genetic plasticity frequently leads to a mismatch between circulating viral strains and existing vaccines, thereby reducing vaccine effectiveness and necessitating continuous surveillance and vaccine reformulation [
4]. Given these challenges, the development of broadly protective and long-lasting vaccines remains a critical objective in influenza research. In recent years, increasing attention has been directed toward cell-mediated immunity, particularly T-cell-based responses, which have been shown to provide cross-strain protection by recognizing conserved viral regions [
5]. Central to this approach is the identification of T-cell epitopes (TCEs), short peptide fragments derived from viral proteins that are presented by major histocompatibility complex (MHC) molecules and recognized by T lymphocytes [
6]. However, the experimental identification of TCEs through immunological assays is labor-intensive, costly, and difficult to scale across the full viral proteome [
7]. As a result, computational methods have emerged as indispensable tools for accelerating epitope discovery and guiding experimental validation [
8]. These methods aim to rapidly screen large numbers of peptide sequences and prioritize candidates for downstream experimental validation, thereby reducing both cost and development time [
3].
Traditional computational approaches to epitope prediction have largely relied on handcrafted physicochemical features, such as amino acid composition, hydrophobicity indices, charge, and sequence-derived descriptors. While these features are biologically motivated and have shown utility in various machine learning (ML) frameworks, they suffer from inherent limitations [
9]. Most notably, such representations treat residues in a largely independent or weakly contextual manner and depend heavily on prior assumptions about which properties are most relevant [
10]. This limits their ability to capture complex sequence dependencies and often reduces generalization when applied to unseen peptides or evolving viral strains [
11].
Recent advances in deep learning (DL) have introduced new opportunities to address these limitations. In particular, protein language models (PLMs), inspired by natural language processing, have demonstrated the ability to learn rich and contextual representations directly from raw protein sequences [
12]. Trained on large-scale protein databases using self-supervised objectives, these models capture sequence patterns, residue relationships, and evolutionary constraints without requiring manual feature engineering [
13]. When combined with attention-based architectures such as Transformers, these representations enable models to reason about interactions across entire sequences, making them especially suitable for tasks involving short but information-dense peptides [
14]. Recent work has also explored adapting pretrained protein language models to specialized biological domains, including immunological sequence analysis, further highlighting their potential for epitope prediction and related immunoinformatics applications.
Recent advances in immunoinformatics have also introduced a number of DL–based frameworks for epitope prediction. Convolutional neural network (CNN) architectures have been widely explored for modeling sequence patterns associated with immunogenic peptides, demonstrating improved performance over traditional machine learning models that rely on handcrafted descriptors. More recently, hybrid architectures integrating multiple deep learning components have been proposed. For example, the MITNet framework combines convolutional layers with Transformer-based attention mechanisms to capture contextual relationships within peptide sequences [
13]. These studies highlight the increasing use of deep neural networks to model complex sequence dependencies relevant to immune recognition.
In this study, a transformer-based viroinformatics framework is proposed for the prediction of TCEs from Influenza A virus peptide sequences [
15,
16,
17]. The framework leverages a pretrained PLM to generate contextualized sequence embeddings that serve as an automated and biologically informed alternative to handcrafted features [
18]. Among available protein language models, Evolutionary Scale Modeling-2 (ESM-2) was selected due to its strong representational capability, compact architecture, and suitability for modeling short peptide sequences, making it well aligned with TCE prediction tasks [
19]. These embeddings are subsequently processed by a lightweight Transformer classifier that learns epitope-specific patterns through self-attention [
20]. To enhance practical utility, the framework incorporates uncertainty-aware prediction using Monte Carlo dropout, allowing the model to associate confidence estimates with its predictions [
21]. In addition, attention-based analysis is employed to provide residue-level interpretability, offering insight into sequence regions that influence model decisions.
The proposed approach is evaluated within a binary classification setting, distinguishing epitopes from non-epitopes, and is assessed using a comprehensive set of performance metrics and validation strategies [
22,
23]. Comparative and ablation analyses are conducted to examine the contribution of individual architectural components and to contextualize performance relative to existing methods [
24].
The main contributions of this study can be summarized as follows:
A protein language model-based framework is introduced for TCE prediction that eliminates the need for manual physicochemical feature engineering.
A task-specific Transformer classifier is employed to capture residue-level interactions relevant to epitope recognition.
An uncertainty-aware prediction mechanism is integrated to distinguish high-confidence predictions from ambiguous cases.
Attention-based interpretability is provided to support biological insight and transparency in model behavior.
The framework is systematically evaluated on Influenza A virus peptide sequences, demonstrating stable and reliable performance.
2. Materials and Methods
The overall workflow of the proposed transformer-based viroinformatics framework is illustrated in
Figure 1. Each component of the framework is described in detail in the following subsections.
2.1. Acquiring Dataset
The dataset used in this study consists of experimentally validated peptide sequences derived from Influenza A virus proteins from the immune epitope database (IEDB) [
25]. Each peptide is annotated as either a TCE or a non-epitope, forming a binary classification dataset. The annotations are based on curated experimental evidence reported in publicly available immunological repositories. All peptide sequences were subjected to a standard curation pipeline prior to model development. Duplicate sequences were removed to avoid information leakage between training and evaluation phases [
26]. Because the peptide sequences analyzed in this study are relatively short (8–15 amino acids), additional clustering by sequence identity was not applied in order to preserve dataset diversity and avoid excessive reduction in available samples. Peptides containing ambiguous or non-standard amino acid symbols were excluded to preserve biological validity. No synthetic data generation, resampling, or class balancing strategies were applied, as the dataset exhibited an approximately balanced distribution of epitopes and non-epitopes. Peptide lengths spanned a biologically meaningful range consistent with known TCE characteristics. For model evaluation, stratified sampling strategies were employed to preserve the class distribution across training and validation subsets. In addition to fixed train-test splits, stratified cross-validation was used to assess robustness and reduce sensitivity to data partitioning.
2.2. Problem Formulation and Learning Objective
The epitope prediction task is formulated as a binary sequence classification problem. Given an input peptide sequence
, where
represents an amino acid residue, the objective is to learn a function:
where 1 denotes a T-cell epitope and 0 denotes a non-epitope.
Rather than producing a hard class label directly, the model outputs a probability score in the range , representing the likelihood that a given peptide is an epitope. This probabilistic formulation enables downstream confidence analysis and uncertainty estimation while retaining a binary decision boundary for evaluation. For evaluation purposes, a default probability threshold of 0.5 was used to convert predicted probabilities into binary epitope or non-epitope class labels.
2.3. Protein Language Model-Based Representation Learning
Protein Language Models (PLMs) form the representational foundation of the proposed framework. These models are trained on large collections of protein sequences using self-supervised objectives, allowing them to learn statistical and structural regularities inherent to biological sequences [
18]. Through this process, PLMs capture residue compatibility, sequence context, and evolutionary constraints without relying on manually defined biochemical features.
In this study, the Evolutionary Scale Modeling, version 2 (ESM-2) PLM was employed as the pretrained PLM backbone [
19]. ESM-2 is a Transformer-based model trained on large-scale protein sequence corpora using self-supervised learning objectives, enabling it to capture rich contextual and evolutionary information from amino acid sequences. Compared to earlier protein embedding approaches, ESM-2 provides compact yet expressive representations that are well suited for modeling short peptide sequences. In the proposed framework, ESM-2 is used strictly as a frozen feature extractor, and no task-specific fine-tuning of the language model parameters is performed. The pretrained PLM is used exclusively as an automated feature extractor wherein each peptide sequence is tokenized at the amino acid level and passed through the PLM, which produces a high-dimensional embedding for every residue. These embeddings are contextualized, meaning each residue representation depends on its surrounding amino acids rather than its identity alone.
To enable downstream classification, a fixed-length representation is derived from the sequence-level embedding (CLS token) [
27]. This vector acts as a compact summary of the entire peptide, capturing both local residue properties and global sequence patterns. The PLM parameters are kept frozen during training to retain general protein knowledge, reduce overfitting, and improve computational efficiency. The role of the PLM within the overall pipeline is illustrated in the model architecture diagram.
2.4. Transformer-Based Epitope Classification Module
While the PLM provides biologically informed sequence representations, it is not trained to discriminate epitopes from non-epitopes. To address this, a task-specific Transformer-based classifier is employed [
16]. The Transformer architecture leverages self-attention to model interactions between different residue positions within a peptide, allowing the classifier to focus on residue combinations that are informative for epitope recognition.
The Transformer module used in this framework is a lightweight, task-specific encoder designed to operate on PLM-derived sequence representations rather than raw amino acid tokens. Unlike the pretrained Transformer architecture underlying the protein language model, this classifier is trained from scratch and optimized exclusively for the epitope prediction task [
17]. Its role is to refine the contextual embeddings produced by the PLM by learning epitope-relevant residue interactions through self-attention, thereby bridging general protein sequence knowledge and task-specific discrimination.
Self-attention allows the model to weigh residues according to their contribution to the classification decision. This is particularly relevant for epitope prediction, where immunogenicity often depends on motifs or positional relationships rather than on individual residues in isolation. Multi-head attention further allows the model to capture multiple interaction patterns simultaneously. The output of the Transformer encoder is passed through a fully connected layer with a sigmoid activation function, producing a probability score that reflects epitope likelihood. The complete computational workflow of the model is formally summarized in Algorithm 1.
| Algorithm 1: PLM-Transformer-Based T-Cell Epitope Prediction |
Input: Peptide dataset where is a peptide sequence and Output: Predicted epitope probability , uncertainty 1: Initialize pretrained Protein Language Model (PLM) 2: Freeze PLM parameters 3: Initialize Transformer-based classifier T 4: Define loss function L and optimizer O 5: Perform K-fold cross-validation 6: for k = 1 to K do 7: Split D into training set Dtrain and test set Dtest 8: for epoch = 1 to E do 9: for each mini-batch (S, Y) in Dtrain do 10: Tokenize peptide sequences S 11: E ← PLM(S) # contextual embeddings 12: Z ← T(E) #attention-based encoding 13: Ŷ ← Sigmoid(Z) #prediction probabilities 14: Compute loss L(Ŷ, Y) 15: Update T using optimizer O 16: end for 17: end for 18: Evaluate T on Dtest and store performance metrics 19: end for 20: Compute mean and standard deviation of metrics across folds 21: Enable dropout layers in T 22: for each peptide s in D do 23: Perform multiple stochastic forward passes 24: Compute mean prediction ŷ and uncertainty u 25: end for 26: Return predictions, uncertainties, and performance statistics |
2.5. Model Training Strategy and Optimization
Model training is restricted to the parameters of the Transformer-based classifier, while the PLM remains frozen [
15]. This separation allows the model to exploit pretrained biological knowledge while learning task-specific decision boundaries from labeled epitope data. Training uses mini-batch gradient descent optimized with binary cross-entropy loss, which measures the discrepancy between predicted probabilities and true labels.
An adaptive optimization algorithm is used to handle variability in peptide sequence patterns. Dropout layers are incorporated within the Transformer encoder as a regularization mechanism to mitigate overfitting. These dropout layers also serve a secondary role during inference for uncertainty estimation.
2.6. Uncertainty-Aware Prediction Using Monte Carlo Dropout
To quantify prediction confidence, the proposed framework incorporates Monte Carlo dropout. During inference, dropout remains active and multiple stochastic forward passes are performed for each peptide sequence [
21]. The mean prediction across these passes is used as the final epitope probability, while the variability across predictions is interpreted as predictive uncertainty.
Low uncertainty indicates stable model behavior and high confidence in the classification outcome, whereas higher uncertainty highlights ambiguous peptides that may require further experimental validation. This uncertainty-aware formulation is particularly valuable in immunological applications, where computational predictions are often used to prioritize candidates for downstream laboratory testing.
2.7. Model Interpretability via Attention Analysis
To improve transparency, the self-attention weights learned by the Transformer classifier are extracted and analyzed. These weights reflect how strongly different residue positions influence one another during classification. Visualizing attention distributions helps identify residues or regions that contribute strongly to the model’s decision [
28].
Although attention weights do not establish causal relationships, they provide useful qualitative insight into model focus and allow assessment of whether predictions are driven by biologically plausible sequence regions [
29]. This interpretability layer helps bridge the gap between DL predictions and immunological reasoning.
2.8. Performance Evaluation Protocol
Model performance is evaluated using a comprehensive set of binary classification metrics, including accuracy, precision, recall (sensitivity), specificity, F1-score, area under the Receiver Operating Characteristic (ROC) curve, Matthew’s correlation coefficient, and Gini coefficient [
30]. These metrics jointly capture predictive accuracy, class balance, and discrimination capability.
To assess robustness, stratified cross-validation is employed, ensuring that each fold maintains the original class distribution. Confusion matrix analysis is used to examine error patterns, while ROC analysis evaluates threshold-independent discrimination performance. Together, these evaluation strategies provide a rigorous assessment of model reliability.
2.9. Comparative Evaluation and Ablation Analysis
The proposed framework is evaluated against classical machine learning models trained on handcrafted physicochemical features to assess the benefit of representation learning. Additional comparisons with existing epitope prediction tools and DL baselines without pretrained embeddings are conducted to contextualize performance gains. Ablation experiments are performed by selectively removing or modifying key components, such as the protein language model or the attention mechanism [
31]. These experiments isolate the contribution of each component and clarify how architectural design choices influence predictive behavior.
2.10. Implementation and Computational Environment
All experiments are implemented using Python (version 3.12)-based ML libraries. DL components are developed using PyTorch (version 2.7.0), while pretrained protein language models are integrated through transformer-based frameworks [
32]. Evaluation procedures are implemented using standard scientific computing tools. Experiments are conducted on hardware equipped with Graphics Processing Unit (GPU) acceleration. Freezing the PLM parameters significantly reduces training time and memory requirements, making the framework suitable for practical deployment.
4. Discussion
This study examined the application of PLMs combined with attention-based Transformer architectures for the binary prediction of TCEs from Influenza A virus peptide sequences [
3]. Influenza A virus is characterized by rapid genetic evolution driven by antigenic drift and reassortment, which frequently undermines vaccine efficacy and complicates immune targeting [
41]. In this context, computational identification of conserved and immunologically relevant peptide regions represents an important step toward supporting T-cell-based vaccine strategies [
13]. Unlike traditional epitope prediction approaches that rely on handcrafted physicochemical descriptors, the proposed framework learns representations directly from raw amino acid sequences, enabling the model to capture contextual and relational information that is difficult to encode manually. The observed performance indicates that this representation-driven strategy offers both reliable predictive capability and improved interpretability.
A key observation from this work is the importance of sequence context in TCE recognition. For Influenza A virus, subtle sequence variations often arising from mutation or reassortment can significantly alter immune recognition while preserving overall protein composition. Classical ML models typically summarize peptides using averaged residue properties or frequency-based descriptors, which tend to overlook such contextual dependencies. In contrast, the PLM employed in this study generates embeddings in which each residue representation is conditioned on its surrounding sequence. This allows the downstream Transformer classifier to model residue interactions and motif-like patterns that are likely to be relevant for T-cell recognition of influenza-derived peptides.
The architectural separation between the pretrained PLM and the task-specific Transformer classifier proved to be an effective design choice. The language model contributes general protein sequence knowledge learned from large-scale pretraining, capturing biologically meaningful patterns that extend beyond virus-specific datasets. The Transformer classifier, in turn, adapts these representations to the epitope prediction task by learning decision boundaries specific to Influenza A virus peptides. Freezing the language model parameters reduced computational cost and limited overfitting, while preserving representational richness. This modular design also provides a flexible computational framework that may be adaptable to other viral pathogens; however, the present study evaluates the approach specifically on Influenza A virus peptide sequences.
An additional design consideration concerns the selection of the specific protein language model configuration used in this study. The ESM-2 (facebook/esm2_t6_8M_UR50D) model was selected as the pretrained backbone due to its compact architecture and favorable balance between representational capability and computational efficiency. Larger variants of ESM-2, such as esm2_t33_650M_UR50D, contain substantially more parameters and therefore provide higher model capacity; however, they also impose significantly greater computational and memory requirements. For peptide-level prediction tasks involving short sequences, the marginal performance gains obtained from very large models may not justify the associated computational overhead. In contrast, the smaller ESM-2 configuration used in this work is capable of capturing meaningful contextual relationships among amino acid residues while enabling efficient training and inference. This balance between performance and efficiency is particularly important for practical immunoinformatics pipelines, where large-scale screening of peptide candidates is often required.
Interpretability remains a critical consideration in the application of DL models to immunological problems. The attention-based analysis conducted in this study provides insight into which residues and sequence regions influence epitope predictions. While attention weights should not be interpreted as direct evidence of biological causality, the concentration of attention on specific regions suggests that the model relies on structured sequence patterns rather than arbitrary correlations. For Influenza A virus, such regions may correspond to conserved or immunologically relevant segments that are repeatedly recognized by T cells, lending additional biological plausibility to the model’s behavior. This level of transparency is particularly valuable when computational predictions are used to guide experimental validation. Although the attention patterns provide useful qualitative insight into sequence regions that influence model predictions, a systematic comparison between attention-highlighted residues and experimentally characterized immunodominant epitopes of Influenza A virus would require detailed immunological and structural analyses across multiple datasets, which represents an important direction for future investigation.
The incorporation of uncertainty estimation through Monte Carlo dropout further enhances the practical relevance of the proposed framework. In influenza research, where experimental validation is costly and time-sensitive, incorrect high-confidence predictions can lead to inefficient use of resources. By associating each prediction with an uncertainty estimate, the model distinguishes high-confidence epitope candidates from ambiguous peptides that may require additional scrutiny. This capability aligns the computational pipeline more closely with real-world decision-making processes in vaccine research and immunological screening. It is worth noting that several recent sequence prediction studies employ pretrained protein language model embeddings followed by simple linear or shallow classification layers. While such approaches can achieve competitive performance, they often treat the PLM embedding as a fixed global representation without further modeling residue-level interactions relevant to the downstream task. In the proposed framework, the additional Transformer module serves as a task-specific refinement stage that operates on PLM-derived representations to capture positional relationships and interaction patterns within peptide sequences. This design allows the classifier to focus on sequence regions that contribute most strongly to epitope recognition, while maintaining a lightweight architecture that avoids the computational cost of full PLM fine-tuning. Consequently, the contribution of the present work lies in integrating contextual PLM representations with attention-based sequence modeling and uncertainty-aware prediction within a unified framework tailored for Influenza A T-cell epitope discovery.
Despite these strengths, several limitations should be acknowledged. The current framework operates exclusively at the sequence level and does not explicitly incorporate structural information, host MHC allele specificity, or broader immunological context, all of which influence T-cell recognition. Additionally, while freezing the protein language model improves stability, fine-tuning on larger and more diverse epitope datasets may further enhance task-specific performance. Another limitation is that the current framework focuses solely on peptide sequence information and does not explicitly incorporate host-specific MHC allele context, which plays an important role in T-cell recognition. In addition, the dataset used in this study is derived from curated experimental records in IEDB, and potential dataset biases or incomplete coverage of allele-specific immune responses may influence model generalization. Future work may therefore explore integrating allele-specific information, structural context, or broader immunological features to further enhance biological realism. Finally, this study focuses on Influenza A virus to enable controlled analysis; extending the framework to multiple viruses will require careful consideration of domain shifts arising from differing evolutionary pressures and sequence characteristics. Future extensions could integrate allele-specific binding information or peptide-MHC interaction modeling to further enhance biological realism. Future work may also explore: (a) more detailed calibration analyses of uncertainty estimates, including reliability diagrams or comparisons with alternative uncertainty quantification techniques, to further evaluate the reliability of predictive confidence measures. (b) evaluation on independent external datasets, as well as more detailed statistical and calibration analyses, to further assess the generalization and reliability of the proposed framework. In addition, although stratified cross-validation provides a robust internal evaluation strategy, future studies may further examine the generalization capability of the framework using independent datasets or homology-aware data partitioning strategies.
5. Conclusions
In this study, a transformer-based viroinformatics framework was presented for the prediction of TCEs from Influenza A virus peptide sequences. Influenza A virus is characterized by rapid genetic evolution and antigenic variability, which continue to challenge conventional vaccine design strategies. By integrating a pretrained PLM with a task-specific Transformer classifier, the proposed framework moves beyond traditional feature engineering and learns directly from raw amino acid sequences. This design enables the model to capture contextual and relational information that is difficult to represent using handcrafted physicochemical descriptors and is particularly relevant for modeling immune recognition in highly variable viral pathogens.
The findings demonstrate that protein language model-derived embeddings provide a strong representational foundation for epitope prediction, while the attention-based Transformer effectively learns epitope-specific sequence patterns from these embeddings. The framework exhibited stable and competitive performance across multiple evaluation settings, suggesting reliable generalization rather than dependence on favorable data partitions. The integration of uncertainty estimation further strengthens the practical utility of the model by enabling differentiation between high-confidence epitope predictions and ambiguous peptide candidates, an important consideration when computational outputs are used to guide experimental validation in influenza research.
Beyond predictive performance, the proposed framework offers interpretability through attention-based analysis, providing residue-level insights into sequence regions that influence epitope recognition. Although such explanations should be interpreted with caution, they contribute to increased transparency and help align model behavior with biological reasoning. Future work may explore the incorporation of structural information, host-specific context, or larger multi-virus datasets to further improve predictive accuracy and biological relevance.