Tumor Neoepitope-Based Vaccines: A Scoping Review on Current Predictive Computational Strategies

Therapeutic cancer vaccines have been considered in recent decades as important immunotherapeutic strategies capable of leading to tumor regression. In the development of these vaccines, the identification of neoepitopes plays a critical role, and different computational methods have been proposed and employed to direct and accelerate this process. In this context, this review identified and systematically analyzed the most recent studies published in the literature on the computational prediction of epitopes for the development of therapeutic vaccines, outlining critical steps, along with the associated program’s strengths and limitations. A scoping review was conducted following the PRISMA extension (PRISMA-ScR). Searches were performed in databases (Scopus, PubMed, Web of Science, Science Direct) using the keywords: neoepitope, epitope, vaccine, prediction, algorithm, cancer, and tumor. Forty-nine articles published from 2012 to 2024 were synthesized and analyzed. Most of the identified studies focus on the prediction of epitopes with an affinity for MHC I molecules in solid tumors, such as lung carcinoma. Predicting epitopes with class II MHC affinity has been relatively underexplored. Besides neoepitope prediction from high-throughput sequencing data, additional steps were identified, such as the prioritization of neoepitopes and validation. Mutect2 is the most used tool for variant calling, while NetMHCpan is favored for neoepitope prediction. Artificial/convolutional neural networks are the preferred methods for neoepitope prediction. For prioritizing immunogenic epitopes, the random forest algorithm is the most used for classification. The performance values related to the computational models for the prediction and prioritization of neoepitopes are high; however, a large part of the studies still use microbiome databases for training. The in vitro/in vivo validations of the predicted neoepitopes were verified in 55% of the analyzed studies. Clinical trials that led to successful tumor remission were identified, highlighting that this immunotherapeutic approach can benefit these patients. Integrating high-throughput sequencing, sophisticated bioinformatics tools, and rigorous validation methods through in vitro/in vivo assays as well as clinical trials, the tumor neoepitope-based vaccine approach holds promise for developing personalized therapeutic vaccines that target specific tumor cancers.


Introduction
Personalized cancer therapy has been recommended primarily when the patient's treatment prognosis is unfavorable, whether due to the type of tumor or the disease stage.Personalized therapy consists of identifying patient-specific tumor antigens and developing a treatment that stimulates a specific immune response against the tumor.This response can be achieved through the development of cancer vaccines, usually applied in combination with other chemotherapies [1].
Vaccines 2024, 12, 836 2 of 39 The formulation of a cancer vaccine comprises a series of approaches that aim to generate or amplify an antitumor immune response.To achieve this objective, the identification of neoepitopes plays a critical role and can be accomplished through prediction algorithms [2].
Neoepitopes arise from non-synonymous mutations in the tumor genome and result in small, mutated peptides that are presented on major histocompatibility complex (MHC) molecules exclusively on the surface of tumor cells [3].
The performance of these predictive algorithms can be enhanced when integrated with other computational methods addressing complementary, yet critical, biological aspects associated with the immunogenic neoepitopes identification.Among them are the mutation burden, gene expression, peptide transport and cleavage, homolog search, B and T cell activation, and antigen-presenting cells (APCs) prediction algorithms [4].
Among the prominent approaches for analyzing and integrating these data, immunoinformatics stands out as a sub-area of bioinformatics that includes the study and development of algorithms and programs for predicting potential T and B lymphocyte epitopes.
Reverse vaccinology, which is closely related and synergistic to the development of new vaccines, is an in silico method that uses genomic data to identify potential epitopes through computational algorithms.
This innovative approach enables researchers to analyze the entire genome of a pathogen or cancer cells to pinpoint antigenic peptides that could provoke an immune response.By systematically identifying these epitopes, reverse vaccinology facilitates the design of new vaccines that are more precise and effective.This method has revolutionized vaccine development by significantly accelerating the identification process and enhancing the potential for developing targeted immunotherapies [5].It is important to highlight that epitopes are allele-specific, and the identification of different human leukocyte antigen (HLA) alleles represents a critical step in the process [6].HLA molecules (MHC in humans) are encoded by highly polymorphic genes located on human chromosome 6 and are surface proteins involved in the induction and regulation of the immune response, serving as biochemical markers for each individual.While class I HLA molecules have three loci (HLA-A, HLA-B, and HLA-C) that have affinities with 8-12 amino acid epitopes, class II HLA molecules also have three loci (HLA-DR, HLA-DP, and HLA-DQ) that have affinities with 15-24 amino acid epitopes [7,8].
Particularly concerning cancer, much progress has been made in understanding Mendelian disorders, identifying prognostic biomarkers, determining optimal therapeutic drug combinations, and establishing correlations between identified variants and clinical phenotypes [9,10] through the integration of data from different initiatives such as "The Cancer Genome Atlas Program" [11] and UK10K [12], among others.
From an experimental point of view, these neoepitopes are generally administered in association with antigen-presenting cells (APCs) or other immune modulators.This involves immunization with shared tumor antigens expressed by many different patients' tumors and the identification of proteins that were overexpressed in tumor cells but minimally expressed in normal tissues [13].Traditionally, empirical approaches have been employed; however, the utilization of computational methodologies has the potential to decrease the time required for vaccine development and minimize animal experimentation [14].
Considering the perspective of using NGS-derived data in cancer vaccine development and personalized medicine, the molecular characterization of cancer (whole genome, exome, and transcriptome sequencing, computational approaches determining mutational profile and gene expression) has revealed a complexity and tumor diversity that impact patient prognosis and recommended treatment [15].It is important to highlight that different specific treatments vary according to disease stage, cancer type, and the mutational burden of the tumor tissue.As each tumor possesses a unique mutation profile [16], the latter has been more recently explored and the identification of tumor-specific mutations can lead to the identification of vaccine targets in personalized therapies [5].
Vaccines 2024, 12, 836 3 of 39 Lastly, one additional challenging aspect that must be considered for the efficient development of a tumor neoepitope-based vaccine is the quality of the training data used by the algorithms.Neoepitope predictors mainly use machine learning methods (e.g., artificial neural network, convolutional neural network, random forest), which still exhibit a high false positive rate (estimated around 50% to 95%) [3].
A contributing factor to this performance variation is the utilization of pathogen training data by most algorithms, which clearly does not accurately reflect the human cancer data necessary for effective vaccine prospecting [17].Taking into account the aforementioned scenario and aiming to enhance the effective and rational application of computational methods for vaccine prospecting, a scoping review was conducted to offer a comprehensive analysis of scientific evidence, address emerging topics, and pinpoint knowledge gaps in this field.
In this scoping review, our objective is to identify the prevalent computational methods utilized by the scientific community for predicting neoepitopes in human high-throughput data sequencing.Additionally, we aim to explore how these methods incorporate the biological intricacies inherent in the immune response against tumor antigens and the development of therapeutic vaccines.To achieve this, we followed the guiding questions outlined as follows: What are the critical steps in identifying immunogenic neoepitopes?What are the main types of tumors studied?What algorithms and programs have been employed for predicting tumor neoepitopes?What dataset is used as a training model?How are the predictions validated?What are the strengths and limitations of the employed approaches?Can this prediction accurately guide the identification of potential targets for the development of therapeutic vaccines?

Methods
To achieve the proposed objective and address the previously described questions, a scoping review was conducted following the PRISMA extension guidelines for scoping reviews (PRISMA-ScR) [18].
We conducted a search and included studies that aimed to identify neoepitopes in human tumor samples sequencing data using computational approaches.Specifically, we sought articles that identified neoepitopes in human next-generation sequencing (NGS) data through predictive algorithms, as well as whether there was any prioritization step for immunogenic neoepitopes for vaccine composition.Additionally, we looked for articles that presented the development of new programs or algorithms that could be utilized for the identification of neoepitopes in human tumor samples sequencing data.
Articles that analyzed cancer data linked to viral infection were excluded since they involve epitopes of viral origin, distinct from neoepitopes originating from somatic mutations in human wild-type sequences.Additionally, articles were excluded if they did not utilize next-generation sequencing (NGS) data or focused solely on previously studied genes.Moreover, articles lacking detailed descriptions of the in silico analysis stage or employing methods irrelevant to this review were excluded.Studies involving murine models and those falling outside the scope of interest were also excluded.
The retrieved articles were categorized into two synthesis matrices: one focusing on studies conducted with existing algorithms and programs, and the other concentrating on studies that explore the development of novel algorithms and programs.
The articles were retrieved from four databases: Scopus, PubMed, Web of Science, and Science Direct.These databases were selected for their comprehensive coverage of the literature in medical, biological, and computational sciences.The search across all databases was conducted on 24 January 2024.For each database, articles published in English between January 2012 and January 2024 were included.The following filters were used:  To determine which studies should be included in the review, Boolean operators were utilized for the search, employing the keywords "neoepitope", "epitope", "vaccine", "prediction", "algorithm", "cancer", and "tumor": (neoepitope OR epitope) AND vaccine AND (prediction OR algorithm) AND (cancer OR tumor).Only complete, original, and open access publications were considered for inclusion.Any duplicate publications were promptly excluded.During this process, two reviewers independently screened each retrieved record.Any disagreements regarding study selection and data extraction were resolved through discussion and review of the search parameters.The selection of studies and determination of variables to extract were also conducted and discussed by these two independent reviewers.Data collection from the articles and the screening process were managed using the software Zotero (version 6, for Windows).The screening process followed a two-phase protocol.In the first phase, the titles and abstracts of the articles were read to select those that would be subjected to a full reading.
In the subsequent phase, a comprehensive examination of the introduction, methods, results, and conclusion sections was conducted.Publications that did not present a computational methodology for neoepitope identification in next-generation sequencing (NGS) data or solely focused on a specific gene were excluded.
Finally, from the selected studies, a synthesis matrix was compiled, covering qualitative aspects such as study design (data type employed, type of tumor), key program and algorithm applications (variant calling and neoepitope prediction), evaluated immune attributes, validations, benefits, and drawbacks.
To determine the data to collect, we focused on aspects directly related to the process of identifying immunogenic neoepitopes and the success rate of the programs used.During the review process, two main types of studies were identified.One type presents the identification of neoepitopes using existing programs and algorithms from the literature, while the other type involves the development of new tools for neoepitope identification, both utilizing next-generation sequencing (NGS) data.
In studies that utilized programs already existing in the literature, the following data were collected: 1.
Type of tumor 3.
Alleles of MHC-I and II 6.
Statistics (the score used to classify epitopes) 7.
Evaluated immunological characteristics (biological and biochemical aspects affecting the immune response) 8.
Whether a structural model was evaluated 9.
Whether in vitro and in vivo validations were performed 10.Positive/negative aspects In studies where new programs or algorithms were designed, the following data were collected: 1.
Whether mutational data were evaluated 2.
Length of MHC-I and II epitopes 3.
Steps of prioritization of neoepitopes 6. Performance Here, we differentiate between algorithms and programs as follows: An algorithm refers to the mathematical and statistical method used.A program is the code written to allow the algorithm's function.Software is an architecture built to utilize one or more programs and manipulate data.All data and methods used in each study were synthesized into a matrix for visualization and analysis.This matrix allowed for a comprehensive overview of the key variables and approaches employed across the selected studies, facilitating comparison and interpretation of the findings.

Results and Discussion
The number of articles retrieved from all databases totaled 1013, comprising 16 from Science Direct, 494 from PubMed, 257 from Web of Science, and 246 from Scopus.After eliminating duplicates across databases, 637 articles remained.An evaluation of the titles and abstracts of these articles was conducted, resulting in the exclusion of 487 articles that did not meet the established criteria.Consequently, 150 articles were selected for full-text reading, including an examination of the introduction, methods, results, discussion, and conclusion sections.Among these, 101 articles were excluded for not presenting a study model based on human tumor NGS data.All articles removed during the screening process are listed in Table S1.Ultimately, 49 articles were included in this review.Figure 1 illustrates the PRISMA flow diagram, summarizing the article selection and exclusion process.Out of the remaining 49 articles, a synthesis matrix was formulated and divided into two segments corresponding to the applications discussed in the studies.
The initial section encompasses articles that utilize tumor next-generation sequencing (NGS) data and implement computational methodologies.This includes the utilization of neoepitope prediction algorithms in testing these sequences to generate an immune response and/or in designing a therapeutic vaccine.
The subsequent section compiles articles that focus on pioneering a new computational methodology for neoepitope identification.
Table 1 provides an overview of the first section of the matrix, while Table 2 provides an overview of the second section of the matrix.

Critical Steps in Identifying Immunogenic Neoepitopes
To address the guiding question "What are the critical steps in identifying immunogenic neoepitopes?",we identify these critical steps and summarize the main aspects for each of them.According to the results of the first section of the synthesis matrix , the critical steps are (a) variant calling; (b) the prediction of neoepitopes; (c) the prioritization of neoepitopes; and (d) in vitro and in vivo validation.In contrast, the second section of the matrix , which focuses on articles addressing the development of new algorithms and programs, identified the main steps as (a) the prediction of neoepitopes; and (b) the prioritization of neoepitopes.Figure 2 presents a flowchart with the critical steps identified, challenges, and progress.
In variant calling, mutations with the potential to alter the gene product were identified.This step is crucial when working with sequencing data, such as whole-genome sequencing (WGS), whole-exome sequencing (WES), or RNA-seq, as it directly impacts epitope prediction.
The prediction of neoepitopes involves using algorithms to analyze genes with inserted mutations.These algorithms predict the binding affinity of the peptide to the MHC molecule and assess other variables that impact its immunogenicity.
The prioritization of neoepitopes is the subsequent step, where the most promising neoepitopes are selected based on their predicted immunogenicity and other relevant factors.
Finally, the validation of the immunogenicity of neoepitopes is performed to test the validity of the in silico steps performed previously.This validation step is crucial for confirming the efficacy of the predicted neoepitopes in eliciting an immune response.
These critical stages in the neoepitope prediction process will be discussed in more detail below, connecting them with the review questions and objectives.
neoepitopes are selected based on their predicted immunogenicity and other relevant factors.
Finally, the validation of the immunogenicity of neoepitopes is performed to test the validity of the in silico steps performed previously.This validation step is crucial for confirming the efficacy of the predicted neoepitopes in eliciting an immune response.
These critical stages in the neoepitope prediction process will be discussed in more detail below, connecting them with the review questions and objectives.

Main Types of Tumors Studied and Variant Calling
In the context of the guiding question "What are the main types of tumors studied?", we explore the most studied types of tumors and their mutation frequency and investigate the challenges posed by tumor heterogeneity in the critical step of variant calling.

Mutation Frequency
The mutation frequency of tumor cells varies depending on the tissue of origin, with some types of tumors exhibiting higher mutation frequencies than others.One of the factors that result in this increase in the mutation rate of a cell type is the degree of exposure of the tissue to mutagenic agents.For example, cells in the lung and skin, which are organs with a surface facing the external environment, and are exposed to mutagenic agents, like tobacco and UV radiation, respectively, typically show the highest mutation rates.Other factors such as toxins and chronic inflammation can also increase cellular stress, leading to increased mutation rates [68].
Consequently, tumors with a high mutation rate can generate a large number of potential neoantigen candidates for vaccine development [30].However, even in tumors with a low mutation rate, neoepitopes still have the potential to induce specific immune responses through immunotherapy or therapeutic vaccines [36,69].Figure 3 illustrates how each of the types of tumors were grouped (A, B, C, D, E, F) according to the number of articles where they were studied in the review.Each of the groups aggregated types of tumors with the same frequency of use among the articles.Supplementary Table S2 shows the articles where each type of tumor is found.

Mutation Frequency
The mutation frequency of tumor cells varies depending on the tissue of origin, with some types of tumors exhibiting higher mutation frequencies than others.One of the factors that result in this increase in the mutation rate of a cell type is the degree of exposure of the tissue to mutagenic agents.For example, cells in the lung and skin, which are organs with a surface facing the external environment, and are exposed to mutagenic agents, like tobacco and UV radiation, respectively, typically show the highest mutation rates.Other factors such as toxins and chronic inflammation can also increase cellular stress, leading to increased mutation rates [68].
Consequently, tumors with a high mutation rate can generate a large number of potential neoantigen candidates for vaccine development [30].However, even in tumors with a low mutation rate, neoepitopes still have the potential to induce specific immune responses through immunotherapy or therapeutic vaccines [36,69].Figure 3 illustrates how each of the types of tumors were grouped (A, B, C, D, E, F) according to the number of articles where they were studied in the review.Each of the groups aggregated types of tumors with the same frequency of use among the articles.Supplementary Table S2 shows the articles where each type of tumor is found.(F) Burkitt's lymphoma, cervical adenocarcinoma, cholangiocarcinoma, chronic lymphocytic leukemia, chronic myeloid leukemia, colorectal adenocarcinoma, embryonal, endometrial, mesothelioma, myeloblastoma, osteosarcoma, renal carcinoma, squamous cell carcinoma, teratoid/rhabditoid tumor.
Among the analyzed articles, lung carcinoma was the most extensively studied type of tumor, featured in seven articles (group A).Group B had five articles, and the focus was on breast carcinoma, colorectal carcinoma, liver carcinoma, melanoma, and ovarian (E) B-cell lymphocytic leukemia, glioma, neuroblastoma, prostate carcinoma; (F) Burkitt's lymphoma, cervical adenocarcinoma, cholangiocarcinoma, chronic lymphocytic leukemia, chronic myeloid leukemia, colorectal adenocarcinoma, embryonal, endometrial, mesothelioma, myeloblastoma, osteosarcoma, renal carcinoma, squamous cell carcinoma, teratoid/rhabditoid tumor.
Among the analyzed articles, lung carcinoma was the most extensively studied type of tumor, featured in seven articles (group A).Group B had five articles, and the focus was on breast carcinoma, colorectal carcinoma, liver carcinoma, melanoma, and ovarian carcinoma.In the case of group C, covered by four articles, the research centered around gastric adenocarcinoma.Group D, explored in three articles, encompassed glioblastoma and pancreatic carcinoma.Group E, covered by two articles, studied B-cell lymphocytic leukemia, glioma, neuroblastoma, and prostate carcinoma.With tumors investigated in one article, group F encompassed a range of less commonly studied tumors, predominantly highlighted by a solitary article [19] that utilized database samples.
As reported by Martínez-Pérez in 2021 [70], discussing the average mutations per collected tumor sample from COSMIC, it becomes evident that many of the most extensively studied tumors also exhibit the highest mutation burdens.In groups A, B, and C, lung carcinoma, gastric adenocarcinoma, ovarian carcinoma, colorectal carcinoma, melanoma, and liver carcinoma exhibit averages of 40, 86, 72, 129, 170, and 76 mutations per sample, respectively.Meanwhile, within group F, less frequently cited tumors in articles are charac-terized by the lowest average mutation rates per sample, including renal carcinoma and myeloid leukemia, with 25 and 6 mutations per sample, respectively.
Tumors with a low mutation burden can pose challenges in identifying significant vaccine targets.Conversely, an extremely high mutation load results in a high load of neoepitopes, increasing the likelihood of identifying an immunogenic neoepitope that enhances cytotoxic activity against the tumor.Nevertheless, research indicates that the quest for neoepitopes remains feasible even in low-mutation-burden tumors [71].
The reviewed literature clearly shows an association between mutation burden and certain types of cancers, such as lung, colorectal, and breast cancer.It is important to highlight that, according to the 2023 report from the World Health Organization [72], these cancers are among the most common types worldwide, and their high prevalence rates mean that a larger patient population is available for research, increasing the statistical power of studies and the relevance of the findings.Additionally, lung, breast, and colorectal cancers are leading causes of cancer-related death, so research in these areas has the potential to significantly impact public health by reducing mortality.
Due to their high prevalence and mortality rates, these cancers often receive substantial research funding from government agencies, non-profit organizations, and private sector entities.There is also a well-established body of knowledge and research infrastructure for these cancers, including clinical trials, biobanks, and patient registries, making it easier to conduct and replicate studies.Furthermore, lung, breast, and colorectal cancers exhibit significant heterogeneity and have various subtypes, presenting opportunities to study different aspects of cancer biology, treatment responses, and resistance mechanisms.
In contrast, less well-studied tumors present a unique set of challenges and opportunities for research.These cancers face limited research funding, which can hinder extensive research and development efforts.Smaller patient populations also represent an issue, as fewer patients with these types of tumors make it difficult to conduct large-scale trials and achieve statistically significant results.Additionally, less common tumors can be highly heterogeneous, making it challenging to identify common therapeutic targets and treatment approaches.
However, there are significant opportunities in researching less well-studied cancers.Addressing significant unmet medical needs, these efforts can provide hope and potentially life-saving treatments for patients with limited options.Furthermore, these tumors may offer unique biological insights that can contribute to the broader understanding of cancer mechanisms, leading to new discoveries that may also benefit more common cancers.The challenges mentioned above can drive innovative research approaches, such as the use of novel biomarkers, adaptive clinical trial designs, and advanced bioinformatics [73].
Another factor that directly impacts the effectiveness of computational neoepitope prediction is variant calling, which will be explored in a subsequent topic.

Variant Calling
To predict patient-specific immunogenic neoepitopes, it is essential to identify which peptide sequences have undergone somatic mutations.Variant calling is the step where these mutations are identified prior to the neoepitope prediction.
The main challenge in this step is to distinguish common (germline) variants from a somatic tumor variant [74].Programs such "Mutect" and "Samtools" are employed for sequence alignment and to define critical cutoff points (the quality of the sequencing at the position of a nucleotide and the comparison of germline and tumor sequences), which are essential for determining which mutations are truly somatic.Figure 4 presents the variant calling programs and how frequently they were used among the selected studies.
these mutations are identified prior to the neoepitope prediction.
The main challenge in this step is to distinguish common (germline) variants from a somatic tumor variant [74].Programs such "Mutect" and "Samtools" are employed for sequence alignment and to define critical cutoff points (the quality of the sequencing at the position of a nucleotide and the comparison of germline and tumor sequences), which are essential for determining which mutations are truly somatic.Figure 4 presents the variant calling programs and how frequently they were used among the selected studies.The methodologies employed in the articles for variant calling can be categorized into two primary forms: (a) variant calling programs that compare sequencing data from normal samples with those from tumor samples [20][21][22][23][24]26,[29][30][31][32][33][34][35][36][37][38][39][40]; and (b) the retrieval of recognized and validated variants from databases [19,25,27,28,41].
According to the best practices for variant calling defined by Koboldt [74], the prediction results of these programs can be influenced by factors such as the proportion of pathologically mutated subclones and sequence depth.Sequencing depth represents a critical factor since the input data (normal and tumor sequencing data) are used to estimate the probability that the predicted variations are tumor somatic mutations.Subsequently, candidates for somatic variants undergo filtering and review to eliminate artifacts and germline variants.
Mutect2 [75] is among the most extensively utilized mutation-calling programs, providing enhanced flexibility in accommodating data from various sequencing platforms.Additionally, it has the capability to predict a repertoire of somatic mutations even when only tumor sequencing data are available, while also excluding germline mutations identified in databases.However, this approach can result in a significant number of false positive predictions of mutation sites [74].In a comparative study, Mutect2 demonstrated superior performance, especially when the mutation frequency was below 10% [76].In According to the best practices for variant calling defined by Koboldt [74], the prediction results of these programs can be influenced by factors such as the proportion of pathologically mutated subclones and sequence depth.Sequencing depth represents a critical factor since the input data (normal and tumor sequencing data) are used to estimate the probability that the predicted variations are tumor somatic mutations.Subsequently, candidates for somatic variants undergo filtering and review to eliminate artifacts and germline variants.
Mutect2 [75] is among the most extensively utilized mutation-calling programs, providing enhanced flexibility in accommodating data from various sequencing platforms.Additionally, it has the capability to predict a repertoire of somatic mutations even when only tumor sequencing data are available, while also excluding germline mutations identified in databases.However, this approach can result in a significant number of false positive predictions of mutation sites [74].In a comparative study, Mutect2 demonstrated superior performance, especially when the mutation frequency was below 10% [76].In contrast, Strelka showed better performance when the mutation frequency reached or exceeded 20%.Additionally, Strelka boasted a processing speed 34-39 times faster than Mutect2.
Several studies have utilized an approach to identify somatic mutations by querying databases that provide verified variants categorized as originating from either normal or tumor tissues.Platforms such as CCLE [77], COSMIC [78], dbSNP [79], and TCGA [11] offer a diverse range of tools and store data on known human variants.
TCGA and COSMIC store human variants, providing related data such as genes, clinical cases, and types of tumors.CCLE primarily stores data from tumor cell line models for use in other studies.dbSNP houses single-nucleotide variations, microsatellites, insertions, and deletions referenced in the literature.Some studies utilize this data to construct specific databases [19,25].TCGA and dbSNP were utilized in studies related to personalized therapy [27,28,41].
When searching for variants in databases, it is important to highlight that numerous mutations generating potential neoepitopes might remain undetected.This is particularly relevant when studying tumors with limited data and/or less frequent HLA subtypes [80].
Additionally, it is worth mentioning that in this context, these databases can only confirm established mutations.Therefore, variant calling programs play a vital role in pinpointing novel and sample-specific mutations.They conduct statistical analyses that identify sites likely to harbor somatic mutations with substantial confidence, disregarding readings with numerous discrepancies or exceedingly low-quality scores due to their tendency to introduce more noise than signal [74].
Furthermore, it is also important to underscore that the selected studies integrated into this review investigated single-nucleotide polymorphisms (SNPs) mainly.These mutations, involving a single nucleotide base variation, represent the most prevalent mutations that can arise in tumor cells [81], which accounts for this outcome.
However, it is important to emphasize that some studies confirmed that insertion or deletion variations (INDELS) where one or more bases are inserted or removed are the second most abundant form of genetic variation in humans after SNPs (79% of the variants detected were SNPs and 21% of the variants were INDELs) [81,82].It is evident that these deletions also have the potential to generate a neoepitope [83], and depending on the alteration to the genomic reading frame, insertions or deletions can lead to neoepitopes with multiple modified amino acids, entirely distinct from the wild-type form.
Several studies have utilized a combination of multiple programs rather than relying on a single one [22][23][24]31,[34][35][36].This approach of employing various programs in combination aims to reduce the occurrence of false negative variants [84].Limitations associated with variant calling will be discussed in the next subtopic.

Limitations of Variant Calling
This section specifically addresses the guiding question "What are the strengths and limitations of the employed approaches?" by analyzing the limitations of the variant calling.Among these limitations, most are associated with the samples themselves, including those exhibiting low sequencing quality [33].Factors such as contamination, low purity, and tumor heterogeneity can compromise the quality of the generated reads, along with meeting the necessary minimum coverage [83].
Tumor heterogeneity specially involves various subclones within a single tumor, each with unique spatial and mutational characteristics.This diversity influences the expression and immunogenicity of neoantigens.Different types and stages of tumors can exhibit high heterogeneity and diverse mutation profiles, impacting the effectiveness of variant calling [85].
Furthermore, the length of sequencing reads can introduce challenges and biases in genome assembly, particularly in repetitive regions, thereby impeding the identification of specific mutation types such as gene fusions, rearrangements, and inversions.While SNPs and INDELs are readily detectable in shorter reads, larger mutations may go unnoticed [86].
Moreover, the variant calling process itself can influence the selection of ineffective neoepitopes, depending on its execution [87].Although INDEL-type mutations are less extensively studied, they also have the potential to yield immunogenic neoepitopes and have begun to be included in recent investigations.However, the identification of gene fusions remains challenging and necessitates specific tools like FACTERA [34].Lastly, certain studies validate mutations using databases instead of sequence alignment, potentially overlooking significant tumor-specific and patient-specific mutations [19,25,27,28,41].
Advanced sequencing techniques like single-cell RNA sequencing (ScRNA-seq) provide detailed genomic data that reveal the clonal structure and mutation history of tumor subclones, helping in the identification of optimal neoantigens for targeted therapy.However, challenges such as cell isolation and viability, cost, and data integration issues remain hurdles to its widespread application [88].
In summary, the accuracy of variant calling hinges on factors such as sample quality, purity, and validation parameters, which persist as limiting factors in numerous studies.

Algorithms Employed for Prediction and Prioritization of Neoepitopes
In response to the guiding question "What algorithms and programs have been employed for predicting tumor neoepitopes?",this section summarizes the results obtained regarding the algorithms and programs employed for predicting neoepitopes and examines the inherent limitations of current in silico methods and suggests potential strategies to overcome these challenges.First, we will present the programs for predicting epitopes with a binding affinity to MHC-I and MHC-II.The importance of the allele's specificity will also be discussed.Next, the computational algorithms utilized by prediction programs will be summarized.

Neoepitope Prediction Programs
After identifying genomic mutations within tumor cells, the functional consequences of these variants are assessed.This process leads to the creation of peptide sequences containing the mutations which are then used to predict neoepitopes.Subsequently, the second crucial step involves selecting or prioritizing those with greater immunogenic potential among them.Figure 5 showcases the neoepitope prediction programs identified in the reviewed studies.In response to the guiding question "What are the strengths and limitations of the employed approaches?", Table 3 presents a list of all neoepitope predictors identified, with advantages and disadvantages.As depicted, this review encompassed 49 studies, categorized into two sections (Tables 1 and 2).The first segment (23 articles) included studies employing pre-existing neoepitope prediction programs already described in the literature.The subsequent segment (26 articles) included studies that developed novel neoepitope prediction programs.
The second segment of the review, with 26 studies focusing on the development of new programs/algorithms, is outlined in Table 2.Among the programs identified in Table 2,

HLA-I Restriction
In the context of personalized cancer therapy, the emphasis on predicting neoepitopes primarily revolves around the utilization and advancement of algorithms geared towards forecasting neoepitopes binding to HLA-I.This emphasis arises from the intracellular origin of tumor antigens.Such predictions are allele-specific, targeting a particular HLA allele.With over 25,000 distinct alleles of class I HLA genes (HLA-A, B, and C) cataloged [89], it underscores the importance of determining the specific HLA allele to aid in predicting potentially immunogenic epitopes.
The majority of epitope prediction programs utilize a 4-digit resolution of alleles, which denotes the level of detail in identifying HLA alleles (e.g., A*02:01, where "02" is the type and "01" is the subtype).In Figure 6, HLA-I alleles (A, B, C, D, E, F, G, H, I) were grouped according to the number of articles in which they were studied (in light gray) and the number of alleles in each group (in black).Each of the groups aggregated HLA-I alleles with the same frequency of use among the articles.The comprehensive list of alleles is provided in Supplementary Table S3.
For predicting neoepitopes associated with HLA-I alleles, Group A emerged as the most prevalent, utilized in 14 studies.This group includes the allele A*02:01, which is among the most widespread and extensively studied alleles worldwide [90].Similarly, Group B and C encompass two and three alleles, respectively (B: A*03:01 and A*01:01; C: A*11:01, B*07:02, C*07:02), both highly prevalent with significant frequencies across continents.The alleles in Groups D, E, F, G, and H (featured in six, five, four, three, and two articles, respectively) also represent some of the most common alleles globally.In Group I, the alleles were less frequently used in studies (one article).Notably, this group also includes rarer alleles with heightened frequencies within the population [91].
Certain The precise identification of the HLA allele is of significant importance in neoepitope prediction.However, the distribution of these alleles is not uniform within the population, which affects algorithm training.Rarer alleles have limited available data in the existing literature, also posing challenges in algorithm training.Consequently, the prediction accuracy for these rarer alleles tends to be lower, resulting in an elevated false positive rate [92].For predicting neoepitopes associated with HLA-I alleles, Group A emerged as the most prevalent, utilized in 14 studies.This group includes the allele A*02:01, which is among the most widespread and extensively studied alleles worldwide [90].Similarly, Group B and C encompass two and three alleles, respectively (B: A*03:01 and A*01:01; C: A*11:01, B*07:02, C*07:02), both highly prevalent with significant frequencies across continents.The alleles in Groups D, E, F, G, and H (featured in six, five, four, three, and two articles, respectively) also represent some of the most common alleles globally.In Group I, the alleles were less frequently used in studies (one article).Notably, this group also includes rarer alleles with heightened frequencies within the population [91].
The precise identification of the HLA allele is of significant importance in neoepitope prediction.However, the distribution of these alleles is not uniform within the population, In summary, since the selection of major histocompatibility complex (MHC) alleles is crucial for neoepitope prediction due to their essential role in presenting peptide fragments to T cells, the rationales cited in the reviewed articles for selecting specific MHC alleles focus on several factors.These include allele frequency and ethnicity-specific alleles, with studies commonly referencing the Allele Frequency Net Database [91] to choose alleles prevalent in specific populations.Binding affinity is also important, as alleles that bind a wide range of peptides with high affinity are preferred for presenting a diverse set of neoepitopes.Clinical relevance plays a role as well; certain MHC alleles are associated with specific diseases or conditions, and selecting these alleles can aid in understanding and targeting diseasespecific neoepitopes.The availability of comprehensive sequence data for certain alleles can also facilitate more accurate neoepitope prediction.Finally, prior research influences allele selection, and well-studied alleles documented in the literature are often chosen due to the existing body of knowledge and available comparative data.
The articles discussing neoepitope prediction with HLA-I affinity primarily focused on processing peptides with a length of 8 to 11 amino acids.Specific studies adhered to a 9-amino-acid size [26][27][28][29]33], while others slightly expanded the prediction scope to encompass 7-11 amino acids [35] or 8-15 amino acids [39].In these articles discussing neoepitope prediction to HLA-I affinity, emphasis was placed on the analysis of 9-aminoacid peptides due to their optimal binding stability with the MHC-I molecule [93].

HLA-II Restriction
In Figure 7, HLA-II alleles (A, B) were grouped according to the number of articles in which they were studied (in light gray) and the number of alleles in each group (in black).Each of the groups aggregated HLA-II alleles with the same frequency of use among the articles.The comprehensive list of alleles is provided in Supplementary Table S4.
The prediction of neoepitopes restricted by HLA-II is less extensively studied compared to HLA-I.In this context, the A allele group was employed in two articles, while the B allele group was utilized in one article.The majority of alleles from the B group were featured in a single article [34], which encompassed various distinct samples.
Vaccines 2024, 12, x FOR PEER REVIEW 26 of 43 literature, also posing challenges in algorithm training.Consequently, the prediction accuracy for these rarer alleles tends to be lower, resulting in an elevated false positive rate [92].In summary, since the selection of major histocompatibility complex (MHC) alleles is crucial for neoepitope prediction due to their essential role in presenting peptide fragments to T cells, the rationales cited in the reviewed articles for selecting specific MHC alleles focus on several factors.
These include allele frequency and ethnicity-specific alleles, with studies commonly referencing the Allele Frequency Net Database [91] to choose alleles prevalent in specific populations.Binding affinity is also important, as alleles that bind a wide range of peptides with high affinity are preferred for presenting a diverse set of neoepitopes.Clinical relevance plays a role as well; certain MHC alleles are associated with specific diseases or conditions, and selecting these alleles can aid in understanding and targeting disease-specific neoepitopes.The availability of comprehensive sequence data for certain alleles can also facilitate more accurate neoepitope prediction.Finally, prior research influences allele selection, and well-studied alleles documented in the literature are often chosen due to the existing body of knowledge and available comparative data.
The articles discussing neoepitope prediction with HLA-I affinity primarily focused on processing peptides with a length of 8 to 11 amino acids.Specific studies adhered to a 9-amino-acid size [26][27][28][29]33], while others slightly expanded the prediction scope to encompass 7-11 amino acids [35] or 8-15 amino acids [39].In these articles discussing neoepitope prediction to HLA-I affinity, emphasis was placed on the analysis of 9-aminoacid peptides due to their optimal binding stability with the MHC-I molecule [93].

HLA-II Restriction
In Figure 7, HLA-II alleles (A, B) were grouped according to the number of articles in which they were studied (in light gray) and the number of alleles in each group (in black).Each of the groups aggregated HLA-II alleles with the same frequency of use among the articles.The comprehensive list of alleles is provided in Supplementary Table S4.HLA class II molecules encompass three loci (HLA-DR, HLA-DP, and HLA-DQ) and exhibit affinities with epitopes spanning 15 to 24 amino acids, as highlighted in the works of GOSH (2019) [15] and WANG (2010) [16].These characteristics were investigated in three recent articles [34,35,41] that underline the significance of identifying neoepitopes with an affinity for MHC class II, yet this aspect remains relatively underexplored in the context of therapeutic vaccine development.Predicting MHC-II-restricted neoepitopes poses challenges due to the limited understanding of the class II antigen processing and presentation pathway.Additionally, it is observed that class II epitopes tend to be larger and more versatile [94].
Limitations associated with HLA specificity will be discussed in a subsequent section.

Prediction Algorithms for Neoepitopes
The discovery of peptide sequences bound to HLA molecules in the early 1990s sparked investigations into the sequence patterns responsible for allele-specific peptide-HLA binding.Given the complexities of the binding mechanism, various algorithms and software rooted in pattern recognition and machine learning principles were subsequently developed [95][96][97].
The latest and crucial algorithms for neoepitope prediction highlighted in this review predominantly employ neural networks.Neural networks represent a sophisticated learning approach capable of capturing the nonlinearity inherent in the binding process, along with the interrelation among distinct amino acid binding positions, ultimately resulting in enhanced performance [96].
Supplementary Table S5 presents the algorithms and matrices identified within the set of neoepitope predictors.Here, in response to the guiding question "How are the predictions validated?",Table 4 presents the strengths and limitations of the neoepitope prediction algorithms.Among the 23 studies covered in the initial section of this review, the SYFPEITHI predictor stands out as the earliest one, with its algorithm coded in Object Pascal, an object-oriented programming language [96].This program remains the sole user of this particular object-oriented programming language.In contrast, the other programs identified (NetMHC, NetMHCpan, NetMHCII, NetMHCIIpan, pVACtools, and MHCflurry) were coded in the Python programming language and employ neural networks as their predictive methodology.Lastly, MixMHCpred and MixMHC2pred were coded in the C++ programming language.
The programs NetMHC, NetMHCpan, NetMHCII, and NetMHCIIpan are trained using data from in vitro peptide binding affinity tests.These programs are designed to be allele-specific, targeting the prevalent HLA alleles in the human population that have shown a connection with immunogenicity.The "pan" versions of these tools are particularly suited for analyzing cancer-related data.These versions incorporate mass spectrometry data, leading to a reduction in false positive results, and they encompass training data derived from tumor origins.Furthermore, the "pan" version enables predictions for HLA alleles that have been less extensively studied [57].
The pVAC-tools program [22,29,35,64] offers a versatile toolkit that seamlessly integrates with other prominent tools for detecting neoepitopes.It not only works well with programs like NetMHCpan and MHCflurry but also provides the unique advantage of directly incorporating data through the processing of variant calling information (including SNPs, indels, and gene fusions).This capability enables pVAC-tools to perform neoepi-tope prediction and prioritization while considering factors such as gene expression and homology analysis.
The MHCflurry program [21,52] demonstrates significant improvements in both processing speed and training.It integrates tumor epitopes detected through mass spectrometry, thereby enhancing accuracy and minimizing the risk of missing potential candidates.While the programs MixMHCpred and MixMHCpred2 [31] do not predict neoepitopes, they are capable of evaluating various peptide scores and ranking the most probable HLA-I ligands.Trained on naturally occurring peptides, these programs provide a score rather than a predicted affinity value.
For predicting neoepitopes that account for both HLA-I and -II affinity, the programs Ivax [51] and Ancer [55] employ the EpiMatrix algorithm [98].The strength of this algorithm lies in its ability to explore HLA families that share common characteristics, thereby reducing redundancy in epitope mapping and addressing HLA polymorphisms, both utilizing six HLA-I alleles and eight HLA-II alleles for neoepitope prediction.In the case of Ivax [51], this program uses overlapping 9-mer frames of amino acids, whereas Ancer [55] employs 9-and 10-mer frames of amino acids.Additionally, these same programs utilize JanusMatrix to evaluate the homology of peptide sequences.
Moreover, in the context of predicting neoepitopes for both HLA classes, the program MS2Rescore [54] leverages the XGBoost [99] algorithm for its efficiency and rapid processing.MS2Rescore stands out from other neoepitope predictors by integrating newly trained immunopeptidome MS2PIP models, DeepLC, and Percolator into a single software package.This integration significantly enhances neoepitope identification rates and specificity.Additionally, the program ProTECT [62] predicts neoepitopes using the NetMHC programs.It distinguishes itself in the realm of neoepitope prioritization by seamlessly integrating variant calling data and peptide sequence homology with the rankboost technique [100].
Programs that predict neoepitopes with only HLA-I affinity primarily utilize neural networks.ACME [44] predicts peptide transport and cleavage and evaluates T cell response using the BLOSUM50 matrix [101].ACME offers significant improvements over other methods, boasting an approximately 5 percentage point increase in the Spearman correlation coefficient (SRCC).Furthermore, ACME outperforms the NetMHCpan 3.0 model on several metrics such as PCC and AUROC, particularly for peptides of 9, 10, and 11 amino acids.ACME can also predict new alleles not available in the training data.INeo-Epp [50] prioritizes neoepitopes by evaluating their physicochemical characteristics using random forest and Boruta algorithms, which help to eliminate false positive neoantigens, significantly reducing the number of candidates requiring experimental verification.MHCSeqNet [53] also employs neural networks for peptide transport and cleavage prediction.It is flexible, allowing predictions for peptides of any length and for any MHC class I allele with a known amino acid sequence.MHCSeqNet outperforms tools like NetMHCPan and MHCflurry in terms of performance.OnionMHC [58] differentiates itself by incorporating both structureand sequence-based features to predict the binding affinity of peptides with the HLA-I allele A*02:01.While models such as NetMHCpan4.0 rely solely on peptide sequences, OnionMHC combines information from both areas, resulting in improved performance.SHERPA [59] evaluates physicochemical characteristics with the BLOSUM62 matrix, increasing the selectivity of immunogenic neoepitopes.PTuneos [63] integrates variant calling data, uses the NetMHCpan program for neoepitope prediction, and employs random forest to assess T cell activation.Seq2Neo [66] integrates variant calling data, epitope prediction with convolutional neural networks, T cell activation, and gene expression.
Other programs employ various algorithms for neoepitope prediction.3pHLA [42] uses the CART and random forest algorithms [102] for direct neoepitope prediction, integrated with structural analysis.These methods can handle binary and multi-class classification problems and are widely used in various biological applications.PLATO [65] uses Bootstrap [103] and random forest algorithms, integrating neoepitope prediction with peptide transport and cleavage, variant calling integration, gene expression, and a T cell activation score.Neopepsee [56] uses neural networks for neoepitope prediction and prioritizes using six physicochemical characteristics, gene expression, homology, peptide transport, and cleavage, employing naïve Bayes, random forest, and support vector machine algorithms.
For programs predicting neoepitopes with HLA-II affinity [43,45,48,60,61], neural networks are also the predominant method.FIONA [43] utilizes convolutional neural networks.The program developed by Mettu and collaborators [45] employs NetMHCII and prioritizes neoepitopes with a score that assesses binding stability.ITcell not only predicts neoepitopes using neural networks but also evaluates peptide transport and cleavage and creates a neoepitope model to assess T cell recognition.MARIA [61] uses neural networks and also accesses gene expression.
Optimizing these algorithms involves incorporating additional variables into epitope prediction to account for antigen processing and presentation pathways, thereby increasing the accuracy of predictions.However, studies still exhibit a high rate of false positives in both in vitro and in vivo tests, estimated to be around 50% to 95% [3].This indicates that not all predicted epitopes presented by MHC-I on the surface of tumor cells and/or APCs will induce effective CD8+ T cell responses, necessitating further methods of neoepitope prioritization.

Limitations in MHC-I Neoepitope Prediction
To address the guiding question "What are the strengths and limitations of the employed approaches?", we present the limitations identified in the MHC-I neoepitope prediction in the subsequent discussion.Machine learning stands as the most commonly utilized and well-established method for predicting neoepitopes when applied to epitope prediction in sequences derived from microbial samples.However, when addressing tumor-origin samples, a key challenge arises concerning the composition of the database used to train the algorithm [104].Given that the method relies on sequence similarity and pattern recognition, it is imperative that algorithm training incorporates tumor-origin data.Yet, there remains a scarcity of data for constructing these databases, thereby diminishing the predictor's performance.In terms of training, another issue is allele-specific prediction.Common and extensively studied HLA alleles, such as HLA-A2, benefit from ample training data.Conversely, for rarer alleles, data availability is limited, significantly reducing reliability in these cases [90].
These limitations are evident in the reviewed programs that heavily rely on microbial data for algorithm training.Many studies confine their analyses to the HLA alleles identified in the NGS data utilized.This necessitates caution when extrapolating results to other alleles.Programs like OnionMHC [58] validated their results for only one allele (A*02:01), potentially displaying varied performances for other alleles.Thus far, most algorithms have been trained with data obtained by testing predicted epitopes.In other words, the database is confined to data acquired from in silico prediction, disregarding other potential immunogenic epitopes.One approach to circumvent this is to incorporate data generated by mass spectrometry, which directly identifies sequences presented on the cell plasma membrane.However, this approach was scarcely explored in the review [49,52,57,65,67], mainly due to its inherent difficulties, including sample quality dependency and low yield.Nonetheless, mass spectrometry has been pivotal in understanding physicochemical properties in epitope-MHC binding and holds the potential to identify rare sequences overlooked by standard techniques.It is important to emphasize that the inclusion of this data in algorithm training has enhanced our understanding of antigen processing and presentation [105].
The majority of neoepitope prediction programs solely consider the binding affinity of the peptide to class I MHC molecules as the metric to evaluate immunogenicity [105].However, it is known that numerous biological factors, many of which are still unknown, can influence or determine the immunogenicity of these sequences.Cleavage, transport, expression level, and recognition by T cells are among the factors that notably influence peptide immunogenicity.Additionally, other factors related to structural analysis are beginning to be explored, such as the physicochemical characteristics of sequences influencing T cell activation and the affinity of epitopes to class II MHC molecules [106].

Limitations in MHC-II Neoepitope Prediction
In line with the previous discussion, we provide in the subsequent analysis the limitations identified in the MHC-II neoepitope prediction to answer the guiding question "What are the strengths and limitations of the employed approaches?".CD4+ T cells play a crucial role in generating specific and long-lasting cellular immunity.Within the tumor microenvironment, some studies [107] have demonstrated that the activation of CD4+ T cells can impact the immune response against tumors.This response can occur even in instances where tumor cells negatively regulate class I MHC expression.Consequently, precise identification of class II neoepitopes can prove critical for optimizing neoepitope prediction and devising therapeutic cancer vaccines.
Predicting epitopes with class II MHC affinity has been relatively underexplored in the reviewed studies and presents additional challenges.These epitopes tend to be longer and more variable in length than class I MHC peptides, spanning from 15 to 24 amino acids.Moreover, besides exhibiting more promiscuous peptide binding, unstable regions contribute to epitope non-specificity, complicating the identification of immunogenic epitopes, particularly in terms of pattern recognition by the utilized algorithms.When these molecular factors are combined with the lack of validation for these epitopes as tumor neoepitopes, algorithm training becomes challenging, resulting in diminished performance [107].
Similar to predicting neoepitopes binding to class I MHC, the utilization of mass spectrometry data for class II MHC affinity epitopes has facilitated the development of more reliable binding prediction programs, as evidenced by NetMHCIIpan [108].
Peptide immunogenicity is influenced by various factors, including those related to structural analysis.These factors are beginning to be explored, such as the physicochemical characteristics of sequences influencing T cell activation and the affinity of epitopes to class II MHC molecules [106].
One of the most important prioritization methods is the assessment of gene expression [56,59,61,63,[65][66][67], as a selected neoepitope may be contained within a non-expressed gene.It is important to emphasize that there is no consensus on the optimal gene expression cutoff point.Moreover, even if a neoepitope is overexpressed, it may not trigger an effective immune response, as this relies on various other factors.
Another prioritization method explored by some programs [42,50,56,60] is structural analysis, which evaluates intrinsic physicochemical characteristics of the neoepitope, such as polarity, amino acid position, pair interactions, and entropy.This prioritization process seeks to assess which amino acid sequence parameters determine or influence T cell recognition and activation.
Additional prioritization parameters commonly used include peptide transport and cleavage analysis, and homology assessment.Peptide transport and cleavage are important for antigen presentation processing pathways, but some studies suggest that analyzing this separately has a limited impact on the results [110].This is because current epitope predictors, such as NetMHCpan and MHCflurry, take into account processed sequences (transported and cleaved).Homology searches are performed between selected neoepitopes and naturally expressed peptide sequences in the genome.Neoepitopes highly similar to natural epitopes increase the risk of cross-reactivity in vaccine formulation.
Since the programs identified in this review use machine learning for neoepitope prediction, an important factor to consider is the data used for algorithm training.

Training Datasets Models
To answer the guiding question "What dataset is used as a training model?", we delve into the findings concerning the training of algorithms outlined in the review, aiming to address the datasets utilized in model training.
Incorporating tumor-origin data can aid in reducing the false positive rate.Additionally, employing data obtained from mass spectrometry analysis can further mitigate this rate.Research has demonstrated that including mass spectrometry data in the training set is crucial for developing a high-performance predictor [109].Without this data, both the size of the training set and allele coverage are significantly diminished.Moreover, achieving a balance in the training set, combining mass spectrometry data with peptide-MHC class I layer depth, is imperative, as mass spectrometry data alone do not accurately represent antigen processing [111].
Despite these advancements, in silico methods, have limitations, spanning from biases in sequencing technologies and variant calling to issues in the epitope prediction algorithm that fail to consider other factors influencing immunogenicity.Notably, there is a lack of exploration into neoepitopes binding to class II MHC molecules.

Performance and In Silico Validation
The guiding question "How are the predictions validated?" is tackled in the following paragraphs, where we delve into the in silico validation of predicted neoepitopes and present the findings regarding the performance evaluation of the algorithms and programs.In silico analyses, in addition to accelerating the identification of potential immunogenic neoepitopes, are important for identifying key points and genetic mechanisms that trigger an immune response against the tumor.Across the studies reviewed in this manuscript, the methodologies employed to assess performance measures displayed some degree of variability.Typically, the aim is to evaluate the likelihood of a predicted peptide sequence being a true neoepitope with affinity for an HLA allele, as well as to determine if this neoepitope genuinely elicits an immune response.However, a recurring shortcoming in this domain is the lack of validation of the acquired data [112].
Many studies showcase predictor performance using performance metrics, often quantified by the AUC (area under the curve).This metric is commonly utilized for evaluating classification models, particularly in binary classification scenarios where the goal is to distinguish between two classes, such as "positive" and "negative".Within a receiver operating characteristic (ROC) curve-a standard tool for assessing binary classification models-the AUC represents the area under the curve.This curve plots the true positive rate (sensitivity) against the false positive rate (1-specificity) at various classification thresholds.The AUC value serves to quantify the overall ability of the model to differentiate between the two classes.A higher AUC generally signifies a betterperforming model, with an AUC of 1 representing a perfect classifier, and an AUC of 0.5 representing a random classifier.Neoepitope predictors use this metric to define the performance of the algorithm when subjected to a test group after training.
The predictors identified in this review exhibited AUC values spanning from 0.65 [55] to 0.99 [53].However, each predictor possesses unique characteristics, such as the algorithm employed, biological traits covered, HLA alleles utilized for training, and the composition of the training data.Consequently, depending on the object of study, the predictor with the highest AUC value may not necessarily be the optimal predictor of immunogenic neoepitopes.This is due to the specific nuances of each predictor, such as the training dataset and the parameters defined in the program.
ACME [44] demonstrated an AUC value of 0.9 for HLA-A and 0.88 for HLA-B, shedding light on performance discrepancies among alleles.However, no performance breakdown is available for HLA-C alleles within the same study.Additionally, the study revealed variations in predictions concerning neoepitopes of different sizes, with 9 amino acids (9AA) exhibiting the most favorable performance.
MHCSeqNet [53] exhibited an impressive AUC performance of 0.99 for alleles prevalent in the population and 0.79 for what are termed "invisible" alleles.This suggests that the predictor possesses robust predictive capabilities for epitopes binding to any patient's HLA allele, a highly desirable trait given that neoepitopes capable of binding to multiple HLA alleles are generally more likely to be immunogenic.
MHCflurry [52] achieved an AUC performance of 0.8 and highlighted its superior performance when dealing with peptide sequences longer than nine amino acids compared to other tools.This distinction may be influenced by the algorithm's training data, which amalgamates microbiome and tumor data alongside mass spectrometry data.
The predictor 3pHLA [42] also demonstrated outstanding performance, attaining a precision rate of 0.99, validated across 28 HLA-I alleles.
A significant leap in predicting neoepitopes with HLA-II affinity comes from the FIONA predictor [43], boasting an AUC value of 0.94.Utilizing convolutional neural networks and integrating tumor data into its training algorithm, FIONA achieves superior performance compared to other predictors like MARIA [61] (0.87) and NetMHCIIpan (0.76).
It is worth noting that the majority of predictors primarily assess the binding of the peptide sequence to the HLA molecule, without considering the subsequent recognition and activation of T cells.Currently, this gap has been explored by numerous studies, predominantly through the structural analysis and physicochemical characterization of neoepitopes.
OnionMHC [58] achieves an AUC performance of 0.83 and utilizes molecular modeling to prioritize neoepitopes.However, its predictions are limited to a single allele and a specific neoepitope size.
These studies highlight the varied factors that differentiate neoepitope prediction programs and algorithms.The choice of which one to utilize should consider the nature of the data under examination.Depending on variables such as type of tumor, sequencing data quality, patient HLA alleles, and the predictor's approach to these factors, certain programs may prove more suitable than others.
Programs like Ivax [51], Neopepsee [56], ProTECT [62], and pVACtools [64] boast expansive scopes, encompassing numerous critical parameters within the neoepitope prediction process.They directly integrate variant calling data, predict neoepitopes for both HLA-I and -II, employ prioritization techniques such as gene expression analysis, consider aspects such as transport and cleavage, assess physicochemical properties, and explore homology.
Five studies conducted clinical tests [34,36,38,40,41].In one study [34], immunogenic peptides were identified and used for the vaccine design.The results observed in patients receiving personalized neoantigen therapy included a complete and durable response beyond 29 months in a patient with metastatic thymoma.A patient with metastatic pancreatic cancer showed a partial response related to the immune system.The other four patients achieved prolonged disease stabilization, with a median progression-free survival of 8.6 months.
In a separate study [36], despite generating specific immune responses against both systemic and intratumoral neoantigens after vaccination, all patients experienced tumor recurrence and died of progressive disease.This indicates that the T cell responses elicited still need to overcome considerable challenges to produce clinically relevant antitumor activity, including tumor-intrinsic damage and immunosuppressive factors in the microenvironment.Low MHC class I expression in the tumor and lack of class II expression may hinder tumor recognition by neoantigen-specific T cells.
In the study reported in [38], 41 months after the start of vaccination, the radiological imaging results (February 2016) showed no evidence of tumor recurrence.In yet another study [40], approximately 55% (61 sequences) of the predicted neoepitopes were confirmed to bind to HLA-I molecules, with three of them eliciting an IFN-γ response.The administered vaccine led to tumor remission alongside bone marrow transplantation.
Finally, in [41], a neoantigen vaccine developed against melanoma has demonstrated that neoantigen vaccination, in combination with checkpoint blockade, can significantly enhance vaccine-induced immune responses and lead to lasting clinical effects in some patients.
Nonetheless, numerous challenges persist within the pipelines established by the scientific community.The lack of standardization in computational methods, applied parameters, utilized datasets, and even the presentation of results complicates the formulation of a comprehensive gold standard protocol, particularly given the genomic diversity of tumors and human alleles, alongside the incomplete understanding of biochemical mechanisms mediating the immune response.Furthermore, many studies do not validate their predictions through in vivo or in vitro methods, despite the importance of this validation given that the error rate for predictions can be as high as 95%.It is also crucial to note that reports of cytokine production in response to neoepitopes in these tests do not always translate to effective tumor inhibition or reduction.Therefore, in addition to considering the specific alleles and types of tumors studied, it is essential to evaluate the combination of identified neoepitopes with adjuvants and conventional chemotherapy drugs for effective treatment.
Despite these limitations, based on the results extracted from in vitro, in vivo, and some clinical tests in humans in the selected studies for this review and the preceding discussions, the prediction of immunogenic neoepitopes appears promising, offering robust evidence for the development of therapeutic vaccines against cancer.

Restrictions Regarding the Article Selection Process
Regarding the constraints of this scoping review, firstly, we have solely included articles written in English and publicly accessible.Second, it is conceivable that certain programs and algorithms for neoepitope prediction remain proprietary, rendering both their code and reference articles inaccessible to the public.
Additionally, articles focusing solely on specific genes have been excluded.The methodologies employed in these studies vary considerably based on their objectives, gene types, and available datasets.The divergence in the target-specific approach adopted by those studies distinguishes them from our scope of focus, leading to their exclusion.
Furthermore, given that our review focuses on computational approaches applied to human data for personalized therapy, articles that did not utilize human high-throughput sequencing data were not incorporated.

Conclusions and Perspectives
Here, we provide an overview of the approaches for identifying immunogenic neoepitopes and future directions in neoepitope prediction research.This includes advancements in computational models and the need for cross-disciplinary collaboration.
Personalized cancer therapy has emerged as a burgeoning field of research, focusing on understanding the biological mechanisms underlying the immune response to disease.This field integrates computational approaches aimed at developing neoepitope predictors and cancer vaccines.Given the intricate nature of cancer and the limitations inherent in specific in silico methods in addressing this complexity, studies have ventured into exploring diverse parameters and data integration techniques to refine the selection of cancer vaccine targets.
The design of therapeutic vaccines has been facilitated by a deeper understanding of the diversity of tumor-associated neoantigens, the native immune response, and the advancements in genomic sequencing technologies.Consequently, therapeutic cancer vaccines hold the potential to induce tumor regression and eradicate minimal residual disease.
The focus of this scoping review was guided by primary investigations aiming to define the most commonly used computational methods and approaches currently available for predicting tumor neoepitopes, crucial for the development of therapeutic vaccines.We assess the application of neoepitope prediction algorithms, pivotal steps in in silico approaches facilitating the selection of potential immunogenic tumor targets, algorithm performance, and associated limitations spanning from 2012 to 2024.To structure this review, we followed seven guiding questions: "What are the critical steps in identifying immunogenic neoepitopes?";"What are the main types of tumors studied?"; "What algorithms and programs have been employed for predicting tumor neoepitopes?";"What dataset is used as a training model?"; "How are the predictions validated?";"What are the strengths and limitations of the employed approaches?"; and "Can this prediction accurately guide the identification of potential targets for the development of therapeutic vaccines?",aiming to answer them using data obtained from the selected studies.
The critical steps identified in this review for selecting immunogenic neoepitopes (variant calling, the prediction of neoepitopes and prioritization, and validation) represent the primary approach for identifying immunogenic neoepitopes in next-generation sequencing data.The first step, variant calling, plays a crucial role in facilitating the selection of patient-specific tumor mutations.Programs that compare normal and tumor sequences emerge as the optimal strategy at this stage, enhancing the likelihood of identifying novel mutations and tumor-specific mutations and evaluating parameters such as tumor purity.Mutect stands out as the most commonly utilized tool in this category; however, as evidenced, the choice of program may vary depending on the dataset, with some outperforming others.An essential consideration at this stage is the tumor origin of the sequencing data, as it significantly influences mutation rates and purity.Limitations associated with this stage center around the quality of sequenced samples, encompassing the sample collection process and the choice of the sequencing approach.
Challenges such as errors in identifying true mutations (false positives/negatives), tumor heterogeneity, and sequence depth can be addressed through several strategies.Enhanced next-generation sequencing (NGS) platforms offering improved accuracy and coverage can reduce errors.Additionally, advanced bioinformatics pipelines integrating algorithms and machine learning techniques can more effectively distinguish genuine variants from sequencing artifacts.Multi-sample analysis, which integrates multiple regions of both tumor and normal tissue, allows for a comprehensive assessment of mutation spectra and accommodates tumor heterogeneity.Finally, employing cross-validation techniques such as digital droplet PCR or Sanger sequencing can validate key variants identified by NGS, ensuring their accuracy and reliability.
The second and third critical steps, neoepitope prediction and prioritization, respectively, are directly interconnected and represent the stages that receive the most emphasis in the development of novel algorithms and in silico approaches aimed at enhancing the performance of these predictors.
In the context of neoepitope prediction, despite the diversity of machine learning algorithms available, neural networks, and more recently, convolutional neural networks, dominate as the preferred choices for neoepitope prediction.These algorithms offer advantages such as the ability to discern intricate patterns, high accuracy compared to other methods, scalability, adaptability, processing speed, and a reduced rate of false positives.Consequently, they hold the potential to deepen our understanding of the immune response and identify novel targets for immunotherapy.
In the context of neoepitope prioritization, which involves identifying and selecting the most promising neoepitopes, the critical steps include ranking neoepitopes based on their predicted binding affinity and immunogenicity, prioritizing neoepitopes unique to the tumor and not present in normal tissues, and considering the mutational load and expression levels of the mutant protein in the tumor.These approaches could be enhanced with the development of more accurate prediction algorithms that incorporate machine learning and deep learning techniques to improve binding affinity and immunogenicity predictions.Integrating sequencing data with proteomics and transcriptomics data to validate the expression of neoepitopes, along with using multiple prediction tools to provide a consensus prediction for better reliability, would also help enhance performance.Additionally, high-throughput screening techniques such as peptide-MHC tetramer assays and in vitro T cell activation assays can be employed to experimentally validate predicted neoepitopes.Using patient-specific HLA typing and tumor heterogeneity data to tailor predictions to individual patients, incorporating structural modeling to better understand binding interactions of neoepitopes/HLA molecules, and applying network-based approaches to study the neoepitope landscape to identify key immunogenic hotspots within the tumor would further improve the effectiveness of these approaches.
In brief, future directions should focus on expanding HLA databases to include diverse and rare HLA alleles, enhancing the predictive power for all populations.Employing machine learning models trained on large, diverse datasets can improve the prediction of peptide-HLA binding affinities and immunogenicity.Additionally, developing integrative models that consider multiple factors, such as proteasomal cleavage, TAP transport, peptide-MHC stability, and T cell receptor affinity, will further refine predictions.Utilizing high-throughput experimental techniques like mass spectrometry to validate and refine computational predictions can create feedback loops for continuous model improvement.
The fourth critical step, in vitro and in vivo validation of predicted neoepitopes, highlights additional issues such as a high rate of false positives, indicating that the ability to induce the production of cytokines does not guarantee that there will be an immune response against the tumor.Current limitations include the relatively small number of experimentally validated predicted neoepitopes (ensuring that predicted neoepitopes elicit a robust T cell response) and the translation of findings from in vitro and in vivo models to human patients.To address these challenges, future efforts should concentrate on developing high-throughput assays, such as peptide-MHC binding assays and T cell activation assays, to efficiently validate large numbers of predicted neoepitopes.Using humanized mouse models, organoid systems, and organ-on-chip systems can help assess the immunogenicity and therapeutic potential of predicted neoepitopes in a more clinically relevant context.Finally, applying iterative cycles of prediction and validation, where computational predictions are continually refined based on experimental results, will enhance the accuracy and reliability of neoepitope prediction.
Translational approaches and cross-disciplinary collaborations among bioinformaticians, immunologists, and clinicians should also be pursued to develop prediction models that are more clinically relevant.Such collaborations can facilitate the creation of prediction models that are both accurate and comprehensive.Another crucial aspect, of continued relevance, is standardization and benchmarking.Establishing standardized benchmarks and datasets for assessing the performance of neoepitope prediction tools should be addressed through the establishment of open access repositories, enabling researchers to share validated neoepitopes and associated data.
In this context, in obtaining and sequencing tumor data, an important challenge faced in the development of therapeutic vaccines is tumor heterogeneity.Tumor heterogeneity refers to the existence of different subclones within a single tumor, which are separated not only spatially but also by distinct mutational patterns.This intratumoral heterogeneity can affect the expression of neoantigens and their immunogenicity.In other words, the type of tumor and the stage of the disease may present tumors with high heterogeneity, diverse mutation profiles, and the expression of neoepitopes, leading to greater complexity in targeting therapies, since the predicted neoepitopes may only be effective against one subset of tumor cells.
Currently, sequencing and bioinformatics tools, such as single-cell RNA sequencing (ScRNA-seq), can be used to acquire high-resolution genomic sequence information, allowing for the provision of the clonal architecture and mutational ancestry of subclones, helping to map the ideal neoantigens to combat the tumor.However, despite being promising, this technique still faces some problems of its own, such as the difficulty of isolating cells and maintaining their integrity and viability for future analyses, the high cost, and challenges in data integration.
In summary, overcoming the limitations in the critical steps of variant calling, neoepitope prediction and prioritization, and validation requires a multifaceted approach.Future studies should leverage advances in sequencing technologies, machine learning, highthroughput screening, and collaborative frameworks to enhance the accuracy, efficiency, and clinical relevance of neoepitope prediction for therapeutic vaccine development.By integrating these approaches, the field can move closer to developing effective personalized cancer vaccines.

Figure 2 .
Figure 2. Critical steps flowchart.The figure presents the challenges (left) and current progress (right) of each of the critical steps in the identification of immunogenic neoepitopes.

Figure 4 .
Figure 4.Variant calling software usage in the selected studies.The "x" axis represents the software or database, while the "y" axis represents the frequency of software utilization across the articles identified in this review.

Figure 4 .
Figure 4.Variant calling software usage in the selected studies.The "x" axis represents the software or database, while the "y" axis represents the frequency of software utilization across the articles identified in this review.

Figure 5 .
Figure 5. Neoepitope prediction usage in the selected studies.The horizontal (x) axis represents the neoepitope prediction programs, while the vertical (y) axis indicates the frequency of usage of each program across the articles.

Figure 5 .
Figure 5. Neoepitope prediction usage in the selected studies.The horizontal (x) axis represents the neoepitope prediction programs, while the vertical (y) axis indicates the frequency of usage of each program across the articles.

3. 6 .
Can This Prediction Accurately Guide the Identification of Potential Targets for the Development of Therapeutic Vaccines?

•
Web of science: open access; articles; English language.• Science Direct: open access and open archive; research article (search in "title, abstract, keywords").

Table 1 .
Synthesis matrix of neoepitope identification studies in NGS data.

Table 2 .
Synthesis matrix of studies addressing development of new programs for neoepitope prediction.

Table 4 .
Prediction and prioritization algorithms, advantages, and disadvantages.