Predicting Immunogenicity Risk in Biopharmaceuticals

: The assessment of immunogenicity of biopharmaceuticals is a crucial step in the process of their development. Immunogenicity is related to the activation of adaptive immunity. The complexity of the immune system manifests through numerous different mechanisms, which allows the use of different approaches for predicting the immunogenicity of biopharmaceuticals. The direct experimental approaches are sometimes expensive and time consuming, or their results need to be conﬁrmed. In this case, computational methods for immunogenicity prediction appear as an appropriate complement in the process of drug design. In this review, we analyze the use of various In silico methods and approaches for immunogenicity prediction of biomolecules: sequence alignment algorithms, predicting subcellular localization, searching for major histocompatibility complex (MHC) binding motifs, predicting T and B cell epitopes based on machine learning algorithms, molecular docking, and molecular dynamics simulations. Computational tools for antigenicity and allergenicity prediction also are considered. Ranks peptides using the position-speciﬁc scoring matrix coefﬁcients. The method is validated with proteins known to bear MHC class I T cell epitopes. Performance and result analysis show that more than 80% of these epitopes are among the top 2% of scoring peptides.


Introduction
Biopharmaceuticals are drug therapy products of biological origins such as living cells or viruses, usually derived with the methods of biotechnology. One of the main obstacles in using biological drugs for new therapies is the unwanted immunogenicity. On the other hand, immunogenicity is a keystone in vaccine development. When a foreign substance is exposed for the first time to the immune systems, it induces a specific immune response. This ability is known as immunogenicity [1]. The development of methods for the assessment of immunogenicity requires deep understanding of the mechanisms of immune response activation.
The immune system can be divided in two main branches: innate and adaptive immunity. The main difference between them is in how they recognize the pathogens. The innate immunity system uses nonspecific defense mechanisms and is represented by eosinophils, monocytes, macrophages, natural killer cells, complement system. The chemical properties of the antigens activate the innate immune response, which induces inflammation on the site.
The adaptive immunity is induced by the activation of cellular and/or humoral responses [1,2]. The T cell activation is a decisive step in both responses. The T cell receptor (TCR) is a specific receptor, which enables the recognition of antigens bounded to major histocompatibility complex (MHC) molecules. The MHC molecules present antigens (intracellular or extracellular) on the surface of the antigen-presenting cells (APCs). MHC proteins in humans are extremely polymorphic. There are three gene loci that encode the MHC class I proteins in humans: HLA-A, HLA-B and HLA-C, and other three loci that encode the MHC class II proteins: HLA-DR, HLA-DQ and HLA-DP. The MHC class I molecules bind to intracellular antigens and present them on the surface of all nucleated cells [3][4][5]. The MHC class II molecules present extracellular antigens and can be found on structure-based and sequence-based. Structure-based methods use the data from the threedimensional (3D) structure of the proteins. Sequence-based methods analyze the amino acid (aa) sequence. The list of approaches includes QSAR (quantitative structure-activity relationship) analysis, matrix-driven methods, protein threading, homology modeling, identification of structural binding motifs, docking techniques, and the application of machine-learning algorithms and tools [5].

Sequence-Based Methods
The first computational methods used for sequence-based binding affinity prediction were sequence motifs [15] and quantitative matrices [16]. They were quickly replaced by the gold standard-machine learning algorithms [17][18][19]. Sequence-based methods have some limitations. In the statistical learning methods, in order to train the model, an experimental (training) dataset is required. The composition of the dataset is very important because it can bias the predictions [20,21]. Therefore, it is more reliable to predict MHC variants (i.e., alleles) with larger datasets available for training than to predict less studied alleles.

Structure-Based Methods
A training dataset is not required, because these methods rely on the peptide-MHC (pMHC) interaction and the biochemical properties of the amino acids involved in it [20]. The structure-based methods can be used to analyze the impact on the binding affinity and immunogenicity of MHC-binding peptides after phosphorylation [22], citrullination [23], and glycosylation [24]. The latter are post-translational amino acid modifications, which may affect the binding affinity. In addition, the structural basis of TCR/pMHC interactions can be detailed. One of the main challenges in the structure-based methods is the obtaining and processing of three-dimensional structural data. In order to conduct personalized structure-based analyses, the structural prediction and/or modeling should be performed using computational methods and tools. However, from a computational perspective, the pMHC modeling and structure-based binding affinity prediction is challenged by the size and flexibility of the ligands involved [25]. In order to perform the required structural analyses in an efficient way, we should rely on the expert knowledge and/or experimental data that we already have [20,25].

Methods for Immunogenicity Prediction
The complexity of the immune system allows the application of different approaches for immunogenicity predictions.

Sequence Alignment Algorithms
The sequence alignment algorithms compare two or more sequences trying to highlight similarities between them. These similarities can be a result of different relationships between the compared sequences-structural, functional, or evolutionary. The aim of the sequence alignments is to organize, visualize, search, and analyze the sequences. The sequence alignment algorithms are applied for immunogenicity prediction by screening databases and comparison of proteins or peptides with known antigens.
The application of sequence alignment algorithms for the identification of antigens is problematic for several reasons and may produce ambiguous results or fail. For example, some proteins which do not have obvious sequence similarity and are formed through evolution (divergent or convergent), may share similar properties and structure [26]. Moreover, in some cases the antigenicity (as a property) may not be available for direct identification using sequence alignment methods because it is encoded in a subtle sequence.

Predicting Subcellular Localization
One of the basic assumptions for the search of vaccine antigen candidates is that a potential antigen is located at the bacterial surface where it is available for antibody recogni-tion. The N-terminal of a protein contains an indication of its cellular destination. A "leader sequence" is an indication that the protein will be exported to extracytoplasmic compartments. Additional signature is when the cleavage site is just after the leader peptide-a sign that the protein in Gram-positive bacteria will be released to the extracellular space, or in Gram-negative bacteria into the periplasmic space [27].
A few tools for the prediction of the subcellular localization exist: • PsortB (https://www.psort.org/psortb/ (accessed on 27 February 2021)) consists of multiple analytical modules, each of which analyzes one biological feature known to influence or be characteristic of subcellular localization. This tool provides predictions of protein subcellular localization for both Gram-positive and Gram-negative bacteria [28]; • SignaIP (http://www.cbs.dtu.dk/services/SignalP/ (accessed on 27 February 2021)) is a neural network-based method, which can discriminate signal peptides from trans membrane regions. It predicts signal peptide cleavage sites, their presence and location in the amino acid sequence [29]; • PSLpred (http://crdd.osdd.net/raghava/pslpred/ (accessed on 27 February 2021)) is a web server for predicting the subcellular localization of Gram-negative bacterial proteins based on support vector machine modules [30].

T and B Cell Epitopes Prediction
The epitope is a part of an antigen, comprising a few amino acids residues and can be recognized by the host immune system (B or T cells). The epitopes can elicit an immune response, which is why it is important to identify or predict them [31].
The main goal when predicting the epitopes is to investigate their binding affinity to the MHC molecules, which is a prerequisite for peptide immunogenicity [2]. The complex of peptide and MHC proteins is involved in the known mechanisms of both cellular and humoral immune response. For the contemporary identification of epitopes, several In Silico methodologies have been created and constantly improved, because the experimental techniques were found to be time-consuming, difficult, and expensive. There are different databases storing the immunological data, which is necessary for the development of computational methods. Some of the most reliable and up-to-date are presented in Table 1.

T Cell Epitope Prediction
The MHC class II binding groove is open-ended, and peptides of varying sizes can accommodate in it. Therefore, it is considered important the influence of flanking residues (p. 1, p. 10, and p. 11) which are positioned outside of the core 9-mer [38,39].
The potency of a peptide to stimulate a T cell response In Vivo is not determined by the affinity of a peptide to MHC class II, although a limited number of models have demonstrated a correlation between predicted MHC class II binding and T cell epitopes [40]. All T cell epitopes can bind to MHC class II, however not all MCH binders are T cell epitopes. The In Silico methods are a useful aid to identify peptide sequences which potentially are MHC class II binders. The number of actual T cell epitopes in a certain protein sequence are constantly over-predicted by the In Silico methods. Therefore, normally an Ex Vivo or In Vivo T cell activation analysis is required in order to determine which are the actual T cell epitopes and to rank the epitopes based on their potency to induce a T cell response [41].
There are many obstacles for the high prediction accuracy of MHC-II-binding epitopes: insufficient or low-quality data, the limitation of the binding core length and difficulty in its identification, the influence of the residues flanking the core, and the permissiveness of the binding groove [42,43].

Sequence-Based Methods
Some of the web tools for MHC binding and T cell epitope prediction are listed in Table 2. Integrates prediction of binding to 12 MHC class I alleles and proteasomal C-terminal cleavage using artificial neural networks (ANN) and TAP transport efficiency using weight matrix.
ANN [52]  Calculates the immune response. The lymphocyte receptors and the pathogens are represented by their amino acid sequences.
ANN [66] Motif Search-Based Approach A motif is a preferred amino acid combination at some of the peptide binding positions. Motifs can be searched in the peptide sequence by using a motif library [41]. If we know and compare the binders and non-binders for a given peptide, we can identify the MHCbinding motifs [67]. The motif-based algorithm's accuracy is about 60-70%. The reason for this relatively low accuracy is an absence of recognizable motifs in some binding peptides [68]. Quantitative matrices-driven methods (QMs) look like an extended motif, where for every amino acid in a peptide a coefficient is imputed [69]. The estimation of QM is made by specific binding profiles, which are derived experimentally.

Prediction by Statistical and Machine Learning Algorithms
The development of computer science and the need to analyze an increased amount of data has led to the development of machine learning algorithms. The main challenges in working with proteins as sequence data are to represent amino acid sequence using appropriate numerical descriptors, and how sequences with different lengths should be compared. Once there is success with these challenges, a huge arsenal of machine learning methods can be applied in order to find reliable models.

Artificial Neural Networks (ANN)
The ANNs provide a convenient method to find relationships and describe nonlinear data [70]. ANNs are based on a simple model of a neuron: dendrites that collect inputs from other neurons, a soma that performs a nonlinear processing step, and an axon that transmit the signal to another neuron. The base for the learning process is the connections between the elements: dendrites (summation of input), soma (threshold, nonlinearity), and axon (transmission of output). The peptide length can be highly variable when we are applying it for epitope prediction. A specific anchor position is assigned to the sequences which are included in the training set. The model prediction for the MHC I model is less challenging than those for MHC class II; the reason is that for MHC I, the difference in the peptide length is negligible, while for MHC class II the length variability is larger [5]. In order to have reliable models based on ANNs, huge amounts of data are required for proper training of the neural network, this is one major limitation in the application of ANNs.

Support Vector Machines (SVMs)
SVMs, or support vector machines, handle non-linearity by using a kernel function to map the input variables into a very high-dimensional (possibly infinite-dimensional) space in which a hyperplane can be used to perform separation. The goal in SVM modeling is an optimal hyperplane, which ensures that the clusters are separated in such a way that the observations from two different classes are correctly separated between the two sides of the plane [71].

Structure-Based Methods
Structure-based drug design relies on the 3D structure of the biological target, which can be obtained with X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy [72].

Molecular Docking
The most frequently used method among the structure-based methods is molecular docking. Molecular docking allows the interaction between a protein and a small molecule (ligand) to be modeled at atomic level. The molecular docking process includes two main steps: the ligand conformation is sampled in the active site of the protein, and then the obtained conformations are ranked using a scoring function. Ideally, the experimental binding mode would be reproduced by the sampling algorithms, and among all generated conformations the scoring function would rank the highest [73].
A central challenge in molecular docking is the ligand flexibility-more flexible bonds mean more conformations that can be adopted. Consequently, a good docking method must consider not only the position and orientation of the ligand in the receptor's binding cleft but also all possible conformations [74]. The drug-like ligands usually have fewer than 10 flexible bonds; however, the MHC-binding peptides in most of the cases have more than 30 flexible bonds (reaching to more than 50 for MHC class II). The increased number of flexible bonds requires the usage of more computational resources. These limitations of the molecular docking have triggered the development of protein-peptide docking techniques.
Protein-peptide docking techniques include template-based docking; local and global docking [75].
In template-based docking methods, known structures (templates) are used as scaffolds in order to build a model of the complex. GalaxyPepDock (http://galaxy.seoklab.org/ cgi-bin/submit.cgi?type=PEPDOCK (accessed on 27 February 2021)) searches for templates based on structure and interaction similarity [76].
Local docking methods apply a search for a peptide binding pose in the proximity of a user-defined binding site. Usually, the criteria for improvement of the initial model are the backbone-root-mean-square deviation (RMSD) from the experimental model. Rosetta FlexPepDock (http://flexpepdock.furmanlab.cs.huji.ac.il/cite.php (accessed on 27 February 2021)) [77], and PepCrawler (http://bioinfo3d.cs.tau.ac.il/PepCrawler/php.php (accessed on 27 February 2021)) [78] requires the user to prepare an initial model of the complex. HADDOC (http://milou.science.uu.nl/services/HADDOCK2.2/haddock.php (accessed on 27 February 2021)) [79] automatically places the peptide in the proximity of the binding site defined by a user-provided list of interface residues. Some tools such as AutoDock Vina [80], Gold [81], Glide [82] or Surflex-Dock [83], used for the docking of small molecules can be used for the docking of peptides (up to a few amino acids). For those tools, there is also a requirement for manual placement of the peptide conformation within the binding site. DINC (http://dinc.kavrakilab.org (accessed on 27 February 2021)) [84] divides the peptide into segments of increasing length, and in this way overcomes the short peptide limitation. DINC is based on AutoDock protocol. The receptor structure remains rigid during the docking procedure.
Global docking methods perform a coupled search for peptide binding sites and the respective pose. The protein and the ligand conformations are represented as rigid and usually an exhaustive rigid-body docking is performed [75]. However, some docking tools allows flexibility in ligands or proteins by using different approaches: PeppAttract (http://www.attract.ph.tum.de/services/ATTRACT/peptide.html (accessed on 27 February 2021)) [85] performs global rigid-body docking of peptide structures within binding pockets with subsequent scoring and flexible refinement of models; CABS-dock (http://biocomp.chem.uw.edu.pl/CABSdock (accessed on 27 February 2021)) [86] generates peptide conformations combined with global docking in one explicit simulation. ClusPro PeptiDock (https://cluspro.org/peptide/index.php (accessed on 27 February 2021)) [87] predicts the peptide conformation based on motifs, followed by rigid-body docking, structural clustering used for scoring; and subsequent final structure minimization.
Although the great variety of methods and tools for protein-ligand docking, there are only a few MHC protein-ligand docking tools. Hlaffy (http://proline.biochem.iisc. ernet.in/HLaffy (accessed on 27 February 2021)) [88] estimates the binding strengths of peptide-MHCs and predicts the epitopes for any MHC class I allele. The assessment of the binding strength of peptide-MHCs is achieved through learning pair-potentials, which are important for the peptide binding and provide an estimation of the ligand space (total) for each allele. DockTope (http://tools.iedb.org/docktope (accessed on 27 February 2021)) [89] is a web-based tool for pMHC class I modeling. It is based on crystal structures of peptide complexes with two human and three mice MHC class I alleles DB. EpiDOCK (http://epidock.ddg-pharmfac.net (accessed on 27 February 2021)) [90] predicts the binding affinity to the most frequent human MHC class II alleles. Docking score-based quantitative matrices represent the prediction models.

Molecular Dynamics Simulations
Molecular dynamic (MD) simulation is another structure-based method. MD analyzes the physical movements of atoms and molecules. The evolution of a molecular system is tracked over time by using a potential energy function. MD simulations require huge computational resources. The MD simulations are more appropriate to study the binding between TCR, epitope and MHC protein in detail, rather than for pMHC interaction predictions. The T cell epitope engineering and the rational design to fit to the peptide binding cleft of new MHC inhibitors may be facilitated by using MD simulation [91].

B Cell Epitope Prediction
The B cell receptors and/or antibodies recognize the B cell epitopes in their native structure. B cell epitopes are classified into linear (continuous) and conformational (discontinuous). Linear epitopes are short peptides, composed of sequential amino acids in the antigen. The greatest part of B cell epitopes are discontinuous (90% of B-cell epitopes). The amino acids of an epitope may not be contiguous in the primary sequence and the protein folding can bring them into spatial proximity. The continuous B cell epitopes are predicted using amino acid properties such as secondary protein structure, amino acid charge, hydrophilicity, and exposed surface area. The discontinuous B cell epitopes are predicted using the 3D structure of the antigen [92,93].
The main methodologies of B cell epitope prediction involve sequence-based methods, machine-learning methods, and structure-based methods ( Table 3). The epitope surface accessibility to antibody binding is generally used by the sequence-based methods [94]. These B cell prediction methods can be widely used because the antigen sequence can be obtained easily; however, the predicted epitopic residues are not grouped into the corresponding epitopes [93]. The machine learning-based linear B cell epitope prediction method principle includes several steps: collection of a dataset with comprehensive and clean data; extraction of antigen features of the sequences (e.g., amino acid composition, physicochemical properties, evolutionary information); and training the model using machine learning algorithms [92]. Structure-based methods for conformational B cell epitope prediction identify epitopes by antigen structure and epitope-related propensity scales, including specific physicochemical properties and geometric attributes. The surface accessibility and novel epitope propensity amino acid score are determined using the 3D structure of proteins. The combination of propensity residues scores in spatial proximity and the contact numbers is used to calculate the final score. Maps conformational epitopes on the antigen protein surface. Then performs patch analysis to identify the spatial contiguous clusters of residues on the antigen surface with similar physico-chemical properties, as found in the phage display sequences.
Discontinuous [101]  Based on machine-learning algorithm, which is trained to distinguish antigenic features within a given protein.
Both linear and discontinuous [102] PepSurf http://pepitope.tau.ac.il/ (accessed on 27 February 2021) Uses a mimotope based methods. Maps each of the affinity selected peptides onto the surface of antigen and aims to find interacting interfaces by only considering solvent-exposed residues.
Both linear and discontinuous [103] ElliPro http://tools.iedb.org/ellipro/ (accessed on 27 February 2021) Implements three algorithms: approximation of the protein shape as an ellipsoid, calculation of the residue protrusion index; and clustering of neighboring residues based on their protrusion index values.
Both linear and discontinuous [104] EPSVR http://sysbio.unl.edu/EPSVR/ (accessed on 27 February 2021) Based on support vector regression method in order to integrate six scoring terms: residue epitope propensity, conservation, side chain energy score, contact number, secondary structure protein composition and surface planarity score.
Both linear and discontinuous [105] Mimotope-based methods use peptides called mimotopes which are usually obtained by random peptide library screening using a monoclonal antibody [106]. The genuine epitopes can be mimicked by the mimotopes on the corresponding antigen. The high sequential similarity of the selected mimotope implies existence of key binding motifs and physicochemical preferences [107].

Antigenicity Prediction
Antigenicity of immunogens is among the most important criteria for efficient protein design. In order to have better immunization results in the empirical phase of immunological studies, we need a protein with a high antigenicity score. In silico antigenicity prediction can be the first step in the search of the potential vaccine candidates. The employment of In Silico methods and bioinformatics in the vaccinology led to the definition of the term reverse vaccinology, invented by Rino Rappuoli [108] and applied for the first time in the development of a vaccine against serogroup B meningococcus [109].
A few computational tools for antigenicity prediction are available online: • AntigenPro (http://scratch.proteomics.ics.uci.edu./ (accessed on 27 February 2021)) uses machine learning methods to predicts the protein likelihood to be protective antigen. The predictive models are built using feature sets from six microbial species and their known antigens and non-antigens [110]

Allergenicity Prediction
Allergen is a protein or a glycoprotein that produces an abnormally vigorous immune response and cause an allergic reaction. The potential immunogenicity of partly foreign biopharmaceuticals can develop allergic reactions. The immune response reaction and the severity to a biological drug may range from life-threatening (e.g., anaphylactic) to severe or effects with no clinical significance [114].
Several online tools for allergenicity prediction exist (Table 4). Allergenicity prediction tools can be divided into two groups according to the used approach: assessment according to the rules of the guidance of Codex Alimentarius, created by the Food and Agriculture Organization of the United Nations [115], and assessment based on machine learning methods. According to the Codex Alimentarius guidance, a novel protein can be considered as a putative allergen if it shares greater than 35% sequence identity with a known allergen over a sliding window of at least 80 amino acids [115]. Provides six different methods for allergenicity assessment: IgE epitopes mapping, motif search, SVM, and sequence alignment. [117] AllerTop http: //www.ddg-pharmfac.net/AllerTOP/ (accessed on 27 February 2021) A machine learning-based tool. The amino acids of the query proteins are represented by descriptors-size, residue, abundance, hydrophobicity, helix-and β-strand forming propensities. [118] AllergenFP http://ddg-pharmfac.net/AllergenFP (accessed on 27 February 2021) A four-step algorithm: a protein sequence is transformed into numerical strings, the strings are converted by auto-and cross-covariance (ACC) transformation into vectors with equal length, the vectors are transformed into binary fingerprints, and the binary fingerprints are compared using Tanimoto similarity coefficient. [119]

Conclusions
In silico methods for predicting antigenicity, allergenicity and MHC binding have become an imminent part of the process of discovery and development of new biopharmaceuticals. The In Silico methods play a crucial role in drug design and development, being safe and resource-and time-efficient. They significantly reduce the subsequent experimental work but cannot yet fully replace and be an alternative of the In Vivo animal testing. The robustness of the In Silico models depends on the quality of the biological data used in the models development; therefore, the compilation and assembling of consistent biological data is of utmost importance. It is expected that the In Silico techniques will influence the search for new biopharmaceuticals as much as they have influenced the design and discovery of new drugs and small molecules.

Conflicts of Interest:
The authors declare no conflict of interest.