Next Article in Journal
Effects of Acute Gamma Irradiation on Physiological Traits and Flavonoid Accumulation of Centella asiatica
Previous Article in Journal
Anti-Migration Effects of Gekko Sulfated Glycopeptide on Human Hepatoma SMMC-7721 Cells
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Epitope Prediction Based on Random Peptide Library Screening: Benchmark Dataset and Prediction Tools Evaluation

1
Faculty of Chemistry, Northeast Normal University, Changchun 130024, China
2
School of Computer Science and Information Technology, Northeast Normal University, Changchun 130117, China
3
National Engineering Laboratory for Druggable Gene and Protein Screening, Northeast Normal University, Changchun 130024, China
*
Authors to whom correspondence should be addressed.
Molecules 2011, 16(6), 4971-4993; https://doi.org/10.3390/molecules16064971
Submission received: 28 February 2011 / Revised: 1 June 2011 / Accepted: 10 June 2011 / Published: 16 June 2011

Abstract

:
Epitope prediction based on random peptide library screening has become a focus as a promising method in immunoinformatics research. Some novel software and web-based servers have been proposed in recent years and have succeeded in given test cases. However, since the number of available mimotopes with the relevant structure of template-target complex is limited, a systematic evaluation of these methods is still absent. In this study, a new benchmark dataset was defined. Using this benchmark dataset and a representative dataset, five examples of the most popular epitope prediction software products which are based on random peptide library screening have been evaluated. Using the benchmark dataset, in no method did performance exceed a 0.42 precision and 0.37 sensitivity, and the MCC scores suggest that the epitope prediction results of these software programs are greater than random prediction about 0.09–0.13; while using the representative dataset, most of the values of these performance measures are slightly improved, but the overall performance is still not satisfactory. Many test cases in the benchmark dataset cannot be applied to these pieces of software due to software limitations. Moreover chances are that these software products are overfitted to the small dataset and will fail in other cases. Therefore finding the correlation between mimotopes and genuine epitope residues is still far from resolved and much larger dataset for mimotope-based epitope prediction is desirable.

1. Introduction

A B-cell epitope is defined as a part of protein antigen recognized by either a particular B cell receptor (BCR) or a particular antibody molecule of the immune system [1]. It may be either a short contiguous stretch of amino acids, called a linear or continuous epitope, or consist of sequence segments that are brought together in spatial proximity when the protein is folded, called a conformational or discontinuous epitope [2]. It has been suggested that more than 90% of B-cell epitopes are conformational epitopes [2,3].
The identification of B-cell epitopes is important to many immunodetection and immunotherapeutic applications because they elicit humoral immune responses [4,5]. The objective of epitope prediction is to design a molecule that can mimic the structure and function of a genuine epitope and replace it in medical diagnostics and therapeutics, and also in vaccine design [1,6]. The most reliable methods of mapping epitopes are by X-ray crystallography or Nuclear Magnetic Resonance (NMR) [7,8,9]; however these techniques are demanding and time-consuming. Comparatively the computational methods to detect epitopes are much more efficient and cheap, but they still cannot achieve as satisfactory results as experimental methods [10]. Thus, the combination of experimental and computational approaches seems to be the best choice.
In a very long time, the computational methods for B-cell epitope prediction have focused mainly on linear epitopes. These methods are based on various amino acid propensity scales, such as hydrophilcity scale, β turns, solvent accessibility, etc. [11,12,13,14,15,16,17], but in 2005 Blythe and Flower confirmed “single-scale amino acid propensity profiles cannot be used to predict epitope location reliably” by the combination of propensity scales and feature parameters obtained amount to above 106 experiments [18]. Later, some methods that based on machine learning and artificial intelligence were proposed [19,20,21], but in 2007 Greenbaum et al. demonstrated that “the combination of scales and experimentation with several machine learning algorithms showed little improvement over single scale-based methods” [22]. The 3D structure of protein can give more information than amino acid sequence, thus epitope prediction based on 3D structure of antigen will get better results. Methods that had been proposed to detect linear epitopes based on the 3D structure of protein before the first Antigen-Antibody (Ag-Ab) was solved confirmed this point [11,23,24,25,26]. With the availability of more Ag-Ab crystal complexes, epitope prediction entered a new era, and many algorithms for conformational rather than linear epitopes identification were proposed based on this significant information [27,28,29,30].
Since phage display technology was first propounded by Smith in 1985 as a systematic method for presenting, selecting and evolving proteins and peptides displays on the surface of filamentous phage [31], it has developed rapidly both in basic research such as the identification of protein-protein interactions [32,33], and in applied research such as the development of new diagnostics, therapeutics and vaccine design [4,34,35]. There are usually two steps in epitope prediction based on phage display technology. The first is the determination of mimotopes. The random peptides which bind to a monoclonal antibody with certain degree of affinity are screened, eluted and later amplified. According to 3–5 rounds of “adsorption-elution-amplification”, the resulting peptides become fewer but with higher affinity. These affinity-selected peptides are defined as mimotopes. They not only have high sequential similarity with the native epitope but also can mimic the essential features of the genuine epitope [36,37]. The second step is mapping the mimotopes to the antigen. In recent years, several tools have been developed to implement this step [27,28,38,39,40,41,42,43,44,45,46], and the mapping algorithms can be mainly classified into two categories according to their dependency on antigen structures: the one works with the sequence of antigen and the other works with both the sequence and the 3D structure of antigen. But no matter which way they work, the algorithms can predict both linear and conformational epitopes.
Epitope prediction method based on random peptide library screening is supposed to improve tremendously the accuracy of the epitope prediction by combining biology experiments with computational methods [40,41,42]. Though all the existing methods have succeeded in the given test cases, a systematic evaluation of these methods is still absent because of the scarcity of useful test data [38]. Moreover, since the number of available mimotopes with corresponding structure of template-target complex is limited, the existing software is prone to be overfitted. Constructing benchmark datasets will help establish the standard for the evaluation and comparison to the existing algorithms and it is badly needed for the development of new epitope prediction algorithms. In this paper, we collected a group of reliable datasets from MimoDB [47] and PDB [48] as a benchmark dataset to facilitate the further development of this standard. In addition, using the new benchmark dataset and a representative dataset, we evaluated five recently developed epitope prediction tools: Mapitope [40], PepSurf [41], Pep-3D-Search [42], Pepitope (short for the combination method of Mapitope and PepSurf, for a detailed description see the Material and Methods section) [43] and EpiSearch [44]. All these tools are either Open Source or have freely available web services.

2. Results and Discussion

2.1. Datasets Compilation

Sixty two test sets had been collected from the MimoDB of November 2010, and the corresponding structures of template-target complexes had been obtained from the PDB. Every test set was carefully manually checked. If a test set has several corresponding complex structures in the PDB, we selected the one with the latest released date and the highest resolution (there are 24 such sets among the 62). For the sets whose whole information is the same but the round of biopanning, the last round one is retained while the others are excluded. There are 14 such sets [MS01153, MS01101, MS01102, MS01103, MS01104, MS01106, MS01107, MS01108, MS01109, MS01111, MS01112, MS01113, MS01114 and MS01153]. If sets have all the same data, they were combined as one independent set, and the mimotope sequences of these sets were combined. The MS01105, MS01110 and MS01115 meet this condition, and we combined them as MS01105/10/15. If a template-target complex structure has several pairs of template and target chains with the same structure and binding sites, only one pair was selected and the template chain was extracted. The genuine epitopes of these mimotopes sets were determined by CED, IEDB-3D or CMA. In this way, 47 test sets with 28 corresponding complex structures formed the final benchmark dataset. These 47 test sets contain 18 sets with the antigen-antibody complex and 29 sets with the protein-protein interactions. Then for the same template-target complex, we selected only one test set according to the prediction results as the representative dataset (see Material and Methods section). Finally, 30 test sets formed the representative dataset.

2.2. Performance Evaluation for the Existing Epitope Prediction Software

We consulted and analyzed 11 recently-developed methods (see Table 1), and finally we chose five publicly available ones to test using the benchmark dataset. The detail information of the five methods is provided in the Material and Methods section.
Table 1. Comparison of 11 epitope prediction methods. A brief introduction was given.
Table 1. Comparison of 11 epitope prediction methods. A brief introduction was given.
MethodPublication yearLanguageOperating SystemServiceNotes
FINDMAP2003C++not statednoFINDMAP is a method to acquire information on the 3D structure of the protein by identifying discontinuous epitopes; it maps one mimotope sequence to the protein at a time.
SiteLight2005C++LinuxnoSiteLight is a method of predicting the binding site on a 3D structure using random peptide library screening.
3DEX2005VBWindowsno3DEX allows the analysis of single amino acid of a linear peptide sequence with regard to their spatial neighborhood in the 3D structures of PDB files based on preselectable parameters like distance, string length (frame size) and surface exposure. It maps mimotopes to the protein one by one, one sequence at a time.
MIMOP2006PHPIndependentUpon requestMIMOP provides an environment for mimotope characterization which integrates two main approaches, MimAlign and MimCons, which deliver to the user mimotope analysis results.
MIMOX2006PerlIndependentWebMIMOX has two sections, the first is to derive the consensus sequence, and the second is to map the single sequence to the target protein.
Mapitope2007C++WindowsWebMapitope is based on that epitope determinants shared by the entire set of peptides are detected. Both web service and source code is available.
PepSurf2007C++LinuxWebPepSurf is an algorithm for mapping a set of affinity-selected peptides to the solved surface of the antigen. Both web service and source code is available.
Pepitope2007C++WindowsWebPepitope is a combination algorithm of PepSurf and Mapitope, the web service is available freely on the Pepitope server.
MEPS2007JavaIndependentWebMEPS provides two services, one is to evaluate the likelihood that a given peptide to mimic exposed regions of the protein, and the other one is to generate all peptides of a given length to mimic exposed regions of the protein.
Pep-3D-Search2008VBWindowsGraphic interfacePep-3D-Search is an epitope mapping algorithm based on both mimotope and motif analysis. The source code is available freely.
EpiSearch2009not statednot statedWebEpiSearch is an automated detection of conformational epitopes using random peptide library screening. It provides web service freely.

2.2.1. Criteria and datasets used in methods evaluation

There is no commonly acceptable standard for evaluating mimotope based B-cell epitope prediction methods. Some authors measure only the number of predicted epitope residues and the true epitope residues on a few test cases [40,45]. Some authors measure the sensitivity, positive predictive value (PPV) and Matthews correlation coefficient (MCC) [42]. When approaching this task of evaluating and comparing different prediction methods, a number of problems occur. Using the benchmark dataset to evaluate the methods, we found that some data have several predicted clusters while some have only one cluster, but the highest score cluster does not always perform best. Addressing this problem we defined the best resulting cluster to give a fair comparison for the test sets with different number of resulting clusters. Different methods adopt different algorithms and have different restrictions. Pep-3D-Search is based on the establishment of empirical background distribution for aligning score of every mimotope and antigen, if the P-value of aligning score for every mimotope is bigger than 10–3, Pep-3D-Search will not give any prediction result; the online Pepsurf and Pepitope have the restriction that the length of mimotope sequence cannot be longer than 14; and EpiSearch has the restriction that the number of mimotope sequences cannot be longer than 30. Considering these restrictions, we defined a representative dataset for further evaluation of the prediction methods. Being aware of above problems and limitations, we also applied the following performance measures in this study to provide a complete and fair evaluation of the methods:
Molecules 16 04971 i001
Molecules 16 04971 i002
Molecules 16 04971 i003
Molecules 16 04971 i004
In our study, TP is the number of predicted epitope residues proven to be the true epitope residues. FP is the number of predicted epitope residues proven not to be the true epitope residues. TN is the predicted non-epitope residues proven not to be the true epitope residues. FN is the number of predicted non-epitope residues proven to be the true epitope residues. In these performance measures, MCC is in essence a correlation coefficient between the identified and the true epitope residues; it returns a value between –1 and +1. The value +1 represents a perfect prediction, the value 0 represents an average random prediction and –1 represents an inverse prediction.
Because the five epitope prediction methods adopt different strategies to define the surface amino acids of the antigen, to be fair we took the amino acid number of the whole antigen as TP+FP+FN+TN instead of the surface amino acids number for calculating the above performance measures. According to analysis of the testing results with the true epitopes, we found that the first resulting cluster did not always performs best, and then we defined the best resulting cluster as the cluster whose TP is the highest through comparison with the true epitopes in this paper in order to achieve the best performance of the methods for evaluation.
The performance measures were applied in evaluating the methods as follows: all methods were tested on the benchmark dataset and the best resulting cluster which correctly predicted the most epitope residues among the first three resulting clusters was retained. MCC, for all data have been calculated to evaluate the methods through the benchmark dataset; then scatter diagrams of sensitivity with respect to 1-specificity were figured to compare the methods with the random prediction. Finally, the average of all performance measures on the two datasets were calculated to give a direct view of the prediction performance of all methods on different datasets.

2.2.2. Evaluation through MCC

We evaluated the methods through the benchmark dataset using the MCC, and the results are shown in Table 2. For the five methods which evaluated in the paper, only EpiSearch can provide more than three prediction clusters, and the number is unequal, while the other methods all provide three clusters at most. To be fair for evaluation, we retained only the first three highest score clusters for EpiSearch. The all resulting test clusters are provided in Supplementary File 2.
Figure 1 gives the MCC of all methods on the benchmark dataset. Through comparison and analysis we found the following facts: first, small antigens result in several high points. The antigen of MS00012, MS00029, MS00030, MS00062, MS00242, MS00384, MS01190, MS01191 and MS01192 all have less than 30 amino acids. It is not difficult to imagine that under the same conditions a short antigen may give better performance, but in fact so short a antigen is less significant indeed. Secondly, large antigens (more than 500 amino acids) result in low MCC in most of methods, and some values are negative (seen from the low points in the figure). MS00048, MS00099, MS00276, MS00277, MS00278 and MS00279 all belong to this situation, but there is an exception dataset: MS00049, as the performance of every method with it is better. The corresponding target-template complex of MS00049 and MS00099 are the same [PDB id: 1N8Z], then the big difference of MCC may result from the number of mimotope sequence: the MS00049 has five mimotopes and the MS00099 has two. The large number of mimotopes may contain more features of epitope, so the prediction result of MS00049 is relatively good. Lastly, medium antigens (where the number of amino acids ranges from 30 to 500) with the corresponding mimotope sequence number from 1 to 41 result in diversiform MCC values. There are 31 such data in the benchmark datasets. Among these data, MS00058 and MS00060 got low MCC for all methods, but MS00056 and MS00139 obtained high MCC for all methods. Since the length of template chain in these four data is medium, and the number of mimotope sequence is all large, to find the particular reason we did the following analysis.
Table 2. MCC and PPV of each method on the benchmark datasets. The data which belong to the representative dataset were marked with *. (1) ‘NA’ means that the results for epitopes were not obtained; (2) ‘—’ means that the data has restriction of the sequence length or the sequence number; (3) The best resulting cluster that was found in the second-ranked cluster is shown in bold; (4) The best resulting cluster that was found in the third-ranked cluster is shown in italic; (5) “ALL” means the residue number of the antigen.
Table 2. MCC and PPV of each method on the benchmark datasets. The data which belong to the representative dataset were marked with *. (1) ‘NA’ means that the results for epitopes were not obtained; (2) ‘—’ means that the data has restriction of the sequence length or the sequence number; (3) The best resulting cluster that was found in the second-ranked cluster is shown in bold; (4) The best resulting cluster that was found in the third-ranked cluster is shown in italic; (5) “ALL” means the residue number of the antigen.
MimoIDAntigenALLMapitopePepSurfPepitopePep-3D-SearchEpiSearch
MCCPPVMCCPPVMCCPPVMCCPPVMCCPPV
Antigen-Antibody
MS00012*2OSL_P25NANA0.1820.250NANA0.1450.2000.1900.267
MS00013*3IU3_I2230.1460.6670.0450.2310.0710.5000.1680.4710.1340.500
MS00029*1TET_P150.5641.0000.6171.0000.5641.0000.5100.9000.8561.000
MS000301TET_P150.4711.0000.6791.0000.4711.000NANANANA
MS00048*1YY9_A624-0.0660.000-0.0030.000-0.0330.0000.0490.400-0.0050.000
MS00049*1N8Z_C6070.1000.6920.0590.4550.0411.0000.1140.5000.0960.48
MS000522ADF_A1960.0740.4290.1640.556————0.1570.359NANA
MS00053*2ADF_A196-0.0150.0000.0320.167-0.0100.0000.1890.8890.1060.500
MS00054*1IQD_C1560.0930.142-0.0060.0910.1131.0000.0230.1300.1290.360
MS00055*2GHW_A2030.1000.4000.0290.208————-0.0820.0000.0800.320
MS000562GHW_A2030.1100.4440.1100.385————-0.0890.000————
MS000572NY7_G3170.0040.1000.0260.222————0.0060.0970.0620.300
MS000582NY7_G317-0.0150.000————————0.0000.083————
MS00059*2NY7_G3170.1520.5600.0520.2050.0880.5560.0010.0860.0850.333
MS000991N8Z_C6070.0760.600-0.0660.000NANA-0.0050.0000.0050.059
MS00185*1G9M_G3210.1020.3240.0630.2260.0910.412-0.0010.044-0.0150.000
MS00186*1E6J_P2100.0210.1670.1580.4780.0360.3330.1190.2750.1140.364
MS002422OSL_P25NANA0.1450.200NANA0.1450.200NANA
Protein-Protein
MS00041*1OC0_B510.2260.3640.1660.3750.1660.3750.1010.3100.2540.440
MS00047*1HX1_B1140.0280.2380.1140.360————-0.0220.1670.1900.480
MS00060*1WLP_B138-0.0730.000-0.0330.160-0.0400.0000.0650.279————
MS00062*1WLP_A250.4960.7890.1800.7140.4991.000NANA0.5301.000
MS00139*1K4U_S620.2470.7780.2171.0000.2470.7780.3910.611————
MS002762GRX_A725-0.0070.0000.0410.261-0.0060.000-0.0060.0260.0040.069
MS00277*2GRX_A725-0.0070.0000.0390.429-0.0060.0000.0060.0710.0290.222
MS002782GSK_A590-0.0080.0000.0700.545-0.0030.0000.0380.263-0.0180.000
MS00279*2GSK_A5900.0330.1840.0470.4000.0470.400-0.0110.000-0.0160.000
MS00357*1FLT_X950.2280.7270.1400.7500.1400.7500.0050.2270.2590.688
MS00384*3DOW_B120.5271.0000.7641.0000.5271.000NANANANA
MS00405*1SHY_A2340.0080.1250.0360.2000.0170.200-0.0050.088-0.0200.045
MS004641SQ0_A2140.0770.4440.0210.188NANA0.0370.1940.0800.333
MS00465*1SQ0_A214-0.0290.0000.0020.133-0.0250.0000.0460.2310.0710.357
MS00671*1D4V_B163-0.0460.0000.1170.381-0.0210.000-0.0390.000-0.0170.083
MS00976*3BT1_A1350.2400.786-0.0730.114————0.2400.5930.0250.240
MS00984*1EER_A1660.0780.4550.0060.250-0.0330.0000.0890.5000.0010.231
MS010041MQ8_B177-0.0310.000-0.0300.000NANA-0.0140.069NANA
MS01036*3EZE_B850.0710.4000.3310.9170.2300.8570.3930.8500.3130.909
MS010373EZE_B85-0.1090.0000.3360.727NANA0.4080.7500.1980.550
MS010383EZE_B850.2880.5430.3640.704————0.3750.7310.0860.429
MS01061*1MQ8_B177-0.0510.000-0.0200.000-0.0410.0000.0090.081-0.0300.000
MS010621MQ8_B177-0.0300.000-0.0620.000————0.0240.135-0.0410.000
MS010631MQ8_B177-0.0270.000-0.0390.000————-0.0180.065-0.0310.000
MS01105/10/15*1II4_A1550.0590.3850.1630.5450.0950.5710.2330.5230.2740.750
MS01154*1HX1_A400-0.0060.0000.0100.111-0.0040.0000.0400.1560.0290.154
MS01190*1G1S_D280.3860.7500.3880.636————NANANANA
MS011911G1S_D280.3860.7500.2740.556————NANANANA
MS011921G1S_D28NANANANA————NANANANA
Figure 1. The MCC of each method on the benchmark dataset. The data which belong to the representative dataset were marked with *.
Figure 1. The MCC of each method on the benchmark dataset. The data which belong to the representative dataset were marked with *.
Molecules 16 04971 g001
We reexamined the data with the number of antigen amino acids ranging from 30 to 500, and the number of mimotope sequence is equal or larger than 3 and less than 30 in the benchmark dataset. Through evaluation, we found that the MCC of the following data are still lower: MS00465, MS00671, MS01061, MS01162 and MS001154. The MS00465 and MS00464 have the same target-template complex, and MS00464 has two mimotope sequences while MS00465 has three, but the overall performance of MS00464 is slightly better through the results of five methods. The similarities of mimotopes and epitopes may become the reason that causes this difference. MS00671 and MS01154 have 163 and 400 amino acids in antigen, 13 and eight mimotopes respectively, and the low MCC may also result by the similarities of mimotopes and epitopes in this situation. MS01061 and MS01062 share the same target-template complex [PDB id: 1MQ8], and the antigen of the two data is neither small nor large: 177 amino acids, and the numbers of mimotopes in these two data are 13 and eight, respectively. Moreover there are another two data have the same target-template complex as MS01061 and MS01062: MS01004 and MS01063, but these two data were removed at the beginning of this step for they have one mimotope. The MCCs to the four data are all low, nearly negative in all methods. The reasons for the low MCCs of these four data seems to be complicated: the small numbers of mimotope sequences, and the low affinities between mimotopes and antibodies, as well as the complicated structures of the antigens.
Through MCC we evaluate the methods through the benchmark dataset, and try to find the reasons for different MCCs in all methods. From the analysis, we find that the small antigens are less significant, and the large antigens with small number of mimotope sequences work worse in all methods due to the fact that they contain relatively less epitope features. For the medium antigen with not a small number of mimotope sequences, the reason for low MCCs may be the similarities of mimotopes and epitopes and the complicated structures of the antigens. In addition, the mimotope sequences cannot reflect the whole features of genuine epitope, for example, a mimotope does not contain structure features of epitopes. Methods which considering other features of epitopes may work better. To sum up, finding the relationship between mimotope sequence and the genuine epitope will be an open problem for further research.

2.2.3. Evaluation through sensitivity/1-specificity

Table 3 gives the sensitivity and specificity values of all methods using the benchmark dataset, and Figure 2, Figure 3, Figure 4, Figure 5, Figure 6 give directly relations between sensitivity and 1-specificity of the five methods. For the data whose prediction result is NA or has restrictions to predict, the sensitivity/1-specificity is set to 0. For all methods, there are very few such data that have no prediction result in the representative dataset.
As seen from Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Mapitope has three points in the origin, 28 points above the diagonal and 16 points below. PepSurf has two points in the origin, 36 points above the diagonal and nine points below. Pepitope is the combined method of Mapitope and PepSurf. Pepitope was tested on the web, and much data gave null results or length restriction of mimotopes. The number of points in the origin is 19 and the number of points above the diagonal is 17 and below is 11. Pep-3D-Search has six points in the origin, 31 points above the diagonal and 10 points below. EpiSearch has the number restriction of mimotope sequences, and then there are 12 points in origin, 25 points above the diagonal and 10 points below.
Figure 2. Sensitivity vs. 1-specificity scores of Mapitope on the benchmark dataset and the representative dataset.
Figure 2. Sensitivity vs. 1-specificity scores of Mapitope on the benchmark dataset and the representative dataset.
Molecules 16 04971 g002
Figure 3. Sensitivity vs. 1-specificity scores of PepSurf on the benchmark dataset and the represendative dataset.
Figure 3. Sensitivity vs. 1-specificity scores of PepSurf on the benchmark dataset and the represendative dataset.
Molecules 16 04971 g003
Table 3. Sensitivity and specificity of each method on the benchmark dataset. The data which belong to the representative dataset were marked with *. (1) ‘NA’ means that the results for epitope prediction were not obtained; (2) ‘—’ means that the data has restriction of the sequence length or the sequence number; (3) The best resulting cluster that was found in the second-ranked cluster is shown in bold; (4) The best resulting cluster that was found in the third-ranked cluster is shown in italic; (5) “EPI” means the residue number of the epitope.
Table 3. Sensitivity and specificity of each method on the benchmark dataset. The data which belong to the representative dataset were marked with *. (1) ‘NA’ means that the results for epitope prediction were not obtained; (2) ‘—’ means that the data has restriction of the sequence length or the sequence number; (3) The best resulting cluster that was found in the second-ranked cluster is shown in bold; (4) The best resulting cluster that was found in the third-ranked cluster is shown in italic; (5) “EPI” means the residue number of the epitope.
MimoIDAntigenEPIMapitopePepSurfPepitopePep-3D-SearchEpiSearch
SenSpeSenSpeSenSpeSenSpeSenSpe
Antigen-Antibody
MS00012*2OSL_P4NANA1.0000.429NANA1.0000.2381.0000.400
MS00013*3IU3_I280.2860.9790.2140.8970.1070.9850.5710.9080.3570.949
MS00029*1TET_P110.6361.0000.7271.0000.6361.0000.8180.7501.0001.000
MS000301TET_P110.4551.0000.8181.0000.4551.000NANANANA
MS00048*1YY9_A150.0000.9520.0000.9820.0000.9840.2670.9900.0000.959
MS00049*1N8Z_C200.4500.9930.2500.9900.0501.0000.8000.9730.6600.978
MS000522ADF_A150.2000.9780.6670.956————0.9330.862NANA
MS00053*2ADF_A150.0000.9670.2000.9170.0000.9830.5330.9940.3330.972
MS00054*1IQD_C160.9380.3500.1250.8570.1251.0000.3750.7140.5620.886
MS00055*2GHW_A290.2760.9310.1720.891————0.0000.8100.2760.902
MS000562GHW_A290.2760.9430.3450.908————0.0000.787————
MS000572NY7_G260.0380.9690.0770.976————0.1150.9040.2310.952
MS000582NY7_G260.0000.973————————0.0770.924————
MS00059*2NY7_G260.5380.9620.3080.8930.1920.9860.1150.8900.3460.938
MS000991N8Z_C200.3000.9930.0000.969NANA0.0000.9810.0500.973
MS00185*1G9M_G150.7330.9240.4670.9220.4670.9670.1330.8590.0000.922
MS00186*1E6J_P110.0910.9751.0000.9400.0910.9901.0000.8540.7270.930
MS002422OSL_P4NANA1.0000.238NANA1.0000.238NANA
Protein-Protein
MS00041*1OC0_B130.9230.4470.6920.6050.6920.6050.6920.4740.8460.632
MS00047*1HX1_B220.2270.8260.4090.826————0.2270.7280.5450.859
MS00060*1WLP_B290.0000.9170.1380.8070.0000.9720.4140.716————
MS00062*1WLP_A160.9380.5560.6250.5560.5001.000NANA0.5621.000
MS00139*1K4U_S240.2920.9470.1251.0000.2920.9470.9170.632————
MS002762GRX_A360.0000.9800.1670.9750.0000.9870.0280.9460.0560.961
MS00277*2GRX_A360.0000.9810.0830.9940.0000.9870.0830.9430.1110.980
MS002782GSK_A420.0000.9890.1430.9910.0000.9980.1190.9740.0000.947
MS00279*2GSK_A420.1670.9430.0950.9890.0950.9890.0000.9800.0000.958
MS00357*1FLT_X210.3810.9590.1430.9860.1430.9860.2380.7700.5240.932
MS00384*3DOW_B70.5711.0001.0001.0000.5711.000NANANANA
MS00405*1SHY_A270.0870.9340.1740.9240.0430.9810.1300.8530.0430.900
MS004641SQ0_A270.1480.9730.1110.930NANA0.2590.8450.2590.925
MS00465*1SQ0_A270.0000.9570.0740.9300.0000.9680.2220.8930.1850.952
MS00671*1D4V_B190.0000.8890.4210.9100.0000.9720.0000.9170.1050.847
MS00976*3BT1_A270.4070.9720.1480.713————0.5930.8980.2220.824
MS00984*1EER_A380.1320.9530.0530.9530.0000.9840.1320.9610.0790.922
MS010041MQ8_B170.0000.9190.0000.925NANA0.1180.831NANA
MS01036*3EZE_B250.2400.8500.4400.9830.2400.9830.6800.9500.4000.983
MS010373EZE_B250.0000.9170.6400.900NANA0.8400.8830.4400.850
MS010383EZE_B250.7600.7330.7600.867————0.7600.8830.2400.867
MS01061*1MQ8_B170.0000.8250.0000.9630.0000.8750.1760.7870.0000.925
MS010621MQ8_B170.0000.9250.0000.769————0.2940.8000.0000.875
MS010631MQ8_B170.0000.9380.0000.881————0.1180.8190.0000.919
MS01105/10/15*1II4_A370.1350.9320.3240.9150.1080.9750.6220.8220.4860.949
MS01154*1HX1_A210.0000.9890.0480.9790.0000.9950.3330.9000.1900.942
MS01190*1G1S_D70.8570.9051.0000.810————NANANANA
MS011911G1S_D70.8570.9050.7140.810————NANANANA
MS011921G1S_D7NANANANA————NANANANA
Figure 4. Sensitivity vs. 1-specificity scores of Pepitope on the benchmark dataset and the representative dataset.
Figure 4. Sensitivity vs. 1-specificity scores of Pepitope on the benchmark dataset and the representative dataset.
Molecules 16 04971 g004
Figure 5. Sensitivity vs. 1-specificity scores of Pep-3D-Search on the benchmark dataset and the representative dataset.
Figure 5. Sensitivity vs. 1-specificity scores of Pep-3D-Search on the benchmark dataset and the representative dataset.
Molecules 16 04971 g005
Figure 6. Sensitivity vs. 1-specificity scores of EpiSearch on the benchmark dataset and the representative dataset.
Figure 6. Sensitivity vs. 1-specificity scores of EpiSearch on the benchmark dataset and the representative dataset.
Molecules 16 04971 g006
Through the scatter diagrams of sensitivity with respect to 1-specificity, we evaluated the methods with the random prediction. From the analysis above we can see that for most test cases the five epitope prediction methods can all precisely localize epitope regions. The predicted results of all methods are totally better than random prediction. But due to the restrictions of each methods and the mimotope sequence itself, there is still distance from a satisfactory predicting precision. Moreover, to make a systematic evaluation of epitope prediction methods, the quantity and diversity of this benchmark dataset is still not enough, more reliable experiment data are expected.

2.2.4. Overall performance evaluation

We calculated the average value of sensitivity, specificity, PPV and MCC using the benchmark dataset and the representative dataset. Meanwhile, we also gave the average values of these performance measures using antigen-antibody and protein-protein interactions for each method respectively. Table 4 gives the overall performance for each method.
Table 4. Overall performance of each method.
Table 4. Overall performance of each method.
StatisticsMapitopePepSurfPepitopePep-3D-SearchEpiSearch
Antigen-Antibody
sensitivity0.3260.4340.2120.4550.426
specificity0.9310.8690.9900.8040.905
PPV0.4070.3340.5800.2730.345
MCC0.1200.1340.1430.0850.141
Protein-Protein
sensitivity0.2540.3050.1490.3330.241
specificity0.8950.8890.9560.8420.907
PPV0.3110.4090.3300.2880.317
MCC0.1050.1270.0990.0990.099
Benchmark dataset (Representative dataset)
sensitivity0.280(0.320)0.339(0.326)0.172(0.174)0.368(0.387)0.289(0.342)
specificity0.908(0.890)0.892(0.901)0.968(0.965)0.841(0.845)0.921(0.922)
PPV0.346(0.377)0.384(0.398)0.419(0.429)0.284(0.322)0.329(0.378)
MCC0.112(0.127)0.129(0.126)0.116(0.112)0.092(0.101)0.112(0.139)
As one can see in Table 4, when using the benchmark dataset, Pep-3D-Search gives the best sensitivity but low PPV and MCC; this resulted from the number of the positively predicted epitope residues being large as well as the total number of predicted epitope residues being even larger so more false positive residues are included. Pepitope gives the best specificity and PPV because it is a combination algorithm of Mapitope and PepSurf, and the total number of predicted epitope residues is small. The performance measures of EpiSearch fall in the middle of the five methods. As seen from Figure 8, PepSurf was rated best with a MCC value of 0.129. The other four methods performed with a MCC about 0.1. The best performances of these methods were Pep-3D-Search with a sensitivity of 0.368 and PPV of 0.284 and PepSurf with a sensitivity of 0.339 and PPV of 0.384. Using the representative dataset, the sensitivity and PPV of each method are slightly improved, but the specificity and MCC are more or less the same with using the benchmark dataset (see Figure 7).
Figure 7. Overall performancesofeach method. The average values of sensitivity, specificity, PPV and MCC were calculated for each method using the benchmark dataset and the representative dataset. The performance measures of the representative dataset are marked with *.
Figure 7. Overall performancesofeach method. The average values of sensitivity, specificity, PPV and MCC were calculated for each method using the benchmark dataset and the representative dataset. The performance measures of the representative dataset are marked with *.
Molecules 16 04971 g007
Seen from this result, the overall performance was poor for all methods. When testing the methods using the representative dataset, the performances of all methods are just slightly improved. The poor performances results from two aspects. The one is the method itself. These five methods have different restrictions. The restrictions lead to the methods not working with some test cases. The other aspect for the poor performance is the dataset itself. Seen from the performance measures of every data in the benchmark dataset (Table 2 and Table 3), there are some data that gave satisfactory results in all methods, whereas the opposite is true for others. This is because the mimotope has low similarity with the antigen, so it contains few features of the genuine epitopes, thus using the mimotope to map epitopes will not give a satisfactory result. To improve the performance of epitope prediction based on random peptide library screening, the improvement in both mimotopes and algorithms are required.

3. Materials and Methods

3.1. Construction of the Datasets

The datasets were derived from MimoDB and PDB:
Benchmark dataset—test sets with corresponding 3D structure of template-target complex. This dataset is intended for evaluation the performance of existing epitope prediction methods to each other and development of new method. The benchmark dataset contains 47 test sets with 28 3D structures of template-target complex.
Representative dataset—a subset of the benchmark dataset. It contains only one test set for each template-target complex in the benchmark dataset. This dataset is intended for further verifying the validity of the epitope prediction methods and accessing the performance of the methods. The representative dataset contains 30 test sets with 28 3D structures of template-target complex. The following steps 1-4 relate to the benchmark dataset; steps 1–5 relate to the representative dataset.
Step 1: 62 test sets which are related to 53 solved structures of template-target complex were collected from MimoDB. The corresponding crystal structure of template-target complexes were downloaded from the PDB. If a test set has several complex structures we selected one according to the resolution and release date. Among these test sets, there are 18 sets with solved 3D structures of Ag-Ab complex, and 44 sets with solved 3D structures of protein-protein complex. The five methods evaluated in the paper all developed for epitope prediction, and the test sets with the protein-protein interactions could be taken as supplementary data to verify the efficiency of the methods. The number of peptide sequence in these test sets is range from 1 to 41.
Step 2: all the test sets were checked and filtered carefully in this step. The test sets with either different library names or different experimental methods were considered as different sets. The test sets which were published in different papers by different research teams were considered as different, while the ones which published in different papers by the same research team were combined as one test set. For the test sets with different round of biopanning in experiment and the other conditions are the same, we choose the last round one. The MS01153 and MS01154 ones were in this situation. The MS01153 was obtained from two rounds of biopanning while the MS01154 was obtained after three rounds of biopanning, so the MS01153 was excluded. The test sets from MS01101 to MS01115 were slightly more complicated. The template of these sets was the same and the targets were obtained in three different ways, and these 15 data sets were obtained according five rounds of biopanning. Through background knowledge we choose the last round ones as results, therefore the MS01105, MS01110 and MS01115 were retained and the others were excluded. The target of these three sets were the same antibody from different organisms rather than different antibodies, thus we combined these three data sets as one test set-MS01105/10/15. Then we checked the sequence identity of the test sets whose number of peptide sequence is more than 20. There are six such test sets. Through pairwise alignment for each set and taking the identity of 70% as the threshold, we found that there are no more than three identity sequences. Then we excluded one sequence from each identity peptide sequence pairs according to the peptide binding affinity, and the mimotope sequences in following three sets changed: MS00054, MS00058 and MS01105/10/15. After this step, 47 test sets were retained.
Step 3: for each 3D structure of template-target complex, the template chain was extracted and the target chain which interacts with the template was recorded to locate the epitopes. There are two cases: the first is the 3D structure has one template chain, and we extracted the single chain directly and recorded the target chain in this situation; the second situation is the 3D structure has several template-target pairs, since the template and target are identical and the binding sites are the same, so we choose one pair randomly and extracted the template chain and recorded the target chain.
Step 4: for each test set, the mimotope sequences, the 3D structure of template-target complex and the structure of template are all ready. In this step the epitopes were determined. We confirmed the epitopes through three tools: CED [53], IEDB-3D [54] and CMA (Contact Map Analysis) [55]. The definitions of an epitope inferred from the 3D structure of Ag-Ab complex are mainly based on either ASA (accessible surface area) or the contact area between residues of antigen and antibody. In this study, we define an epitope as the residues of an antigen which has a contact area above 4 Å2 upon interaction with the antibody. As this definition, when applying CMA to locate the epitopes the contact area threshold is setting 4 Å2 and the other parameters are setting as default. We adopted the same way to locate the binding-sites of protein-protein interactions. In this way, 47 data sets with corresponding mimotope sequences, 28 3D structures of template-target complexes, structures of template chains and the genuine epitope residues are all determined. The resulting dataset is denoted as the benchmark dataset. The benchmark dataset is listed in Table 5. The detailed datasets are given in the Supplementary Material section.
Table 5. The dataset compiled from MimoDB, the data which belong to the representative dataset were marked with *. (1) Number of peptides ×peptide length. (2) PMID of the reference. (3) The combination of MS01105, MS01110 and MS01115.
Table 5. The dataset compiled from MimoDB, the data which belong to the representative dataset were marked with *. (1) Number of peptides ×peptide length. (2) PMID of the reference. (3) The combination of MS01105, MS01110 and MS01115.
Mimo_IDPDB_IDTemplateTargetLibrary(1)Ref (2)
Antigen-Antibody
MS00012*2OSLB-lymphocyte antigen CD20Anti-CD20 monoclonal antibody rituximab13 × 916705086
MS00013*3IU3Interleukin-2 receptor subunit alphaAnti-CD25 monoclonal antibody basiliximab6 × 917440057
MS00029*1TETHeat-labile enterotoxin B chainAnti-LTP-B monoclonal antibody TE3310 × 916273596
MS000301TETHeat-labile enterotoxin B chainAnti-LTP-B monoclonal antibody TE335 × 1116273596
MS00048*1YY9Epidermal growth factor receptorCetuximab4 × 1216288119
MS00049*1N8ZReceptor tyrosine-protein kinase erbB-2Trastuzumab5 × 1215210798
MS000522ADFvon Willebrand factorAnti-vWF monoclonal antibody 82D6A32 × 1512855771
MS00053*2ADFvon Willebrand factorAnti-vWF monoclonal antibody 82D6A33 × 812855771
MS00054*1IQDCoagulation factor VIIIAnti-coagulation factor VIII monoclonal antibody BO2C1127 × 1212676786
MS00055*2GHWSpike glycoproteinAnti-spike glycoprotein monoclonal antibody 80R18 × 1516630634
MS000562GHWSpike glycoproteinAnti-spike glycoprotein monoclonal antibody 80R9 × 16, 9 × 15, 19 × 14, 4 × 1316630634
MS000572NY7Surface protein gp120Anti-gp120 monoclonal antibody b121 × 12, 1 × 1516940148
MS000582NY7Surface protein gp120Anti-gp120 monoclonal antibody b121 × 6, 1 × 12, 1 × 13, 1 × 16, 1 × 18, 2 × 14, 2 × 20, 2 × 22 8 × 15, 13 × 2116940148
MS00059*2NY7Surface protein gp120Anti-gp120 monoclonal antibody b121 × 10, 1 × 13, 17 × 1416940148
MS000991N8ZReceptor tyrosine-protein kinase erbB-2Trastuzumab2 × 1215536075
MS00185*1G9MEnvelope glycoprotein gp120Anti-gp120 monoclonal antibody 17b10 × 12, 1 × 1014596802
MS00186*1E6JCapsid protein p24Anti-p42 monoclonal antibody 13b514 × 14, 2 × 714596802
MS002422OSLB-lymphocyte antigen CD20Anti-CD20 monoclonal antibody rituximab7 × 1216814270
Protein-Protein
MS00041*1OC0VitronectinPlasminogen activator inhibitor 18 × 13, 1 × 7, 1 × 1116813566
MS00047*1HX1BAG family molecular chaperone regulator 1Heat shock cognate 71 kDa protein8 × 157649995
MS00060*1WLPNeutrophil cytosol factor 1Cytochrome b-2452 × 8, 31 × 97592831
MS00062*1WLPCytochrome b-245 light chainNeutrophil cytosol factor 14 × 5, 3 × 9, 1 × 10, 1 × 87624379
MS00139*1K4UNeutrophil cytosol factor 1Neutrophil cytosol factor 228 × 9, 2 × 10, 4 × 12, 2 × 6, 1 × 88663333
MS002762GRXFerrichrome-iron receptorProtein tonB12 × 1216414071
MS00277*2GRXFerrichrome-iron receptorProtein tonB6 × 916414071
MS002782GSKVitamin B12 transporter btuBProtein tonB2 × 1216414071
MS00279*2GSKVitamin B12 transporter btuBProtein tonB6 × 916414071
MS00357*1FLTVascular endothelial growth factor receptor 1Vascular endothelial growth factor A4 × 717401149
MS00384*3DOWCalreticulinGamma-aminobutyric acid receptor-associated protein5 × 1217916189
MS00405*1SHYHepatocyte growth factorHepatocyte growth factor receptor2 × 12, 1 × 1317947467
MS004641SQ0von Willebrand factorPlatelet glycoprotein Ib alpha chain2 × 1118363340
MS00465*1SQ0von Willebrand factorPlatelet glycoprotein Ib alpha chain3 ×1118363340
MS00671*1D4VTumor necrosis factor ligand superfamily member 10Tumor necrosis factor receptor superfamily member 10B13 × 920156289
MS00976*3BT1Urokinase-type plasminogen activatorUrokinase plasminogen activator surface receptor19 × 158041758
MS00984*1EERErythropoietinErythropoietin receptor1 × 108662529
MS010041MQ8Integrin alpha-L beta-2Intercellular adhesion molecule 11 × 148953648
MS01036*3EZEPhosphocarrier protein HPrPhosphoenolpyruvate-protein phosphotransferase11 × 69350871
MS010373EZEPhosphocarrier protein HPrPhosphoenolpyruvate-protein phosphotransferase9 × 109350871
MS010383EZEPhosphocarrier protein HPrPhosphoenolpyruvate-protein phosphotransferase6 × 159350871
MS01061*1MQ8Integrin alpha-L beta-2Intercellular adhesion molecule 112 × 9, 1 × 811532073
MS010621MQ8Integrin alpha-L beta-2Intercellular adhesion molecule 11 × 9, 7 × 1612963036
MS010631MQ8Integrin alpha-L beta-2Intercellular adhesion molecule 11 × 1612963036
MS01105/10/15*(3)1II4Heparin-binding growth factor 2Fibroblast growth factor receptor 230 × 712032665
MS01154*1HX1Heat shock cognate 71 kDa proteinBAG family molecular chaperone regulator 18 × 1211121403
MS01190*1G1SP-selectin glycoprotein ligand 1P-selectin5 × 1712393589
MS011911G1SP-selectin glycoprotein ligand 1P-selectin2 × 1512393589
MS011921G1SP-selectin glycoprotein ligand 1P-selectin1 × 13, 1 × 1812393589
Step 5: in the benchmark dataset, there exist some test sets have the same template-target complex, but these test sets cannot be combined because of different experimental methods or different library names used by different research teams. For one complex structure, we retained only one test set which has better prediction results for most methods and excluded the others. The same structures of complex but different template chain are taken as different (there are two such complexes in the 28: 1HX1 and 1WLP). In this way, 30 test sets with 28 3D structures of template-target complexes were collected. The final dataset is defined as the representative dataset.

3.2. Algorithm Evaluation

We consulted all epitope mapping algorithms based on random peptide library screening published thus far, to the best of our knowledge (Table 1). All methods were identified through PubMed and web search. Through comparative analysis we chose five of them for evaluation based on our benchmark dataset. All these five tools are either open source or provide friendly web service freely. In addition, we adopted the default parameters provided by each tool in all test sets.
Mapitope is based on an alternative computational approach in which epitope determinants shared by the entire set of peptides are detected. Mapitope defines AAP (pairs of amino acids) by calculating the distance between two α carbon atoms of adjacent residues. Each peptide sequence obtained from the phage library is translated into AAPs, and then Mapitope calculates them to rank the occurrences of AAPs to obtain a set of major SSP (statistically significant pairs), and finally uses them to search the 3D structure of the antigen and links the SSP into clusters on the antigen surface as predicted epitope. Mapitope algorithm is implemented in C++, and we get the source code and binaries freely on line, so this algorithm is tested in local environment. The MS00139 gave no result when tested on a local computer, but gave results when tested online. The result of MS00139 is obtained on web service while the results of other sets are obtained locally.
PepSurf is an algorithm for mapping a set of affinity-selected peptides onto the solved surface of the antigen. This is done by efficiently searching virtually all possible 3D paths based on the color-coding technique for those that exhibit high similarity to the peptide sequences. A modified BLOSUM62 matrix is used for scoring amino acid similarities in mapping step. Then the best alignment of each peptide to antigen residues brought to proximity by folding is obtained. The resulting most significant alignments are then clustered and the epitope location is inferred. The PepSurf algorithm is also implemented in C++, and the source code and binaries are freely available online. To avoid the length restriction of peptides sequence (shorter than 15 amino acids) online, we tested PepSurf locally. The result of MS1105/10/15 is obtained online since there was no result when tested locally. But the situation is unfortunate for MS00058. MS00058 also has no result when tested locally, and it has 32 mimotope sequences while 27 of them have the sequence length exceed 14, so the web service is also powerless.
Pepitope is a web-based tool server that aims at predicting B-cell epitopes based on random peptide library screening; it provides three different algorithms for epitope mapping: Mapitope, PepSurf and a combination of the two. The predicted clusters in combination algorithm include only residues which were predicted to be part of the epitope in both algorithms. In this paper we referred Pepitope as the combination algorithm. The web service of Pepitope is available freely on the Pepitope server. The Pepitope algorithm is only available online, and the online service also has the length restriction of peptides sequence shorter than 15 amino acids, and 13 sets of the benchmark datasets thus had no results.
Pep-3D-Search is an epitope mapping algorithm based on both mimotope and motif analysis. An ACO (Ant Colony Optimization) algorithm was proposed for aligning a 1D mimotope sequence (or a motif sequence) to the 3D structure of an antigen, and P-value calculation based screening strategy and DFS (Depth-First Search) algorithm based clustering strategy were employed in localizing epitope candidate regions. Pep-3D-Search is implemented in VB and the source code and executable program can be freely obtained. We tested the benchmark dataset established in this paper locally with Pep-3D-Search. Since the algorithm is based on the establishment of empirical background distribution for aligning score of every mimotope and antigen, it cannot work with data whose P-value of the aligning score of every mimotope is more than 10−3. The MS00030, MS00062, MS00384, MS01190, MS01191 and MS01192 fall into this group and gave no results. Pep-3D-Search also provides the selection of mapping epitope based on motif analysis. The output can give 10 best fits. When the number of mimotope sequences is less than three the motif would not be gained and there is no result returned. MS00057, MS00099, MS00278 and MS00464 all have two mimotope sequences while MS01004 and MS01063 have one mimotope sequence, so there is no result for these data. Among all algorithms evaluated in this paper, only Pep-3D-Search supporsts the epitope mapping based on motif, so we do not give an evaluation of Pep-3D-Search results which are based on motif analysis with the other algorithms, but the testing results are provided in Supplementary File 2.
EpiSearch is based on a patch analysis that identifies spatial contiguous clusters of residues on the surface of the antigen with similar physicochemical properties as found in the mimotopes. The amino acid compositions of the 1D and 3D profiles are compared through three matrixes and quantified in a score function for each patch on the protein surface. Similarity of residues is measured by a physical-chemical property distance that was derived from five descriptors of amino acid residues. The highest scoring patches are listed in the output files and are also displayed on the surface of the protein. The benchmark dataset is tested with the online version of the EpiSearch method. EpiSearch has a peptide sequence number restriction, so it cannot work with more than 30 mimotope sequences at one time. MS00056, MS00058, MS00060 and MS00139 all have mimotope sequences of more than 30 so they returned no results.

4. Conclusions

B-cell epitope prediction is important for vaccine design, development of diagnostic reagents and for studies to elucidate the interactions between antigen and antibody on a molecular level. Epitope prediction methods based on random peptide library screening are supposed to tremendously improve the accuracy of the epitope prediction by combining experiments with computational methods, and thus have attracted the attention of many researchers. In this paper, a benchmark dataset for evaluating B-cell epitope prediction methods based on random peptide library screening has been constructed and was made available. Using this benchmark dataset and a representative dataset, five publicly available methods were evaluated. Several schemes were implemented for evaluating the methods.
Firstly, we evaluated the methods through the MCC measure using the benchmark dataset, and tried to find what kind of data is more suitable for further development of epitope prediction based on random peptide library screening. We find that the number of antigen amino acids and the number of mimotopes influenced the prediction result, and the relationship between mimotope and the genuine epitope may be another factor which influences the prediction performance. Secondly, the sensitivity with respect to 1-specificity of each method was figured to evaluate the methods and compare them with the random prediction. Through comparison and evaluation, the performances of all methods are superior to random prediction, but the results are still unsatisfactory on both datasets due to diverse restrictions of each method and poor precision. Finally, the average sensitivity, specificity, PPV and MCC of all methods were computed using the two datasets to give a view of the overall performance for all evaluating methods. The overall performance of all methods was poor: using the benchmark dataset they did not exceed 42% precision and 37% sensitivity, and the values of average MCC were about 0.11 for Mapitope, Pepitope and EpiSearch, about 0.09 for Pep-3D-Search, and about 0.13 for PepSurf; while using the representative dataset, the average values of the sensitivity and precision improved a little, while the average values of specificity and MCC are nearly the same.
Nevertheless, the epitope prediction problem using phage display library screening is far from resolved. Usually a mimotope is a set of peptides sequence, while these sequences cannot reflect the real structure features of genuine epitopes. Though a mimotope has functional similarity with the genuine epitope, and this similarity may be reflected by the physic-chemical properties rather than the sequence similarity. In addition, the available data are still limited for both method evaluation and new algorithm development. Finding the relationship between the mimotopes and the genuine epitopes of antigens is still an open problem for further research.
Given the current results and the shortcomings of the existing methods, how can epitope prediction be further improved? On the one hand, finding the correlation between mimotope and epitope and obtaining enough mimotopes by designing appropriate experiments is required. On the other hand, the new algorithms for mapping protein epitopes based mimotope are desirable. The interaction sites of an antigen and an antibody are usually a spacial region but not a stretch of sequence, and the epitope prediction based on mimotope analysis is different from the classical sequence aligning. It is possible for improving the prediction performance to combine more features that discriminate epitopes from non-epitopes besides sequence similarity, for example, the evolutionary conservation score, side-chain energy score and planarity score. The application of these features, in the context of conformational epitope prediction, has been raised by others [56]. In addition, for finding the correlation between mimotopes and genuine epitopes based on machine learning methods, it is pre-requisite that larger dataset which is used for training are available. Therefore, more experiment data including available mimotopes and corrsponding structure of template-target complex are strongly needed.

Authors’ Contribution

PPS conceived and designed the research. PPS, WHC, HYW and YXH performed the research including data collection, test and analysis. ZQM suggested extension and modifications to the research. YHL supervised the whole research and revised the manuscript critically. All authors have read and approved the final manuscript.

Additional Materials

Supplementary File 1

The representative test sets with the sequence of mimotope, the template-target complexes and the represented epitopes. The data provides curate information on 48 test sets with the corresponding 28 3D structure of template-target complexes available in the MimoDB and PDB of November, 2010 and used in this work. Please see the Supplementary Material.

Supplementary File 2

The detail results on the prediction for the benchmark dataset. This data include the results of all methods. For Pep-3D-Search especially, the data has the prediction results that based on motif analysis. For EpiSearch, first three solutions with the highest score were retained as the results and the results of other methods were all retained. Please see the Supplementary Material.

Acknowledgements

This work was supported by the Natural Science Foundation of Jilin Province (20101506), the Natural Science Foundation of Jilin Province (20101503) and the Scientific and Technical Project of Administration of Traditional Chinese Medicine of Jilin Province (No.2010pt067). Many thanks to Tsodikov, O. V. who kindly helped with the problems we encountered in the test stage.

References and Notes

  1. Peters, B.; Sidney, J.; Bourne, P.; Bui, H.H.; Buus, S.; Doh, G.; Fleri, W.; Kronenberg, M.; Kubo, R.; Lund, O.; Nemazee, D.; Ponomarenko, J.V.; Sathiamurthy, M.; Schoenberger, S.P.; Stewart, S.; Surko, P.; Way, S.; Wilson, S.; Sette, A. The design and implementation of the immune epitope database and analysis resource. Immunogenetics 2005, 57, 326–336. [Google Scholar] [CrossRef]
  2. Van Regenmortel, M.H. Antigenicity and immunogenicity of synthetic peptides. Biologicals 2001, 29, 209–213. [Google Scholar] [CrossRef]
  3. Barlow, D.J.; Edwards, M.S.; Thornton, J.M. Continuous and discontinuous protein antigenic determinants. Nature 1986, 322, 747–748. [Google Scholar] [CrossRef]
  4. Irving, M.B.; Pan, O.; Scott, J.K. Random-peptide libraries and antigen-fragment libraries for epitope mapping and the developmentof vaccines and diagnostics. Curr. Opin. Chem. Biol. 2001, 5, 314–324. [Google Scholar] [CrossRef]
  5. Westwood, O.M.R.; Hay, F.C. Epitope Mapping: A Practical Approach; Oxford University Press: Oxford, UK, 2001. [Google Scholar]
  6. Gomara, M.J.; Haro, I. Synthetic peptides for the immunodiagnosisof human diseases. Curr. Med. Chem. 2007, 14, 531–546. [Google Scholar] [CrossRef]
  7. Camacho, C.J.; Vajda, S. Protein docking along smooth association pathways. Proc. Natl. Acad. Sci. USA 2001, 98, 10636–10641. [Google Scholar] [CrossRef]
  8. Rus, J.J.; Burnett, R.M. Type-specific epitope locations revealed by X-ray crystallographic study of adenovirus type 5 hexon. Mol. Ther. 2000, 1, 3–4. [Google Scholar] [CrossRef]
  9. Mayer, M.; Meyer, B. Group epitope mapping by saturation transfer difference NMR to identify segments of a ligand in direct contact with a protein receptor. J. Am. Chem. Soc. 2001, 123, 6108–6117. [Google Scholar] [CrossRef]
  10. Ponomarenko, J.; Bui, H.-H.; Li, W.; Fusseder, N.; Bourne, P.E.; Sette, A.; Peters, B. ElliPro: a new structure-based tool for the prediction of antibody epitopes. BMC Bioinformatics 2008, 9, 514. [Google Scholar] [CrossRef]
  11. Novotny, J.; Handschumacher, M.; Haber, E.; Bruccoleri, R.E.; Carlson, W.B.; Fanning, D.W.; Smith, J.A.; Rose, G.D. Antigenic determinants in proteins coincide with surface regions accessible to large probes (antibody domains). Proc. Natl. Acad. Sci. USA 1989, 83, 226–230. [Google Scholar]
  12. Hopp, T.P.; Woods, K.P. Prediction of Protein Antigenetic Determinants from Amino Acid Sequences. Proc. Natl. Acad. Sci. USA 1981, 78, 3824–3828. [Google Scholar] [CrossRef]
  13. Emini, E.A.; Hughes, J.V.; Perlow, D.S.; Boger, J. Induction of hepatitis avirus-neutralizing antibody by a virus-specific synthetic peptide. J. Virol. 1985, 55, 836–839. [Google Scholar]
  14. Garnier, J.; Osguthorpe, D.J.; Robson, B. Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 1978, 120, 97–120. [Google Scholar] [CrossRef]
  15. Pellequer, J.L.; Westhof, E.; Van Regenmortel, M.H. Correlation between the location of antigenic sites and the prediction of turns in proteins. Immunol. Lett. 1993, 36, 83–99. [Google Scholar] [CrossRef]
  16. Levitt, M. Conformational preferences of amino acids in globular proteins. Biochemistry 1978, 17, 4277–4285. [Google Scholar] [CrossRef]
  17. Chou, P.Y.; Fasman, G.D. Conformational parameters for amino acids in helical, beta-sheet, and random coil regions calculated from proteins. Biochemistry 1974, 13, 211–222. [Google Scholar] [CrossRef]
  18. Blythe, M.J.; Flower, D.R. Benchmarking B cell epitope prediction: underperformance of existing methods. Protein Sci. 2005, 14, 246–248. [Google Scholar]
  19. Larsen, J.E.; Lund, O.; Nielsen, M. Improved method for predicting linear B-cell epitopes. Immunome Res. 2006, 2, 2. [Google Scholar] [CrossRef]
  20. Saha, S.; Raghava, G.P. Prediction of continuous B-cell epitopes in an antigen using recurrent neural network. Proteins 2006, 65, 40–48. [Google Scholar] [CrossRef]
  21. Chen, R.; Li, L.; Weng, Z. ZDOCK: an initial-stage protein-docking algorithm. Proteins 2003, 52, 80–87. [Google Scholar] [CrossRef]
  22. Greenbaum, J.A.; Andersen, P.H.; Blythe, M.; Bui, H.H.; Cachau, R.E.; Crowe, J.; Davies, M.; Kolaskar, A.S.; Lund, O.; Morrison, S.; Mumey, B.; Ofran, Y.; Pellequer, J.L.; Pinilla, C.; Ponomarenko, J.V.; Raghava, G.P.; van, Regenmortel, M.H.; Roggen, E.L.; Sette, A.; Schlessinger, A.; Sollner, J.; Zand, M.; Peters, B. Towards a consensus on datasets and evaluation metrics for developing B-cell epitope prediction tools. J. Mol. Recognit. 2007, 20, 75–82. [Google Scholar] [CrossRef]
  23. Westhof, E.; Altschuh, D.; Moras, D.; Bloomer, A.C.; Mondragon, A.; Klug, A.; Van Regenmortel, M.H. Correlation between segmentalmobility and the location of antigenic determinants in proteins. Nature 1984, 311, 123–126. [Google Scholar] [CrossRef]
  24. Tainer, J.A.; Getzoff, E.D.; Alexander, H.; Houghten, R.A.; Olson, A.J.; Lerner, R.A.; Hendrickson, W.A. The reactivity of anti-peptide antibodiesis a function of the atomic mobility of sites in a protein. Nature 1984, 312, 127–134. [Google Scholar] [CrossRef]
  25. Thornton, J.M.; Edwards, M.S.; Taylor, W.R.; Barlow, D.J. Location of 'continuous' antigenic determinants in the protruding regions of proteins. Embo J. 1986, 5, 409–413. [Google Scholar]
  26. Amit, A.G.; Mariuzza, R.A.; Phillips, S.E.; Poljak, R.J. Three-dimensional structure of an antigen-antibody complex at 2.8 Å resolution. Science 1986, 233, 747–753. [Google Scholar]
  27. Halperin, I.; Wolfson, H. SiteLight: Binding-site prediction using phage display libraries. Protein Sci. 2003, 12, 1344–1359. [Google Scholar] [CrossRef]
  28. Mumey, B.M. A new method for mapping discontinuous antibody epitopes to reveal structural features of proteins. J. Comput. Biol. 2003, 10, 555–567. [Google Scholar] [CrossRef]
  29. Kulkarni-Kale, U.; Bhosle, S.; Kolaskar, A.S. CEP: a conformational epitope prediction server. Nucleic Acids Res. 2005, 33, W168–W171. [Google Scholar] [CrossRef]
  30. Andersen, P.H.; Nielsen, M.; Lund, O. Prediction of residues in discontinuous B-cell epitopes using protein 3D structures. Protein Sci. 2006, 15, 2558–2567. [Google Scholar] [CrossRef]
  31. Smith, G.P. Filamentous fusion phage: novel expressionvectors that display cloned antigens on the virion surface. Science 1985, 228, 1315–1317. [Google Scholar]
  32. Tong, A.H.Y.; Drees, B.; Nardelli, G.; Bader, G.D.; Brannetti, B.; Castagnoli, L.; Evangelista, M.; Ferracuti, S.; Nelson, B.; Paoluzi, S.; Quondam, M.; Zucconi, A.; Hogue, C.W.; Fields, S.; Boone, C.; Cesareni, G. A combinedexperimental and computational strategy to define proteininteraction networks for peptide recognition modules. Science 2002, 295, 321–324. [Google Scholar]
  33. Thom, G.; Cockroft, A.C.; Buchanan, A.G.; Candotti, C.J.; Cohen, E.S.; Lowne, D.; Monk, P.; Shorrock-Hart, C.P.; Lutz, J.; Ralph, R. Minter Probing aprotein-protein interaction by in vitro evolution. Proc. Natl. Acad. Sci. USA 2006, 103, 7619–7624. [Google Scholar]
  34. Wang, L.F.; Yu, M. Epitope identification and discoveryusing phage display libraries: Applications in vaccine developmentand diagnostics. Curr. Drug Targets 2004, 5, 1–15. [Google Scholar] [CrossRef]
  35. Thie, H.; Meyer, T.; Schirrmann, T.; Hust, M.; Dubel, S. Phage display derived therapeutic antibodies. Curr. Pharm. Biotechnol. 2008, 9, 439–446. [Google Scholar] [CrossRef]
  36. Geysen, H.M.; Rodda, S.J.; Mason, T.J. A priori delineation of a peptide which mimics a discontinuous antigenic determinant. Mol. Immunol. 1986, 23, 709–715. [Google Scholar] [CrossRef]
  37. Moreau, V.; Granier, C.; Villard, S.; Laune, D.; Molina, F. Discontinuous epitope prediction based on mimotope analysis. Bioinformatics 2006, 22, 1088–1095. [Google Scholar] [CrossRef]
  38. Huang, J.; Gutteridge, A.; Honda, W.; Kanehisa, M. MIMOX: a web tool for phage display based epitope mapping. BMC Bioinformatics 2006, 10, 1–10. [Google Scholar]
  39. Schreiber, A.; Humbert, M.; Benz, A.; Dietrich, U. 3D-Epitope-Explorer (3DEX): Localization of Conformational Epitopes within Three-Dimensional Structures of Proteins. J. Comput. Chem. 2005, 26, 879–887. [Google Scholar] [CrossRef]
  40. Bublil, E.M.; Freund, N.T.; Mayrose, I.; Penn, O.; Roitburd-berman, A.; Rubinstein, N.D.; Pupko, T.; Gershoni, J.M. Stepwise Prediction of Conformational Discontinuous B-Cell Epitopes Using the Mapitope Algorithm. Bioinformatics 2007, 304, 294–304. [Google Scholar]
  41. Mayrose, I.; Shlomi, T.; Rubinstein, N.D.; Gershoni, J.M.; Ruppin, E.; Sharan, R.; Pupko, T. Epitope mapping using combinatorial phage-display libraries: a graph-based algorithm. Nucleic Acids Res. 2007, 35, 69–78. [Google Scholar] [CrossRef]
  42. Huang, Y.X.; Bao, Y.L.; Guo, S.Y.; Wang, Y.; Zhou, C.G.; Li, Y.X. Pep-3D-Search: a method for B-cell epitope prediction based on mimotope analysis. BMC Bioinformatics 2008, 9, 538. [Google Scholar] [CrossRef]
  43. Mayrose, I.; Penn, O.; Erez, E.; Rubinstein, N.D.; Shlomi, T.; Freund, N.T.; Bublil, E.M.; Ruppin, E.; Sharan, R.; Gershoni, J.M.; Martz, E.; Pupko, T. Pepitope: epitope mapping from affinity-selected peptides. Bioinformatics 2007, 23, 3244–3246. [Google Scholar] [CrossRef]
  44. Negi, S.S.; Braun, W. Automated detection of conformational epitopes using phage display peptide sequences. Bioinform. Biol. C 2009, 3, 71–81. [Google Scholar]
  45. Enshell-Seijffers, D.; Denisov, D.; Groisman, B.; Smelyanski, L.; Meyuhas, R.; Gross, G.; Denisova, G.; Gershoni, J.M. The mapping and reconstitution of a conformational discontinuous B-cell epitope of HIV-1. J. Mol. Biol. 2003, 334, 87–101. [Google Scholar] [CrossRef]
  46. Castrignanò, T.; D'Onorio De Meo, P.; Carrabino, D.; Orsini, M.; Floris, M.; Tramontano, A. The MEPS server foridentifying protein conformational epitopes. BMC Bioinformatics 2007, 8, S6. [Google Scholar]
  47. Ru, B.; Huang, J.; Dai, P.; Li, S.; Xia, Z.; Ding, H.; Lin, H.; Guo, F.B.; Wang, X. MimoDB: A New Repository for Mimotope Data Derived from Phage Display Technology. Molecules 2010, 15, 8279–8288. [Google Scholar] [CrossRef]
  48. Berman, H.M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T.N.; Weissig, H.; Shindyalov, I.N.; Bourne, P.E. The Protein Data Bank. Nucleic Acids Res. 2000, 28, 235–242. [Google Scholar] [CrossRef]
  49. Hubbard, S.J. NACCESS Computer Program. Available online: http://www.bioinf.manchester.ac.uk/naccess/ (accessed on 2 February 2011).
  50. Tsodikov, O.V.; Record, M.T.J.; Sergeev, Y.V. A novel computer program for fast exact calculation of accessible and molecular surface areas and average surface curvature. J. Comput. Chem. 2002, 23, 600–609. [Google Scholar] [CrossRef]
  51. Ponomarenko, J.V.; Bourne, P.E. Antibody-protein interactions: benchmark datasets and predictiontools evaluation. BMC Struct. Biol. 2007, 7, 64. [Google Scholar] [CrossRef]
  52. Huang, J.; Honda, W. CED: a conformational epitope database. BMC Immunol. 2006, 7, 7. [Google Scholar] [CrossRef]
  53. Ponomarenko, J.; Papangelopoulos, M.; Zajonc, D.M.; Peters, B.; Sette, A.; Bourne, P.E. IEDB-3D: structural data within the immune epitope database. Nucleic Acids Res. 2010, 39, D1164–D1170. [Google Scholar]
  54. Sobolev, V.; Eyal, E.; Gerzon, S.; Potapov, V.; Babor, M.; Prilusky, J.; Edelman, M. PACE: a suite of tools for protein structure predictionand analysis based on complementarity and environment. Nucleic Acids Res. 2005, 33, W39–W43. [Google Scholar] [CrossRef]
  55. Shide, L.; Dandan, Z.; Chi, Z.; Martin, Z. Prediction of antigenic epitopes on protein surfaces by consensus scoring. BMC Bioinformatics 2009, 10, 302. [Google Scholar] [CrossRef]
  • Sample Availability: Not available.

Share and Cite

MDPI and ACS Style

Sun, P.; Chen, W.; Huang, Y.; Wang, H.; Ma, Z.; Lv, Y. Epitope Prediction Based on Random Peptide Library Screening: Benchmark Dataset and Prediction Tools Evaluation. Molecules 2011, 16, 4971-4993. https://doi.org/10.3390/molecules16064971

AMA Style

Sun P, Chen W, Huang Y, Wang H, Ma Z, Lv Y. Epitope Prediction Based on Random Peptide Library Screening: Benchmark Dataset and Prediction Tools Evaluation. Molecules. 2011; 16(6):4971-4993. https://doi.org/10.3390/molecules16064971

Chicago/Turabian Style

Sun, Pingping, Wenhan Chen, Yanxin Huang, Hongyan Wang, Zhiqiang Ma, and Yinghua Lv. 2011. "Epitope Prediction Based on Random Peptide Library Screening: Benchmark Dataset and Prediction Tools Evaluation" Molecules 16, no. 6: 4971-4993. https://doi.org/10.3390/molecules16064971

Article Metrics

Back to TopTop