<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">ijms</journal-id>
<journal-title>International Journal of Molecular Sciences</journal-title>
<abbrev-journal-title>Int. J. Mol. Sci.</abbrev-journal-title>
<issn pub-type="epub">1422-0067</issn>
<publisher>
<publisher-name>Molecular Diversity Preservation International (MDPI)</publisher-name></publisher></journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3390/ijms10052190</article-id>
<article-id pub-id-type="publisher-id">ijms-10-02190</article-id>
<article-categories>
<subj-group>
<subject>Article</subject></subj-group></article-categories>
<title-group>
<article-title>Identifying Protein-Protein Interaction Sites Using Covering Algorithm</article-title></title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Du</surname><given-names>Xiuquan</given-names></name><xref ref-type="corresp" rid="c1-ijms-10-02190">*</xref></contrib>
<contrib contrib-type="author">
<name><surname>Cheng</surname><given-names>Jiaxing</given-names></name></contrib>
<contrib contrib-type="author">
<name><surname>Song</surname><given-names>Jie</given-names></name></contrib>
<aff id="af1-ijms-10-02190">The Key Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, Anhui University, Anhui, China; E-Mails:
<email>cjx@ahu.edu.cn</email> (J.-X.C.);
<email>jsong@ahu.edu.cn</email> (J.S.)</aff></contrib-group>
<author-notes>
<corresp id="c1-ijms-10-02190">
<label>*</label>Author to whom correspondence should be addressed; E-Mail:
<email>dxqllp@sohu.com</email>; Tel. +86-13721058041</corresp></author-notes>
<pub-date pub-type="collection">
<month>5</month>
<year>2009</year></pub-date>
<pub-date pub-type="epub">
<day>15</day>
<month>5</month>
<year>2009</year></pub-date>
<volume>10</volume>
<issue>5</issue>
<fpage>2190</fpage>
<lpage>2202</lpage>
<history>
<date date-type="received">
<day>17</day>
<month>3</month>
<year>2009</year></date>
<date date-type="rev-recd">
<day>30</day>
<month>4</month>
<year>2009</year></date>
<date date-type="accepted">
<day>13</day>
<month>5</month>
<year>2009</year></date></history>
<permissions>
<copyright-statement>© 2009 by the authors; licensee Molecular Diversity Preservation International, Basel, Switzerland.</copyright-statement>
<copyright-year>2009</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/3.0">
<p>This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).</p></license></permissions>
<abstract>
<p>Identification of protein-protein interface residues is crucial for structural biology. This paper proposes a covering algorithm for predicting protein-protein interface residues with features including protein sequence profile and residue accessible area. This method adequately utilizes the characters of a covering algorithm which have simple, lower complexity and high accuracy for high dimension data. The covering algorithm can achieve a comparable performance (69.62%, Complete dataset; 60.86%, Trim dataset with overall accuracy) to a support vector machine and maximum entropy on our dataset, a correlation coefficient (CC) of 0.2893, 58.83% specificity, 56.12% sensitivity on the Complete dataset and 0.2144 (CC), 53.34% (specificity), 65.59% (sensitivity) on the Trim dataset in identifying interface residues by 5-fold cross-validation on 61 protein chains. This result indicates that the covering algorithm is a powerful and robust protein-protein interaction site prediction method that can guide biologists to make specific experiments on proteins. Examination of the predictions in the context of the 3-dimensional structures of proteins demonstrates the effectiveness of this method.</p></abstract>
<kwd-group>
<kwd>protein-protein interaction</kwd>
<kwd>covering algorithm</kwd>
<kwd>sequence profile</kwd>
<kwd>residue accessible area</kwd>
<kwd>maximum entropy</kwd></kwd-group></article-meta></front>
<body>
<sec sec-type="intro">
<label>1.</label>
<title>Introduction</title>
<p>Protein-protein interactions and protein-DNA interactions are among the most ubiquitous types of macromolecule interactions in biological systems. Revealing the mechanisms of protein-protein interactions is crucial for understanding the functions of biological systems. Furthermore, the ability to predict interfacial sites is also important in mutant and drug design [<xref ref-type="bibr" rid="b1-ijms-10-02190">1</xref>]. Structural knowledge at the residue and atom level is one of the keys to understanding the mechanisms of protein interactions. X-ray crystallography and NMR are without doubt the best methods to obtain such information. In recent years, high throughput technologies have provided experimental tools to identify protein-protein interactions systematically and have generated tremendous amount of protein interaction data. However, the high throughput experiments are often associated with high numbers of false positives and false negatives [<xref ref-type="bibr" rid="b2-ijms-10-02190">2</xref>]. The experiments are also tedious and labor-intensive and they cannot meet the requirements of proteomics, since there can be many thousands of protein-protein interactions even for a relatively primitive organism, so the need arises for seeking complementary <italic>in silico</italic> methods that are capable of accurately predicting interactions.</p>
<p>The availability of more and more protein structures in the Protein Data Bank (PDB) [<xref ref-type="bibr" rid="b3-ijms-10-02190">3</xref>] makes prediction of protein-protein interaction sites possible. A series of computational efforts to identify interaction sites or interfaces in proteins have been undertaken, such as hydrophobic residues cluster at some interfaces [<xref ref-type="bibr" rid="b4-ijms-10-02190">4</xref>,<xref ref-type="bibr" rid="b5-ijms-10-02190">5</xref>], Jones and Thornton have proposed two kinds of complexes: ‘permanent’ and ‘transient’ [<xref ref-type="bibr" rid="b6-ijms-10-02190">6</xref>] and so on. Current biophysical theories about the protein interacting regions highlight the role of the shape, chemical complementarily and flexibility of the molecules involved [<xref ref-type="bibr" rid="b7-ijms-10-02190">7</xref>]. In parallel, a growing number of machine learning methods for inferring protein interactions have been proposed, such as neural networks (ANN) [<xref ref-type="bibr" rid="b8-ijms-10-02190">8</xref>–<xref ref-type="bibr" rid="b10-ijms-10-02190">10</xref>] and support vector machines (SVMs) [<xref ref-type="bibr" rid="b11-ijms-10-02190">11</xref>–<xref ref-type="bibr" rid="b15-ijms-10-02190">15</xref>] have been successfully applied in this field. These studies consider sequential, structural or evolutionary features such as amino acid residue composition [<xref ref-type="bibr" rid="b8-ijms-10-02190">8</xref>,<xref ref-type="bibr" rid="b10-ijms-10-02190">10</xref>,<xref ref-type="bibr" rid="b13-ijms-10-02190">13</xref>,<xref ref-type="bibr" rid="b14-ijms-10-02190">14</xref>,<xref ref-type="bibr" rid="b16-ijms-10-02190">16</xref>], spatial neighboring residues [<xref ref-type="bibr" rid="b15-ijms-10-02190">15</xref>,<xref ref-type="bibr" rid="b16-ijms-10-02190">16</xref>], accessible surface area, structural conservation score and residue evolutionary information. However, Res I. <italic>et al.</italic> [<xref ref-type="bibr" rid="b14-ijms-10-02190">14</xref>] use protein sequential and evolutionary information to predict protein interaction sites without structural information.</p>
<p>Traditional methods take protein-protein interaction site prediction as a classification task and separately study each residue. Li Ming-Hui <italic>et al.</italic> [<xref ref-type="bibr" rid="b17-ijms-10-02190">17</xref>] take it as a sequence labeling task using conditional random fields (CRFs) in their research.</p>
<p>In this study, we mainly focus on a novel method developed for detecting interacting surfaces in proteins starting from their three-dimensional structure. This is particularly important in determining protein function, particularly for proteins of known structure but unknown function. Ofran <italic>et al.</italic> [<xref ref-type="bibr" rid="b18-ijms-10-02190">18</xref>] investigated the sequence neighborhood of protein-protein interface residues in a set of 333 proteins and found that 98% of these interface residues have at least one additional interface residue within their local sequence vicinity, so we think the characteristic that protein interface residues tend to form spatial clusters can be an important factor in solving our problem. A new method is constructed to learn association rules at the protein surface, i.e. a covering algorithm system. We also discuss the prediction power of support vector machines (SVMs), the covering algorithm (CA) and maximum entropy (ME) [<xref ref-type="bibr" rid="b19-ijms-10-02190">19</xref>].</p></sec>
<sec sec-type="results|discussion">
<label>2.</label>
<title>Results and Discussion</title>
<sec>
<label>2.1.</label>
<title>Cross-validation</title>
<p>The covering algorithm method is trained to predict whether or not a surface residue which is located in the interface based on identity of the target residue and its sequence neighbors. Five-fold cross-validation strategy is adopted for our experiments. Specifically, on the each dataset, we divide our dataset to five parts according to 5-fold cross-validation. The training set is composed of four parts and the remainder is the testing set. Thus, we get five training sets and testing sets. Then, we carry out our experiment on these five training sets and testing sets. For each dataset (see collection of dataset), we do ten times. Herein, total 2 × 5 × 10 = 100 experiments are implemented and the average performance of the results is used to evaluate each method.</p></sec>
<sec>
<label>2.2.</label>
<title>Evaluation measures of the covering algorithm (CA)</title>
<p>The covering algorithm (CA) classifier is evaluated using 5-fold cross-validation on two kinds of datasets. <xref ref-type="table" rid="t1-ijms-10-02190">Table 1</xref> shows the classification performance as measured by correlation coefficient, accuracy, specificity, sensitivity and F1-measure. Of the residues predicted to be interface, 58.83% (Complete), 53.34% (Trim) are actually interface residues, and 56.12% (Complete), 65.59% (Trim) of interface residues are identified as such. We also investigate the fraction of interface residues in each protein that are correctly identified by the covering algorithm (CA) classifier. In eight out of 12 (~ 75%) proteins the classifier can recognize the interaction surface by identifying at least half of the interface residues. and in 92% of the proteins, at least 40% of the interface residues are correctly identified.</p>
<p>In order to examine whether the covering algorithm (CA) method learns sequence characteristics that are predictive of target residue functions, we run a control experiment in which the class labels are randomly shuffled. The correlation coefficient (CC) obtained on the class shuffled dataset is 0.0604 (our method with 0.2893 on the Complete data) and −0.0065 (our method with 0.2124 on the Trim data) shows that the covering algorithm performs better than a random predictor (CC ≈ 0). <xref ref-type="table" rid="t1-ijms-10-02190">Table 1</xref> shows the result between the covering algorithm and random classifier. From this table, the covering algorithm has got better performance (5% ~ 10% sensitivity, 7% ~ 11% specificity, 10% ~ 14% accuracy, 13% ~ 14% F1-measure and 21% ~ 23% CC, respectively) than a random classifier.</p></sec>
<sec>
<label>2.3.</label>
<title>FP rate versus TP rate tradeoff</title>
<p>In some situations (e.g. key interface residue recognition for site-specific mutagenesis), we need to have a higher sensitivity and lower specificity. This requirement can be met by modifying the parameters used by the covering algorithm (CA). <xref ref-type="fig" rid="f1-ijms-10-02190">Figure 1</xref> shows the specificity-sensitivity graph and ROC curves for the Complete dataset. <xref ref-type="fig" rid="f2-ijms-10-02190">Figure 2</xref> shows specificity-sensitivity graph and ROC curves for the Trim dataset. The area under the ROC curve (AUC = 0.9167) of the covering algorithm for the Complete dataset is higher than the random classifier with 0.3307 (random), and AUC (0.8298) from the covering algorithm (CA) of the Trim dataset is larger than random classifier with 0.2847 (random). But AUC decreases about 8% using the covering algorithm between the Complete and Trim dataset, this perhaps because of removing some non-interfacial residues from training set (Complete dataset) reduce the performance of the covering algorithm method and these removed residues may contain useful information for predicting interaction sites.</p></sec>
<sec sec-type="methods">
<label>2.4.</label>
<title>Comparison with other methods</title>
<p>Support vector machines (SVMs) and maximum entropy model (ME) are selected to compare with our method. They are all discriminative classification methods. SVMs are a state-of-art method for predicting protein-protein interaction sites [<xref ref-type="bibr" rid="b11-ijms-10-02190">11</xref>,<xref ref-type="bibr" rid="b13-ijms-10-02190">13</xref>,<xref ref-type="bibr" rid="b15-ijms-10-02190">15</xref>,<xref ref-type="bibr" rid="b16-ijms-10-02190">16</xref>,<xref ref-type="bibr" rid="b28-ijms-10-02190">28</xref>]. ME is implemented in [<xref ref-type="bibr" rid="b17-ijms-10-02190">17</xref>]. Herein, we evaluate these methods using 5-fold cross-validation on the same dataset for direct comparison with our method. LIBSVM is used as the SVM implementation with radial basis function as kernel and default C, γ. Stanford Classifier (ME) is used and can be download freely from <ext-link xlink:href="http://www-nlp.stanford.edu/software/classifier.shtml" ext-link-type="uri">http://www-nlp.stanford.edu/software/classifier.shtml</ext-link>.</p>
<p><xref ref-type="table" rid="t2-ijms-10-02190">Table 2</xref> shows the results using covering algorithm (CA), support vector machine (SVM) and maximum entropy (ME) on the Trim and Complete dataset. From the Table, we find that our classifier has good performance in our dataset. The covering algorithm (CA) performs best, according to sensitivity, F1-measure, accuracy and CC, but its specificity was slightly lower than that of SVM and ME on the Complete dataset. In the Trim data, the sensitivity, F1-measure and CC achieved by the CA method is higher than SVM (7.52% better sensitivity and 2.27% better F1-mearsure and 0.92% better CC), albeit with 5.47% lower specificity. If judged only by sensitivity, the CA seems to slightly outperform (by 4%) the ME, whatever the dataset. Experiments on our dataset shows that CA is an effective method for protein interaction sites recognition, especially for Complete dataset.</p>
<p>In order to illustrate the effectiveness of our approach, we plotted the ROC curves for the Complete and Trim datasets. As shown in <xref ref-type="fig" rid="f3-ijms-10-02190">Figure 3</xref>, prediction performance is improved by the CA method with higher AUC = 0.9167 than SVM (0.7754), ME (0.7486) on the Complete dataset. After removing some negative samples (i.e. Trim dataset), performance of the CA method (0.8298) is slightly lower, but still larger than SVM (0.7654) and ME (0.7488).</p></sec>
<sec>
<label>2.5.</label>
<title>Some experimental examples</title>
<p>Here we give two examples that are predicted by the CA, SVM and ME classifiers. The first example is the refined 2.8 an alphabeta T cell receptor (TCR) heterodimer complexed with an anti-TCR fab fragment derived from a mitogenic antibody [<xref ref-type="bibr" rid="b21-ijms-10-02190">21</xref>]. We use our classifier to predict 45 residues to be interfaced with 81.82% sensitivity and 55.56% specificity (<xref ref-type="fig" rid="f5-ijms-10-02190">Figure 5A</xref>). SVM predicts 38 interface residues with 69.09% sensitivity, 58.46% specificity (<xref ref-type="fig" rid="f5-ijms-10-02190">Figure 5B</xref>). ME predicts 34 interface residues with 61.81% sensitivity, 52.30% specificity (<xref ref-type="fig" rid="f5-ijms-10-02190">Figure 5C</xref>) while the actual interface residues are 55 (<xref ref-type="fig" rid="f5-ijms-10-02190">Figure 5D</xref>).</p>
<p>The second example is the jel42 Fab fragment/HPr complex [<xref ref-type="bibr" rid="b22-ijms-10-02190">22</xref>]. This interface region is accurately identified by CA covering ~ 83% of the actual binding site with a specificity of 63.93% (<xref ref-type="fig" rid="f6-ijms-10-02190">Figure 6A</xref>), The prediction result by SVM covers only 78.26% of the actual binding site with a specificity of 62.06% (<xref ref-type="fig" rid="f6-ijms-10-02190">Figure 6B</xref>). ME predicts 34 interface residues with 73.91% sensitivity, 37.78% specificity (<xref ref-type="fig" rid="f6-ijms-10-02190">Figure 6C</xref>) versus the number of actual interface residues which are 46 (<xref ref-type="fig" rid="f6-ijms-10-02190">Figure 6D</xref>).</p></sec></sec>
<sec>
<label>3.</label>
<title>Experimental</title>
<p>Each surface residue is predicted to belong to a particular interaction site on the basis of characteristic of residue spatial cluster. Interaction site residues and non-interaction residues are used as positive and negative data, respectively.</p>
<sec sec-type="methods">
<label>3.1.</label>
<title>Collection of data sets</title>
<p>In our experiments protein-protein interaction data are extracted from a set of 70 protein-protein complexes in an independent study [<xref ref-type="bibr" rid="b20-ijms-10-02190">20</xref>] that contain X-ray diffraction structures of protein-protein complexes determined at a resolution of 1.6 Å or better. The dataset eliminates homo-complexes whose interacting surfaces are characterized by hydrophobicity. In order to obtain non-redundant protein chains of hetero-complexes we adopt two processes. First, all chains of 70 protein complexes are compared assigned using the BLASTCLUST program of NCBI BLAST 2.0. Two chains are assigned with the same cluster if (1) over 90% of their sequences are aligned and (2) the sequence identity is &gt; 30%. All above chains are clustered in this way. The first chain of each cluster is selected. Second, protein chains shorter than 40 residues are removed and we select protein chain pairs with ≥ 20 interfacial residues. A residue is considered to form an interfacial contact if the distance between α-carbon atoms and any α-carbon atoms of its interacting proteins are &lt; 1.2 nm [<xref ref-type="bibr" rid="b9-ijms-10-02190">9</xref>]. For protein chains that interacts with multiple partners, only one partner with the most interfacial residues is selected. According to the above definitions, the finally dataset is composed of 61 hetero-complexes, which includes 12 protease-inhibitor complexes, five antibody-antigen complexes, eight enzyme complexes, eight G-proteins, cell cycle, signal transduction and seven miscellaneous complexes. The dataset used is available online as Supplementary Material at <italic>IJMS</italic>.</p>
<p>Interfaces are formed mostly by residues that are exposed to the solvent if the partner chain is removed, so we mainly focus on surface residues. The solvent accessible surface area (ASA) is computed for each residue in the unbound molecule (MASA) and in the complex (CASA) using the DSSP program [<xref ref-type="bibr" rid="b23-ijms-10-02190">23</xref>]. Here, we should emphasize that only the coordinates of the unbound chain are used in the calculation. If other chains present in the complex are included, their influence would cause the ASA to be incorrectly calculated. In this paper, a residue is considered to be a surface residue if its relative accessible surface area (ASA) is at least 16% of its nominal maximum area whose value as defined by [<xref ref-type="bibr" rid="b24-ijms-10-02190">24</xref>]. As a result, a total of 6,567 residues (~ 64.03%) are collected as surface residues from all these chains. A surface residue is defined to be an interface residue if it formed an interfacial contact. According to this definition, we get about 24.03% (2,465) of all surface residues in the dataset.</p>
<p>The fact that there are more non-interface residues than interface residues in the training set leads to higher specificity and lower sensitivity for many classifiers such as SVMs and ANN [<xref ref-type="bibr" rid="b8-ijms-10-02190">8</xref>,<xref ref-type="bibr" rid="b13-ijms-10-02190">13</xref>]. In order to evaluate the robustness and performance of different methods, we conduct experiments on Trim dataset and Complete dataset. <xref ref-type="table" rid="t3-ijms-10-02190">Table 3</xref> shows Complete and Trim dataset. The entire cross-validation procedure is repeated ten times, and the resulting average performances are used to evaluate our method.</p></sec>
<sec>
<label>3.2.</label>
<title>Generation of the character vector</title>
<p>Interface prediction relies on characteristics of residues found in interfaces of protein complexes. The characteristics of interface residues are different. The most prominent involve: sequence conservation, proportions of the 20 types of amino acids, secondary structure, solvent accessibility and side-chain conformational entropy etc. Most of these characters are structure information. In this article, we choose sequence profile and residue accessible surface area as our test character.</p>
<sec>
<label>3.2.1.</label>
<title>Protein sequence profile feature</title>
<p>Sequence profiles are sequence information which denotes its potential structural homolog. Protein function information is embedded in the protein sequence, but how it can be determined is a pivotal problem. A good candidate technique for extracting such information is multiple sequence alignment (MSA). Protein sequence profile is a result of MSA which shows which kind of amino acid appearing in a given position of the protein primary structure. Herein, the protein sequence profiles are extracted from the HSSP database [<xref ref-type="bibr" rid="b27-ijms-10-02190">27</xref>]. Each residue is coded as a vector of 20 elements which denotes relative frequency for each of the 20 amino acid residue in a given sequence position, from counting the residue at that position in each of the aligned sequences including the test sequence.</p></sec>
<sec>
<label>3.2.2.</label>
<title>ASA feature</title>
<p>Accessible surface area (ASA) feature represents the relative accessible surface area (scaled by the nominal maximum area of each residue). For convenience, we use ASA to represent the relative accessible surface area of residue. ASA of each residue is calculated using DSSP program [<xref ref-type="bibr" rid="b23-ijms-10-02190">23</xref>].</p>
<p>In order to include the environment of the target residue, the profiles of sequentially neighboring residues with n windows are also included in the character vector. <xref ref-type="disp-formula" rid="FD1">Equation (1)</xref> is an example of a vector with 11 windows in our experiment:
<disp-formula id="FD1">
<label>(1)</label>
<mml:math display="block">
<mml:msub>
<mml:mtext>V</mml:mtext>
<mml:mtext>n</mml:mtext></mml:msub>
<mml:mi> </mml:mi>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mtext>p</mml:mtext>
<mml:mrow>
<mml:mtext>n</mml:mtext>
<mml:mo>−</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mtext>p</mml:mtext>
<mml:mrow>
<mml:mtext>n</mml:mtext>
<mml:mo>−</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>20</mml:mn></mml:mrow></mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mtext>p</mml:mtext>
<mml:mrow>
<mml:mtext>n</mml:mtext>
<mml:mo>−</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>21</mml:mn></mml:mrow></mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mtext>p</mml:mtext>
<mml:mrow>
<mml:mtext>n</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:msub>
<mml:mtext>p</mml:mtext>
<mml:mrow>
<mml:mtext>n</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mn>21</mml:mn></mml:mrow></mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mtext>p</mml:mtext>
<mml:mrow>
<mml:mtext>n</mml:mtext>
<mml:mo>+</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mtext>p</mml:mtext>
<mml:mrow>
<mml:mtext>n</mml:mtext>
<mml:mo>+</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>20</mml:mn></mml:mrow></mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mtext>p</mml:mtext>
<mml:mrow>
<mml:mtext>n</mml:mtext>
<mml:mo>+</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>21</mml:mn></mml:mrow></mml:msub>
<mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>and:
<disp-formula>
<mml:math display="block">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:msub>
<mml:mtext>p</mml:mtext>
<mml:mrow>
<mml:mtext>n</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mtext>j</mml:mtext></mml:mrow></mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mtext>N</mml:mtext>
<mml:mrow>
<mml:mtext>n</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mtext>j</mml:mtext></mml:mrow></mml:msub></mml:mrow>
<mml:mrow>
<mml:munder>
<mml:mo>∑</mml:mo>
<mml:mtext>j</mml:mtext></mml:munder>
<mml:mrow>
<mml:msub>
<mml:mtext>N</mml:mtext>
<mml:mrow>
<mml:mtext>n</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mtext>j</mml:mtext></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mtext>j</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>…</mml:mo>
<mml:mn>20</mml:mn></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:msub>
<mml:mtext>p</mml:mtext>
<mml:mrow>
<mml:mtext>n</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mtext>j</mml:mtext></mml:mrow></mml:msub>
<mml:mo>=</mml:mo>
<mml:mtext>ASA</mml:mtext>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mtext>x</mml:mtext>
<mml:mtext>n</mml:mtext></mml:msub>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mtext>j</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mn>21</mml:mn></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:math></disp-formula>where N<sub>n,j</sub> is the number of amino acids j in position n, X<sub>n</sub> is a residue in position n and ASA(X<sub>n</sub>) denotes accessible surface area of residue X<sub>n</sub>.</p></sec></sec>
<sec>
<label>3.3.</label>
<title>Covering algorithm (CA) for classification</title>
<p>Data-based machine learning explores the rule to predict new data from the observation data. The covering algorithm is proposed by Zhang Ling and Zhang Bo for classification. Suppose that given input set <italic>K</italic> = {<italic>X</italic><sub>1</sub>, <italic>X</italic><sub>2</sub>,......<italic>X</italic><italic><sub>K</sub></italic>} (K is a set of points in the N dimension Euclid Space, <italic>X</italic><sub>1</sub> = (<italic>x</italic><sub>1</sub>, <italic>y</italic><sub>1</sub>), <italic>X</italic><sub>2</sub>(<italic>x</italic><sub>2</sub>, <italic>y</italic><sub>2</sub>), ......<italic>X</italic><italic><sub>k</sub></italic> = (<italic>x</italic><italic><sub>k</sub></italic><italic>, y</italic><italic><sub>k</sub></italic>), <italic>x</italic><sub>1</sub>, <italic>x</italic><sub>2</sub>,......, denotes input vector of covering algorithm, <italic>y</italic><sub>1</sub>, <italic>y</italic><sub>2</sub>,... <italic>y</italic><italic><sub>k</sub></italic> ∈ {1, −1} denotes label of <italic>x</italic><sub>1</sub>, <italic>x</italic><sub>2</sub>,...xk). Now suppose K is divided to s subsets: in this paper, we discuss s = 2 (i.e. two classes corresponding to interface residue and non-interface residue). First, the original input space (<italic>K</italic><sub>1</sub>, <italic>K</italic><sub>2</sub>) is transferred into a quadratic space by the use of a global project function, such as <xref ref-type="fig" rid="f7-ijms-10-02190">Figure 7</xref>. Then, the well-known point set covering method is used to perform the partition of the data in the transformed space.</p>
<sec>
<label>3.3.1.</label>
<title>Algorithm 1</title>
<p>Step 1. Making a cover C(i) (i = 1 at the begin), which only covers point of K<sub>1</sub> and these points are enclosed set D.</p>
<p>Step 2. Taking point of K/D, i.e. p, suppose p belongs to K<sub>j</sub>(j = 1, 2), making a cover C(i) which only covers point of K<sub>j</sub>, and then are enclosed set D, i = i + 1, return Step 2 until K/D=Φ.</p>
<p>Step 3. Suppose we get cover set C ={C<sub>1</sub>, C<sub>2</sub>,......C<sub>k</sub>}. Then taking C<sub>1</sub>, C<sub>2</sub>,……., C<sub>k</sub>, if test point is in the C<sub>i</sub> which cover point of K<sub>1</sub>, output 1, otherwise −1.</p>
<p>In fact, C(i) is a sphere domain with center w and radius r<sub>i</sub>.</p></sec>
<sec>
<label>3.3.2.</label>
<title>Algorithm 2 for making a cover C(i)</title>
<p>Step 1. if K<sub>1</sub> or K<sub>2</sub> is empty, then stop. Otherwise, suppose that K<sub>1</sub> ≠ Ø, randomly selecting a<sub>i</sub> ∈ k<sub>1</sub>(j = 1, i = 1 at the begin).</p>
<p>Step 2. Seeking a sphere domain with center= <italic>a</italic><italic><sub>i</sub></italic>. Suppose C(a<sub>i</sub>) ∩ K<sub>1</sub> = D<sub>i</sub>, i = 1,2..., D<sub>0</sub> = Ø.</p>
<disp-formula id="FD2">
<label>(2)</label>
<mml:math display="block">
<mml:msub>
<mml:mtext>d</mml:mtext>
<mml:mn>1</mml:mn></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mtext>max</mml:mtext></mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>∉</mml:mo>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:munder>
<mml:mo>{</mml:mo>
<mml:mo>&lt;</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:mi> </mml:mi>
<mml:mi>x</mml:mi>
<mml:mo>&gt;</mml:mo>
<mml:mo>}</mml:mo></mml:math></disp-formula>
<disp-formula id="FD3">
<label>(3)</label>
<mml:math display="block">
<mml:msub>
<mml:mtext>d</mml:mtext>
<mml:mn>2</mml:mn></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mtext>max</mml:mtext></mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>∉</mml:mo>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:munder>
<mml:mo>{</mml:mo>
<mml:mo>&lt;</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:mi> </mml:mi>
<mml:mi>x</mml:mi>
<mml:mo>&gt;</mml:mo>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo>&lt;</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>&gt;</mml:mo>
<mml:mo>≻</mml:mo>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mn>1</mml:mn></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>}</mml:mo></mml:math></disp-formula>
<disp-formula id="FD4">
<label>(4)</label>
<mml:math display="block">
<mml:mtext>d</mml:mtext>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mn>2</mml:mn></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mn>1</mml:mn></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow>
<mml:mn>2</mml:mn></mml:mfrac></mml:math></disp-formula>
<disp-formula id="FD5">
<label>(5)</label>
<mml:math display="block">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:msub>
<mml:mi>θ</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:mi>d</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mi>ω</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>Step 3. C<sub>j</sub> = C(a<sub>i</sub>), K<sub>1,j</sub> = C<sub>j</sub> ∩ K<sub>1</sub>, K<sub>2</sub> ← K<sub>1</sub> / K<sub>1,j</sub>, k<sub>1</sub> ← k<sub>2</sub>, j ← j+1, go to Step 1 of Algorithm 1.</p>
<p>More details about covering algorithm can be referred from [<xref ref-type="bibr" rid="b25-ijms-10-02190">25</xref>,<xref ref-type="bibr" rid="b26-ijms-10-02190">26</xref>].</p>
<p>Hence by using the training set we can calculate all the parameters <italic>W</italic> = {<italic>ω</italic> = (<italic>a</italic><italic><sub>i</sub></italic>), <italic>θ</italic> = (<italic>θ</italic><italic><sub>i</sub></italic>)}based on the above equations and by using testing set, the performance of our algorithm can be evaluated.</p></sec></sec>
<sec>
<label>3.4.</label>
<title>Predictor construction</title>
<p>In our experiment, predictors are generated using the covering algorithm (CA) to judge whether a residue is located on an interface or not. The CA has simple, lower complexity, high accuracy for high dimension data and frequently demonstrates high accuracy. It can also handle large feature spaces and condense the information given by the training dataset. Here, we consider only surface residues in the training process, the target value of which is 1 (positive sample) if it is classified into interface residue and −1 denotes non-interface residue corresponding to negative sample.</p>
<p>We construct our CA predictor using sequence profile and ASA attributes. Following the method used by Fariselli <italic>et al.</italic> [<xref ref-type="bibr" rid="b9-ijms-10-02190">9</xref>], the input vector of CA is fed with a window of 11 residues, centered on the target residue and including the five sequence neighboring residues on each side such as <xref ref-type="disp-formula" rid="FD1">formula (1)</xref> organization. So, each residue is represented by a 231-component vector in the predictor based on the residue sequence neighboring profile and ASA.</p></sec>
<sec>
<label>3.5.</label>
<title>Evaluation of performance</title>
<p>Interface prediction has to fulfill two competing demands. The predictor should cover as many of the real interface residues as possible, but at the same time should predict as few false positive as possible. These two demands are measured by sensitivity and specificity, respectively. Let TP = the number of true positives (residues predicted to be interface residues that actually are interface residues); FP = the number of false positives (residues predicted to be interface residues that are in fact not interface residues); TN = the number of true negatives; FN=the number of false negatives; N = TP + TN + FP + FN (the total number of examples), then sensitivity is:
<disp-formula>
<mml:math display="block">
<mml:mtext>sensitivity</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mtext>TP</mml:mtext></mml:mrow>
<mml:mrow>
<mml:mtext>TP</mml:mtext>
<mml:mo>+</mml:mo>
<mml:mtext>FN</mml:mtext></mml:mrow></mml:mfrac></mml:math></disp-formula>and specificity is:
<disp-formula>
<mml:math display="block">
<mml:mtext>specificity</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mtext>TP</mml:mtext></mml:mrow>
<mml:mrow>
<mml:mtext>TP</mml:mtext>
<mml:mo>+</mml:mo>
<mml:mtext>FP</mml:mtext></mml:mrow></mml:mfrac></mml:math></disp-formula>and correlation coefficient (CC) is:
<disp-formula>
<mml:math display="block">
<mml:mtext>CC</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mtext>TP</mml:mtext>
<mml:mo>*</mml:mo>
<mml:mtext>TN</mml:mtext>
<mml:mo>−</mml:mo>
<mml:mtext>FP</mml:mtext>
<mml:mo>*</mml:mo>
<mml:mtext>FN</mml:mtext></mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mtext>TP</mml:mtext>
<mml:mo>+</mml:mo>
<mml:mtext>FN</mml:mtext>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>*</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mtext>TP</mml:mtext>
<mml:mo>+</mml:mo>
<mml:mtext>FP</mml:mtext>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>*</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mtext>TN</mml:mtext>
<mml:mo>+</mml:mo>
<mml:mtext>FP</mml:mtext>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>*</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mtext>TN</mml:mtext>
<mml:mo>+</mml:mo>
<mml:mtext>FN</mml:mtext>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac></mml:math></disp-formula>and F1-measure is:
<disp-formula>
<mml:math display="block">
<mml:mtext>F</mml:mtext>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo>
<mml:mi> </mml:mi>
<mml:mtext>measure</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>*</mml:mo>
<mml:mtext>sensitivity</mml:mtext>
<mml:mo>*</mml:mo>
<mml:mtext>specificity</mml:mtext></mml:mrow>
<mml:mrow>
<mml:mtext>sensitivity</mml:mtext>
<mml:mo>+</mml:mo>
<mml:mtext>specificity</mml:mtext></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>Sensitivity measures the fraction of interface residues that are identified as such. Specificity measures the fraction of the predicted interface residues that are actually interface residues. Correlation coefficient measures that how well the predicted class labels correlate with the actual class labels. It ranges from −1 to 1 where a correlation coefficient of 1 corresponds to perfect prediction and 0 corresponds to random guessing.</p></sec></sec>
<sec sec-type="conclusions">
<label>4.</label>
<title>Conclusions</title>
<p>Generally speaking, identifying residues in protein-protein interaction sites is an extremely difficult task, let alone in the absence of any information about partner chains. In this paper, as we have presented above, due to the absence of information about research proteins, we propose a new approach to predict interface sites from protein sequence and structure characteristic. This method adequately utilizes the characters of covering algorithm which have simple, lower complexity, high accuracy for high dimension data. A relatively high false positive ratio in protein-protein interaction sites prediction is a troublesome problem. Some investigators reduce the false positive ratio by eliminating isolated raw positive predictions [<xref ref-type="bibr" rid="b13-ijms-10-02190">13</xref>]. In our experiment, we can decrease false positive predictions using a covering algorithm based on different features of protein-protein interaction. The results obtained in this paper show that our propose method is a promising approach for studying protein-protein interaction, although this method is not good in sensitivity. Choosing proper features perhaps improve the results and we will investigate more effective features in the future and information of binding protein chains will also be considered in our future work.</p></sec></body>
<back>
<ack>
<p>We would like to thank Dr. Chih-Jen Lin from National Taiwan University for providing the original the LIBSVM tool and Christopher Manning and Dan Klein from Stanford University for providing the maximum entropy software package. This work also is supported by the Project of Provincial Natural Scientific Fund from the Bureau of Education of Anhui Province (KJ2007B239) and the Project of Doctoral Foundation of Ministry of Education (200403057002).</p></ack>
<ref-list>
<title>References</title>
<ref id="b1-ijms-10-02190"><label>1.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhou</surname><given-names>HX</given-names></name></person-group><article-title>Improving the understanding of human genetic diseases through predictions of protein structures and protein-protein interaction sites</article-title><source>Curr. Med. Chem</source><year>2004</year><volume>11</volume><fpage>539</fpage><lpage>549</lpage><pub-id pub-id-type="doi">10.2174/0929867043455800</pub-id><pub-id pub-id-type="pmid">15032602</pub-id></citation></ref>
<ref id="b2-ijms-10-02190"><label>2.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Mrowka</surname><given-names>R</given-names></name><name><surname>Patzak</surname><given-names>A</given-names></name><name><surname>Herzel</surname><given-names>H</given-names></name></person-group><source>Is there a bias in proteome research?</source><publisher-name>Cold Spring Harbor Laboratory Press</publisher-name><publisher-loc>New York, NY, USA</publisher-loc><year>2001</year><volume>11</volume><fpage>1971</fpage><lpage>1973</lpage></citation></ref>
<ref id="b3-ijms-10-02190"><label>3.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Berman</surname><given-names>HM</given-names></name><name><surname>Battistuz</surname><given-names>T</given-names></name><name><surname>Bhat</surname><given-names>TN</given-names></name><name><surname>Bluhm</surname><given-names>WF</given-names></name><name><surname>Bourne</surname><given-names>PE</given-names></name><name><surname>Burkhardt</surname><given-names>K</given-names></name><name><surname>Feng</surname><given-names>Z</given-names></name><name><surname>Gilliland</surname><given-names>GL</given-names></name><name><surname>Iype</surname><given-names>L</given-names></name><name><surname>Jain</surname><given-names>S</given-names></name></person-group><article-title>The protein data bank</article-title><source>Acta Crystallogr. D</source><year>2002</year><volume>D58</volume><fpage>899</fpage><lpage>907</lpage></citation></ref>
<ref id="b4-ijms-10-02190"><label>4.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Glaser</surname><given-names>F</given-names></name><name><surname>Steinberg</surname><given-names>DM</given-names></name><name><surname>Vakser</surname><given-names>IA</given-names></name><name><surname>Ben-Tal</surname><given-names>N</given-names></name></person-group><article-title>Residue frequencies and pairing preferences at protein-protein interfaces</article-title><source>Proteins: Struct. Funct. Bioinf</source><year>2001</year><volume>43</volume><fpage>89</fpage><lpage>102</lpage><pub-id pub-id-type="doi">10.1002/1097-0134(20010501)43:2&lt;89::AID-PROT1021&gt;3.0.CO;2-H</pub-id></citation></ref>
<ref id="b5-ijms-10-02190"><label>5.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Young</surname><given-names>L</given-names></name><name><surname>Jernigan</surname><given-names>RL</given-names></name><name><surname>Covell</surname><given-names>DG</given-names></name></person-group><article-title>A role for surface hydrophobicity in protein-protein recognition</article-title><source>Protein Sci</source><year>1994</year><volume>3</volume><fpage>717</fpage><lpage>729</lpage><pub-id pub-id-type="pmid">8061602</pub-id></citation></ref>
<ref id="b6-ijms-10-02190"><label>6.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jones</surname><given-names>S</given-names></name><name><surname>Thornton</surname><given-names>JM</given-names></name></person-group><article-title>Principles of protein-protein interactions</article-title><source>Proc. Natl. Acad. Sci. USA</source><year>1996</year><volume>93</volume><fpage>13</fpage><lpage>20</lpage><pub-id pub-id-type="doi">10.1073/pnas.93.1.13</pub-id><pub-id pub-id-type="pmid">8552589</pub-id></citation></ref>
<ref id="b7-ijms-10-02190"><label>7.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Conte</surname><given-names>LL</given-names></name><name><surname>Chothia</surname><given-names>C</given-names></name><name><surname>Janin</surname><given-names>J</given-names></name></person-group><article-title>The atomic structure of protein-protein recognition sites</article-title><source>J. Mol. Biol</source><year>1999</year><volume>285</volume><fpage>2177</fpage><lpage>2198</lpage><pub-id pub-id-type="doi">10.1006/jmbi.1998.2439</pub-id><pub-id pub-id-type="pmid">9925793</pub-id></citation></ref>
<ref id="b8-ijms-10-02190"><label>8.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname><given-names>H</given-names></name><name><surname>Zhou</surname><given-names>HX</given-names></name></person-group><article-title>Prediction of interface residues in protein-protein complexes by a consensus neural network method: test against NMR data</article-title><source>Proteins: Struct. Funct. Bioinf</source><year>2005</year><volume>61</volume><fpage>21</fpage><lpage>35</lpage><pub-id pub-id-type="doi">10.1002/prot.20514</pub-id></citation></ref>
<ref id="b9-ijms-10-02190"><label>9.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fariselli</surname><given-names>P</given-names></name><name><surname>Pazos</surname><given-names>F</given-names></name><name><surname>Valencia</surname><given-names>A</given-names></name><name><surname>Casadio</surname><given-names>R</given-names></name></person-group><article-title>Prediction of protein-protein interaction sites in heterocomplexes with neural networks</article-title><source>Euro. J. Biochem</source><year>2002</year><volume>269</volume><fpage>1356</fpage><lpage>1361</lpage><pub-id pub-id-type="doi">10.1046/j.1432-1033.2002.02767.x</pub-id></citation></ref>
<ref id="b10-ijms-10-02190"><label>10.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhou</surname><given-names>HX</given-names></name><name><surname>Shan</surname><given-names>Y</given-names></name></person-group><article-title>Prediction of Protein Interaction Sites From Sequence Profile and Residue Neighbor List</article-title><source>Proteins: Struct. Funct. Bioinf</source><year>2001</year><volume>44</volume><fpage>336</fpage><lpage>343</lpage><pub-id pub-id-type="doi">10.1002/prot.1099</pub-id></citation></ref>
<ref id="b11-ijms-10-02190"><label>11.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bradford</surname><given-names>JR</given-names></name><name><surname>Westhead</surname><given-names>DR</given-names></name></person-group><article-title>Improved prediction of protein-protein binding sites using a support vector machines approach</article-title><source>Bioinformatics</source><year>2005</year><volume>21</volume><fpage>1487</fpage><lpage>1494</lpage><pub-id pub-id-type="doi">10.1093/bioinformatics/bti242</pub-id><pub-id pub-id-type="pmid">15613384</pub-id></citation></ref>
<ref id="b12-ijms-10-02190"><label>12.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chung</surname><given-names>JL</given-names></name><name><surname>Wang</surname><given-names>W</given-names></name><name><surname>Bourne</surname><given-names>PE</given-names></name></person-group><article-title>Exploiting sequence and structure homologs to identify protein-protein binding sites</article-title><source>Proteins: Struct. Funct. Bioinf</source><year>2006</year><volume>62</volume><fpage>630</fpage><lpage>640</lpage></citation></ref>
<ref id="b13-ijms-10-02190"><label>13.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Koike</surname><given-names>A</given-names></name><name><surname>Takagi</surname><given-names>T</given-names></name></person-group><article-title>Prediction of protein-protein interaction sites using support vector machines</article-title><source>Protein Eng. Des. Sel</source><year>2004</year><volume>17</volume><fpage>165</fpage><lpage>173</lpage><pub-id pub-id-type="doi">10.1093/protein/gzh020</pub-id><pub-id pub-id-type="pmid">15047913</pub-id></citation></ref>
<ref id="b14-ijms-10-02190"><label>14.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Res</surname><given-names>I</given-names></name><name><surname>Mihalek</surname><given-names>I</given-names></name><name><surname>Lichtarge</surname><given-names>O</given-names></name></person-group><article-title>An evolution based classifier for prediction of protein interfaces without using protein structures</article-title><source>Bioinformatics</source><year>2005</year><volume>21</volume><fpage>2496</fpage><lpage>2501</lpage><pub-id pub-id-type="doi">10.1093/bioinformatics/bti340</pub-id><pub-id pub-id-type="pmid">15728113</pub-id></citation></ref>
<ref id="b15-ijms-10-02190"><label>15.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname><given-names>B</given-names></name><name><surname>San Wong</surname><given-names>H</given-names></name><name><surname>Huang</surname><given-names>DS</given-names></name></person-group><article-title>Inferring protein-protein interacting sites using residue conservation and evolutionary information</article-title><source>Protein Pept. Lett</source><year>2006</year><volume>13</volume><fpage>999</fpage><lpage>1005</lpage><pub-id pub-id-type="doi">10.2174/092986606778777498</pub-id><pub-id pub-id-type="pmid">17168822</pub-id></citation></ref>
<ref id="b16-ijms-10-02190"><label>16.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname><given-names>B</given-names></name><name><surname>Chen</surname><given-names>P</given-names></name><name><surname>Huang</surname><given-names>DS</given-names></name><name><surname>Li</surname><given-names>J</given-names></name><name><surname>Lok</surname><given-names>TM</given-names></name><name><surname>Lyu</surname><given-names>MR</given-names></name></person-group><article-title>Predicting protein interaction sites from residue spatial sequence profile and evolution rate</article-title><source>FEBS Lett</source><year>2006</year><volume>580</volume><fpage>380</fpage><lpage>384</lpage><pub-id pub-id-type="doi">10.1016/j.febslet.2005.11.081</pub-id><pub-id pub-id-type="pmid">16376878</pub-id></citation></ref>
<ref id="b17-ijms-10-02190"><label>17.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname><given-names>MH</given-names></name><name><surname>Lin</surname><given-names>L</given-names></name><name><surname>Wang</surname><given-names>XL</given-names></name><name><surname>Liu</surname><given-names>T</given-names></name></person-group><article-title>Protein-protein interaction site prediction based on conditional random fields</article-title><source>Bioinformatics</source><year>2007</year><volume>23</volume><fpage>597</fpage><pub-id pub-id-type="doi">10.1093/bioinformatics/btl660</pub-id><pub-id pub-id-type="pmid">17234636</pub-id></citation></ref>
<ref id="b18-ijms-10-02190"><label>18.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ofran</surname><given-names>Y</given-names></name><name><surname>Rost</surname><given-names>B</given-names></name></person-group><article-title>Predicted protein–protein interaction sites from local sequence information</article-title><source>FEBS Lett</source><year>2003</year><volume>544</volume><fpage>236</fpage><lpage>239</lpage><pub-id pub-id-type="doi">10.1016/S0014-5793(03)00456-3</pub-id><pub-id pub-id-type="pmid">12782323</pub-id></citation></ref>
<ref id="b19-ijms-10-02190"><label>19.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jaynes</surname><given-names>ET</given-names></name></person-group><article-title>Information theory and statistical mechanics</article-title><source>Phys. Rev</source><year>1957</year><volume>106</volume><fpage>620</fpage><lpage>630</lpage><pub-id pub-id-type="doi">10.1103/PhysRev.106.620</pub-id></citation></ref>
<ref id="b20-ijms-10-02190"><label>20.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chakrabarti</surname><given-names>P</given-names></name><name><surname>Janin</surname><given-names>J</given-names></name></person-group><article-title>Dissecting protein-protein recognition sites</article-title><source>Proteins: Struct. Funct. Bioinf</source><year>2002</year><volume>47</volume><fpage>334</fpage><lpage>343</lpage><pub-id pub-id-type="doi">10.1002/prot.10085</pub-id></citation></ref>
<ref id="b21-ijms-10-02190"><label>21.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname><given-names>J</given-names></name><name><surname>Lim</surname><given-names>K</given-names></name><name><surname>Smolyar</surname><given-names>A</given-names></name><name><surname>Teng</surname><given-names>M</given-names></name><name><surname>Liu</surname><given-names>J</given-names></name><name><surname>Tse</surname><given-names>AGD</given-names></name><name><surname>Hussey</surname><given-names>RE</given-names></name><name><surname>Chishti</surname><given-names>Y</given-names></name><name><surname>Thomson</surname><given-names>CT</given-names></name></person-group><article-title>Atomic structure of an alpha beta T cell receptor (TCR) heterodimer in complex with an anti-TCR Fab fragment derived from a mitogenic antibody</article-title><source>EMBO J</source><year>1998</year><volume>17</volume><fpage>10</fpage><lpage>26</lpage><pub-id pub-id-type="doi">10.1093/emboj/17.1.10</pub-id><pub-id pub-id-type="pmid">9427737</pub-id></citation></ref>
<ref id="b22-ijms-10-02190"><label>22.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Prasad</surname><given-names>L</given-names></name><name><surname>Waygood</surname><given-names>EB</given-names></name><name><surname>Lee</surname><given-names>JS</given-names></name><name><surname>Delbaere</surname><given-names>LTJ</given-names></name></person-group><article-title>The 2.5 Å resolution structure of the jel42 Fab fragment/HPr complex</article-title><source>J. Mol. Biol</source><year>1998</year><volume>280</volume><fpage>829</fpage><lpage>845</lpage><pub-id pub-id-type="doi">10.1006/jmbi.1998.1888</pub-id><pub-id pub-id-type="pmid">9671553</pub-id></citation></ref>
<ref id="b23-ijms-10-02190"><label>23.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kabsch</surname><given-names>W</given-names></name><name><surname>Sander</surname><given-names>C</given-names></name></person-group><article-title>Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features</article-title><source>Biopolymers</source><year>1983</year><volume>22</volume><fpage>2577</fpage><lpage>2637</lpage><pub-id pub-id-type="doi">10.1002/bip.360221211</pub-id><pub-id pub-id-type="pmid">6667333</pub-id></citation></ref>
<ref id="b24-ijms-10-02190"><label>24.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rost</surname><given-names>B</given-names></name><name><surname>Sander</surname><given-names>C</given-names></name></person-group><article-title>Conservation and prediction of solvent accessibility in protein families</article-title><source>Proteins: Struct. Funct. Bioinf</source><year>1994</year><volume>20</volume><fpage>216</fpage><lpage>226</lpage><pub-id pub-id-type="doi">10.1002/prot.340200303</pub-id></citation></ref>
<ref id="b25-ijms-10-02190"><label>25.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname><given-names>L</given-names></name><name><surname>Zhang</surname><given-names>B</given-names></name></person-group><article-title>A geometrical representation of McCulloch-Pitts neural model andits applications</article-title><source>IEEE Trans Neural Netw</source><year>1999</year><volume>10</volume><fpage>925</fpage><lpage>929</lpage><pub-id pub-id-type="doi">10.1109/72.774263</pub-id><pub-id pub-id-type="pmid">18252589</pub-id></citation></ref>
<ref id="b26-ijms-10-02190"><label>26.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname><given-names>L</given-names></name><name><surname>Zhang</surname><given-names>B</given-names></name><name><surname>Yin</surname><given-names>HF</given-names></name></person-group><article-title>An alternative covering design algorithm of multi-layer neural networks</article-title><source>J. Soft</source><year>1999</year><volume>10</volume><fpage>737</fpage><lpage>742</lpage></citation></ref>
<ref id="b27-ijms-10-02190"><label>27.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dodge</surname><given-names>C</given-names></name><name><surname>Schneider</surname><given-names>R</given-names></name><name><surname>Sander</surname><given-names>C</given-names></name></person-group><article-title>The HSSP database of protein structure-sequence alignments and family profiles</article-title><source>Nucleic Acids Res</source><year>1998</year><volume>26</volume><fpage>313</fpage><pub-id pub-id-type="doi">10.1093/nar/26.1.313</pub-id><pub-id pub-id-type="pmid">9399862</pub-id></citation></ref>
<ref id="b28-ijms-10-02190"><label>28.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Burgoyne</surname><given-names>NJ</given-names></name><name><surname>Jackson</surname><given-names>RM</given-names></name></person-group><article-title>Predicting protein interaction sites: binding hot-spots in protein-protein and protein-ligand interfaces</article-title><source>Bioinformatics</source><year>2006</year><volume>22</volume><fpage>1335</fpage><lpage>1342</lpage><pub-id pub-id-type="doi">10.1093/bioinformatics/btl079</pub-id><pub-id pub-id-type="pmid">16522669</pub-id></citation></ref></ref-list>
<sec sec-type="display-objects">
<title>Figures and Tables</title>
<fig id="f1-ijms-10-02190" position="float">
<label>Figure 1.</label>
<caption>
<p>Specificity-sensitivity and ROC curves on the Complete dataset.</p></caption>
<graphic xlink:href="ijms-10-02190f1.gif"/></fig>
<fig id="f2-ijms-10-02190" position="float">
<label>Figure 2.</label>
<caption>
<p>Specificity-sensitivity and ROC curves on the Trim dataset.</p></caption>
<graphic xlink:href="ijms-10-02190f2.gif"/></fig>
<fig id="f3-ijms-10-02190" position="float">
<label>Figure 3.</label>
<caption>
<p>Specificity-sensitivity and ROC curves on the Complete data based on SVM, ME and CA.</p></caption>
<graphic xlink:href="ijms-10-02190f3.gif"/></fig>
<fig id="f4-ijms-10-02190" position="float">
<label>Figure 4.</label>
<caption>
<p>Specificity-sensitivity and ROC curves on the Trim data based on SVM, ME and CA.</p></caption>
<graphic xlink:href="ijms-10-02190f4.gif"/></fig>
<fig id="f5-ijms-10-02190" position="float">
<label>Figure 5.</label>
<caption>
<p>Predicted interface residues (red color) on protein (PDB: 1NFD_C) identified by (A) CA, (B) SVM (C) ME and (D) The actual interface residues. Red denotes true positive residues, blue denotes false negative residues, yellow denotes false positive residues, and pink denotes true negative residues.</p></caption>
<graphic xlink:href="ijms-10-02190f5.gif"/></fig>
<fig id="f6-ijms-10-02190" position="float">
<label>Figure 6.</label>
<caption>
<p>Predicted interface residues (red color) on protein (PDB: 2JEL_H) identified by (A) CA, (B) SVM (C) ME and (D) The actual interface residues. Red denotes true positive residues, blue denotes false negative residues, yellow denotes false positive residues, and pink denotes true negative residues.</p></caption>
<graphic xlink:href="ijms-10-02190f6.gif"/></fig>
<fig id="f7-ijms-10-02190" position="float">
<label>Figure 7.</label>
<caption>
<p>(a) a sphere neighborhood (b) input vector and their projection.</p></caption>
<graphic xlink:href="ijms-10-02190f7.gif"/></fig>
<table-wrap id="t1-ijms-10-02190" position="float">
<label>Table 1.</label>
<caption>
<p>Performances on a dataset of 61 protein chains using 5-fold cross-validation.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="middle" align="center"><bold>Dataset</bold></th>
<th valign="middle" align="center"><bold>Method</bold></th>
<th valign="middle" align="center"><bold>Sensitivity</bold></th>
<th valign="middle" align="center"><bold>Specificity</bold></th>
<th valign="middle" align="center"><bold>Accuracy</bold></th>
<th valign="middle" align="center"><bold>F1-mesure</bold></th>
<th valign="middle" align="center"><bold>CC</bold></th></tr></thead>
<tbody>
<tr>
<td valign="top" align="center">Complete</td>
<td valign="top" align="center">CA</td>
<td valign="top" align="center">0.5612</td>
<td valign="top" align="center">0.5883</td>
<td valign="top" align="center">0.6962</td>
<td valign="top" align="center">0.5916</td>
<td valign="top" align="center">0.2893</td></tr>
<tr>
<td valign="top" align="center"/>
<td valign="top" align="center">Random</td>
<td valign="top" align="center">0.4535</td>
<td valign="top" align="center">0.4764</td>
<td valign="top" align="center">0.5582</td>
<td valign="top" align="center">0.4462</td>
<td valign="top" align="center">0.0604</td></tr>
<tr>
<td valign="top" align="center">Trim</td>
<td valign="top" align="center">CA</td>
<td valign="top" align="center">0.6559</td>
<td valign="top" align="center">0.5334</td>
<td valign="top" align="center">0.6086</td>
<td valign="top" align="center">0.5863</td>
<td valign="top" align="center">0.2124</td></tr>
<tr>
<td valign="top" align="center"/>
<td valign="top" align="center">Random</td>
<td valign="top" align="center">0.5036</td>
<td valign="top" align="center">0.4555</td>
<td valign="top" align="center">0.4955</td>
<td valign="top" align="center">0.4550</td>
<td valign="top" align="center">−0.0065</td></tr></tbody></table></table-wrap>
<table-wrap id="t2-ijms-10-02190" position="float">
<label>Table 2.</label>
<caption>
<p>Performances of SVM, CA and ME based on 5-fold cross-validation.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="middle" align="center"><bold>Data set</bold></th>
<th valign="middle" align="center"><bold>Method</bold></th>
<th valign="middle" align="center"><bold>Sensitivity</bold></th>
<th valign="middle" align="center"><bold>Specificity</bold></th>
<th valign="middle" align="center"><bold>F1-measure</bold></th>
<th valign="middle" align="center"><bold>Accuracy</bold></th>
<th valign="middle" align="center"><bold>CC</bold></th></tr></thead>
<tbody>
<tr>
<td valign="top" align="center"/>
<td valign="top" align="center">SVM</td>
<td valign="top" align="center">0.5547</td>
<td valign="top" align="center">0.6294</td>
<td valign="top" align="center">0.5796</td>
<td valign="top" align="center">0.6896</td>
<td valign="top" align="center">0.2443</td></tr>
<tr>
<td valign="top" align="center">Complete</td>
<td valign="top" align="center">ME</td>
<td valign="top" align="center">0.5011</td>
<td valign="top" align="center">0.6734</td>
<td valign="top" align="center">0.5408</td>
<td valign="top" align="center">0.6761</td>
<td valign="top" align="center">0.2719</td></tr>
<tr>
<td valign="top" align="center"/>
<td valign="top" align="center">CA</td>
<td valign="top" align="center">0.5612</td>
<td valign="top" align="center">0.5883</td>
<td valign="top" align="center">0.5916</td>
<td valign="top" align="center">0.6962</td>
<td valign="top" align="center">0.2893</td></tr>
<tr>
<td valign="top" align="center"/>
<td valign="top" align="center">SVM</td>
<td valign="top" align="center">0.5807</td>
<td valign="top" align="center">0.5883</td>
<td valign="top" align="center">0.5639</td>
<td valign="top" align="center">0.6662</td>
<td valign="top" align="center">0.2032</td></tr>
<tr>
<td valign="top" align="center">Trim</td>
<td valign="top" align="center">ME</td>
<td valign="top" align="center">0.6103</td>
<td valign="top" align="center">0.6101</td>
<td valign="top" align="center">0.6576</td>
<td valign="top" align="center">0.5860</td>
<td valign="top" align="center">0.2417</td></tr>
<tr>
<td valign="top" align="center"/>
<td valign="top" align="center">CA</td>
<td valign="top" align="center">0.6559</td>
<td valign="top" align="center">0.5334</td>
<td valign="top" align="center">0.5863</td>
<td valign="top" align="center">0.6086</td>
<td valign="top" align="center">0.2124</td></tr></tbody></table></table-wrap>
<table-wrap id="t3-ijms-10-02190" position="float">
<label>Table 3.</label>
<caption>
<p>Two types of data sets.</p></caption>
<table frame="hsides" rules="none">
<thead>
<tr>
<th valign="middle" align="center"><bold>Data set</bold></th>
<th valign="middle" align="center"><bold>Chains</bold></th>
<th valign="middle" align="center"><bold>Residues</bold></th>
<th valign="middle" align="center"><bold>Surface Residues</bold></th>
<th valign="middle" align="center"><bold>Interface residues</bold></th></tr></thead>
<tbody>
<tr>
<td valign="top" align="center">Complete<xref ref-type="table-fn" rid="tfn1-ijms-10-02190">a</xref></td>
<td valign="top" align="center">61</td>
<td valign="top" align="center">10,256</td>
<td valign="top" align="center">6,567</td>
<td valign="top" align="center">2,465</td></tr>
<tr>
<td valign="top" align="center">Trim<xref ref-type="table-fn" rid="tfn2-ijms-10-02190">b</xref></td>
<td valign="top" align="center">61</td>
<td valign="top" align="center">10,256</td>
<td valign="top" align="center">2,465</td>
<td valign="top" align="center">2,465</td></tr></tbody></table>
<table-wrap-foot><fn id="tfn1-ijms-10-02190">
<label>a</label>
<p>Include all surface residues;</p></fn><fn id="tfn2-ijms-10-02190">
<label>b</label>
<p>Remove randomly non-interface residues in order to equal with interface residues.</p></fn></table-wrap-foot></table-wrap></sec></back></article>
