<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">ijms</journal-id>
<journal-title>International Journal of Molecular Sciences</journal-title>
<abbrev-journal-title>Int. J. Mol. Sci.</abbrev-journal-title>
<issn pub-type="epub">1422-0067</issn>
<publisher>
<publisher-name>Molecular Diversity Preservation International (MDPI)</publisher-name></publisher></journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3390/ijms12128347</article-id>
<article-id pub-id-type="publisher-id">ijms-12-08347</article-id>
<article-categories>
<subj-group>
<subject>Article</subject></subj-group></article-categories>
<title-group>
<article-title>Prediction of Lysine Ubiquitylation with Ensemble Classifier and Feature Selection</article-title></title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Zhao</surname><given-names>Xiaowei</given-names></name><xref ref-type="aff" rid="af1-ijms-12-08347">1</xref><xref ref-type="aff" rid="af2-ijms-12-08347">2</xref></contrib>
<contrib contrib-type="author">
<name><surname>Li</surname><given-names>Xiangtao</given-names></name><xref ref-type="aff" rid="af2-ijms-12-08347">2</xref></contrib>
<contrib contrib-type="author">
<name><surname>Ma</surname><given-names>Zhiqiang</given-names></name><xref ref-type="aff" rid="af1-ijms-12-08347">1</xref><xref ref-type="aff" rid="af2-ijms-12-08347">2</xref><xref ref-type="corresp" rid="c1-ijms-12-08347">*</xref></contrib>
<contrib contrib-type="author">
<name><surname>Yin</surname><given-names>Minghao</given-names></name><xref ref-type="aff" rid="af2-ijms-12-08347">2</xref><xref ref-type="corresp" rid="c1-ijms-12-08347">*</xref></contrib></contrib-group>
<aff id="af1-ijms-12-08347">
<label>1</label>College of Life Science, Northeast Normal University, 5268 Renmin Street, Changchun 130024, China; E-Mail: <email>zhaoxw303@nenu.edu.cn</email></aff>
<aff id="af2-ijms-12-08347">
<label>2</label>College of Computer Science, Northeast Normal University, 2555 Jingyue Street, Changchun 130117, China; E-Mail: <email>lixt314@nenu.edu.cn</email></aff>
<author-notes>
<corresp id="c1-ijms-12-08347">
<label>*</label>Authors to whom correspondence should be addressed; E-Mails: <email>zhiqiang.ma967@gmail.com</email> (Z.M.); <email>ymh@nenu.edu.cn</email> (M.Y.); Tel.: +86-0431-8453-6338 (Z.M.); Fax: +86-0431-8453-6338 (Z.M.).</corresp></author-notes>
<pub-date pub-type="collection">
<year>2011</year></pub-date>
<pub-date pub-type="epub">
<day>28</day>
<month>11</month>
<year>2011</year></pub-date>
<volume>12</volume>
<issue>12</issue>
<fpage>8347</fpage>
<lpage>8361</lpage>
<history>
<date date-type="received">
<day>27</day>
<month>7</month>
<year>2011</year></date>
<date date-type="rev-recd">
<day>14</day>
<month>11</month>
<year>2011</year></date>
<date date-type="accepted">
<day>15</day>
<month>11</month>
<year>2011</year></date></history>
<permissions>
<copyright-statement>© 2011 by the authors; licensee MDPI, Basel, Switzerland.</copyright-statement>
<copyright-year>2011</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/3.0">
<p>This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).</p></license></permissions>
<abstract>
<p>Ubiquitylation is an important process of post-translational modification. Correct identification of protein lysine ubiquitylation sites is of fundamental importance to understand the molecular mechanism of lysine ubiquitylation in biological systems. This paper develops a novel computational method to effectively identify the lysine ubiquitylation sites based on the ensemble approach. In the proposed method, 468 ubiquitylation sites from 323 proteins retrieved from the Swiss-Prot database were encoded into feature vectors by using four kinds of protein sequences information. An effective feature selection method was then applied to extract informative feature subsets. After different feature subsets were obtained by setting different starting points in the search procedure, they were used to train multiple random forests classifiers and then aggregated into a consensus classifier by majority voting. Evaluated by jackknife tests and independent tests respectively, the accuracy of the proposed predictor reached 76.82% for the training dataset and 79.16% for the test dataset, indicating that this predictor is a useful tool to predict lysine ubiquitylation sites. Furthermore, site-specific feature analysis was performed and it was shown that ubiquitylation is intimately correlated with the features of its surrounding sites in addition to features derived from the lysine site itself. The feature selection method is available upon request.</p></abstract>
<kwd-group>
<kwd>ubiquitylation</kwd>
<kwd>ensemble classifier</kwd>
<kwd>support vector machine</kwd>
<kwd>lysine ubiquitylation sites</kwd></kwd-group></article-meta></front>
<body>
<sec sec-type="intro">
<title>1. Introduction</title>
<p>Ubiquitylation is a universal and important post-translational modification where ubiquitin is linked to some lysine residues of target proteins [<xref ref-type="bibr" rid="b1-ijms-12-08347">1</xref>–<xref ref-type="bibr" rid="b3-ijms-12-08347">3</xref>], and forms an isopeptide bond between the ɛ-amino groups of lysine residues of a substrate protein and the <italic>C</italic>-terminal double-glycine carboxy groups of ubiquitin protein [<xref ref-type="bibr" rid="b4-ijms-12-08347">4</xref>,<xref ref-type="bibr" rid="b5-ijms-12-08347">5</xref>]. Note that only the last glycine of ubiquitin is linked to substrate lysine residues in this process. There are mainly three kinds of enzymes participating in this highly collaborative process, including ubiquitin-activating enzymes, ubiquitin-conjugating enzymes, and ubiquitin ligases [<xref ref-type="bibr" rid="b6-ijms-12-08347">6</xref>,<xref ref-type="bibr" rid="b7-ijms-12-08347">7</xref>]. During the past decade, the function of ubiquitylation has been extended far beyond its role in just directing protein degradation [<xref ref-type="bibr" rid="b1-ijms-12-08347">1</xref>,<xref ref-type="bibr" rid="b2-ijms-12-08347">2</xref>], for example to the control of signal transduction, the regulation of DNA repair and transcription, and the implication of endocytosis and sorting [<xref ref-type="bibr" rid="b8-ijms-12-08347">8</xref>].</p>
<p>Since identification of protein lysine ubiquitylation sites is of fundamental importance to understand the molecular mechanism of lysine ubiquitylation in biological systems, many post-genome era researchers have focused on this field [<xref ref-type="bibr" rid="b9-ijms-12-08347">9</xref>–<xref ref-type="bibr" rid="b11-ijms-12-08347">11</xref>]. Meanwhile, some high-throughput experimental technologies have been developed to analyze and model the lysine ubiquitylation process at a genomic scale, such as proteolytic digestion, three steps affinity purification, and analysis using mass spectrometry [<xref ref-type="bibr" rid="b12-ijms-12-08347">12</xref>]. However, these conventional experiment approaches are labor-intensive and time-consuming, especially for large-scale data sets. Accordingly, several computation approaches have been developed to effectively and accurately predict lysine ubiquitylation sites. Tung and Ho built a prediction model, UbiPred, by using an informative subset of 531 physicochemical properties and a support vector machine. A new algorithm was then proposed for selecting an informative physicochemical properties subset, which can significantly improve the accuracy [<xref ref-type="bibr" rid="b13-ijms-12-08347">13</xref>]. Radivojac <italic>et al</italic>. developed a random forest predictor of ubiquitylation sites, UbPred. In their method, amino acid compositions, physicochemical properties and evolutionary information are first used to represent a protein sequence, and then a t-test attribute selection filter is applied to retain only statistically significant attributes [<xref ref-type="bibr" rid="b14-ijms-12-08347">14</xref>]. Cai <italic>et al</italic>. proposed a predictor based on nearest neighbor algorithm. In that algorithm, they extract conservation scores, disorder scores from a protein sequence, and then utilize the maximum relevance and minimum redundancy principle to identify the key features [<xref ref-type="bibr" rid="b15-ijms-12-08347">15</xref>]. Nevertheless, the prediction performances of these approaches are not always satisfactory.</p>
<p>In this study, an ensemble computational method is developed to predict lysine ubiquitylation sites based on amino acid sequence features. Firstly, four kinds of useful features, which describe each amino acid of lysine site and its surrounding sites, are extracted from each protein sequence: amino acid composition [<xref ref-type="bibr" rid="b16-ijms-12-08347">16</xref>]; evolutionary information [<xref ref-type="bibr" rid="b17-ijms-12-08347">17</xref>,<xref ref-type="bibr" rid="b18-ijms-12-08347">18</xref>]; amino acid factors [<xref ref-type="bibr" rid="b19-ijms-12-08347">19</xref>]; and disorder score [<xref ref-type="bibr" rid="b20-ijms-12-08347">20</xref>]. Secondly, in order to reduce the computational complexity and enhance the overall accuracy of the predictor, an effective feature selection method is used to select some optimal feature subsets. Finally, the ensemble classifier is established using the vectors of resulting features subset as input. For the new constructed ubiquitylation sites dataset, the accuracy of the proposed predictor is 76.82% for the training dataset, and 79.16% for the test dataset, which is higher than the state-of-art ubiquitylation site predictor.</p>
<p>Our feature analysis shows that flanking residues will influence the property and structure of a central residue. That is, the environmental information will be helpful to enhance prediction accuracy. In lysine ubiquitylation sites prediction, the position-specific scoring matrix (PSSM) conservation scores play a more important role. The other three descriptors, <italic>i.e.</italic>, amino acid composition, disorder score and amino acid factors, show almost equal relevance to ubiquitylation. When the window size is 21, amino acid residues at location 8, 11, 12, 14, 16 and 20 have much more features in the optimal features subset, compared with the other locations.</p></sec>
<sec sec-type="materials|methods">
<title>2. Materials and Methods</title>
<sec sec-type="methods">
<title>2.1. Data Sets</title>
<p>In this study, a dataset consisting of 468 ubiquitylation sites from 323 proteins is constructed by retrieving annotated proteins from the UniProt database [<xref ref-type="bibr" rid="b21-ijms-12-08347">21</xref>] at [<xref ref-type="bibr" rid="b22-ijms-12-08347">22</xref>]. These proteins have been reprocessed in order to avoid homology bias using the program cd-hit [<xref ref-type="bibr" rid="b23-ijms-12-08347">23</xref>], so that the sequence identity is lower than 0.6. By mapping the experimentally verified ubiquitylation sites to the corresponding 323 protein sequences, the 920 lysine residues with no annotation of ubiquitylation sites are regarded as non-ubiquitylation sites. The benchmark dataset is then divided into training dataset and test dataset: 65 proteins are randomly selected to construct a test dataset, and the remaining proteins make up the training dataset. According to Tung and Cai’s work [<xref ref-type="bibr" rid="b13-ijms-12-08347">13</xref>,<xref ref-type="bibr" rid="b15-ijms-12-08347">15</xref>], the best window size for ubiquitylation site prediction is 21, so we adopt it in this study too; with 10 residues located upstream and 10 residues located downstream of lysine residue in the protein sequence. As a result, the training dataset includes 298 ubiquitylation sites and 563 non-ubiquitylation sites, and the test dataset includes 170 ubiquitylation sites and 357 non-ubiquitylation sites.</p>
<p>To evaluate the ensemble classifier’s performance and compare it with existing methods, a publicly available dataset [<xref ref-type="bibr" rid="b15-ijms-12-08347">15</xref>] is also adopted here, which includes 14 ubiquitylation sites and 267 non-ubiquitylation sites. In this paper we have called this “independent dataset”. In <xref ref-type="table" rid="t1-ijms-12-08347">Table 1</xref>, we describe the number of ubiquitylation and non-ubiquitylation sites in each dataset.</p></sec>
<sec>
<title>2.2. Representation of Peptides</title>
<p>In this study, amino acid composition, PSSM conservation scores, disorder scores and amino acid factors are used to transform the peptides into feature vectors.</p>
<sec>
<title>2.2.1. Amino Acid Compositions</title>
<p>Usually, there are many encoding methods of protein sequence, e.g. amino acid composition [<xref ref-type="bibr" rid="b16-ijms-12-08347">16</xref>,<xref ref-type="bibr" rid="b24-ijms-12-08347">24</xref>], pseudo amino acid composition method [<xref ref-type="bibr" rid="b25-ijms-12-08347">25</xref>] and amino acid identity [<xref ref-type="bibr" rid="b13-ijms-12-08347">13</xref>], <italic>etc</italic>. Here we utilize the amino acid composition to represent each peptide, which is based on normalized counts of single or pairs of amino acids. Firstly, each peptide is represented by a feature vector of length 141 that includes 20 features for average amino acid composition and 121 dipeptides. Secondly, in order to reduce the dimensionality of dipeptides, the 20 amino acids are clustered into 11 groups according to similar physicochemical or structural properties [<xref ref-type="bibr" rid="b26-ijms-12-08347">26</xref>], and then 121 pairwise combinations are reduced to 66 by classifying the dipeptides with the same amino acid composition into one category.</p></sec>
<sec>
<title>2.2.2. PSSM Conservation Scores</title>
<p>Evolutionary conservation, one of the most important types of information in assessing functionality in biological analysis, has been used successfully in many studies [<xref ref-type="bibr" rid="b17-ijms-12-08347">17</xref>,<xref ref-type="bibr" rid="b18-ijms-12-08347">18</xref>]. In biology, conserved sequences are similar or identical sequences that occur within protein sequences, nucleic acid sequences or within different molecules produced by the same organism. Highly conserved proteins are often required for basic cellular function, stability or reproduction. Protein sequences’ evolutionary conservation serves as evidence for structural and functional conservation. So the corresponding position-specific scoring matrix (PSSM) extracted from sequence profiles generated by PSI-BLAST is selected as the second type of feature descriptor in this study. Here, we employ each sample to search and align homogenous sequences from NCBI’s NR database [<xref ref-type="bibr" rid="b27-ijms-12-08347">27</xref>] using the PSI-BLAST program [<xref ref-type="bibr" rid="b28-ijms-12-08347">28</xref>] with three iterations (−j 3) and e-value threshold for inclusion in multi-pass model 0.0001 (−h 0.0001).</p>
<p>It can be seen from <xref ref-type="fig" rid="f1-ijms-12-08347">Figure 1</xref>, the PSSM matrix is composed of <italic>L</italic>*20 elements, where <italic>L</italic> is the total number of residues in a peptide, the rows of the matrix represent the protein residues and the columns of the matrix represent the 20 amino acids. Each amino acid in the PSSM profiles is encoded by an evolutionary information vector of 20 dimensions using the <italic>i</italic>th row of PSSM. Then we normalize the values of PSSM in range of [0, 1] by using formula (value − minimum)/(maximum − minimum) before we use this PSSM matrix. In order to consider the neighboring effect of residues surrounding each ubiquitylation site, a sliding window of size <italic>w</italic> is utilized to combine the evolutionary information from downstream and upstream neighbors. For an ubiquitylation site K in sequence position <italic>i</italic>, we used a feature vector <italic>P</italic><italic><sub>i</sub></italic> to represent it. <italic>P</italic><italic><sub>i</sub></italic> is defined as follows, where <italic>w</italic> is an odd number which stands for the size of sliding window, and <italic>p</italic><italic><sub>[i]</sub></italic> is the <italic>i</italic>th row of normalized PSSM matrix. The length of vector <italic>P</italic><italic><sub>i</sub></italic> is <italic>w</italic>*20.</p>
<disp-formula id="FD1">
<label>(1)</label>
<mml:math id="mm1" display="block">
<mml:semantics id="sm1">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi></mml:mrow>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">[</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi></mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>-</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>/</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi></mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi></mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>/</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msub>
<mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:semantics></mml:math></disp-formula></sec>
<sec>
<title>2.2.3. Disorder Scores</title>
<p>In recent decades, the functional importance of disorder regions has been increasingly recognized [<xref ref-type="bibr" rid="b29-ijms-12-08347">29</xref>,<xref ref-type="bibr" rid="b30-ijms-12-08347">30</xref>]. Protein disorder in the nonglobular segments allows for more modification sites and interaction partners, and is of great importance to predict protein structures and functions [<xref ref-type="bibr" rid="b29-ijms-12-08347">29</xref>,<xref ref-type="bibr" rid="b31-ijms-12-08347">31</xref>,<xref ref-type="bibr" rid="b32-ijms-12-08347">32</xref>]. In this paper, we use the disorder score calculated by VSL2 [<xref ref-type="bibr" rid="b33-ijms-12-08347">33</xref>] to represent each amino acid disorder status in the given protein sequence. The VSL2 predictor can accurately identify both long and short disordered regions [<xref ref-type="bibr" rid="b34-ijms-12-08347">34</xref>,<xref ref-type="bibr" rid="b35-ijms-12-08347">35</xref>]. The disorder score features are composed of the disorder scores of the lysine site and its surrounding sites.</p></sec>
<sec>
<title>2.2.4. Amino Acid Factors</title>
<p>The structure and function of proteins are largely dependent on the composition of various properties of each of the 20 amino acids. The individual amino acid physicochemical properties have been successfully used in lysine ubiquitylation identification [<xref ref-type="bibr" rid="b36-ijms-12-08347">36</xref>,<xref ref-type="bibr" rid="b37-ijms-12-08347">37</xref>]. AAIndex [<xref ref-type="bibr" rid="b38-ijms-12-08347">38</xref>] is a well known database of amino acids’ biochemical and physicochemical properties. Atchley <italic>et al</italic>. [<xref ref-type="bibr" rid="b19-ijms-12-08347">19</xref>] have conducted multivariate statistical analysis on this database. They summarized this and provided five highly interpretable and multi-dimensional numeric indices that represent electrostatic charge, codon diversity, molecular volume, secondary structure, and polarity. Thus, we use these five numerical index scores (also called “amino acid factors”) to encode each amino acid in this study.</p></sec>
<sec>
<title>2.2.5. Feature Space</title>
<p>For every sample in the dataset, its feature space is composed of the features of AA compositions, PSSM scores, amino acid factors and disorder scores. Totally, there are 627 features to be encoded in a sample, including 86 amino acid composition features, 420 (20 × 21 = 420) PSSM conservation score features, 100 (20 × 5 = 100) amino acid factors features and 21 disorder score features.</p></sec></sec>
<sec>
<title>2.3. Feature Selection Based on Normalized Conditional Mutual Information</title>
<p>Usually, the most popular feature selection methods find a single features subset whose discriminative capability is limited for classification purpose [<xref ref-type="bibr" rid="b39-ijms-12-08347">39</xref>]. In fact, there are many feature subsets with good discriminative power, so we use an effective feature selection method [<xref ref-type="bibr" rid="b40-ijms-12-08347">40</xref>], Feature Selection based on Normalized Conditional Mutual Information (FSNCMI), to predict lysine ubiquitylation sites by manipulating multiple feature subsets simultaneously. Unlike other ensemble methods which use different classifiers or different sample subsets to strengthen the final prediction accuracies, FSNCMI obtains multiple feature subsets by the same selection technique with different starting points in its research process.</p>
<p>To measure the information shared by two features, mutual information (MI) is used here, defined as follows:</p>
<disp-formula id="FD2">
<label>(2)</label>
<mml:math id="mm2" display="block">
<mml:semantics id="sm2">
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>X</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>Y</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>∈</mml:mo>
<mml:mi>d</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>m</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>X</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:munder>
<mml:mrow>
<mml:munder>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>y</mml:mi>
<mml:mo>∈</mml:mo>
<mml:mi>d</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>m</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>Y</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:munder>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>y</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mi> </mml:mi>
<mml:mtext>log</mml:mtext>
<mml:mfrac>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>y</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>y</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac></mml:mrow></mml:mrow></mml:mrow></mml:semantics></mml:math></disp-formula>
<p>where <italic>X</italic> and <italic>dom</italic>(<italic>X</italic>) are discrete random variable and its domain, <italic>p</italic>(<italic>x, y</italic>) is the joint probabilistic density, and <italic>p</italic>(<italic>x</italic>) and <italic>p</italic>(<italic>y</italic>) are the marginal probabilistic densities. According to the concept of Shannon entropy, joint entropy and conditional entropy in information theory [<xref ref-type="bibr" rid="b41-ijms-12-08347">41</xref>], the following equations stand:</p>
<disp-formula id="FD3">
<label>(3)</label>
<mml:math id="mm3" display="block">
<mml:semantics id="sm3">
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>X</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>Y</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>H</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>X</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>-</mml:mo>
<mml:mi>H</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>X</mml:mi>
<mml:mo>/</mml:mo>
<mml:mi>Y</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>H</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>Y</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>-</mml:mo>
<mml:mi>H</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>Y</mml:mi>
<mml:mo>/</mml:mo>
<mml:mi>X</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>H</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>X</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>H</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>Y</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>-</mml:mo>
<mml:mi>H</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>X</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>Y</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:semantics></mml:math></disp-formula>
<p>Similarly, the conditional mutual information between <italic>X</italic> and <italic>Y</italic> given <italic>Z</italic> is defined as:</p>
<disp-formula id="FD4">
<label>(4)</label>
<mml:math id="mm4" display="block">
<mml:semantics id="sm4">
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>X</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>Y</mml:mi>
<mml:mo>/</mml:mo>
<mml:mi>Z</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>H</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>Y</mml:mi>
<mml:mo>/</mml:mo>
<mml:mi>Z</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>-</mml:mo>
<mml:mi>H</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>Y</mml:mi>
<mml:mo>/</mml:mo>
<mml:mi>X</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>Z</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:semantics></mml:math></disp-formula>
<p>where <italic>Z</italic> is a discrete random variable.</p>
<p><xref rid="FD4" ref-type="disp-formula">Equation 4</xref> represents the reduction of uncertainty of <italic>Y</italic> with respect to <italic>X</italic>, when <italic>Z</italic> is known. From the definition of <italic>conditional mutual information</italic>, we know that <italic>I</italic>(<italic>f;C</italic>/<italic>FS</italic>) can be used to measure the information amount shared by the feature <italic>f</italic> and the class labels C. Yet this information has not been captured by the already selected features <italic>FS</italic>. Therefore, the conditional mutual information <italic>I</italic>(<italic>f;C</italic>/<italic>FS</italic>) can be taken as the evaluation criterion <italic>J</italic>(<italic>f</italic>) of feature selection to evaluate the significant degree of feature <italic>f</italic>, and then at each step, the feature with maximal <italic>I</italic>(<italic>f;C</italic>/<italic>FS</italic>) will be selected. Normally, <italic>FS</italic> is replaced by one of its member <italic>f</italic><italic><sub>s</sub></italic> to deal with the problem that the estimation of <italic>I</italic>(<italic>f;C</italic>/<italic>FS</italic>) by multivariable dense distribution is usually unfaithful [<xref ref-type="bibr" rid="b42-ijms-12-08347">42</xref>]. Thus we have</p>
<disp-formula id="FD5">
<label>(5)</label>
<mml:math id="mm5" display="block">
<mml:semantics id="sm5">
<mml:mrow>
<mml:mi>J</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>f</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mtext>arg</mml:mtext>
<mml:mi> </mml:mi>
<mml:mtext>min</mml:mtext></mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi></mml:mrow>
<mml:mi>s</mml:mi></mml:msub>
<mml:mo>∈</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>S</mml:mi></mml:mrow></mml:munder>
<mml:mi> </mml:mi>
<mml:mi>I</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>f</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>C</mml:mi>
<mml:mo>/</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi></mml:mrow>
<mml:mi>s</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:semantics></mml:math></disp-formula>
<p>Note that the criterion of conditional mutual information may tend to choose the feature with more concrete values [<xref ref-type="bibr" rid="b43-ijms-12-08347">43</xref>], so we normalize it by <italic>H</italic>(<italic>f</italic>,<italic>C</italic>) and refer to it as normalized conditional mutual information, <italic>i.e.</italic>,</p>
<disp-formula id="FD6">
<label>(6)</label>
<mml:math id="mm6" display="block">
<mml:semantics id="sm6">
<mml:mrow>
<mml:msup>
<mml:mi>J</mml:mi>
<mml:mo>′</mml:mo></mml:msup>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>f</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mtext>arg</mml:mtext>
<mml:mi> </mml:mi>
<mml:mtext>min</mml:mtext></mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi></mml:mrow>
<mml:mi>s</mml:mi></mml:msub>
<mml:mo>∈</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>S</mml:mi></mml:mrow></mml:munder>
<mml:mfrac>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>f</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>C</mml:mi>
<mml:mo>/</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi></mml:mrow>
<mml:mi>s</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo></mml:mrow>
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>f</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>C</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac></mml:mrow></mml:semantics></mml:math></disp-formula>
<p>Based on the above analysis, the normalized form of conditional mutual information can also be used to evaluate the correlation between features and target classes when other features are known. On the ground of this criterion, the ensemble feature selection method can be induced by using normalized conditional mutual information, which is described as follows:</p>
<p><bold>Algorithm 1.</bold> FSNCMI: feature selection based on normalized conditional mutual information.</p>
<p><bold>Input:</bold> An ubiquitylation sites dataset M = (<italic>D</italic>, <italic>F</italic>); the index of starting point <italic>t</italic>; the number of selection features <italic>k</italic>;</p>
<p><bold>Output:</bold> A set of selected features <italic>FS</italic><italic><sub>i</sub></italic>;</p>
<table-wrap id="t5-ijms-12-08347" position="anchor">
<table frame="void" rules="none">
<tbody>
<tr>
<td align="left" valign="top">(1)</td>
<td align="left" valign="top">Initialize related parameters, <italic>FS</italic><italic><sub>i</sub></italic> =Ø, F = <italic>F</italic>, <italic>t</italic> = 0;</td></tr>
<tr>
<td align="left" valign="top">(2)</td>
<td align="left" valign="top"><bold>For</bold> each feature <italic>f</italic> ε F <bold>do</bold></td></tr>
<tr>
<td align="left" valign="top">(3)</td>
<td align="left" valign="top"> calculate its mutual information with the target classes <italic>C</italic>;</td></tr>
<tr>
<td align="left" valign="top">(4)</td>
<td align="left" valign="top">Sort them in a descending order;</td></tr>
<tr>
<td align="left" valign="top">(5)</td>
<td align="left" valign="top"><italic>FS</italic><italic><sub>i</sub></italic> = {<italic>f</italic><italic><sub>t</sub></italic>}, F = F − {<italic>f</italic><italic><sub>t</sub></italic>}, where <italic>f</italic><italic><sub>t</sub></italic> is the starting point;</td></tr>
<tr>
<td align="left" valign="top">(6)</td>
<td align="left" valign="top"><bold>While</bold> |<italic>FS</italic><italic><sub>i</sub></italic>| &lt; <italic>k</italic> <bold>do</bold></td></tr>
<tr>
<td align="left" valign="top">(7)</td>
<td align="left" valign="top"> <bold>For</bold> each feature <italic>f</italic> ε F <bold>do</bold></td></tr>
<tr>
<td align="left" valign="top">(8)</td>
<td align="left" valign="top">  Calculate its criterion <italic>J</italic>(<italic>f</italic>) according to <xref rid="FD4" ref-type="disp-formula">Equation.(4)</xref>;</td></tr>
<tr>
<td align="left" valign="top">(9)</td>
<td align="left" valign="top">If <italic>J</italic>(<italic>f</italic>) = 0 then F = <italic>F</italic> − {<italic>f</italic>};</td></tr>
<tr>
<td align="left" valign="top">(10)</td>
<td align="left" valign="top">  Select the gene with the largest <italic>J</italic>(<italic>f</italic>);</td></tr>
<tr>
<td align="left" valign="top">(11)</td>
<td align="left" valign="top">  <italic>FS</italic><italic><sub>i</sub></italic> = <italic>FS</italic><italic><sub>i</sub></italic> ∪{<italic>f</italic>}, F = <italic>F</italic> − {<italic>f</italic>};</td></tr>
<tr>
<td align="left" valign="top">(12)</td>
<td align="left" valign="top"><bold>End</bold></td></tr></tbody></table></table-wrap>
<p>This algorithm works in a straightforward way. For the stopping condition k, we choose the strategy that when the difference of <italic>J</italic>(<italic>f</italic><italic><sub>i</sub></italic>) with <italic>J</italic>(<italic>f</italic><italic><sub>i-1</sub></italic>) is lower than a very small value, the iterative procedure is terminated. For the starting point, we choose <italic>t</italic> = 0, since the top-ranked features have higher mutual information and they may highly correlate with each other. To some extent, this will strengthen the stability and classification performance of the classifier [<xref ref-type="bibr" rid="b42-ijms-12-08347">42</xref>].</p></sec>
<sec>
<title>2.4. Random Forests Classification</title>
<p>Random Forests (RF) is a classification algorithm combining ensemble tree-structured classifiers [<xref ref-type="bibr" rid="b44-ijms-12-08347">44</xref>], which has been successfully used to deal with some problems in the bioinformatics area [<xref ref-type="bibr" rid="b45-ijms-12-08347">45</xref>–<xref ref-type="bibr" rid="b47-ijms-12-08347">47</xref>]. In RF, each tree is grown using a subset of the possible attributes in the input vectors [<xref ref-type="bibr" rid="b48-ijms-12-08347">48</xref>]. The results from [<xref ref-type="bibr" rid="b46-ijms-12-08347">46</xref>] showed that combining multiple trees produced in randomly selected subspaces can enhance the prediction performance. The RF is useful for estimating prediction errors. The prediction error is estimated by using an out-of-bag (OOB) sample. For each RF tree, the OOB sample including approximately one-third of the training dataset is applied to test the decision tree constructed by using the remaining training dataset with no pruning procedure. Finally, the overall prediction error is then calculated by combining results from the trees via voting, which can avoid over fitting on the training set while preserving maximum accuracy. The RF algorithm is available via the link at [<xref ref-type="bibr" rid="b49-ijms-12-08347">49</xref>]. Recently, the RF code for the MATLAB windows is also available at [<xref ref-type="bibr" rid="b50-ijms-12-08347">50</xref>], which has two functions, one is “classRF-train” for establishing a prediction model, and the other is “classRF_predict” for predicting the test dataset using the prediction model. The classifier in this study is developed based on this RF.</p></sec>
<sec>
<title>2.5. Model Building</title>
<p>The ensemble classifier can combine several decisions induced by the individual classier into one in some way. Compared with traditional methods, ensemble classifier can effectively improve classification performance, reliability and stability of individual classier. <xref ref-type="fig" rid="f2-ijms-12-08347">Figure 2</xref> describes the whole framework of our model.</p>
<p><xref ref-type="fig" rid="f2-ijms-12-08347">Figure 2</xref> shows the main framework of our method. Firstly, each peptide is transformed to vectors by using amino acid composition, PSSM conservation scores, disorder scores and amino acid factors features. Secondly, FSNCMI feature selection method is utilized to extract <italic>P</italic> informative feature subsets. After that, these corresponding RF predictors will be aggregated into a consensus using the majority voting strategy.</p></sec>
<sec>
<title>2.6. Evaluation</title>
<p>Jackknife test [<xref ref-type="bibr" rid="b51-ijms-12-08347">51</xref>] is a rigorous and objective statistical test, and has been widely used to examine the performance of various predictors [<xref ref-type="bibr" rid="b52-ijms-12-08347">52</xref>–<xref ref-type="bibr" rid="b55-ijms-12-08347">55</xref>]. Therefore we use it to evaluate our method as well, where proteins are singled out from the dataset one by one as a testing protein and the classifier is trained by the remaining proteins. Besides the jackknife test on training set, we also utilize sub-sampling (e.g., 5- or 10-fold cross validation) and an independent test [<xref ref-type="bibr" rid="b56-ijms-12-08347">56</xref>] to evaluate our model. Since the number of ubiquitylation sites and non-ubiquitylation sites are imbalanced in both the training set and the independent set, the Matthews’s correlation coefficient (<italic>MCC</italic>) is used here to objectively measure the performance of our ensemble classifier. MCC is usually regarded as a balanced measure to process imbalanced data [<xref ref-type="bibr" rid="b53-ijms-12-08347">53</xref>]. Meanwhile, sensitivity (<italic>S</italic><italic><sub>n</sub></italic>), specificity (<italic>S</italic><italic><sub>p</sub></italic>) and accuracy (<italic>AC</italic>) are also used. These parameters are defined by the following formulas:</p>
<disp-formula id="FD7">
<label>(7)</label>
<mml:math id="mm7" display="block">
<mml:semantics id="sm7">
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi></mml:mrow>
<mml:mi>n</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi></mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi></mml:mrow>
<mml:mi>p</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi></mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:semantics></mml:math></disp-formula>
<p>where <italic>TP</italic>, <italic>TN</italic>, <italic>FP</italic> and <italic>FN</italic> stand for true positive, true negative, false positive and false negative, respectively.</p></sec></sec>
<sec sec-type="results|discussion">
<title>3. Results and Discussion</title>
<sec>
<title>3.1. Prediction Performance of Our Method</title>
<p>Here, we evaluate the prediction performance of the ensemble classifier on the training set constructed in this study which consists of 298 ubiquitylation sites and 563 non-ubiquitylation sites. In our model, the number of features (stopping condition k in Algorithm 1) is an important parameter in the implementation of the ensemble predictor, so we should assign <italic>k</italic> with an appropriate value. We compare the difference of <italic>J</italic>(<italic>f</italic><italic><sub>i</sub></italic>) with <italic>J</italic>(<italic>f</italic><italic><sub>i-1</sub></italic>), and find that when <italic>i</italic> = 13 (<italic>i</italic> = 1 to 627), the difference is lower than the threshold 0.015. So the number of selected features in the ensemble selector is set to be 12. Such a selection is reasonable, because f<sub>13</sub> brings little information to any already selected features in FS.</p>
<p>Moreover, for prediction performance, the quantity of base classifiers (Qbc) is another aspect associated with ensemble classifier. Therefore, in the implementation of ensemble model, Qbc must also be considered. Intuitively, the prediction capability of ensemble model is highly affected by the number of base classifiers and, the more base classifiers contained in an ensemble selector, the higher the accuracy obtained by the model. To illustrate this kind of relationship between Qbc and prediction performance, we run our model with various numbers of Qbc. The results on the training set constructed in this study which consisted of 298 ubiquitylation sites and 563 non-ubiquitylation sites, are presented as <xref ref-type="fig" rid="f3-ijms-12-08347">Figure 3</xref>. As can be seen in the figure, the prediction performance of ensemble model becomes stable when Qbc reaches the point where Qbc equals 16. Although the increasing of Qbc will improve the classification performance and the stability of the ensemble model to some extent, it is not appropriate to employ as many base classifiers as possible. The reason for this is that when Qbc reaches a certain value, the performance will only increase a little, but the computational cost of building base classifiers will increase abruptly. As is shown in <xref ref-type="fig" rid="f3-ijms-12-08347">Figure 3</xref>, the ensemble model obtains the highest accuracy of 76.82% when Qbc is 10.</p>
<p>Since the evaluation criterion of FSNCMI is normalized conditional mutual information, the maximum relevance and minimum redundancy principle (mRMR) [<xref ref-type="bibr" rid="b57-ijms-12-08347">57</xref>] feature selection method, which is based on mutual information, is also taken as the base line. mRMR chooses those features which has more relevance to the class labels and less redundancy to the selected features at the same time. In our experiments, mRMR chooses the same features as FSNCMI, for both the training dataset and the independent dataset. All experiments are performed and report the <italic>S</italic><italic><sub>n</sub></italic>, <italic>S</italic><italic><sub>p</sub></italic>, <italic>AC</italic> and <italic>MCC</italic>. The comparison results of the two feature selection methods by jackknife test on the training set are shown in <xref ref-type="table" rid="t2-ijms-12-08347">Table 2</xref>. The comparison results of the two feature selection methods by 5-fold cross validation test on the test dataset are shown in <xref ref-type="table" rid="t3-ijms-12-08347">Table 3</xref>. For the training set and the independent test set, our ensemble predictor achieves accuracies of 76.82% and 79.16%; higher than the results obtained by using the mRMR feature selection method by 9.4% and 9.96% respectively. This may be due to the use of an effective feature selection method (FSNCMI), which can manipulate multiple feature subsets simultaneously.</p>
<p>Recently, two groups managed to identify a large number of endogenous ubiquitylation sites in human cells using mass spectrometry [<xref ref-type="bibr" rid="b58-ijms-12-08347">58</xref>,<xref ref-type="bibr" rid="b59-ijms-12-08347">59</xref>]. We picked out 300 lysine ubiquitylation sites downloaded from [<xref ref-type="bibr" rid="b58-ijms-12-08347">58</xref>], and each ubiquitylation site in this dataset does not appear in the training dataset or the test dataset. Similarly, the 300 ubiquitylation residues with no annotation of ubiquitylation sites are regarded as non-ubiquitylation sites. The ensemble model obtains an accuracy of 75.26% and MCC of 0.623. In the future, we will use these identified sites as the training dataset to further improve the prediction accuracy of our ensemble model.</p></sec>
<sec sec-type="methods">
<title>3.2. Comparison with Existing Methods</title>
<p>In this section, the proposed ensemble predictor is further compared with three recently reported predictors [<xref ref-type="bibr" rid="b13-ijms-12-08347">13</xref>–<xref ref-type="bibr" rid="b15-ijms-12-08347">15</xref>] on a publicly available dataset [<xref ref-type="bibr" rid="b15-ijms-12-08347">15</xref>], which includes 14 ubiquitylation sites and 267 non-ubiquitylation sites. The number of ubiquitylation sites and non-ubiquitylation sites in this dataset are highly imbalanced, and this situation is close to reality. The compared results are shown in <xref ref-type="table" rid="t4-ijms-12-08347">Table 4</xref>. As can be seen from the table, the ensemble predictor proposed in this study obtains an <italic>MCC</italic> of 0.153, higher than the other three methods with an accuracy of 71.32%. In <xref ref-type="table" rid="t4-ijms-12-08347">Table 4</xref>, “NA” means that the corresponding terms are until now unknown.</p></sec>
<sec sec-type="methods">
<title>3.3. Feature Analysis</title>
<p>As is described in Section 2.2, we utilize four kinds of attributes to represent a peptide: amino acid compositions; PSSM conservation scores; disorder scores; and amino acid factors. In this section, feature analyses are performed on the training dataset. Firstly, the number of each type of features in the selected 10 subsets is counted and shown in <xref ref-type="fig" rid="f4-ijms-12-08347">Figure 4</xref>. There are 12 amino acid compositions features, 75 PSSM conservation scores features, 10 disorder scores features and 24 amino acid factors features in the selected subsets. From <xref ref-type="fig" rid="f4-ijms-12-08347">Figure 4</xref>, we can conclude that the PSSM conservation scores play the most important role in the lysine ubiquitylation prediction. In addition, it should be noted that most disorder scores features are extracted from site 13, this means that the disorder status of amino acid around the ubiquitylation site may affect the ubiquitylation process.</p>
<p>Secondly, the number of features in each amino acid site taken from the 10 selected subsets is counted and shown in <xref ref-type="fig" rid="f5-ijms-12-08347">Figure 5</xref>. It is obvious that there are much more features in the selected subsets on sites 7, 8, 11, 12, 14, 16, 20 and 21 than on the other sites. This phenomenon may explain why Tung and Ho [<xref ref-type="bibr" rid="b13-ijms-12-08347">13</xref>] found the best window size for ubiquitylation sites prediction to be 21.</p>
<p>Finally, the number of PSSM features in each amino acid site taken from the selected 10 subsets is counted and shown in <xref ref-type="fig" rid="f6-ijms-12-08347">Figure 6</xref>. It can be seen that the conservation of lysine residue plays a key role in the ubiquitylation process; moreover, the nearby sites 8, 12, 14 and remote sites 2, 7, 16, 17, 20 have more PSSM conservation scores features than the others.</p></sec></sec>
<sec sec-type="conclusions">
<title>4. Conclusions</title>
<p>Prediction of lysine ubiquitylation sites is important to understand the molecular mechanism of lysine ubiquitylation in biological systems. Though some researchers have focused on this problem, the accuracy of prediction has still not been satisfied. In this study, we develop an ensemble predictor for the prediction of lysine ubiquitylation sites based on a new feature selection method using the information of sequence conservation, amino acid composition, amino acid physicochemical properties and residue disorder status. The accuracy of our predictor is higher than those of state-of-art ubiquitylation sites prediction tools. Experimental results have shown that our method is very promising and may be a useful supplement tool to existing methods. Moreover, the conclusions derived from this paper might help to understand more about the ubiquitylation mechanism and guide related experimental validations.</p>
<p>Since user-friendly and publicly accessible web-servers represent the future direction for developing more practically useful models, simulated methods or predictors, in our future work we will attempt to provide a web-server for the method presented in this paper.</p></sec></body>
<back>
<ack>
<title>Acknowledgments</title>
<p>This research is partially supported by the National Natural Science Foundation of China under Grant Nos. 60803102, 61070084, and also funded by NSFC Major Research Program 60496321: Basic Theory and Core Techniques of Non Canonical Knowledge.</p></ack>
<ref-list>
<title>References</title>
<ref id="b1-ijms-12-08347"><label>1</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pickart</surname><given-names>C.M.</given-names></name></person-group><article-title>Ubiquitin enters the new millennium</article-title><source>Mol. Cell</source><year>2001</year><volume>8</volume><fpage>499</fpage><lpage>504</lpage><pub-id pub-id-type="doi">10.1016/S1097-2765(01)00347-1</pub-id><pub-id pub-id-type="pmid">11583613</pub-id></citation></ref>
<ref id="b2-ijms-12-08347"><label>2</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Aguilar</surname><given-names>R.C.</given-names></name><name><surname>Wendland</surname><given-names>B.</given-names></name></person-group><article-title>Ubiquitin: Not just for proteasomes anymore</article-title><source>Curr. Opin. Cell Biol</source><year>2003</year><volume>15</volume><fpage>184</fpage><lpage>190</lpage><pub-id pub-id-type="doi">10.1016/S0955-0674(03)00010-3</pub-id><pub-id pub-id-type="pmid">12648674</pub-id></citation></ref>
<ref id="b3-ijms-12-08347"><label>3</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Saghatelian</surname><given-names>A.</given-names></name><name><surname>Cravatt</surname><given-names>B.F.</given-names></name></person-group><article-title>Assignment of protein function in the postgenomic era</article-title><source>Nat. Chem. Biol</source><year>2005</year><volume>1</volume><fpage>130</fpage><lpage>142</lpage><pub-id pub-id-type="doi">10.1038/nchembio0805-130</pub-id><pub-id pub-id-type="pmid">16408016</pub-id></citation></ref>
<ref id="b4-ijms-12-08347"><label>4</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Herrmann</surname><given-names>J.</given-names></name><name><surname>Lerman</surname><given-names>L.O.</given-names></name><name><surname>Lerman</surname><given-names>A.</given-names></name></person-group><article-title>Ubiquitin and ubiquitin-like proteins in protein regulation</article-title><source>Circ. Res</source><year>2007</year><volume>100</volume><fpage>1276</fpage><lpage>1291</lpage><pub-id pub-id-type="doi">10.1161/01.RES.0000264500.11888.f0</pub-id><pub-id pub-id-type="pmid">17495234</pub-id></citation></ref>
<ref id="b5-ijms-12-08347"><label>5</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hicke</surname><given-names>L.</given-names></name><name><surname>Dunn</surname><given-names>R.</given-names></name></person-group><article-title>Regulation of membrane protein transport by ubiquitin and ubiquiti-binding proteins</article-title><source>Annu. Rev. Cell Dev. Biol</source><year>2003</year><volume>19</volume><fpage>141</fpage><lpage>172</lpage><pub-id pub-id-type="doi">10.1146/annurev.cellbio.19.110701.154617</pub-id><pub-id pub-id-type="pmid">14570567</pub-id></citation></ref>
<ref id="b6-ijms-12-08347"><label>6</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Welchman</surname><given-names>R.L.</given-names></name><name><surname>Gordon</surname><given-names>C.</given-names></name><name><surname>Mayer</surname><given-names>R.J.</given-names></name></person-group><article-title>Ubiquitin and ubiquitin-like proteins as multifunctional signals</article-title><source>Nat. Rev. Mol. Cell Biol</source><year>2005</year><volume>6</volume><fpage>599</fpage><lpage>609</lpage><pub-id pub-id-type="doi">10.1038/nrm1700</pub-id><pub-id pub-id-type="pmid">16064136</pub-id></citation></ref>
<ref id="b7-ijms-12-08347"><label>7</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hershko</surname><given-names>A.</given-names></name><name><surname>Ciechanover</surname><given-names>A.</given-names></name></person-group><article-title>The ubiquitin system</article-title><source>Annu. Rev. Biochem</source><year>1998</year><volume>67</volume><fpage>425</fpage><lpage>479</lpage><pub-id pub-id-type="doi">10.1146/annurev.biochem.67.1.425</pub-id><pub-id pub-id-type="pmid">9759494</pub-id></citation></ref>
<ref id="b8-ijms-12-08347"><label>8</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hicke</surname><given-names>L.</given-names></name></person-group><article-title>Protein regulation by monoubiquitin</article-title><source>Nat. Rev. Mol. Cell Biol</source><year>2001</year><volume>2</volume><fpage>195</fpage><lpage>201</lpage><pub-id pub-id-type="doi">10.1038/35056583</pub-id><pub-id pub-id-type="pmid">11265249</pub-id></citation></ref>
<ref id="b9-ijms-12-08347"><label>9</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Denis</surname><given-names>N.J.</given-names></name><name><surname>Vasilescu</surname><given-names>J.</given-names></name><name><surname>Lambert</surname><given-names>J.P.</given-names></name><name><surname>Smith</surname><given-names>J.C.</given-names></name><name><surname>Figeys</surname><given-names>D.</given-names></name></person-group><article-title>Tryptic digestion of ubiquitin standards reveals an improved strategy for identifying ubiquitinated proteins by mass spectrometry</article-title><source>Proteomics</source><year>2007</year><volume>7</volume><fpage>868</fpage><lpage>874</lpage><pub-id pub-id-type="doi">10.1002/pmic.200600410</pub-id><pub-id pub-id-type="pmid">17370265</pub-id></citation></ref>
<ref id="b10-ijms-12-08347"><label>10</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hitchcock</surname><given-names>A.L.</given-names></name><name><surname>Auld</surname><given-names>K.</given-names></name><name><surname>Gygi</surname><given-names>S.P.</given-names></name><name><surname>Silver</surname><given-names>P.A.</given-names></name></person-group><article-title>A subset of membrane-associated proteins is ubiquitinated in response to mutations in the endoplasmic reticulum degradation machinery</article-title><source>Proc. Natl. Acad. Sci. USA</source><year>2003</year><volume>100</volume><fpage>12735</fpage><lpage>12740</lpage><pub-id pub-id-type="doi">10.1073/pnas.2135500100</pub-id><pub-id pub-id-type="pmid">14557538</pub-id></citation></ref>
<ref id="b11-ijms-12-08347"><label>11</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jeon</surname><given-names>H.B.</given-names></name><name><surname>Choi</surname><given-names>E.S.</given-names></name><name><surname>Yoon</surname><given-names>J.F.</given-names></name><name><surname>Hwang</surname><given-names>J.H.</given-names></name><name><surname>Chang</surname><given-names>J.W.</given-names></name><name><surname>Lee</surname><given-names>E.K.</given-names></name><name><surname>Choi</surname><given-names>H.W.</given-names></name><name><surname>Park</surname><given-names>Z.Y.</given-names></name><name><surname>Yoo</surname><given-names>Y.J.</given-names></name></person-group><article-title>A proteomics approach to identify the ubiquitinated proteins in mouse heart</article-title><source>Biochem. Biophys. Res. Commun</source><year>2007</year><volume>357</volume><fpage>731</fpage><lpage>736</lpage><pub-id pub-id-type="doi">10.1016/j.bbrc.2007.04.015</pub-id><pub-id pub-id-type="pmid">17451654</pub-id></citation></ref>
<ref id="b12-ijms-12-08347"><label>12</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kirkpatrick</surname><given-names>D.S.</given-names></name><name><surname>Weldon</surname><given-names>S.F.</given-names></name><name><surname>Tsaprailis</surname><given-names>G.</given-names></name><name><surname>Liebler</surname><given-names>D.C.</given-names></name><name><surname>Gandolfi</surname><given-names>A.J.</given-names></name></person-group><article-title>Proteomic identification of ubiquitinated proteins from human cells expressing His-tagged ubiquitin</article-title><source>Proteomics</source><year>2005</year><volume>5</volume><fpage>2104</fpage><lpage>2111</lpage><pub-id pub-id-type="doi">10.1002/pmic.200401089</pub-id><pub-id pub-id-type="pmid">15852347</pub-id></citation></ref>
<ref id="b13-ijms-12-08347"><label>13</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tung</surname><given-names>C.W.</given-names></name><name><surname>Ho</surname><given-names>S.Y.</given-names></name></person-group><article-title>Computational identification of ubiquitylation sites from protein sequences</article-title><source>BMC Bioinf</source><year>2008</year><volume>9</volume><fpage>310</fpage><lpage>324</lpage><pub-id pub-id-type="doi">10.1186/1471-2105-9-310</pub-id></citation></ref>
<ref id="b14-ijms-12-08347"><label>14</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Radivojac</surname><given-names>P.</given-names></name><name><surname>Vacic</surname><given-names>V.</given-names></name><name><surname>Haynes</surname><given-names>C.</given-names></name><name><surname>Cocklin</surname><given-names>R.R.</given-names></name><name><surname>Mohan</surname><given-names>A.</given-names></name><name><surname>Heyen</surname><given-names>J.W.</given-names></name><name><surname>Goebl</surname><given-names>M.G.</given-names></name><name><surname>Iakoucheva</surname><given-names>L.M.</given-names></name></person-group><article-title>Identification, analysis, and prediction of protein ubiquitination sites</article-title><source>Proteins</source><year>2010</year><volume>78</volume><fpage>365</fpage><lpage>380</lpage><pub-id pub-id-type="doi">10.1002/prot.22555</pub-id><pub-id pub-id-type="pmid">19722269</pub-id></citation></ref>
<ref id="b15-ijms-12-08347"><label>15</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cai</surname><given-names>Y.</given-names></name><name><surname>Huang</surname><given-names>T.</given-names></name><name><surname>Hu</surname><given-names>L.</given-names></name><name><surname>Shi</surname><given-names>X.</given-names></name><name><surname>Xie</surname><given-names>L.</given-names></name><name><surname>Li</surname><given-names>Y.</given-names></name></person-group><article-title>Prediction of lysine ubiquitination with mRMR feature selection and analysis</article-title><source>Amino Acids</source><year>2011</year><volume>17</volume><fpage>273</fpage><lpage>281</lpage></citation></ref>
<ref id="b16-ijms-12-08347"><label>16</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Roy</surname><given-names>S.</given-names></name><name><surname>Martinez</surname><given-names>A.D.</given-names></name><name><surname>Platero</surname><given-names>H.</given-names></name><name><surname>Lane</surname><given-names>T.</given-names></name><name><surname>Werner-Washburne</surname><given-names>M</given-names></name></person-group><article-title>Exploiting amino acid composition for predicting protein-protein interactions</article-title><source>PLoS One</source><year>2009</year><volume>4</volume><pub-id pub-id-type="doi">10.1371/journal.pone.0007813.</pub-id></citation></ref>
<ref id="b17-ijms-12-08347"><label>17</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jones</surname><given-names>D.T.</given-names></name></person-group><article-title>Improving the accuracy of transmembrane protein topology prediction using evolutionary information</article-title><source>Bioinformatics</source><year>2007</year><volume>23</volume><fpage>538</fpage><lpage>544</lpage><pub-id pub-id-type="doi">10.1093/bioinformatics/btl677</pub-id><pub-id pub-id-type="pmid">17237066</pub-id></citation></ref>
<ref id="b18-ijms-12-08347"><label>18</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kaur</surname><given-names>H.</given-names></name><name><surname>Raghava</surname><given-names>G.P.</given-names></name></person-group><article-title>A neural network method for prediction of beta-turn types in proteins using evolutionary information</article-title><source>Bioinformatics</source><year>2004</year><volume>20</volume><fpage>2751</fpage><lpage>2758</lpage><pub-id pub-id-type="doi">10.1093/bioinformatics/bth322</pub-id><pub-id pub-id-type="pmid">15145798</pub-id></citation></ref>
<ref id="b19-ijms-12-08347"><label>19</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Atchey</surname><given-names>W.R.</given-names></name><name><surname>Zhao</surname><given-names>J.</given-names></name><name><surname>Fernandes</surname><given-names>A.D.</given-names></name><name><surname>Druke</surname><given-names>T.</given-names></name></person-group><article-title>Solving the protein sequence metric problem</article-title><source>Proc. Natl. Acad. Sci. USA</source><year>2005</year><volume>102</volume><fpage>6395</fpage><lpage>6400</lpage><pub-id pub-id-type="doi">10.1073/pnas.0408677102</pub-id><pub-id pub-id-type="pmid">15851683</pub-id></citation></ref>
<ref id="b20-ijms-12-08347"><label>20</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Peng</surname><given-names>K.</given-names></name><name><surname>Radivojac</surname><given-names>P.</given-names></name><name><surname>Vucetic</surname><given-names>S.</given-names></name><name><surname>Dunker</surname><given-names>A.K.</given-names></name><name><surname>Obradovic</surname><given-names>Z.</given-names></name></person-group><article-title>Length-dependent prediction of protein intrinsic disorder</article-title><source>BMC Bioinf</source><year>2006</year><volume>7</volume><fpage>208</fpage><lpage>216</lpage><pub-id pub-id-type="doi">10.1186/1471-2105-7-208</pub-id></citation></ref>
<ref id="b21-ijms-12-08347"><label>21</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Boeckmann</surname><given-names>B.</given-names></name><name><surname>Bairoch</surname><given-names>A.</given-names></name><name><surname>Apweiler</surname><given-names>R.</given-names></name><name><surname>Blatter</surname><given-names>M.C.</given-names></name><name><surname>Estreicher</surname><given-names>A.</given-names></name><name><surname>Gasteiger</surname><given-names>E.</given-names></name><name><surname>Martin</surname><given-names>M.J.</given-names></name><name><surname>Michoud</surname><given-names>K.</given-names></name><name><surname>O’Donovan</surname><given-names>C.</given-names></name><name><surname>Phan</surname><given-names>I.</given-names></name><name><surname>Pilbout</surname><given-names>S.</given-names></name><name><surname>Schneider</surname><given-names>M.</given-names></name></person-group><article-title>The SWISS-PROT protein knowledgebase and its supplement TrEMBL</article-title><source>Nucleic Acids Res</source><year>2003</year><volume>31</volume><fpage>365</fpage><lpage>370</lpage><pub-id pub-id-type="doi">10.1093/nar/gkg095</pub-id><pub-id pub-id-type="pmid">12520024</pub-id></citation></ref>
<ref id="b22-ijms-12-08347"><label>22</label><citation citation-type="web"><source>UniProt database</source><comment>Available online: <ext-link xlink:href="http://www.uniprot.org/" ext-link-type="uri">http://www.uniprot.org/</ext-link></comment><access-date>Accessed on 26 May 2010</access-date></citation></ref>
<ref id="b23-ijms-12-08347"><label>23</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname><given-names>W.</given-names></name><name><surname>Godzik</surname><given-names>A.</given-names></name></person-group><article-title>Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences</article-title><source>Bioinformatics</source><year>2006</year><volume>22</volume><fpage>1658</fpage><lpage>1659</lpage><pub-id pub-id-type="doi">10.1093/bioinformatics/btl158</pub-id><pub-id pub-id-type="pmid">16731699</pub-id></citation></ref>
<ref id="b24-ijms-12-08347"><label>24</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Anand</surname><given-names>A.</given-names></name><name><surname>Pugalenthi</surname><given-names>G.</given-names></name><name><surname>Suganthan</surname><given-names>P.N.</given-names></name></person-group><article-title>Predicting Protein Structural Class by SVM with Class-wise Optimized Features and Decision Probabilities</article-title><source>J. Theor. Biol</source><year>2008</year><volume>253</volume><fpage>375</fpage><lpage>380</lpage><pub-id pub-id-type="doi">10.1016/j.jtbi.2008.02.031</pub-id><pub-id pub-id-type="pmid">18423492</pub-id></citation></ref>
<ref id="b25-ijms-12-08347"><label>25</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xiao</surname><given-names>X.</given-names></name><name><surname>Wang</surname><given-names>P.</given-names></name><name><surname>Chou</surname><given-names>K.C.</given-names></name></person-group><article-title>Predicting protein structural classes with pseudo amino acid composition: An approach using geometric moments of cellular automaton image</article-title><source>J. Theor. Biol</source><year>2008</year><volume>254</volume><fpage>691</fpage><lpage>696</lpage><pub-id pub-id-type="doi">10.1016/j.jtbi.2008.06.016</pub-id><pub-id pub-id-type="pmid">18634802</pub-id></citation></ref>
<ref id="b26-ijms-12-08347"><label>26</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pugalenthi</surname><given-names>G.</given-names></name><name><surname>Tang</surname><given-names>K.</given-names></name><name><surname>Suganthan</surname><given-names>P.N.</given-names></name><name><surname>Archunan</surname><given-names>G.</given-names></name><name><surname>Sowdhamini</surname><given-names>R.</given-names></name></person-group><article-title>A machine learning approach for the identification of odorant binding proteins from sequence-derived properties</article-title><source>BMC Bioinf</source><year>2007</year><volume>19</volume><fpage>351</fpage><lpage>362</lpage></citation></ref>
<ref id="b27-ijms-12-08347"><label>27</label><citation citation-type="web"><source>NR database</source><comment>Available online: <ext-link xlink:href="ftp://ftp.ncbi" ext-link-type="ftp">ftp://ftp.ncbi.nih.gov/blast/db/nr</ext-link></comment><access-date>Accessed on 23 June 2011</access-date></citation></ref>
<ref id="b28-ijms-12-08347"><label>28</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Altschul</surname><given-names>S.F.</given-names></name><name><surname>Madden</surname><given-names>T.L.</given-names></name><name><surname>Schaffer</surname><given-names>A.A.</given-names></name><name><surname>Zhang</surname><given-names>J.</given-names></name><name><surname>Miller</surname><given-names>W.</given-names></name><name><surname>Lipman</surname><given-names>D.J.</given-names></name></person-group><article-title>Gapped BLAST and PSI-BLAST: A new generation of protein database search programs</article-title><source>Nucleic Acids Res</source><year>1997</year><volume>25</volume><fpage>3389</fpage><lpage>3402</lpage><pub-id pub-id-type="doi">10.1093/nar/25.17.3389</pub-id><pub-id pub-id-type="pmid">9254694</pub-id></citation></ref>
<ref id="b29-ijms-12-08347"><label>29</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wright</surname><given-names>P.E.</given-names></name><name><surname>Dyson</surname><given-names>H.J.</given-names></name></person-group><article-title>Intrinsically unstructured proteins: Reassessing the protein structure-function paradigm</article-title><source>J. Mol. Biol</source><year>1999</year><volume>293</volume><fpage>321</fpage><lpage>331</lpage><pub-id pub-id-type="doi">10.1006/jmbi.1999.3110</pub-id><pub-id pub-id-type="pmid">10550212</pub-id></citation></ref>
<ref id="b30-ijms-12-08347"><label>30</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dunker</surname><given-names>A.K.</given-names></name><name><surname>Brown</surname><given-names>C.J.</given-names></name><name><surname>Lawson</surname><given-names>J.D.</given-names></name><name><surname>Iakoucheva</surname><given-names>L.M.</given-names></name><name><surname>Obradovic</surname><given-names>Z.</given-names></name></person-group><article-title>Intrinsic disorder and protein function</article-title><source>Biochemistry</source><year>2002</year><volume>41</volume><fpage>6573</fpage><lpage>6582</lpage><pub-id pub-id-type="doi">10.1021/bi012159+</pub-id><pub-id pub-id-type="pmid">12022860</pub-id></citation></ref>
<ref id="b31-ijms-12-08347"><label>31</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname><given-names>J.</given-names></name><name><surname>Tan</surname><given-names>H.</given-names></name><name><surname>Rost</surname><given-names>B.</given-names></name></person-group><article-title>Loopy proteins appear conserved in evolution</article-title><source>J. Mol. Biol</source><year>2002</year><volume>322</volume><fpage>53</fpage><lpage>64</lpage><pub-id pub-id-type="doi">10.1016/S0022-2836(02)00736-2</pub-id><pub-id pub-id-type="pmid">12215414</pub-id></citation></ref>
<ref id="b32-ijms-12-08347"><label>32</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tompa</surname><given-names>P.</given-names></name></person-group><article-title>Intrinsically unstructured proteins</article-title><source>Trends Biochem. Sci</source><year>2002</year><volume>27</volume><fpage>527</fpage><lpage>533</lpage><pub-id pub-id-type="doi">10.1016/S0968-0004(02)02169-2</pub-id><pub-id pub-id-type="pmid">12368089</pub-id></citation></ref>
<ref id="b33-ijms-12-08347"><label>33</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Peng</surname><given-names>K.</given-names></name><name><surname>Radivojac</surname><given-names>P.</given-names></name><name><surname>Vucetic</surname><given-names>S.</given-names></name><name><surname>Dunker</surname><given-names>A.K.</given-names></name><name><surname>Obradovuc</surname><given-names>Z.</given-names></name></person-group><article-title>Length-dependent prediction of protein intrinsic disorder</article-title><source>BMC Bioinf</source><year>2006</year><volume>7</volume><fpage>208</fpage><lpage>217</lpage><pub-id pub-id-type="doi">10.1186/1471-2105-7-208</pub-id></citation></ref>
<ref id="b34-ijms-12-08347"><label>34</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bordoli</surname><given-names>L.</given-names></name><name><surname>Kiefer</surname><given-names>F.</given-names></name><name><surname>Schwede</surname><given-names>T.</given-names></name></person-group><article-title>Assessment of disorder predictions in CASP7</article-title><source>Proteins</source><year>2007</year><volume>69</volume><fpage>129</fpage><lpage>136</lpage><pub-id pub-id-type="doi">10.1002/prot.21671</pub-id><pub-id pub-id-type="pmid">17680688</pub-id></citation></ref>
<ref id="b35-ijms-12-08347"><label>35</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>He</surname><given-names>B.</given-names></name><name><surname>Wang</surname><given-names>K.</given-names></name><name><surname>Liu</surname><given-names>Y.</given-names></name><name><surname>Xue</surname><given-names>B.</given-names></name><name><surname>Uversky</surname><given-names>V.N.</given-names></name><name><surname>Dunker</surname><given-names>A.K.</given-names></name></person-group><article-title>Predicting intrinsic disorder in proteins: an overview</article-title><source>Cell Res</source><year>2009</year><volume>19</volume><fpage>929</fpage><lpage>949</lpage><pub-id pub-id-type="doi">10.1038/cr.2009.87</pub-id><pub-id pub-id-type="pmid">19597536</pub-id></citation></ref>
<ref id="b36-ijms-12-08347"><label>36</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Matsumoto</surname><given-names>M.</given-names></name><name><surname>Hatakeyama</surname><given-names>S.</given-names></name><name><surname>Oyamada</surname><given-names>K.</given-names></name><name><surname>Oda</surname><given-names>Y.</given-names></name><name><surname>Nishimura</surname><given-names>T.</given-names></name><name><surname>Nakayama</surname><given-names>K.I.</given-names></name></person-group><article-title>Large-scale analysis of the human ubiquitin-related proteome</article-title><source>Proteomics</source><year>2005</year><volume>5</volume><fpage>4145</fpage><lpage>4151</lpage><pub-id pub-id-type="doi">10.1002/pmic.200401280</pub-id><pub-id pub-id-type="pmid">16196087</pub-id></citation></ref>
<ref id="b37-ijms-12-08347"><label>37</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Peng</surname><given-names>J.</given-names></name><name><surname>Schwartz</surname><given-names>D.</given-names></name><name><surname>Elias</surname><given-names>J.E.</given-names></name><name><surname>Thoreen</surname><given-names>C.C.</given-names></name><name><surname>Cheng</surname><given-names>D.</given-names></name><name><surname>Marsischky</surname><given-names>G.</given-names></name><name><surname>Roelofs</surname><given-names>J.</given-names></name><name><surname>Finley</surname><given-names>D.</given-names></name><name><surname>Gygi</surname><given-names>S.P.</given-names></name></person-group><article-title>A proteomics approach to understanding protein ubiquitination</article-title><source>Nat. Biotechnol</source><year>2003</year><volume>21</volume><fpage>921</fpage><lpage>926</lpage><pub-id pub-id-type="doi">10.1038/nbt849</pub-id><pub-id pub-id-type="pmid">12872131</pub-id></citation></ref>
<ref id="b38-ijms-12-08347"><label>38</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kawashima</surname><given-names>S.</given-names></name><name><surname>Pokarowski</surname><given-names>P.</given-names></name><name><surname>Pokarowska</surname><given-names>M.</given-names></name><name><surname>Kolinski</surname><given-names>A.</given-names></name><name><surname>Katayama</surname><given-names>T.</given-names></name><name><surname>Kanehisa</surname><given-names>M.</given-names></name></person-group><article-title>AAindex: amino acid index database, progress report</article-title><source>Nucleic Acids Res</source><year>2008</year><volume>36</volume><fpage>202</fpage><lpage>205</lpage><pub-id pub-id-type="doi">10.1093/nar/gkn255</pub-id></citation></ref>
<ref id="b39-ijms-12-08347"><label>39</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Levi</surname><given-names>D.</given-names></name><name><surname>Ullman</surname><given-names>S.</given-names></name></person-group><article-title>Learning to classify by ongoing feature selection</article-title><source>Image Vis. Comput</source><year>2010</year><volume>28</volume><fpage>715</fpage><lpage>723</lpage><pub-id pub-id-type="doi">10.1016/j.imavis.2008.10.010</pub-id></citation></ref>
<ref id="b40-ijms-12-08347"><label>40</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname><given-names>H.W.</given-names></name><name><surname>Liu</surname><given-names>L.</given-names></name><name><surname>Zhang</surname><given-names>H.J.</given-names></name></person-group><article-title>Ensemble gene selection for cancer classification</article-title><source>Pattern Recognit</source><year>2010</year><volume>43</volume><fpage>2763</fpage><lpage>2772</lpage><pub-id pub-id-type="doi">10.1016/j.patcog.2010.02.008</pub-id></citation></ref>
<ref id="b41-ijms-12-08347"><label>41</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Cover</surname><given-names>T.M.</given-names></name><name><surname>Thomas</surname><given-names>J.A.</given-names></name></person-group><source>Elements of Information Theory</source><publisher-name>Wiley</publisher-name><publisher-loc>New York, NY, USA</publisher-loc><year>1991</year></citation></ref>
<ref id="b42-ijms-12-08347"><label>42</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fleuret</surname><given-names>F.</given-names></name></person-group><article-title>Fast binary feature selection with conditional mutual information</article-title><source>J. Mach. Learn. Res</source><year>2004</year><volume>5</volume><fpage>1531</fpage><lpage>1555</lpage></citation></ref>
<ref id="b43-ijms-12-08347"><label>43</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yu</surname><given-names>L.</given-names></name><name><surname>Liu</surname><given-names>H.</given-names></name></person-group><article-title>Efficient feature selection via analysis of relevance and redundancy</article-title><source>J. Mach. Learn. Res</source><year>2004</year><volume>5</volume><fpage>1205</fpage><lpage>1224</lpage></citation></ref>
<ref id="b44-ijms-12-08347"><label>44</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Breiman</surname><given-names>L.</given-names></name></person-group><article-title>Random forests</article-title><source>Mach. Learn</source><year>2001</year><volume>45</volume><fpage>5</fpage><lpage>32</lpage><pub-id pub-id-type="doi">10.1023/A:1010933404324</pub-id></citation></ref>
<ref id="b45-ijms-12-08347"><label>45</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sikic</surname><given-names>M.</given-names></name><name><surname>Tomic</surname><given-names>S.</given-names></name><name><surname>Vlahovicek</surname><given-names>K</given-names></name></person-group><article-title>Prediction of protein-protein interaction sites in sequences and 3D structures by random forests</article-title><source>PLoS Comput. Biol</source><year>2009</year><volume>5</volume><fpage>e1000278:1</fpage><lpage>e1000278:9</lpage></citation></ref>
<ref id="b46-ijms-12-08347"><label>46</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname><given-names>J.</given-names></name><name><surname>Liu</surname><given-names>H.</given-names></name><name><surname>Duan</surname><given-names>X.</given-names></name><name><surname>Ding</surname><given-names>Y.</given-names></name><name><surname>Wu</surname><given-names>H.</given-names></name><name><surname>Bai</surname><given-names>Y.</given-names></name><name><surname>Sun</surname><given-names>X.</given-names></name></person-group><article-title>Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature</article-title><source>Bioinformatics</source><year>2009</year><volume>25</volume><fpage>30</fpage><lpage>35</lpage><pub-id pub-id-type="doi">10.1093/bioinformatics/btn583</pub-id><pub-id pub-id-type="pmid">19008251</pub-id></citation></ref>
<ref id="b47-ijms-12-08347"><label>47</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ma</surname><given-names>X.</given-names></name><name><surname>Guo</surname><given-names>J.</given-names></name><name><surname>Wu</surname><given-names>J.</given-names></name><name><surname>Liu</surname><given-names>H.</given-names></name><name><surname>Yu</surname><given-names>J.</given-names></name><name><surname>Xie</surname><given-names>J.</given-names></name><name><surname>Sun</surname><given-names>X.</given-names></name></person-group><article-title>Prediction of RNA-binding residues in proteins from primary sequence using an enriched random forest model with a novel hybrid feature</article-title><source>Proteins</source><year>2011</year><volume>79</volume><fpage>1230</fpage><lpage>1239</lpage><pub-id pub-id-type="doi">10.1002/prot.22958</pub-id><pub-id pub-id-type="pmid">21268114</pub-id></citation></ref>
<ref id="b48-ijms-12-08347"><label>48</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Skurichina</surname><given-names>M.</given-names></name><name><surname>Kuncheva</surname><given-names>L.I.</given-names></name><name><surname>Duin</surname><given-names>R.P.W.</given-names></name></person-group><article-title>Bagging, Boosting, and the Random Subspace Method for Linear Classifier</article-title><source>Pattern Anal. Appl</source><year>2002</year><volume>5</volume><fpage>102</fpage><lpage>112</lpage><pub-id pub-id-type="doi">10.1007/s100440200009</pub-id></citation></ref>
<ref id="b49-ijms-12-08347"><label>49</label><citation citation-type="web"><person-group person-group-type="author"><name><surname>Breiman</surname><given-names>L.</given-names></name><name><surname>Cutler</surname><given-names>A</given-names></name></person-group><source>Random Forests</source><comment>Available online: <ext-link xlink:href="http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm" ext-link-type="uri">http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm</ext-link></comment><access-date>Accessed on 12 June 2011</access-date></citation></ref>
<ref id="b50-ijms-12-08347"><label>50</label><citation citation-type="web"><source>Randomforest-matlab</source><comment>Available online: <ext-link xlink:href="http://code.google.com/p/randomforest-matlab/" ext-link-type="uri">http://code.google.com/p/randomforest-matlab/</ext-link></comment><access-date>Accessed on 12 June 2011</access-date></citation></ref>
<ref id="b51-ijms-12-08347"><label>51</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname><given-names>T.L.</given-names></name><name><surname>Zheng</surname><given-names>X.Q.</given-names></name><name><surname>Wang</surname><given-names>J.</given-names></name></person-group><article-title>Prediction of protein structural class for low-similarity sequences using support vector machine and PSI-BLAST profile</article-title><source>Biochime</source><year>2010</year><volume>92</volume><fpage>1330</fpage><lpage>1334</lpage><pub-id pub-id-type="doi">10.1016/j.biochi.2010.06.013</pub-id></citation></ref>
<ref id="b52-ijms-12-08347"><label>52</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chou</surname><given-names>K.C.</given-names></name><name><surname>Zhang</surname><given-names>C.T.</given-names></name></person-group><article-title>Prediction of protein structural classes</article-title><source>Mol. Biol</source><year>1999</year><volume>30</volume><fpage>275</fpage><lpage>349</lpage></citation></ref>
<ref id="b53-ijms-12-08347"><label>53</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chou</surname><given-names>K.C.</given-names></name><name><surname>Shen</surname><given-names>H.B.</given-names></name></person-group><article-title>Recent progress in protein subcellular location prediction</article-title><source>Anal. Biochem</source><year>2007</year><volume>370</volume><fpage>1</fpage><lpage>16</lpage><pub-id pub-id-type="doi">10.1016/j.ab.2007.07.006</pub-id><pub-id pub-id-type="pmid">17698024</pub-id></citation></ref>
<ref id="b54-ijms-12-08347"><label>54</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zheng</surname><given-names>X.</given-names></name><name><surname>Liu</surname><given-names>T.</given-names></name><name><surname>Wang</surname><given-names>J.</given-names></name></person-group><article-title>A complexity-based method for predicting protein subcellular location</article-title><source>Amino Acids</source><year>2009</year><volume>37</volume><fpage>427</fpage><lpage>433</lpage><pub-id pub-id-type="doi">10.1007/s00726-008-0172-0</pub-id><pub-id pub-id-type="pmid">18719852</pub-id></citation></ref>
<ref id="b55-ijms-12-08347"><label>55</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shen</surname><given-names>H.B.</given-names></name><name><surname>Chou</surname><given-names>K.C.</given-names></name></person-group><article-title>Predicting protein subnuclear location with optimized evidence-theoretic K-nearest classifier and pseudo amino acid composition</article-title><source>Biochem. Biophys. Res. Commun</source><year>2005</year><volume>337</volume><fpage>752</fpage><lpage>756</lpage><pub-id pub-id-type="doi">10.1016/j.bbrc.2005.09.117</pub-id><pub-id pub-id-type="pmid">16213466</pub-id></citation></ref>
<ref id="b56-ijms-12-08347"><label>56</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chou</surname><given-names>K.C.</given-names></name><name><surname>Shen</surname><given-names>H.B.</given-names></name></person-group><article-title>Cell-PLoc: A package of web-servers for predicting subcellular localization of proteins in various organisms</article-title><source>Nat. Protoc</source><year>2008</year><volume>3</volume><fpage>153</fpage><lpage>162</lpage><pub-id pub-id-type="doi">10.1038/nprot.2007.494</pub-id><pub-id pub-id-type="pmid">18274516</pub-id></citation></ref>
<ref id="b57-ijms-12-08347"><label>57</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Peng</surname><given-names>H.</given-names></name><name><surname>Long</surname><given-names>F.</given-names></name><name><surname>Ding</surname><given-names>C.</given-names></name></person-group><article-title>Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy</article-title><source>IEEE Trans. Pattern Anal. Mach. Intell</source><year>2005</year><volume>27</volume><fpage>1226</fpage><lpage>1238</lpage><pub-id pub-id-type="doi">10.1109/TPAMI.2005.159</pub-id><pub-id pub-id-type="pmid">16119262</pub-id></citation></ref>
<ref id="b58-ijms-12-08347"><label>58</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wagner</surname><given-names>S.A.</given-names></name><name><surname>Beli</surname><given-names>P.</given-names></name><name><surname>Weinert</surname><given-names>B.T.</given-names></name><name><surname>Nielsen</surname><given-names>M.L.</given-names></name><name><surname>Cox</surname><given-names>J.</given-names></name><name><surname>Mann</surname><given-names>M.</given-names></name><name><surname>Choudhary</surname><given-names>C</given-names></name></person-group><article-title>A proteome-wide, quantitative survey of <italic>in vivo</italic> ubiquitylation sites reveals widespread regulatory roles</article-title><source>Mol. Cell. Proteomics</source><year>2011</year><pub-id pub-id-type="doi">10.1074/mcp.M111.013284.</pub-id></citation></ref>
<ref id="b59-ijms-12-08347"><label>59</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname><given-names>W.</given-names></name><name><surname>Bennett</surname><given-names>E.J.</given-names></name><name><surname>Huttlin</surname><given-names>E.L.</given-names></name><name><surname>Guo</surname><given-names>A.</given-names></name><name><surname>Li</surname><given-names>J.</given-names></name><name><surname>Possemato</surname><given-names>A.</given-names></name><name><surname>Sowa</surname><given-names>M.E.</given-names></name><name><surname>Rad</surname><given-names>R.</given-names></name><name><surname>Rush</surname><given-names>J.</given-names></name><name><surname>Comb</surname><given-names>M.J.</given-names></name><etal/></person-group><article-title>Systematic and quantitative assessment of the Ubiquitin-modified proteome</article-title><source>Mol. Cell</source><year>2011</year><volume>44</volume><fpage>325</fpage><lpage>340</lpage><pub-id pub-id-type="doi">10.1016/j.molcel.2011.08.025</pub-id><pub-id pub-id-type="pmid">21906983</pub-id></citation></ref></ref-list>
<sec sec-type="display-objects">
<title>Figures and Tables</title>
<fig id="f1-ijms-12-08347" position="float">
<label>Figure 1</label>
<caption>
<p>Schematic representation of transformation of each protein sequence into <italic>L</italic>*20 dimensional position-specific scoring matrix (PSSM); the rows represent the protein residues and the columns represent the 20 amino acids.</p></caption>
<graphic xlink:href="ijms-12-08347f1.gif"/></fig>
<fig id="f2-ijms-12-08347" position="float">
<label>Figure 2</label>
<caption>
<p>The framework of the ensemble model.</p></caption>
<graphic xlink:href="ijms-12-08347f2.gif"/></fig>
<fig id="f3-ijms-12-08347" position="float">
<label>Figure 3</label>
<caption>
<p>The relationship between the prediction performance and the quantity of base classifiers.</p></caption>
<graphic xlink:href="ijms-12-08347f3.gif"/></fig>
<fig id="f4-ijms-12-08347" position="float">
<label>Figure 4</label>
<caption>
<p>The number of each type of feature in the 10 selected subsets.</p></caption>
<graphic xlink:href="ijms-12-08347f4.gif"/></fig>
<fig id="f5-ijms-12-08347" position="float">
<label>Figure 5</label>
<caption>
<p>The number of all features on each site in the 10 selected subsets.</p></caption>
<graphic xlink:href="ijms-12-08347f5.gif"/></fig>
<fig id="f6-ijms-12-08347" position="float">
<label>Figure 6</label>
<caption>
<p>The number of PSSM features on each site in the 10 selected subsets.</p></caption>
<graphic xlink:href="ijms-12-08347f6.gif"/></fig>
<table-wrap id="t1-ijms-12-08347" position="float">
<label>Table 1</label>
<caption>
<p>The number of ubiquitylation and non-ubiquitylation sites in each dataset.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="center" valign="bottom">Dataset</th>
<th align="center" valign="bottom">No of ubiquitylation sites</th>
<th align="center" valign="bottom">No of non-ubiquitylation sites</th></tr></thead>
<tbody>
<tr>
<td align="center" valign="top">Training dataset</td>
<td align="center" valign="top">298</td>
<td align="center" valign="top">563</td></tr>
<tr>
<td align="center" valign="top">Test dataset</td>
<td align="center" valign="top">170</td>
<td align="center" valign="top">357</td></tr>
<tr>
<td align="center" valign="top">Independent dataset</td>
<td align="center" valign="top">14</td>
<td align="center" valign="top">267</td></tr></tbody></table></table-wrap>
<table-wrap id="t2-ijms-12-08347" position="float">
<label>Table 2</label>
<caption>
<p>The performance comparison of two feature selection methods on the training dataset.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="center" valign="bottom">Method</th>
<th align="center" valign="bottom"><italic>S</italic><italic><sub>n</sub></italic> (%)</th>
<th align="center" valign="bottom"><italic>S</italic><italic><sub>p</sub></italic> (%)</th>
<th align="center" valign="bottom"><italic>AC</italic> (%)</th>
<th align="center" valign="bottom"><italic>MCC</italic></th></tr></thead>
<tbody>
<tr>
<td align="center" valign="top">mRMR [<xref ref-type="bibr" rid="b57-ijms-12-08347">57</xref>]</td>
<td align="center" valign="top">64.76 ± 2.12</td>
<td align="center" valign="top">68.21 ± 3.52</td>
<td align="center" valign="top">67.42 ± 1.37</td>
<td align="center" valign="top">0.282 ± 0.13</td></tr>
<tr>
<td align="center" valign="top">This paper</td>
<td align="center" valign="top">76.85 ± 1.84</td>
<td align="center" valign="top">76.91 ± 2.09</td>
<td align="center" valign="top">76.82 ± 1.03</td>
<td align="center" valign="top">0.519 ± 0.08</td></tr></tbody></table></table-wrap>
<table-wrap id="t3-ijms-12-08347" position="float">
<label>Table 3</label>
<caption>
<p>The performance comparison of the two feature selection methods on the test dataset.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="center" valign="bottom">Method</th>
<th align="center" valign="bottom"><italic>S</italic><italic><sub>n</sub></italic> (%)</th>
<th align="center" valign="bottom"><italic>S</italic><italic><sub>p</sub></italic> (%)</th>
<th align="center" valign="bottom"><italic>AC</italic> (%)</th>
<th align="center" valign="bottom"><italic>MCC</italic></th></tr></thead>
<tbody>
<tr>
<td align="center" valign="top">mRMR [<xref ref-type="bibr" rid="b57-ijms-12-08347">57</xref>]</td>
<td align="center" valign="top">51.68 ± 1.35</td>
<td align="center" valign="top">74.22 ± 0.92</td>
<td align="center" valign="top">69.20 ± 1.06</td>
<td align="center" valign="top">0.229 ± 0.09</td></tr>
<tr>
<td align="center" valign="top">This paper</td>
<td align="center" valign="top">72.61 ± 2.34</td>
<td align="center" valign="top">81.27 ± 0.76</td>
<td align="center" valign="top">79.16 ± 0.98</td>
<td align="center" valign="top">0.503 ± 0.07</td></tr></tbody></table></table-wrap>
<table-wrap id="t4-ijms-12-08347" position="float">
<label>Table 4</label>
<caption>
<p>The performance comparison of different predictors on the independent dataset.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="center" valign="bottom">Predictor</th>
<th align="center" valign="bottom"><italic>S</italic><italic><sub>n</sub></italic> (%)</th>
<th align="center" valign="bottom"><italic>S</italic><italic><sub>p</sub></italic> (%)</th>
<th align="center" valign="bottom"><italic>AC</italic> (%)</th>
<th align="center" valign="bottom"><italic>MCC</italic></th></tr></thead>
<tbody>
<tr>
<td align="center" valign="top">mRMRPred [<xref ref-type="bibr" rid="b15-ijms-12-08347">15</xref>]</td>
<td align="center" valign="top">34.34</td>
<td align="center" valign="top">79.67</td>
<td align="center" valign="top">68.34</td>
<td align="center" valign="top">0.139</td></tr>
<tr>
<td align="center" valign="top">UbiPred [<xref ref-type="bibr" rid="b13-ijms-12-08347">13</xref>]</td>
<td align="center" valign="top">NA</td>
<td align="center" valign="top">NA</td>
<td align="center" valign="top">NA</td>
<td align="center" valign="top">0.135</td></tr>
<tr>
<td align="center" valign="top">UbPred [<xref ref-type="bibr" rid="b14-ijms-12-08347">14</xref>]</td>
<td align="center" valign="top">NA</td>
<td align="center" valign="top">NA</td>
<td align="center" valign="top">NA</td>
<td align="center" valign="top">0.117</td></tr>
<tr>
<td align="center" valign="top">This paper</td>
<td align="center" valign="top">57.14 ± 1.39</td>
<td align="center" valign="top">74.15 ± 0.95</td>
<td align="center" valign="top">71.32 ± 1.26</td>
<td align="center" valign="top">0.153 ± 0.06</td></tr></tbody></table></table-wrap></sec></back></article>
