<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">ijms</journal-id>
<journal-title>International Journal of Molecular Sciences</journal-title>
<abbrev-journal-title>Int. J. Mol. Sci.</abbrev-journal-title>
<issn pub-type="epub">1422-0067</issn>
<publisher>
<publisher-name>Molecular Diversity Preservation International (MDPI)</publisher-name></publisher></journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3390/ijms13033650</article-id>
<article-id pub-id-type="publisher-id">ijms-13-03650</article-id>
<article-categories>
<subj-group>
<subject>Article</subject></subj-group></article-categories>
<title-group>
<article-title>Prediction of Bioluminescent Proteins Using Auto Covariance Transformation of Evolutional Profiles</article-title></title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Zhao</surname><given-names>Xiaowei</given-names></name><xref ref-type="aff" rid="af1-ijms-13-03650">1</xref><xref ref-type="aff" rid="af2-ijms-13-03650">2</xref></contrib>
<contrib contrib-type="author">
<name><surname>Li</surname><given-names>Jiakui</given-names></name><xref ref-type="aff" rid="af1-ijms-13-03650">1</xref></contrib>
<contrib contrib-type="author">
<name><surname>Huang</surname><given-names>Yanxin</given-names></name><xref ref-type="aff" rid="af3-ijms-13-03650">3</xref><xref ref-type="corresp" rid="c1-ijms-13-03650">*</xref></contrib>
<contrib contrib-type="author">
<name><surname>Ma</surname><given-names>Zhiqiang</given-names></name><xref ref-type="aff" rid="af2-ijms-13-03650">2</xref><xref ref-type="corresp" rid="c1-ijms-13-03650">*</xref></contrib>
<contrib contrib-type="author">
<name><surname>Yin</surname><given-names>Minghao</given-names></name><xref ref-type="aff" rid="af1-ijms-13-03650">1</xref><xref ref-type="corresp" rid="c1-ijms-13-03650">*</xref></contrib></contrib-group>
<aff id="af1-ijms-13-03650">
<label>1</label>School of Computer Science and Information Technology, Northeast Normal University, Changchun 130117, China; E-Mails: <email>zhaoxw303@nenu.edu.cn</email> (X.Z.); <email>lijk136@126.com</email> (J.L.)</aff>
<aff id="af2-ijms-13-03650">
<label>2</label>School of Life Sciences, Northeast Normal University, Changchun 130024, China</aff>
<aff id="af3-ijms-13-03650">
<label>3</label>National Engineering Laboratory for Druggable Gene and Protein Screening, Northeast Normal University, Changchun 130024, China</aff>
<author-notes>
<corresp id="c1-ijms-13-03650">
<label>*</label>Authors to whom correspondence should be addressed; E-Mails: <email>huangyx356@nenu.edu.cn</email> (Y.H.); <email>zhiqiang.ma967@gmail.com</email> (Z.M.); <email>minghao.yin197@gmail.com</email> (M.Y.); Tel./Fax: +86-0431-845-36338 (Z.M.).</corresp></author-notes>
<pub-date pub-type="collection">
<year>2012</year></pub-date>
<pub-date pub-type="epub">
<day>19</day>
<month>3</month>
<year>2012</year></pub-date>
<volume>13</volume>
<issue>3</issue>
<fpage>3650</fpage>
<lpage>3660</lpage>
<history>
<date date-type="received">
<day>10</day>
<month>1</month>
<year>2012</year></date>
<date date-type="rev-recd">
<day>21</day>
<month>2</month>
<year>2012</year></date>
<date date-type="accepted">
<day>05</day>
<month>3</month>
<year>2012</year></date></history>
<permissions>
<copyright-statement>© 2012 by the authors; licensee Molecular Diversity Preservation International, Basel, Switzerland.</copyright-statement>
<copyright-year>2012</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/3.0">
<p>This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).</p></license></permissions>
<abstract>
<p>Bioluminescent proteins are important for various cellular processes, such as gene expression analysis, drug discovery, bioluminescent imaging, toxicity determination, and DNA sequencing studies. Hence, the correct identification of bioluminescent proteins is of great importance both for helping genome annotation and providing a supplementary role to experimental research to obtain insight into bioluminescent proteins’ functions. However, few computational methods are available for identifying bioluminescent proteins. Therefore, in this paper we develop a new method to predict bioluminescent proteins using a model based on position specific scoring matrix and auto covariance. Tested by 10-fold cross-validation and independent test, the accuracy of the proposed model reaches 85.17% for the training dataset and 90.71% for the testing dataset respectively. These results indicate that our predictor is a useful tool to predict bioluminescent proteins. This is the first study in which evolutionary information and local sequence environment information have been successfully integrated for predicting bioluminescent proteins. A web server (BLPre) that implements the proposed predictor is freely available.</p></abstract>
<kwd-group>
<kwd>bioluminescent proteins</kwd>
<kwd>position specific scoring matrix</kwd>
<kwd>support vector machine</kwd>
<kwd>evolutionary information</kwd></kwd-group></article-meta></front>
<body>
<sec sec-type="intro">
<title>1. Introduction</title>
<p>Bioluminescence is a process in which light is produced in an organism by means of a chemical reaction [<xref ref-type="bibr" rid="b1-ijms-13-03650">1</xref>,<xref ref-type="bibr" rid="b2-ijms-13-03650">2</xref>]. Bioluminescence has been found in various organisms like squid, bacteria, fungi, ctenophore, algae and fish, <italic>etc.</italic> [<xref ref-type="bibr" rid="b3-ijms-13-03650">3</xref>,<xref ref-type="bibr" rid="b4-ijms-13-03650">4</xref>]. All bioluminescent reactions occur in the presence of oxygen. At least two chemicals are required in the bioluminescence process. The one which produces the light is genetically called a luciferin and the one that drives to catalyze the reaction is called a luciferase [<xref ref-type="bibr" rid="b5-ijms-13-03650">5</xref>]. In the basic reaction, the luciferase catalyzes the oxidation of luciferin, resulting in light and an inactive oxyluciferin. In order to produce more luciferin, energy must be provided to the reaction system. Sometimes the luciferin and luciferase (as well as co-factor such as oxygen) are bound together in a single unit called a photoprotein. When a particular type of ion is added to the system, this molecule can be triggered to produce light.</p>
<p>Bioluminescence serves various functions, such as attraction of mates, attraction of prey, camouflage, finding food, signaling other members of their species and illumination of prey [<xref ref-type="bibr" rid="b3-ijms-13-03650">3</xref>–<xref ref-type="bibr" rid="b5-ijms-13-03650">5</xref>]. The application of bioluminescence can greatly promote the progress in the field of medical and commercial areas. Thus, identification of bioluminescent proteins could help to discover many still unknown functions and design new commercial and medical applications.</p>
<p>Until now, both experimental and computational methods [<xref ref-type="bibr" rid="b6-ijms-13-03650">6</xref>,<xref ref-type="bibr" rid="b7-ijms-13-03650">7</xref>] have been developed to investigate the bioluminescent proteins. But <italic>in vitro</italic> and <italic>in vivo</italic> methods are often time-consuming, expensive and have very limited scopes due to some restrictions for many enzymatic reactions. On the other hand, <italic>in silico</italic> prediction of bioluminescent proteins from computational approaches may provide fast and automatic annotations for candidate bioluminescent proteins. However, there are few studies using computational approaches to discriminate bioluminescent proteins and non-bioluminescent proteins. Kandaswamy <italic>et al.</italic> [<xref ref-type="bibr" rid="b8-ijms-13-03650">8</xref>] have tried to solve this problem using support vector machine (SVM). To the best of our knowledge, that is the first and the only paper utilizing machine learning technique to deal with the prediction of bioluminescent proteins. With the model BLProt, they obtained 80% accuracy from training dataset and 80.06% accuracy from test dataset. A list of 544 physicochemical properties [<xref ref-type="bibr" rid="b9-ijms-13-03650">9</xref>] was used to encode each protein sequence. The problem is worthy of further investigation because the prediction performance is not always satisfactory and there were no online web servers up until now.</p>
<p>In this study, we develop a new computational method to predict bioluminescent proteins. First, sequential evolution information in the form of position specific scoring matrix (PSSM) generated from the inquired sequences is obtained by PSI-BLAST. Second, the PSSM is transformed into a fixed-length feature vector by auto covariance (<italic>AC</italic>) transformation. This encoding strategy (PSSM-AC) has been successfully utilized to predict protein structural classes [<xref ref-type="bibr" rid="b10-ijms-13-03650">10</xref>] and discriminate membrane proteins [<xref ref-type="bibr" rid="b11-ijms-13-03650">11</xref>]. Finally, these resulting vectors are input to an SVM classifier to perform the prediction. Tested by 10-fold cross-validation and independent test, the accuracy of the proposed predictor reaches 85.17% for the training dataset and 90.71% for the testing dataset respectively, which are significantly higher than those by the existing predictors. We reckon that this efficient performance enhancement is largely due to the good discrimination capabilities of the feature extraction strategy PSSM-AC and the learning capabilities of SVM. The proposed predictor is freely accessible to the public at the web server BLPre [<xref ref-type="bibr" rid="b12-ijms-13-03650">12</xref>].</p></sec>
<sec sec-type="materials|methods">
<title>2. Materials and Methods</title>
<sec sec-type="methods">
<title>2.1. Datasets</title>
<p>To evaluate the prediction model proposed in this study and compare it with state-of-the-art methods, two publicly available datasets are used here [<xref ref-type="bibr" rid="b8-ijms-13-03650">8</xref>]. And anyone can freely download it at [<xref ref-type="bibr" rid="b13-ijms-13-03650">13</xref>]. The training dataset contains 300 bioluminescent proteins and 300 non-bioluminescent proteins, and the test dataset contains 139 bioluminescent proteins and 18202 non-bioluminescent proteins.</p>
<p>To avoid homology bias and remove the redundant sequences from the benchmark dataset, a cutoff threshold of 25% is imposed by [<xref ref-type="bibr" rid="b14-ijms-13-03650">14</xref>,<xref ref-type="bibr" rid="b15-ijms-13-03650">15</xref>] to remove those proteins from the benchmark dataset that have ≥ 25% sequence similarity. However, we do not use such a stringent criterion in this study because the number of available protein sequences does not allow us to do so (40% in this paper). In addition, the protein sequences containing less than 50 amino acids are also screened out.</p></sec>
<sec>
<title>2.2. Position Specific Scoring Matrix</title>
<p>Evolutionary information, one of the most important types of information in assessing functionality in biological analysis, has been widely used in many studies [<xref ref-type="bibr" rid="b16-ijms-13-03650">16</xref>–<xref ref-type="bibr" rid="b21-ijms-13-03650">21</xref>]. To extract the evolutionary information, the profile of each protein sequence is generated by running Position Specific Iterated BLAST (PSI-BLAST) program [<xref ref-type="bibr" rid="b22-ijms-13-03650">22</xref>–<xref ref-type="bibr" rid="b24-ijms-13-03650">24</xref>]. Then this information can be represented as a two dimensional matrix which is known as the PSSM of the protein. PSSM has been widely used to predict protein fold pattern [<xref ref-type="bibr" rid="b25-ijms-13-03650">25</xref>], protein quaternary structural attribute [<xref ref-type="bibr" rid="b26-ijms-13-03650">26</xref>], disulfide connectivity [<xref ref-type="bibr" rid="b27-ijms-13-03650">27</xref>,<xref ref-type="bibr" rid="b28-ijms-13-03650">28</xref>], half-sphere exposure [<xref ref-type="bibr" rid="b29-ijms-13-03650">29</xref>], protein fold recognition and superfamily discrimination [<xref ref-type="bibr" rid="b30-ijms-13-03650">30</xref>], ATP binding residues of a protein [<xref ref-type="bibr" rid="b31-ijms-13-03650">31</xref>], and catalytic residues [<xref ref-type="bibr" rid="b32-ijms-13-03650">32</xref>]. As a result, we also use it to predict bioluminescent proteins.</p>
<p>In this paper, the PSSM of each protein sequence in the constructed dataset is generated against the non-redundant Swiss-Prot database (version 56, released on 22 July 2008) using the PSI-BLAST program with three iterations (−<italic>j</italic> 3) and e-value threshold 0.0001 (−<italic>h</italic> 0.0001). This matrix is composed of <italic>L</italic> × 20 elements, where <italic>L</italic> is the total number of residues in a peptide, the rows of the matrix represent the protein residues and the columns of the matrix represent the 20 amino acids.</p>
<p>In view of the fact that SVM requires the fixed length feature vectors as their inputs for training [<xref ref-type="bibr" rid="b10-ijms-13-03650">10</xref>], we generate a vector of 400 dimensions, called PSSM-400 from the PSSM. PSSM-400 is the composition of occurrences of each type of amino acid corresponding to each type of amino acids in protein sequence. Thus for each column we have a vector of dimension 20. <xref ref-type="fig" rid="f1-ijms-13-03650">Figure 1</xref> shows the schematic representation of transformation of each protein sequence into PSSM-400. Besides the PSSM-AC encoding strategy, PSSM-400 is also used to encode each protein sequence in this study.</p></sec>
<sec>
<title>2.3. Auto Covariance</title>
<p>Auto covariance (<italic>AC</italic>) is a correlation factor coupling adjacent residues along the protein sequence [<xref ref-type="bibr" rid="b11-ijms-13-03650">11</xref>]. It’s a kind of variant of auto cross covariance. As a powerful statistical tool used to analyze sequences of vectors [<xref ref-type="bibr" rid="b33-ijms-13-03650">33</xref>], the <italic>AC</italic> transformation has been widely applied in various fields of bioinformatics [<xref ref-type="bibr" rid="b34-ijms-13-03650">34</xref>–<xref ref-type="bibr" rid="b39-ijms-13-03650">39</xref>]. <italic>AC</italic> variables are able to avoid producing too many variants. In the PSSM-AC encoding strategy, the <italic>AC</italic> transformation is applied to each column of PSSM to incorporate the local sequence-order information. In this study, <italic>AC</italic> is employed to transform the PSSM into equal length vector. Given a protein sequence, <italic>AC</italic> variables describe the average interactions between residues with a series of <italic>lag</italic>. Here, <italic>lag</italic> is the distance between one residue and its neighbors in the protein sequence P. The <italic>AC</italic> variables can be calculated by <xref rid="FD1" ref-type="disp-formula">Equation (1)</xref>.</p>
<disp-formula id="FD1">
<label>(1)</label>
<mml:math id="mm1" display="block">
<mml:semantics id="sm1">
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>C</mml:mi></mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>g</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi></mml:mrow></mml:msub>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>-</mml:mo>
<mml:mi>l</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>g</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:munderover>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>-</mml:mo>
<mml:mi>l</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>g</mml:mi></mml:mrow></mml:munderover>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi></mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi></mml:mrow></mml:msub>
<mml:mo>-</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:munderover>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>n</mml:mi></mml:munderover>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi></mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow>
<mml:mo stretchy="false">)</mml:mo></mml:mrow>
<mml:mo>×</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi></mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>l</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>g</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi></mml:mrow></mml:msub>
<mml:mo>-</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:munderover>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>n</mml:mi></mml:munderover>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi></mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:semantics></mml:math></disp-formula>
<p>where <italic>P</italic> represents the PSSM generated by running the PSI-BLAST program, <italic>i</italic> represents the position, <italic>j</italic> represents one descriptor and <italic>n</italic> is the length of the sequence. Thus, the number of <italic>AC</italic> variables <italic>D</italic> can be calculated as <italic>D</italic> = <italic>lg</italic> × <italic>q</italic> (<italic>lg</italic> is the maximum <italic>lag</italic> (<italic>lag</italic> = 1, 2, …, <italic>lg</italic>) and <italic>q</italic> is the number of descriptors). Using <xref rid="FD1" ref-type="disp-formula">Equation 1</xref>, each protein sequence can be represented by a vector of <italic>AC</italic> variables, whose length equals to the value of <italic>D</italic>. Here, the value of <italic>q</italic> is 20, which corresponds to the number of the columns of the PSSM. Ultimately, each protein sequence was characterized by the PSSM-AC model.</p></sec>
<sec>
<title>2.4. Support Vector Machine</title>
<p>Support vector machine (SVM) is a popular learning approach mainly used in pattern recognition areas [<xref ref-type="bibr" rid="b40-ijms-13-03650">40</xref>–<xref ref-type="bibr" rid="b42-ijms-13-03650">42</xref>]. SVM [<xref ref-type="bibr" rid="b43-ijms-13-03650">43</xref>] belongs to the family of margin-based classifier and is assumed to be a very powerful method to deal with prediction, classification, and regression problems. SVM looks for the optimal hyperplane which maximizes the distance between the hyperplane and the nearest samples from each of the two classes. Let <italic>x</italic><italic><sub>i</sub></italic> ∈ <italic>R</italic><italic><sup>n</sup></italic> be training instance and <italic>y</italic><italic><sub>i</sub></italic> ∈ {−1, +1} be the corresponding class labels, <italic>i</italic> = 1, ..., <italic>n</italic>. The class label for a new instance <italic>x</italic> can be determined by the sign of the following function.</p>
<disp-formula id="FD2">
<label>(2)</label>
<mml:math id="mm2" display="block">
<mml:semantics id="sm2">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>m</mml:mi></mml:munderover>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi></mml:mrow>
<mml:mi>i</mml:mi></mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>α</mml:mi></mml:mrow>
<mml:mi>i</mml:mi></mml:msub>
<mml:mi>K</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi></mml:mrow>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>b</mml:mi></mml:mrow></mml:mrow></mml:semantics></mml:math></disp-formula>
<p>where <italic>m</italic> is the number of training instances, α<italic><sub>i</sub></italic> are the obtained by solving a optimization problem on the input instances, and <italic>b</italic> is the bias term. In this paper, LIBSVM package [<xref ref-type="bibr" rid="b44-ijms-13-03650">44</xref>] with radial basis kernels (RBF) is used.</p>
<disp-formula id="FD3">
<label>(3)</label>
<mml:math id="mm3" display="block">
<mml:semantics id="sm3">
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi></mml:mrow>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi></mml:mrow>
<mml:mi>j</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mtext>exp</mml:mtext>
<mml:mo stretchy="false">(</mml:mo>
<mml:mo>-</mml:mo>
<mml:mi>γ</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>‖</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi></mml:mrow>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>-</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi></mml:mrow>
<mml:mi>j</mml:mi></mml:msub></mml:mrow>
<mml:mo>‖</mml:mo></mml:mrow></mml:mrow></mml:mrow>
<mml:mn>2</mml:mn></mml:msup>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:semantics></mml:math></disp-formula>
<p>Two parameters, the regularization parameter <italic>C</italic> and the kernel width parameter γ are optimized based on 10-fold cross-validation using a grid search strategy.</p></sec>
<sec>
<title>2.5. Model Construction</title>
<p>The work flow of the proposed model is described in <xref ref-type="fig" rid="f2-ijms-13-03650">Figure 2</xref>. For the left part of <xref ref-type="fig" rid="f2-ijms-13-03650">Figure 2</xref>, firstly, sequential evolution information in form of PSSM profiles on the training dataset is obtained by PSI-BLAST. Secondly, the <italic>AC</italic> transformation is applied to the obtained PSSM with optional values of <italic>lg</italic> to incorporate local sequence order information. Finally, SVM is applied with ten-fold cross validation. With different <italic>lg</italic>, we can get different prediction models. In this study, we select the one which corresponds to the highest accuracy as the final model. The right part of <xref ref-type="fig" rid="f2-ijms-13-03650">Figure 2</xref> shows the process of how to predict each one protein sequence using the BLPre predictor.</p></sec>
<sec>
<title>2.6. Performance Evaluation</title>
<p>Ten-fold cross validation [<xref ref-type="bibr" rid="b45-ijms-13-03650">45</xref>] is used in this work. The dataset is randomly divided into ten equal sets, out of which nine sets are used for training and the remaining one for testing. This procedure is repeated ten times and the final prediction result is the average accuracy of the ten testing sets. Besides the ten-fold cross validation on the training set, we also utilize independent dataset test [<xref ref-type="bibr" rid="b46-ijms-13-03650">46</xref>] to evaluate our model.</p>
<p>Three parameters, sensitivity (<italic>S</italic><italic><sub>n</sub></italic>), specificity (<italic>S</italic><italic><sub>p</sub></italic>), and accuracy (<italic>AC</italic>) are used to measure the performance of our model. They are defined by the following formulas:</p>
<disp-formula id="FD4">
<label>(4)</label>
<mml:math id="mm4" display="block">
<mml:semantics id="sm4">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi></mml:mrow>
<mml:mi>n</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi></mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:mrow></mml:semantics></mml:math></disp-formula>
<disp-formula id="FD5">
<label>(5)</label>
<mml:math id="mm5" display="block">
<mml:semantics id="sm5">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi></mml:mrow>
<mml:mi>p</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi></mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi></mml:mrow></mml:mfrac></mml:mrow></mml:semantics></mml:math></disp-formula>
<disp-formula id="FD6">
<label>(6)</label>
<mml:math id="mm6" display="block">
<mml:semantics id="sm6">
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>C</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi></mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:mrow></mml:semantics></mml:math></disp-formula>
<p>where <italic>TP</italic>, <italic>TN</italic>, <italic>FP</italic> and <italic>FN</italic> stand for true positive, true negative, false positive and false negative, respectively. Moreover, we create ROC (receiver operating curve) for all of the models in order to evaluate the performance of models using different encoding strategies.</p></sec></sec>
<sec sec-type="results|discussion">
<title>3. Results and Discussion</title>
<sec>
<title>3.1. Selecting the Optimal <italic>lg</italic> for the Prediction Model</title>
<p>As mentioned in Section 2.5, for prediction performance, the value of <italic>lg</italic> of <italic>AC</italic> transformation is an important parameter needed to be considered. Generally, the value of <italic>lg</italic> varies in different datasets, and must be smaller than the length of the shortest protein sequence in the corresponding dataset. Since all the protein sequences collected in this paper contain more than 50 amino acids, a series value of <italic>lg</italic>s (<italic>lg</italic> = 1, 5, 10, 15, …, 50) are investigated to construct the optimal prediction model. The results on the training set constructed in this study are presented as <xref ref-type="fig" rid="f3-ijms-13-03650">Figure 3</xref>.</p>
<p>It can be seen in <xref ref-type="fig" rid="f3-ijms-13-03650">Figure 3</xref>, the prediction performance increases from 79.33% to 85.17% when the value of <italic>lg</italic> increases from 1 to 30 and decreases when the value of <italic>lg</italic> is larger than 30. The accuracy of the prediction model becomes stable when the value of <italic>lg</italic> equals 45. It is obvious that the best value of <italic>lg</italic> is 30 corresponding to a peak with accuracy of 85.17%, so that the value of <italic>lg</italic> is set to 30 in the rest of this study.</p></sec>
<sec>
<title>3.2. Comparison with Simple PSI-BLAST Search Method</title>
<p>In this section, we compare the PSSM-AC encoding strategy with the PSSM-400 encoding strategy mentioned in Section 2.2, thus to highlight the advantage of our prediction model. The results evaluated by ten-fold cross validation on the training dataset are shown in <xref ref-type="table" rid="t1-ijms-13-03650">Table 1</xref> and <xref ref-type="fig" rid="f4-ijms-13-03650">Figure 4</xref>. It can be seen in <xref ref-type="table" rid="t1-ijms-13-03650">Table 1</xref>, compared with the accuracy of 79.32% gained by PSSM-400 method, the accuracy obtained by our method PSSM-AC is 85.17%.</p>
<p>As shown in <xref ref-type="fig" rid="f4-ijms-13-03650">Figure 4</xref>, we achieve the area under the ROC curve (<italic>AUC</italic>) of 0.92, which is significantly better than that of the PSSM-400 method with <italic>AUC</italic> of 0.88. These results indicate that the superior performance of the <italic>AC</italic> transformation encoding when being applied to the PSSM to incorporate the local sequence-order information.</p></sec>
<sec sec-type="methods">
<title>3.3. Comparison with Other Methods</title>
<p>In this section, the proposed predictor is further compared with a recently reported predictor BLProt [<xref ref-type="bibr" rid="b8-ijms-13-03650">8</xref>] on the training dataset and the independent test dataset. As can be seen in <xref ref-type="table" rid="t1-ijms-13-03650">Table 1</xref>, our model achieves the accuracy of 85.17%, which is about 5% higher than the BLProt method. The number of bioluminescent proteins and non-bioluminescent proteins in the test dataset are highly imbalanced, and this situation is close to reality. Compared with the accuracy of 80.06% gained by Kandaswamy <italic>et al.</italic> [<xref ref-type="bibr" rid="b8-ijms-13-03650">8</xref>], the accuracy obtained by our method is 90.71% which has been significantly improved. The better prediction performance may be credited to the appropriate protein sequence encoding strategy adopted in our prediction model.</p></sec></sec>
<sec sec-type="conclusions">
<title>4. Conclusions</title>
<p>Prediction of bioluminescent proteins could help to discover many still unknown functions and design new commercial and medical applications. Though some researchers have focused on this problem, the accuracy of prediction is still not satisfied. In this study, <italic>AC</italic> is applied to PSSM, and this encoding strategy PSSM-AC could contain both sequential evolution information and the local sequence order information which adequately reflect the local environment during the evolution. The accuracy of our prediction model is higher than those of state-of-the-art bioluminescent proteins prediction tools. Experimental results have shown that our method is very promising and may be a useful supplement tool to existing methods.</p></sec></body>
<back>
<ack>
<title>Acknowledgments</title>
<p>This research is partially supported by the National Natural Science Foundation of China (Nos. 61172183 and 61070084), the Natural Science Foundation of Jilin Province (Nos. 20101506 and 20101503), and the Scientific and Technical Project of Administration of Traditional Chinese Medicine of Jilin Province (Nos. 2010pt067 and 2011-zd16).</p></ack>
<ref-list>
<title>References</title>
<ref id="b1-ijms-13-03650"><label>1</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Hastings</surname><given-names>J.W.</given-names></name></person-group><source>Bioluminescence</source><publisher-name>Academic Press</publisher-name><publisher-loc>New York NY, USA</publisher-loc><year>1995</year></citation></ref>
<ref id="b2-ijms-13-03650"><label>2</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wilson</surname><given-names>T.</given-names></name></person-group><article-title>Comments on the mechanisms of chemi- and bioluminescence</article-title><source>Photochem. Photobiol</source><year>1995</year><volume>62</volume><fpage>601</fpage><lpage>606</lpage><pub-id pub-id-type="doi">10.1111/j.1751-1097.1995.tb08706.x</pub-id></citation></ref>
<ref id="b3-ijms-13-03650"><label>3</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Haddock</surname><given-names>S.H.D.</given-names></name><name><surname>Moline</surname><given-names>M.A.</given-names></name><name><surname>Case</surname><given-names>J.F.</given-names></name></person-group><article-title>Bioluminescence in the Sea</article-title><source>Ann. Rev. Mar. Sci</source><year>2010</year><volume>2</volume><fpage>293</fpage><lpage>343</lpage></citation></ref>
<ref id="b4-ijms-13-03650"><label>4</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Lloyd</surname><given-names>J.E.</given-names></name></person-group><source>Insect Bioluminescence</source><publisher-name>Academic Press</publisher-name><publisher-loc>New York NY, USA</publisher-loc><year>1978</year></citation></ref>
<ref id="b5-ijms-13-03650"><label>5</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>White</surname><given-names>E.H.</given-names></name><name><surname>Rapaport</surname><given-names>E.</given-names></name><name><surname>Seliger</surname><given-names>H.H.</given-names></name><name><surname>Hopkins</surname><given-names>T.A.</given-names></name></person-group><article-title>The chemi- and bioluminescence of firefly luciferin: An efficient chemical production of electronically excited states</article-title><source>Bioorg. Chem</source><year>1971</year><volume>1</volume><fpage>92</fpage><lpage>122</lpage><pub-id pub-id-type="doi">10.1016/0045-2068(71)90009-5</pub-id></citation></ref>
<ref id="b6-ijms-13-03650"><label>6</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shimomura</surname><given-names>O.</given-names></name><name><surname>Johnson</surname><given-names>F.</given-names></name><name><surname>Saiga</surname><given-names>Y.</given-names></name></person-group><article-title>Extraction, purification and properties of aequorin, a bioluminescent protein from the luminous hydromedusan, aequorea</article-title><source>J. Cell. Phys</source><year>1962</year><volume>59</volume><fpage>223</fpage><lpage>239</lpage><pub-id pub-id-type="doi">10.1002/jcp.1030590302</pub-id></citation></ref>
<ref id="b7-ijms-13-03650"><label>7</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pierre</surname><given-names>A.V.</given-names></name><name><surname>Val</surname><given-names>J.W.</given-names></name></person-group><article-title>Fluorescent and bioluminescent protein-fragment complementation assays in the study of G protein-coupled receptor oligomerization and signaling</article-title><source>Mol. Pharmacol</source><year>2009</year><volume>75</volume><fpage>733</fpage><lpage>739</lpage><pub-id pub-id-type="doi">10.1124/mol.108.053819</pub-id><pub-id pub-id-type="pmid">19141658</pub-id></citation></ref>
<ref id="b8-ijms-13-03650"><label>8</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kandaswamy</surname><given-names>K.K.</given-names></name><name><surname>Ganesan</surname><given-names>P.</given-names></name><name><surname>Mehrnaz</surname><given-names>K.H.</given-names></name><name><surname>Kai</surname><given-names>K.</given-names></name><name><surname>Martinetz</surname><given-names>T</given-names></name></person-group><article-title>BLProt: Prediction of bioluminescent proteins based on Support Vector Machine and ReliefF feature selection</article-title><source>BMC Bioinforma</source><year>2011</year><volume>12</volume><pub-id pub-id-type="doi">10.1186/1471-2105-12-345</pub-id></citation></ref>
<ref id="b9-ijms-13-03650"><label>9</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kawashima</surname><given-names>S.</given-names></name><name><surname>Ogata</surname><given-names>H.</given-names></name><name><surname>Kanehisa</surname><given-names>M.</given-names></name></person-group><article-title>AAindex: Amino acid index database</article-title><source>Nucleic Acids Res</source><year>1999</year><volume>27</volume><fpage>368</fpage><lpage>369</lpage><pub-id pub-id-type="doi">10.1093/nar/27.1.368</pub-id><pub-id pub-id-type="pmid">9847231</pub-id></citation></ref>
<ref id="b10-ijms-13-03650"><label>10</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname><given-names>T.G.</given-names></name><name><surname>Geng</surname><given-names>X.B.</given-names></name><name><surname>Zheng</surname><given-names>X.Q.</given-names></name><name><surname>Li</surname><given-names>R.S.</given-names></name><name><surname>Wang</surname><given-names>J</given-names></name></person-group><article-title>Accurate prediction of protein structural class using auto covariance transformation of PSI-BLAST profiles</article-title><source>Amino Acids</source><year>2011</year><pub-id pub-id-type="doi">10.1007/s00726-011-0964-5</pub-id></citation></ref>
<ref id="b11-ijms-13-03650"><label>11</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname><given-names>L.</given-names></name><name><surname>Li</surname><given-names>Y.Z.</given-names></name><name><surname>Xiao</surname><given-names>R.Q.</given-names></name><name><surname>Zeng</surname><given-names>Y.H.</given-names></name><name><surname>Xiao</surname><given-names>J.M.</given-names></name><name><surname>Tan</surname><given-names>F.Y.</given-names></name><name><surname>Li</surname><given-names>M.L.</given-names></name></person-group><article-title>Using auto covariance method for functional discrimination of membrance proteins based on evolution information</article-title><source>Amino Acids</source><year>2010</year><volume>38</volume><fpage>1497</fpage><lpage>1503</lpage><pub-id pub-id-type="doi">10.1007/s00726-009-0362-4</pub-id><pub-id pub-id-type="pmid">19820894</pub-id></citation></ref>
<ref id="b12-ijms-13-03650"><label>12</label><citation citation-type="web"><collab>BLPre</collab><comment>Available online: <ext-link xlink:href="http://59.73.198.144/AFP_PSSM/" ext-link-type="uri">http://59.73.198.144/AFP_PSSM/</ext-link></comment><access-date>accessed on 10 February 2012</access-date></citation></ref>
<ref id="b13-ijms-13-03650"><label>13</label><citation citation-type="web"><source>BLProt dataset</source><comment>Available online: <ext-link xlink:href="http://www.inb.uni-luebeck.de/tools-demos/bioluminescent%20protein/BLProt" ext-link-type="uri">http://www.inb.uni-luebeck.de/tools-demos/bioluminescent%20protein/BLProt</ext-link></comment><access-date>accessed on 23 December 2011</access-date></citation></ref>
<ref id="b14-ijms-13-03650"><label>14</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chou</surname><given-names>K.C.</given-names></name><name><surname>Shen</surname><given-names>H.B.</given-names></name></person-group><article-title>Plant-mPLoc: A top-down strategy to augment the power for predicting plant protein subcellular localization</article-title><source>PLoS One</source><year>2010</year><volume>5</volume><pub-id pub-id-type="doi">10.1371/journal.pone.0011335</pub-id></citation></ref>
<ref id="b15-ijms-13-03650"><label>15</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chou</surname><given-names>K.C.</given-names></name><name><surname>Wu</surname><given-names>Z.C.</given-names></name><name><surname>Xiao</surname><given-names>X</given-names></name></person-group><article-title>iLoc-Euk: A multi-lable classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins</article-title><source>PLoS One</source><year>2011</year><volume>6</volume><pub-id pub-id-type="doi">10.1371/journal.pone.0018258</pub-id></citation></ref>
<ref id="b16-ijms-13-03650"><label>16</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kumar</surname><given-names>M.</given-names></name><name><surname>Gromiha</surname><given-names>M.M.</given-names></name><name><surname>Raghava</surname><given-names>G.P.</given-names></name></person-group><article-title>Identification of DNA-binding proteins using support vector machines and evolutionary profiles</article-title><source>BMC Bioinformatics</source><year>2007</year><volume>8</volume><pub-id pub-id-type="doi">10.1186/1471-2105-8-463</pub-id></citation></ref>
<ref id="b17-ijms-13-03650"><label>17</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Song</surname><given-names>J.</given-names></name><name><surname>Burrage</surname><given-names>K.</given-names></name><name><surname>Yuan</surname><given-names>Z.</given-names></name><name><surname>Huber</surname><given-names>T</given-names></name></person-group><article-title>Prediction of <italic>cis</italic>/<italic>trans</italic> isomerization in proteins using PSI-BLAST profiles and secondary structure information</article-title><source>BMC Bioinformatics</source><year>2006</year><volume>7</volume><pub-id pub-id-type="doi">10.1186/1471-2105-7-124</pub-id></citation></ref>
<ref id="b18-ijms-13-03650"><label>18</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jones</surname><given-names>D.T.</given-names></name></person-group><article-title>Improving the accuracy of transmembrane protein topology prediction using evolutionary information</article-title><source>Bioinformatics</source><year>2007</year><volume>23</volume><fpage>538</fpage><lpage>544</lpage><pub-id pub-id-type="doi">10.1093/bioinformatics/btl677</pub-id><pub-id pub-id-type="pmid">17237066</pub-id></citation></ref>
<ref id="b19-ijms-13-03650"><label>19</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Biswas</surname><given-names>A.K.</given-names></name><name><surname>Noman</surname><given-names>N.</given-names></name><name><surname>Sikder</surname><given-names>A.R.</given-names></name></person-group><article-title>Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information</article-title><source>BMC Bioinformatics</source><year>2010</year><volume>11</volume><pub-id pub-id-type="doi">10.1186/1471-2105-11-273</pub-id></citation></ref>
<ref id="b20-ijms-13-03650"><label>20</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ruchi</surname><given-names>V.</given-names></name><name><surname>Grish</surname><given-names>C.V.</given-names></name><name><surname>Raghava</surname><given-names>G.P.S.</given-names></name></person-group><article-title>Prediction of mitochondrial proteins of malaria parasite using split amino acid composition and PSSM profile</article-title><source>Amino Acids</source><year>2010</year><volume>39</volume><fpage>101</fpage><lpage>110</lpage><pub-id pub-id-type="doi">10.1007/s00726-009-0381-1</pub-id><pub-id pub-id-type="pmid">19908123</pub-id></citation></ref>
<ref id="b21-ijms-13-03650"><label>21</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname><given-names>X.W.</given-names></name><name><surname>Li</surname><given-names>X.T.</given-names></name><name><surname>Ma</surname><given-names>Z.Q.</given-names></name><name><surname>Yin</surname><given-names>M.H.</given-names></name></person-group><article-title>Prediction of lysine ubiquitylation with ensemble classifier and feature selection</article-title><source>Int. J. Mol. Sci</source><year>2011</year><volume>12</volume><fpage>8347</fpage><lpage>8361</lpage><pub-id pub-id-type="doi">10.3390/ijms12128347</pub-id><pub-id pub-id-type="pmid">22272076</pub-id></citation></ref>
<ref id="b22-ijms-13-03650"><label>22</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Altschul</surname><given-names>S.</given-names></name><name><surname>Wootton</surname><given-names>J.</given-names></name><name><surname>Gertz</surname><given-names>E.</given-names></name><name><surname>Agarwala</surname><given-names>R.</given-names></name><name><surname>Morgulis</surname><given-names>A.</given-names></name><name><surname>Schaffer</surname><given-names>A.</given-names></name><name><surname>Yu</surname><given-names>Y.</given-names></name></person-group><article-title>Protein database searches using compositionally adjusted substitution matrices</article-title><source>FEBS J</source><year>2005</year><volume>272</volume><fpage>5101</fpage><lpage>5109</lpage><pub-id pub-id-type="doi">10.1111/j.1742-4658.2005.04945.x</pub-id><pub-id pub-id-type="pmid">16218944</pub-id></citation></ref>
<ref id="b23-ijms-13-03650"><label>23</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Altschul</surname><given-names>S.</given-names></name><name><surname>Madden</surname><given-names>T.</given-names></name><name><surname>Schaffer</surname><given-names>A.</given-names></name><name><surname>Zhang</surname><given-names>J.</given-names></name><name><surname>Zhang</surname><given-names>Z.</given-names></name><name><surname>Miller</surname><given-names>W.</given-names></name><name><surname>Lipman</surname><given-names>D.</given-names></name></person-group><article-title>Gapped BLAST and PSI-BLAST: A new generation of protein database search programs</article-title><source>Nucleic Acids Res</source><year>1997</year><volume>25</volume><fpage>3389</fpage><lpage>3402</lpage><pub-id pub-id-type="doi">10.1093/nar/25.17.3389</pub-id><pub-id pub-id-type="pmid">9254694</pub-id></citation></ref>
<ref id="b24-ijms-13-03650"><label>24</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schaffer</surname><given-names>A.</given-names></name><name><surname>Aravind</surname><given-names>L.</given-names></name><name><surname>Madden</surname><given-names>T.</given-names></name><name><surname>Shavirin</surname><given-names>S.</given-names></name><name><surname>Spouge</surname><given-names>J.</given-names></name><name><surname>Wolf</surname><given-names>Y.</given-names></name><name><surname>Koonin</surname><given-names>E.</given-names></name><name><surname>Altschul</surname><given-names>S.</given-names></name></person-group><article-title>Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements</article-title><source>Nucleic Acids Res</source><year>2001</year><volume>29</volume><fpage>2994</fpage><lpage>3005</lpage><pub-id pub-id-type="doi">10.1093/nar/29.14.2994</pub-id><pub-id pub-id-type="pmid">11452024</pub-id></citation></ref>
<ref id="b25-ijms-13-03650"><label>25</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shen</surname><given-names>H.B.</given-names></name><name><surname>Chou</surname><given-names>K.C.</given-names></name></person-group><article-title>Predicting protein fold pattern with functional domain and sequential evolution information</article-title><source>J. Theor. Biol</source><year>2009</year><volume>256</volume><fpage>441</fpage><lpage>446</lpage><pub-id pub-id-type="doi">10.1016/j.jtbi.2008.10.007</pub-id><pub-id pub-id-type="pmid">18996396</pub-id></citation></ref>
<ref id="b26-ijms-13-03650"><label>26</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shen</surname><given-names>H.B.</given-names></name><name><surname>Chou</surname><given-names>K.C.</given-names></name></person-group><article-title>Quatldent: A web server for identifying protein quaternary structural attribute by fusing functional domain and sequential evolution information</article-title><source>J. Proteome Res</source><year>2009</year><volume>8</volume><fpage>1577</fpage><lpage>1584</lpage><pub-id pub-id-type="doi">10.1021/pr800957q</pub-id><pub-id pub-id-type="pmid">19226167</pub-id></citation></ref>
<ref id="b27-ijms-13-03650"><label>27</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Song</surname><given-names>J.</given-names></name><name><surname>Yuan</surname><given-names>Z.</given-names></name><name><surname>Tan</surname><given-names>H.</given-names></name><name><surname>Huber</surname><given-names>T.</given-names></name><name><surname>Burrage</surname><given-names>K.</given-names></name></person-group><article-title>Predicting disulfide connectivity from protein sequence using multiple sequence feature vectors and secondary structure</article-title><source>Bioinformatics</source><year>2007</year><volume>23</volume><fpage>3147</fpage><lpage>3154</lpage><pub-id pub-id-type="doi">10.1093/bioinformatics/btm505</pub-id><pub-id pub-id-type="pmid">17942444</pub-id></citation></ref>
<ref id="b28-ijms-13-03650"><label>28</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhu</surname><given-names>L.</given-names></name><name><surname>Yang</surname><given-names>J.</given-names></name><name><surname>Song</surname><given-names>J.N.</given-names></name><name><surname>Chou</surname><given-names>K.C.</given-names></name><name><surname>Shen</surname><given-names>H.B.</given-names></name></person-group><article-title>Improving the accuracy of predicting disulfide connectivity by feature selection</article-title><source>J. Comput. Chem</source><year>2010</year><volume>31</volume><fpage>1478</fpage><lpage>1485</lpage><pub-id pub-id-type="pmid">20127740</pub-id></citation></ref>
<ref id="b29-ijms-13-03650"><label>29</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Song</surname><given-names>J.</given-names></name><name><surname>Tan</surname><given-names>H.</given-names></name><name><surname>Takemoto</surname><given-names>K.</given-names></name><name><surname>Akutsu</surname><given-names>T.</given-names></name></person-group><article-title>HSEpred: Predict half-sphere exposure from protein sequence</article-title><source>Bioinformatics</source><year>2008</year><volume>24</volume><fpage>1489</fpage><lpage>1497</lpage><pub-id pub-id-type="doi">10.1093/bioinformatics/btn222</pub-id><pub-id pub-id-type="pmid">18467349</pub-id></citation></ref>
<ref id="b30-ijms-13-03650"><label>30</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lobley</surname><given-names>A.</given-names></name><name><surname>Sadowski</surname><given-names>M.I.</given-names></name><name><surname>Jones</surname><given-names>D.T.</given-names></name></person-group><article-title>pGenTHREADER and pDomTHERADER: New methods for improved protein fold recognition and superfamily discrimination</article-title><source>Bioinformatics</source><year>2009</year><volume>25</volume><fpage>1761</fpage><lpage>1767</lpage><pub-id pub-id-type="doi">10.1093/bioinformatics/btp302</pub-id><pub-id pub-id-type="pmid">19429599</pub-id></citation></ref>
<ref id="b31-ijms-13-03650"><label>31</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chauhan</surname><given-names>J.S.</given-names></name><name><surname>Mishra</surname><given-names>N.K.</given-names></name><name><surname>Raghava</surname><given-names>G.P.</given-names></name></person-group><article-title>Identification of ATP binding residues of a protein from its primary sequence</article-title><source>BMC Bioinformatics</source><year>2009</year><volume>10</volume><pub-id pub-id-type="doi">10.1186/1471-2105-10-434</pub-id></citation></ref>
<ref id="b32-ijms-13-03650"><label>32</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname><given-names>T.</given-names></name><name><surname>Zhang</surname><given-names>H.</given-names></name><name><surname>Chen</surname><given-names>K.</given-names></name><name><surname>Shen</surname><given-names>S.</given-names></name><name><surname>Ruan</surname><given-names>J.</given-names></name><name><surname>Kurgan</surname><given-names>L.</given-names></name></person-group><article-title>Accurate sequence-based prediction of catalytic residues</article-title><source>Bioinformatics</source><year>2008</year><volume>24</volume><fpage>2329</fpage><lpage>2338</lpage><pub-id pub-id-type="doi">10.1093/bioinformatics/btn433</pub-id><pub-id pub-id-type="pmid">18710875</pub-id></citation></ref>
<ref id="b33-ijms-13-03650"><label>33</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wold</surname><given-names>S.</given-names></name><name><surname>Jonsson</surname><given-names>J.</given-names></name><name><surname>Sjostrom</surname><given-names>M.</given-names></name><name><surname>Rannar</surname><given-names>S.</given-names></name></person-group><article-title>DNA and peptide sequences and chemical processes multivariately modeled by principal component analysis and partial least squares projection to latent structures</article-title><source>Anal. Chim. Acta</source><year>1993</year><volume>277</volume><fpage>239</fpage><lpage>253</lpage><pub-id pub-id-type="doi">10.1016/0003-2670(93)80437-P</pub-id></citation></ref>
<ref id="b34-ijms-13-03650"><label>34</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Guo</surname><given-names>Y.</given-names></name><name><surname>Li</surname><given-names>M.</given-names></name><name><surname>Lu</surname><given-names>M.</given-names></name><name><surname>Wen</surname><given-names>Z.</given-names></name><name><surname>Huang</surname><given-names>Z.</given-names></name></person-group><article-title>Predicting G-protein coupled receptors-G-protein coupling specificity based on autocross-covariance transform</article-title><source>Proteins</source><year>2006</year><volume>65</volume><fpage>55</fpage><lpage>60</lpage><pub-id pub-id-type="doi">10.1002/prot.21097</pub-id><pub-id pub-id-type="pmid">16865706</pub-id></citation></ref>
<ref id="b35-ijms-13-03650"><label>35</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Guo</surname><given-names>Y.Z.</given-names></name><name><surname>Yu</surname><given-names>L.Z.</given-names></name><name><surname>Wen</surname><given-names>Z.N.</given-names></name><name><surname>Li</surname><given-names>M.L.</given-names></name></person-group><article-title>Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences</article-title><source>Nucleic Acids Res</source><year>2008</year><volume>36</volume><fpage>3025</fpage><lpage>3030</lpage><pub-id pub-id-type="doi">10.1093/nar/gkn159</pub-id><pub-id pub-id-type="pmid">18390576</pub-id></citation></ref>
<ref id="b36-ijms-13-03650"><label>36</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dong</surname><given-names>Q.W.</given-names></name><name><surname>Zhou</surname><given-names>S.G.</given-names></name><name><surname>Guan</surname><given-names>J.H.</given-names></name></person-group><article-title>A new taxonomy-based protein folds recognition approach based on autocross-covariance transformation</article-title><source>Bioinformatics</source><year>2009</year><volume>25</volume><fpage>2655</fpage><lpage>2662</lpage><pub-id pub-id-type="doi">10.1093/bioinformatics/btp500</pub-id><pub-id pub-id-type="pmid">19706744</pub-id></citation></ref>
<ref id="b37-ijms-13-03650"><label>37</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname><given-names>J.</given-names></name><name><surname>Li</surname><given-names>M.</given-names></name><name><surname>Yu</surname><given-names>L.</given-names></name><name><surname>Wang</surname><given-names>C.</given-names></name></person-group><article-title>An ensemble classifier of support vector machines used to predict protein structural classes by fusing auto covariance and pseudo-amino acid composition</article-title><source>Protein J</source><year>2010</year><volume>29</volume><fpage>62</fpage><lpage>67</lpage><pub-id pub-id-type="doi">10.1007/s10930-009-9222-z</pub-id><pub-id pub-id-type="pmid">20049515</pub-id></citation></ref>
<ref id="b38-ijms-13-03650"><label>38</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zeng</surname><given-names>Y.H.</given-names></name><name><surname>Guo</surname><given-names>Y.Z.</given-names></name><name><surname>Xiao</surname><given-names>R.Q.</given-names></name><name><surname>Yang</surname><given-names>L.</given-names></name><name><surname>Yu</surname><given-names>L.Z.</given-names></name><name><surname>Li</surname><given-names>M.L.</given-names></name></person-group><article-title>Using the augmented Chou’s pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach</article-title><source>J. Theor. Biol</source><year>2009</year><volume>259</volume><fpage>366</fpage><lpage>372</lpage><pub-id pub-id-type="doi">10.1016/j.jtbi.2009.03.028</pub-id><pub-id pub-id-type="pmid">19341746</pub-id></citation></ref>
<ref id="b39-ijms-13-03650"><label>39</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname><given-names>T.G.</given-names></name><name><surname>Zheng</surname><given-names>X.Q.</given-names></name><name><surname>Wang</surname><given-names>C.H.</given-names></name><name><surname>Wang</surname><given-names>J.</given-names></name></person-group><article-title>Prediction of subcellular location of apoptosis proteins using pseudo amino acid composition: An approach from auto covariance transformation</article-title><source>Protein Pept. Lett</source><year>2010</year><volume>17</volume><fpage>1263</fpage><lpage>1269</lpage><pub-id pub-id-type="doi">10.2174/092986610792231528</pub-id><pub-id pub-id-type="pmid">20670213</pub-id></citation></ref>
<ref id="b40-ijms-13-03650"><label>40</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Khan</surname><given-names>A.</given-names></name><name><surname>Javed</surname><given-names>S.J.</given-names></name></person-group><article-title>Predicting regularities in lattice constants of GdfeO3-type perovskites</article-title><source>Acta Crystallogr</source><year>2008</year><volume>B64</volume><fpage>120</fpage><lpage>122</lpage></citation></ref>
<ref id="b41-ijms-13-03650"><label>41</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Qiu</surname><given-names>J.D.</given-names></name><name><surname>Huang</surname><given-names>J.H.</given-names></name><name><surname>Liang</surname><given-names>R.P.</given-names></name><name><surname>Lu</surname><given-names>X.Q.</given-names></name></person-group><article-title>Prediction of G-protein-coupled receptors based on the concept of Chou’s pseudo amino acid composition: An approach from discrete wavelet transform</article-title><source>Anal. Biochem</source><year>2009</year><volume>390</volume><fpage>68</fpage><lpage>73</lpage><pub-id pub-id-type="doi">10.1016/j.ab.2009.04.009</pub-id><pub-id pub-id-type="pmid">19364489</pub-id></citation></ref>
<ref id="b42-ijms-13-03650"><label>42</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname><given-names>S.</given-names></name><name><surname>Ding</surname><given-names>S.</given-names></name><name><surname>Wang</surname><given-names>T.</given-names></name></person-group><article-title>High-accuracy prediction of protein structural class for low-similarity sequences based on predicted secondary structure</article-title><source>Biochimie</source><year>2011</year><volume>4</volume><fpage>710</fpage><lpage>714</lpage></citation></ref>
<ref id="b43-ijms-13-03650"><label>43</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Vapnik</surname><given-names>V</given-names></name></person-group><source>Statistical Learning Theory</source><publisher-name>Wiley-Interscience</publisher-name><publisher-loc>New York, NY, USA</publisher-loc><year>1998</year></citation></ref>
<ref id="b44-ijms-13-03650"><label>44</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chang</surname><given-names>C.C.</given-names></name><name><surname>Lin</surname><given-names>C.J.</given-names></name></person-group><article-title>LIBSVM: A library for support vector machine</article-title><source>ACM Trans. Intell. Syst. Technol</source><year>2011</year><volume>2</volume><fpage>1</fpage><lpage>27</lpage></citation></ref>
<ref id="b45-ijms-13-03650"><label>45</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chou</surname><given-names>K.C.</given-names></name><name><surname>Zhang</surname><given-names>C.T.</given-names></name></person-group><article-title>Review: Prediction of protein structural classes</article-title><source>Crit. Rev. Biochem. Mol. Biol</source><year>1995</year><volume>30</volume><fpage>275</fpage><lpage>349</lpage><pub-id pub-id-type="doi">10.3109/10409239509083488</pub-id><pub-id pub-id-type="pmid">7587280</pub-id></citation></ref>
<ref id="b46-ijms-13-03650"><label>46</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chou</surname><given-names>K.C.</given-names></name><name><surname>Shen</surname><given-names>H.B.</given-names></name></person-group><article-title>Cell-PLoc: A package of web-servers for predicting subcellular localization of proteins in various organisms</article-title><source>Nat. Protoc</source><year>2008</year><volume>3</volume><fpage>153</fpage><lpage>162</lpage><pub-id pub-id-type="doi">10.1038/nprot.2007.494</pub-id><pub-id pub-id-type="pmid">18274516</pub-id></citation></ref></ref-list>
<sec sec-type="display-objects">
<title>Figures and Table</title>
<fig id="f1-ijms-13-03650" position="float">
<label>Figure 1</label>
<caption>
<p>Schematic representation of transformation of each protein sequence into PSSM-400 matrix.</p></caption>
<graphic xlink:href="ijms-13-03650f1.gif"/></fig>
<fig id="f2-ijms-13-03650" position="float">
<label>Figure 2</label>
<caption>
<p>Detailed system flow of the prediction system.</p></caption>
<graphic xlink:href="ijms-13-03650f2.gif"/></fig>
<fig id="f3-ijms-13-03650" position="float">
<label>Figure 3</label>
<caption>
<p>Accuracies of the prediction model with <italic>AC</italic> of different <italic>lg</italic>s.</p></caption>
<graphic xlink:href="ijms-13-03650f3.gif"/></fig>
<fig id="f4-ijms-13-03650" position="float">
<label>Figure 4</label>
<caption>
<p>The ROC curves calculated from the ten-fold cross validation of PSSM and PSSM-AC encoding strategies.</p></caption>
<graphic xlink:href="ijms-13-03650f4.gif"/></fig>
<table-wrap id="t1-ijms-13-03650" position="float">
<label>Table 1</label>
<caption>
<p>The performance comparison of different encoding strategies on the training dataset.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="bottom">Method</th>
<th align="center" valign="bottom"><italic>S</italic><italic><sub>n</sub></italic> (%)</th>
<th align="center" valign="bottom"><italic>S</italic><italic><sub>p</sub></italic> (%)</th>
<th align="center" valign="bottom"><italic>AC</italic> (%)</th></tr></thead>
<tbody>
<tr>
<td align="left" valign="top">PSSM-400</td>
<td align="center" valign="top">72.00</td>
<td align="center" valign="top">86.33</td>
<td align="center" valign="top">79.32</td></tr>
<tr>
<td align="left" valign="top">PSSM-AC</td>
<td align="center" valign="top">79.33</td>
<td align="center" valign="top">91.00</td>
<td align="center" valign="top">85.17</td></tr>
<tr>
<td align="left" valign="top">BLProt [<xref ref-type="bibr" rid="b8-ijms-13-03650">8</xref>]</td>
<td align="center" valign="top">74.47</td>
<td align="center" valign="top">84.21</td>
<td align="center" valign="top">80.00</td></tr></tbody></table></table-wrap></sec></back></article>
