Open Access This article is
- freely available
Genes 2019, 10(1), 57; https://doi.org/10.3390/genes10010057
A Multi-Label Supervised Topic Model Conditioned on Arbitrary Features for Gene Function Prediction
School of Information, Yunnan Normal University, Kunming 650500, China
Key Laboratory of Educational Informatization for Nationalities Ministry of Education, Yunnan Normal University, Kunming 650500, China
School of Software, Yunnan University, Kunming 650091, China
Authors to whom correspondence should be addressed.
Received: 30 November 2018 / Accepted: 10 January 2019 / Published: 17 January 2019
With the continuous accumulation of biological data, more and more machine learning algorithms have been introduced into the field of gene function prediction, which has great significance in decoding the secret of life. Recently, a multi-label supervised topic model named labeled latent Dirichlet allocation (LLDA) has been applied to gene function prediction, and obtained more accurate and explainable predictions than conventional methods. Nonetheless, the LLDA model is only able to construct a bag of amino acid words as a classification feature, and does not support any other features, such as hydrophobicity, which has a profound impact on gene function. To achieve more accurate probabilistic modeling of gene function, we propose a multi-label supervised topic model conditioned on arbitrary features, named Dirichlet multinomial regression LLDA (DMR-LLDA), for introducing multiple types of features into the process of topic modeling. Based on DMR framework, DMR-LLDA applies an exponential a priori construction, previously with weighted features, on the hyper-parameters of gene-topic distribution, so as to reflect the effects of extra features on function probability distribution. In the five-fold cross validation experiment of a yeast datasets, DMR-LLDA outperforms the compared model significantly. All of these experiments demonstrate the effectiveness and potential value of DMR-LLDA for predicting gene function.
Keywords:multi-label classification; topic model; gene function; probability distribution; Dirichlet-multinomial Regression
As the main component of a cell, proteins are the most essential and versatile material of life. Thus, the research on protein functions is of great importance for the development of new drugs, better crops, and the development of synthetic biochemical . In recent years, new protein function prediction methods using machine learning algorithms have proliferated, based on various known information about proteins, and have increasingly become important long-standing research works in the post-genomic era. From the point of molecular biology, a protein is the product of a gene after the process of transcribing, translating, and post-translational modifying. Even though the real function of a gene is to encode one or more proteins executing practical functions, the function of a gene product has usually been regarded as the native function of the gene in gene-level experiments. Therefore, we do not distinguish between gene function and protein function in this paper, which are known collectively as gene function.
The most common computational approach for gene function prediction is to transfer the gene function into some specific features from their sequence or structure similarity, such as BLAST . In addition to sequence similarity, many gene function prediction methods have been exploited in recent years as the additional information extracted from proteins, such as protein structure , protein motif, biophysical properties , and integrated heterogeneous data sources . In reference , Evangelia et al. extract novel shape features from protein structures in the form of local (per amino acid) distribution of angles and amino acid distances, respectively. Each of the multi-channel feature maps is introduced into a deep convolutional neural network (CNN) for function prediction, and the outputs are fused through support vector machines or a correlation-based k-nearest neighbor classifier. In addition, automatic prediction using protein–protein similarity information can be further supplemented by experimental data [6,7]; this kind of method assumes that the closely related proteins (or genes) share similar functional annotations on the basis of network structure information. Researchers have made the relevant literature reviews of computational methods on gene function prediction in references [8,9,10].
From the point of machine learning algorithms, predicting gene function based on various data sources is a problem of classification in nature. A gene can be viewed as an instance to be classified—various kinds of data sources (such as an amino acid sequence, textual repositories, and motifs) can be organized into a feature space, so that each gene is represented as a set of attribute values; a function (such as a gene ontology (GO) term ) is regarded as a label. As a gene is always annotated by several functions, gene function prediction is actually a process of multi-label classification: a multi-label classifier is trained firstly on constructed attribute features and annotated genes, and then is used to predict function annotations for unannotated genes. From the above analysis, we believe that many multi-label classification algorithms have great potential to predict gene function, such as a support vector machine (SVM), neural network, and decision tree. In reference , Celine Vens et al. proposed three multi-label classifiers based on a hierarchical decision tree, and the experimental results from 24 datasets show that these classifiers are powerful and effective for gene function prediction.
In addition to traditional machine learning algorithms, a topic model is a kind of probabilistic generative model that has been applied into gene function prediction. In reference , Liu et al. introduced a typical multi-label supervised topic model into gene function prediction, which was called labeled latent Dirichlet allocation (LLDA) and is proposed in reference  for text mining. This research is the first effort to apply a multi-label supervised topic model into gene function prediction. Compared with traditional multi-label classification models, LLDA can model a function label as a topic, and thus can not only work out the function probability distributions over gene instances effectively, but can also directly provide the word probability distributions over functions. Nonetheless, the direct application of LLDA on a gene function dataset can only utilize protein sequence data by formalizing the sequences into a bag of words (BoW), and then the constructed bag of words is used for topic modeling. In other words, due to the restrictions of BoW construction in topic modeling, the feature space was constructed on sequence data rather than multiple biological data. However, we can see from the above paragraph that there are various protein features, such as hydrophobicity and the polarity of amino acids, which have a profound impact on gene structure and function. Apparently, the introduction of multiple kinds of gene features in a multi-label supervised topic model can improve the accuracy of gene function prediction.
Inspired by the application of a multi-label topic model in gene function prediction and a topic model conditioned on arbitrary features named the Dirichlet multinomial regression latent Dirichlet allocation (DMR-LDA) , we propose a DMR-LLDA model, which introduces a DMR framework into an LLDA model. Firstly, we describe DMR-LLDA for gene function prediction problem formulation. Then the generative process and the inference algorithm of DMR-LLDA are described. This model is fully compatible with both discrete and continuous features, whose inference is relatively simple. In a five-fold cross validation experiment on verified gene function prediction, DMR-LLDA significantly outperformed LLDA. In addition, the impact of feature variables on prior parameters and the comparison between two kinds of inference algorithms are shown in experimental data. All these experimental results demonstrate the effectiveness and potential value of DMR-LLDA for predicting gene function.
2.1. Related Definitions and Notations
In this paper, the topic modeling method of gene function prediction reported by reference  is utilized. We consider each gene to be a document , and GO terms (topics) are shared by a document collection. Meanwhile, we view the extra gene features, except for the bag of amino acid words, as the metadata, such as authors and dates of documents. Therefore, the introduction of extra gene features into topic modeling is similar to introducing metadata into the topic modeling of documents, and the type of metadata may be discrete or continuous. To better understand the practical application of our method, the relationship of text topic modeling and gene function predicting is illustrated by Figure 1.
In Figure 1, the right part describes the topic modeling concept of text data, and the left part describes the related concept of gene function data. For all topic models, there are three key concepts: “documents”, “words”, and “topics”. In addition, the supervised topic model introduces “labels” for each document, and the proposed DMR-LLDA model introduces “features” for each document. Therefore, these concepts can now be reformulated with more detail, as follows.
For text data (right part of Figure 1), document collection is composed of several documents numbered D1 to Dn. In the other side (left part of Figure 1), the gene dataset is composed of several protein sequences, numbered G1 to Gn. Therefore, a document is equivalent to a gene in our model. We suppose that there are genes in a gene set, which compose the gene space , and the gene sample set including genes can be represented as , and denotes a gene sample.
For text data (right part of Figure 1), each document is labeled by one or more tags, such as “programming” and “language”. On the other side (left part of Figure 1), each gene is annotated by several GO terms, such as “GO:0003012” and “GO:0003547”. Therefore, a document tag is equivalent to a GO term in our model, and all of them are called “labels”. In this paper, the gene function label space is expressed as . Meanwhile, the observed labels of each gene are described by a sparse binary vector , which is defined as follows:where, represents the label sub-space of gene : .
For text data (right part of Figure 1), word terms are the main component of a document, such as the words “table” and “database”. On the other side (left part of Figure 1), we consider a protein sequence to be a text string, which is defined by a fixed 20 amino acid alphabet (G,A,V,L,I,F,P, Y,S,C,M,N,Q,T,D,E,K,R,H,W). Correspondingly, amino acid blocks are the main components of a protein sequence, which is composed by two or more amino acid alphabets, such as “MS” and “TS”. Therefore, a word term is equivalent to an amino acid block in our model, and all of them are called “words”. Meanwhile, all of the words constitute a vocabulary. In this paper, the amino acid words space is represented as . For a gene , denotes that the th gene is composed by observed word samples, and is one of word samples.
For text data (right part of Figure 1) and gene function data (left part of Figure 1), a “topic” is viewed as a probability distribution over a fixed vocabulary. Taking the text data as an example, the probabilities of the word “table” over “topic 1” are 0.05. For the gene function data, the probabilities of amino acid block MS over “topic 1” are 0.21. Obviously, topics are latent and needed to be inferred by topic modeling. In this paper, the global topic space includes topics, which is represented as . According to the definition of an LLDA model, there is a one-to-one correspondence between label and topic—therefore, ( represents equivalent relationship between two space), .
For text data (right part of Figure 1), the metadata of a document can be viewed as document features, such as the tags “author” and “publish year of document”. On the other side (left part of Figure 1), each gene has several extra features, except for its sequence string, such as molecular weight and hydrophobicity. Therefore, the metadata of a document tag is equivalent to an extra feature of the gene in our model, and all of them are called “features”. In this paper, the feature space composed by gene features is expressed as . Therefore, there is a set of observed features for gene , which can be represented as a feature vector: .
In addition to the above five concepts, there are three other concepts illustrated in Figure 1. Firstly, the BoW, which is a word–document matrix and the input of the topic model. In an instance in the right part of Figure 1, the word “table” appears two times in document D1. Likewise, the word “MS” appears one time in gene G1. In other words, the element of the BoW represents the times of each word in each document. Meanwhile, there are two probability matrices that appear in Figure 1: one is the topic (label)–word probability matrix, and the other is the document (gene)–topic probability matrix. All of them are represented as parameter vectors for each topic or gene in the topic model.
A topic corresponds to a multinomial distribution of word space , whose parameter vector is , and is the probability of word under topic ; a gene corresponds to a multinomial distribution of the topics space , whose parameter vector is , and is the topic weight of topic under gene . Finally, we utilize a feature parameter vector to represent the relationship between features (f) and topics (t) in making features influence the choice of topic.
Note that the shared parameters of a whole gene set, such as topic–word parameter , are called “global parameters” in this paper. Correspondingly, the parameter of one gene is called a local parameter, such as gene–topic (label) parameter .
2.2. Overview of the Dirichlet–Multinomial Regression Latent Dirichlet Allocation Topic Modeling Process
Based on the above notation, we can provide the description of a gene function dataset as follows.
The gene is composed of , which are observed samples, and the word index of each sample comes from the vocabulary . Thus, the gene can also be represented as , where is the local word subspace of . In addition, the latent variables of gene is its topic subset , where , and is the local topic subspace, , and . Specifically, each gene shares the global topic space , where . In this case, we suppose that each word of each gene shares the same feature vector: .
Then, the topic modeling process of our model can be interpreted as follows: for the training set, learning the unknown parameter , , and from the observed variables , , and ; for the testing set, predicting and from known parameters and , and the observed variables and . Obviously, and are global parameters, which are shared by the whole dataset. The above two steps are also called model training and predicting, and are realized by learning and inference algorithms, such as Gibbs sampling  and variable inference .
Moreover, there are two steps before model training and predicting: BoW construction and model description. Since we constructed the BoW of the gene in exactly the same way as reference , this step will be not repeated in this paper. For model description, there are usually two ways to describe a probabilistic graphical model, including the generative process and the graphic model, which are discussed in the next sections. The overview of our topic modeling process is depicted in Figure 2.
2.3. Description of the Dirichlet–Multinomial Regression Latent Dirichlet Allocation Model
This section provides the description of DMR-LLDA, including its generative process and graphic model. It is worth noting that our DMR-LLDA introduces the DMR framework for gene features based on the LLDA model, so this paper emphasizes the DMR part rather than the classic LLDA.
According to DMR framework, each word sample of gene is a “individual”, and all of the samples are divided into groups by word number . A bag of words is composed by (the number of word appeared in gene ) samples of the -th group, and corresponds to a feature vector , which influences the latent topic choice of all samples.
We suppose that
In Equation (2), represents feature parameters that correspond to topic . Likewise, each bag of words of gene shares the same clustering random variable:where is the selecting probability of the n-th word sample of gene , which chooses topic and maximizes the utility selection . In addition, is the topic weight vector of gene , which obeys the Dirichlet distribution of parameter :where is the hyper-parameter of :
The description of DMR-LLDA from the global and local perspective is shown below.
From the global perspective, each topic can be represented as a multinomial distribution over vocabulary , whose parameter is expressed as vector , and we suppose that obeys Dirichlet conjugate prior distribution. Each topic corresponds to a feature weight parameter vector , which obeys the normal distribution of parameter .
From the local perspective, each gene is composed by observed samples, which corresponds to local word number subset and local latent topic number subset , where obeys multinomial distribution of parameter . The local observed word subspace of gene is , the local observed label subspace is , and the local observed feature subspace is . Each label corresponds to a topic , where and . The dimension of topic weight corresponds to , which is . At the same time, the range of topics on feature weight parameter vector is limited to . In addition, decides the hyper-parameter of , which is the dot-product of feature vector , corresponding to feature subspace and its weighted parameter vector .
Above all, the Dirichlet prior hyper-parameter of can be expressed aswhere is computed by Equation (5). The local topic weight can be also represented as
Given the above, the generative process of DMR-LLDA can be described as follows. The corresponding graphical model is shown in Figure 3.
For each global topic , we can
(a) Generate a feature weighted parameter vector of topic from dimension’s normal distribution of parameter :
(b) Generate a multinomial parameter vector from a dimension Dirichlet distribution:
For each gene , . This means that
(a) We suppose that () as the Dirichlet prior hyper-parameter of the topic weight
(b) The binary vector limits the prior hyper-parameter of local topic weight
(c) We can generate local weight topic vector of topic from a Dirichlet distribution:
(d) For each word sample , we can
i. Generate topic number of from dimensions’ multinomial distribution of parameter :
ii. Generate word number of from dimensions’ multinomial distribution of parameter :
As we can see from Figure 3, is computed by feature vector and its weighted parameter. Therefore, is a parameter rather than a random variable in the LLDA.
In our DMR-LLDA model, the unknown parameters to be estimated are the global feature parameter , the global topic–word multinomial distribution parameter , and the local topic weight . The hidden variable to be estimated is . The known data are the observed word samples and binary vector . The joint distribution of is shown in Equation (15):
Above all, the proposed method utilizes extra features as the prior knowledge of the related distribution, which is able to gain more reliable prior distribution for the LLDA; then a more precise estimation of posterior distributions is obtained.
2.4. Inference Algorithm of Dirichlet–Multinomial Regression Latent Dirichlet Allocation
The core learning task of DMR-LLDA is to compute the parameters and posterior distribution . The posterior estimation represents the estimating value of the parameter under the training set. The prediction process of DMR-LLDA is that on the basis of the estimated three parameters and a hidden variable, we update the unknown local parameter and hidden variable of the test gene by fixing the learned global parameters and ; then, we get the corresponding relationship between the label and the topic. The Gibbs sampling algorithm and the variable Bayesian algorithm are two essentially approximate inference algorithms of a probabilistic graphic model, and the purpose of them is universal. In order to compare their impact on the model performance of difference inference algorithms, we designed a collapsed Gibbs sampling algorithm (CGS), a collapsed variable Bayesian algorithm (CVB), and a zero-order variational Bayesian algorithm (CVB0) for DMR-LLDA, with detail as follows.
2.4.1. The Collapsed Construction of Dirichlet–Multinomial Regression Latent Dirichlet Allocation
First of all, after the integration of model parameters in a joint distribution, a semi-collapsed joint distribution is obtained:
The predictive probability distribution for the topic assignment of sample is
is the number of samples that are assigned to the corresponding topic of gene , except for sample . is the number of samples that are assigned to the word of topic , except for sample ; therefore, .
In Equation (17), is optimized by local observed feature vector and global feature parameter , whose updating equation is Equation (5). To simplify the updating equation, we first suppose that , and then an item of hidden global feature parameter is added for global feature parameter , which corresponds to a “fake” observed feature . Thus, the updating equation of is
2.4.2. The Optimization of the Feature Parameters of Dirichlet–Multinomial Regression Latent Dirichlet Allocation
For Gibbs sampling or variable Bayesian, we need to update the global feature parameter in the inference process. We adopted the method of gradient descent for optimizing .
In Equation (16), the -related section is
Based on the logarithm of Equation (20), we take the derivative with respect to global feature parameter and adjust it to zero. The updated equation of is
Finally, is updated by Equation (18).
2.4.3. The Collapsed Gibbs Sampling Algorithm of the Dirichlet–Multinomial Regression Latent Dirichlet Allocation
To determine the initial state of the Markov chain, we initiate the hidden topic number of each sample first; then, we utilize the predictive probability of hidden variable from Equation (17) as the state transition probability of the Markov chain. In the process of Gibbs sampling, the topic number of each sample is updated, and the hyper-parameter is also updated by Equation (18). Finally, the global feature parameter is updated by Equation (21).
After several iterations in the burn-in time, the Markov chain is attracted to objective distribution, and then the posterior distribution is estimated. The posterior estimation of the local topic weight and topic–word multinomial distribution parameter is
2.4.4. The Collapsed Variable Bayesian Inference Algorithm of the Dirichlet–Multinomial Regression Latent Dirichlet Allocation
The whole variational objective function before being collapsed is
After margining the model parameters , the objective function iswhere represents the unrelated item with variational distribution . There are two kinds of construction below:
In Equation (26), the updating equation of optimal variational parameter by a CVB algorithm is
Each expectation of the above equation is
In Equation (27), the updating equation of the optimal variational parameter by CVB0 algorithm is
The plenitude statistic of samples in Equation (35) are ,, and , and their expectation under variational distribution is
The in the above equation can be adapted to , because the bag of words not only shares a similar word number , but also shares the same topic number . Then, optimal variational distribution can be adapted to .
The inference equation difference between CVB and CVB0 shows that CVB only retains the zero-order information of the Taylor expansion; however, CVB0 is the re-collapse of a hidden variable space based on Jensen inequality. Therefore, CVB0 is much more precise than CVB. The corresponding algorithm of CVB and CVB0 are shown in Table 1 and Table 2 respectively.
3. Materials and Results
This section provides a concise and precise description of the experimental results, their interpretation, and the experimental conclusions that can be drawn.
In this paper, the validity and accuracy of proposed models are tested on the S.cerevisiae (S.C) dataset, which is introduced in reference . This dataset includes several aspects of the yeast genome, such as sequence statistics, phenotype, expression, secondary structure, and homology. Meanwhile, two kinds of function annotation standard, including FunCat and GO, are used to annotate gene function. Due the universality of GO, the dataset depends on the GO that is adopted in our experiments. As described in Section 2.1, the construction of the BoW is based on amino acid composition, so we mainly use one of datasets that depends on the sequence statistics. In addition, we construct a dataset named S.C-CC from S.C, which only includes the GO terms belonging to the cellular component (CC). Therefore, there are fewer GO terms in the S.C-CC dataset when compared with the S.C dataset, and both of them are used in our experiments for investigating the influence of different label numbers on prediction performance. The statistics of the S.C and S.C-CC dataset is shown in Table 3. In this set, denotes the number of GO terms, denotes the number of genes, and denotes the size of the vocabulary.
As shown in Table 3, there are 1692 genes and 4133 function labels in the S.C dataset; in the S.C-CC dataset, there are 1692 genes and 547 function labels. Due to the large number of GO terms in the gene function dataset, we utilized a Boolean matrix decomposition (BMD) method to reduce the dimensionality of the function labels. BMD is a kind of label space dimension reduction (LSDR) method , which addresses the multi-label classification problem with many labels. LSDR approaches use a compression step to transform the original high dimension label space into a lower dimension label space, and then multi-label classifiers are trained on a dataset with fewer labels, which can reduce the computation burden of the classifier. The existing studies about LSDR show that LSDR approaches are useful for optimizing the running time and accuracy of multi-label classification. In our BMD process, original label matrix ( denotes the number of genes, and denotes the number of features) is decomposed into the product of two matrices, ( denotes the number of labels) and , where ( denotes Boolean product) is satisfied. We also called it exact BMD and adopted this algorithm, which is proposed in reference . Compared with other LSDR algorithms, an exact BMD can retain the interpretability of low dimension label space and restore the low dimension-predicted label matrix to the original label matrix by matrix . At last, the number of function labels is reduced into a smaller dimension, and denotes the number of GO terms after label space dimension reducing. Then DMR-LLDA actually needs to process 1358 GO terms of the S.C dataset and 319 GO terms of the S.C-CC dataset. Nonetheless, the lower dimensional label space can be recovered by a Boolean product after predicting, so we still get the whole function labels sets in the prediction results.
DMR-LLDA’s advantage here over LLDA lies in the introduction of extra features. In the S.C and S.C-CC dataset, there are six extra gene features for each gene, including the molecular weight of the gene, the isoelectric point, the average coefficients of hydrophilic, the number of exons, the adaptability index of the codon, the number of motifs, and the open reading frame (ORF) number of chromosomes. The statistics of extra features are shown in Table 4.
As the max word length of the S.C dataset is two amino acid alphabets (G,A,V,L,I,F,P, Y,S,C,M,N,Q,T,D,E,K,R,H,W), a human dataset constructed by ourselves is adopted to evaluate the impact on the topic model performance of word length. This human dataset is constructed in a similar way as in reference . In addition, we also constructed two human datasets for different word lengths, where the max word length of the Human1 dataset is two amino acid alphabets, and that of the Human2 dataset is three amino acid alphabets. For the Human2 dataset, the original number of words is 8400, but we filtered several words that have a high frequency. Then, the statistic of the S.C and S.C-CC dataset is shown in Table 5.
3.2. Parameter Settings and Evaluation Criterias
The DMR-LLDA learning framework involves four different parameters: , , , and . The and are the parameters of the two-Dirichlet distribution, where the larger the value of , the more balanced the probability of a word in a topic. The setting of the value has been discussed in reference . Nonetheless, the value of is optimized by a protein feature, so its initial value does not have a big effect on model performance. According to the experience, we set as the initial value, and set , with . In addition, and are respectively the mean and variance of normal distribution, obeyed by feature weighted parameter , and we set .
In the Gibbs sampling process of model training, we set the number of the Markov chain as 1 and the maximum number of iterations is 2000 times, where the number of iterations of burn-in time is set to 1000. We record the state space at intervals of 50 times on the converged Markov chain, and 20 times per record is conducted. In the process of model predicting, we set the number of iterations as 1000 times. After 500 iterations for the burn-in time, we record the state space at intervals of 50 times. In the variable Bayesian inferring process of model training, we initialize the global variable parameter through random number and hyper-parameter : ; in each local variable inference, we set the converged threshold as 0.00001, and the maximum number of times of the local variable inference as 100. The number of global scanning iterations is 1000.
The five-fold cross validation is conducted to measure and compare the performance of DMR-LLDA and the comparative algorithms. Five representative multi-label learning evaluation criteria are used in this paper, including hamming loss (HL), average precision (AP), one error, and micro-averaged and macro-averaged F1 scores (Micro-F1 and Macro-F1). In addition, three kinds of areas under a precision–recall curve are also used, including , , and , which is proposed in reference . Finally, we repeat the random partition and evaluation in five independent rounds, and report the average results.
3.3. The Impact of Word Length on Model Performance
Firstly, the performance comparison of the LLDA model between the Human1 and Human2 datasets is shown in Figure 4. As shown in Figure 4, we find that the value of and in Human1 is higher than that in Human2; the value of the AP on Human1 is lower than that of Human2; and the value of one error, HL, and is almost equal to that of Human1 and Human2. These results show that the classification performance of the LLDA on Human1 and Human2 is almost the same, which reveals that the larger word space might not obtain a better classifying performance.
Moreover, related studies suggested that a word length of more than four amino acid alphabets would not improve the classification accuracy, and would only increase the complexity of computation . Therefore, in the following experiments, we only adopt the S.C and S.C-CC datasets whose word length is two amino acid alphabets.
3.4. Gene Function Prediction with Cross Validation
In addition to LLDA, we also adopted three widely adopted methods: multi-label k-nearest neighbor (MLKNN) , back propagation for multi-label learning (BPMLL) , and support vector machines (SVMs) for performance comparison. MLKNN and BPMLL are two representative multi-label classifiers, and can be performed by an open source tool called Mulan . SVMs adopt a “one-versus-all” scheme, which trains each label by a binary SVM independently and is implemented using the LibLinear software package . These five models are trained and used to predict with the S.C and S.C-CC datasets. Figure 5 shows the HL, AP, one error, Micro-F1, Macro-F1, , , and values of all models in the two datasets, respectively. For AP, Micro-F1, Macro-F1, , , and , the larger the value, the better the performance. Conversely, for HL and one error, the smaller the value, the better the performance. The red asterisk of Figure 5 represents the best results in each dataset. It is worth noting that the experimental results of this section are obtained by a CGS inference algorithm.
As shown in Figure 5, DMR-LLDA can achieve better results in almost all evaluation criteria for the two datasets. The concrete analysis is introduced as follows.
For the S.C dataset, the DMR-LLDA achieves the best performance for AP, , , , Micro-F1, and Macro-F1. For example, with AP, the DMR-LLDA achieves 94%, 3.3%, 96%, and 26% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. With , the DMR-LLDA achieves 109%, 2.3%, 89%, and 24% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. For , the DMR-LLDA achieves 31%, 39%, 44%, and 25% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. For , the DMR-LLDA achieves 33%, 8.1%, 48%, and 10% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. With Micro-F1, the DMR-LLDA achieves 116%, 6.1%, 123%, and 29% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. On Macro-F1, DMR-LLDA achieves 22%, 2.9%, 24%, and 25% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. Nevertheless, for one error and HL, SVMs get better results than the DMR-LLDA.
For the S.C-CC dataset, the DMR-LLDA obtains a better performance in terms of AP, , , , Micro-F1, and Macro-F1. For AP, the DMR-LLDA achieves 36%, 1.7%, 39%, and 30% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. For , the DMR-LLDA achieves 68%, 4%, 64%, and 20% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. For , the DMR-LLDA achieves 73%, 35%, 62%, and 34% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. For , the DMR-LLDA achieves 67%, 6.4%, 92%, and 23% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. For Micro-F1, the DMR-LLDA achieves 101%, 4.1%, 114%, and 26% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. For Macro-F1, the DMR-LLDA achieves 18%, 1.8%, 16%, and 20% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. Nevertheless, for one error, BPMLL gets better results than the DMR-LLDA; for HL, SVMs gets better results than the DMR-LLDA.
For both of the datasets, we can find that the improvements on are more significant than and , which indicates that the DMR-LLDA has a stronger effect on improving the overall accuracy of gene function prediction without respect to label weights. In the comparisons of the S.C and S.C-CC datasets, we find that the values of AP, , , and in the S.C dataset is lower than in the S.C-CC dataset, and the value of one error and HL in the S.C is higher than in the S.C-CC dataset. This is due to the same word space and different label number between these two datasets. The fewer labels of the S.C-CC dataset can promote a higher classifying performance.
Above all, these results indicate that the DMR-LLDA can further improve the accuracy of gene function prediction by introducing the DMR framework into the LLDA model, which optimizes the hyper-parameters of the topic weight. Meanwhile, the DMR-LLDA has an apparent advantage in improving the overall prediction accuracy.
3.5. The Impact on Prior Parameters of Feature Variables
In the DMR-LLDA model, the introduction of gene features is realized by feature weight parameter . Then the operation on topic parameter vector and feature vector are reflected in Dirichlet hyper parameter . Table 6 shows the impact of different feature values on prior parameters .
For the LLDA, the hyper-parameter value is set as a fixed value. However, Table 6 shows that only the different values on mol_wt, theo_pI, hydro, and Cai make a significant difference of hyper-parameter value in the DMR-LLDA, which is also the main way for gene features to impact label allocation.
3.6. The Comparison Results of Inference Algorithms
We designed three kinds of inference algorithm for the DMR-LLDA, including CGS, CVB, and CVB0. This section compares CGS with CVB0 in the S.C dataset. The experimental results are shown in Figure 6. As shown in Figure 6, the overall performance of CVB0 is better than the performance of CGS. Concrete analysis is represented as follows.
For the S.C dataset, CVB0 achieves the best results in AP, , , and , and achieves almost similar results in HL. However, CVB0 has a worse value in one error.
For the S.C-CC dataset, CVB0 achieves the best results in AP, one error, , , and , and achieves almost similar results in HL. The above results demonstrate the validity of the designed inference algorithms for the DMR-LLDA. Meanwhile, the experimental results indicate that the CVB0 inference algorithm can obtain more precise prediction results by the re-collapse of hidden variable space based on Jensen inequality.
Above all, due to the lack of prior knowledge, the prior distributions of the Bayesian model are usually set for convenience. Meanwhile, the parameters of prior distribution are also set as a fixed value based on experience, which makes the inaccurate estimation of posterior distributions. In our DMR-LLDA model, the gene feature information is introduced into the LLDA as the prior knowledge by the DMR framework. The hyper-parameter of the prior distribution is updated in the inference process rather than by a fixed constant, which can improve the estimation precision of posterior distributions, so as to improve the accuracy of gene function prediction.
In this paper, we introduce multiple types of features into gene function prediction based on a multi-label surprised topic model, and propose a multi-label supervised topic model conditioned on arbitrary features named the DMR-LLDA. By applying an exponential a priori constructed previously with weighted features on the hyper-parameters of gene-topic (or label) distribution, this model can utilize the observed features of each gene in multi-label topic modeling. Furthermore, three learning algorithms are designed for this model, including CGS, CVB inference, and CVB0 inference. The predictive performance of this model is measured by the AP, one error, Hamming loss, , , , Micro-F1, and Macro-F1. Experiments on a standard dataset show that the DMR-LLDA is superior to the LLDA, MLKNN, BPMLL, and SVM models. Meanwhile, experimental results show that the DMR-LLDA can get a much more accurate estimation of posterior distribution, due to using the gene feature information in addition to the amino acid sequence.
Conceptualization: L.L. and W.Z.; methodology: L.L.; software: L.L., L.T., and X.J.; validation: L.L. and L.T.; formal analysis: L.L.; writing (original draft preparation): L.L. and L.T.; writing (review and editing): L.T. and X.J.
This research was funded by the National Natural Science Foundation of China (grant number 61862067) and the Doctor Science Foundation of Yunnan Normal University (no. 01000205020503090, no. 2016zb009).
We would like to thank the researchers in the State Key Laboratory of Conservation and Utilization of Bio-Resources, Yunnan University (Kunming, China). Their very helpful comments and suggestions have led to an improved version of paper.
Conflicts of Interest
The authors declare no conflict of interest.
- Pandey, G.; Kumar, V.; Steinbach, M. Computational Approaches for Gene Function Prediction: A Survey; Department of Computer Science and Engineering, University of Minnesota: Minneapolis, MN, USA, 2006. [Google Scholar]
- Altschul, S.F.; Gish, W.; Miller, W.; Myers, E.W.; Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 1990, 215, 403–410. [Google Scholar] [CrossRef]
- Zacharaki, E.I. Prediction of gene function using a deep convolutional neural network ensemble. PeerJ Comput. Sci. 2017, 3, e124. [Google Scholar] [CrossRef]
- Ofer, D.; Linial, M. ProFET: Feature engineering captures high-level protein functions. Bioinformatics 2015, 31, 3429–3436. [Google Scholar] [CrossRef] [PubMed][Green Version]
- Yu, G.; Rangwala, H.; Domeniconi, C.; Zhang, G.; Zhang, Z. Predicting gene function using multiple kernels. IEEE/ACM Trans. Comput. Biol. Bioinform. 2015, 12, 219–233. [Google Scholar] [PubMed]
- Cao, R.; Cheng, J. Integrated protein function prediction by mining function associations, sequences, and protein-protein and gene-gene interaction networks. Methods 2016, 93, 84–91. [Google Scholar] [CrossRef] [PubMed]
- Vascon, S.; Frasca, M.; Tripodi, R.; Valentini, G.; Pelillo, M. Protein Function Prediction as a Graph-Transduction Game. Pattern Recogn. Lett. 2018. [Google Scholar] [CrossRef]
- Radivojac, P.; Clark, W.T.; Oron, T.R.; Schnoes, A.M.; Wittkop, T.; Sokolov, A.; Graim, K.; Funk, C.; Verspoor, K.; Ben-Hur, A.; et al. A large-scale evaluation of computational gene function prediction. Nat. Methods 2013, 10, 221–227. [Google Scholar] [CrossRef] [PubMed]
- Shehu, A.; Barbará, D.; Molloy, K. A Survey of Computational Methods for Gene Function Prediction. Big Data Analytics in Genomics; Springer International Publishing: New York, NY, USA, 2016; pp. 225–298. [Google Scholar]
- Lobb, B.; Doxey, A.C. Novel function discovery through sequence and structural data mining. Curr. Opin. Struct. Biol. 2016, 38, 53–61. [Google Scholar] [CrossRef] [PubMed]
- Njah, H.; Jamoussi, S.; Mahdi, W.; Elati, M. A Bayesian approach to construct Context-Specific Gene Ontology: Application to protein function prediction. In Proceedings of the 2016 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), Chiang Mai, Tailand, 5–7 October 2016. [Google Scholar]
- Vens, C.; Struyf, J.; Schietgat, L.; Džeroski, S.; Blockeel, H. Decision trees for hierarchical multi-label classification. Mach. Learn. 2008, 73, 185–214. [Google Scholar] [CrossRef][Green Version]
- Liu, L.; Tang, L.; He, L.; Yao, S.; Zhou, W. Predicting gene function via multi-label supervised topic model on gene ontology. Biotechnol. Biotechnol. Equip. 2017, 31, 1–9. [Google Scholar] [CrossRef]
- Ramage, D.; Hall, D.; Nallapati, R.; Nallapati, R.; Manning, C. LLDA: A supervised topic model for credit attribution in multi-Lcorpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, Singapore, 6–7 August 2009; pp. 248–256. [Google Scholar]
- Mimno, D.; Mccallum, A. Topic Models Conditioned on Arbitrary Features with Dirichlet-Multinomial Regression; University of Massachusetts: Amherst, MA, USA, 2012; pp. 411–418. [Google Scholar]
- La Rosa, M.; Fiannaca, A.; Rizzo, R.; Urso, A. Probabilistic topic modeling for the analysis and classification of genomic sequences. BMC Bioinform. 2015, 16, S2. [Google Scholar] [CrossRef] [PubMed][Green Version]
- Casella, G.; George, E.I. Explaining the Gibbs Sampler. Am. Stat. 1992, 46, 167–174. [Google Scholar][Green Version]
- Blei, D.M.; Kucukelbir, A.; Mcauliffe, J.D. Variational Inference: A Review for Statisticians. J. Am. Stat. Assoc. 2018, 112, 859–877. [Google Scholar] [CrossRef]
- Tai, F.; Lin, H.T. Multilabel Classification with Principal Label Space Transformation. Neural Comput. 2012, 24, 2508. [Google Scholar] [CrossRef] [PubMed]
- Sun, Y.; Ye, S.; Sun, Y.; Kameda, T. Improved algorithms for exact and approximate Boolean matrix decomposition. In Proceedings of the IEEE International Conference on Data Science and Advanced Analytics (DSAA), Paris, France, 19–21 October 2015; pp. 1–10. [Google Scholar]
- Yang, Y. Research on Biological Sequence Classification Based on Machine Learning Methods; Shanghai Jiao Tong University: Shanghai, China, 2009. [Google Scholar]
- Minling, Z.; Zhihua, Z. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recogn. 2007, 40, 2038–2048. [Google Scholar][Green Version]
- Zhang, M.L.; Zhou, Z.H. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans. Knowl. Data Eng. 2006, 18, 1338–1351. [Google Scholar] [CrossRef]
- Tsoumakas, G.; Katakis, I.; Vlahavas, I. Mining multi-label data. Data Mining and Knowledge Discovery Handbook; Springer: New York, NY, USA, 2009; pp. 667–685. [Google Scholar]
- Fan, R.E.; Chang, K.W.; Hsieh, C.J.; Wang, X.R.; Lin, C.J. LIBLINEAR: A Library for Large Linear Classification. J. Mach. Learn. Res. 2008, 9, 1871–1874. [Google Scholar]
Figure 1. The relationship between protein function prediction and text topic modeling. IP, CP, TS, MS and so on, represent ‘words’, each of which is composed by two amino acid alphabets. Each GO term is started by ‘GO:’.
Figure 2. An overview of the topic modeling process.
Figure 3. The graphic model of Dirichlet multinomial regression latent Dirichlet allocation (DMR-LLDA).
Figure 4. The comparisons between the Human1 and Human2 datasets. Define terms if necessary.
Figure 5. The comparisons between DMR-LLDA, LLDA, back propagation for multi-label learning (BPMLL), support vector machines (SVMs), and multi-label k-nearest neighbor (MLKNN) in two datasets. The red asterisk represents the best results in each dataset.
Figure 6. The comparison results with CVB0 and CGS. The red asterisk represents the best results in each dataset.
Table 1. Collapsed variable Bayesian (CVB) algorithm of DMR-LLDA.
|CVB algorithm of DMR-LLDA|
|1||Initialize global variational parameters|
|2||While the number of iterations or is not converged do|
|4||Initialize local variational parameters to constant|
|5||Repeat (the local variational inference of gene )|
|7||Update and by Equations (29)~(30)|
|9||Until is converged:|
|11||, , and by Equations (31)~(34)|
|12||Update by Equation (21)|
Table 2. Zero-order variational Bayesian (CVB0) algorithm of DMR-LLDA.
|CVB0 algorithm of DMR-LLDA|
|1||Initialize global variational parameters|
|2||While the number of iterations or is not converged do|
|4||Initialize local variational parameters to constant|
|5||Repeat: (the local variational inference of gene )|
|9||Until is converged:|
|12||Update by Equation (21)|
Table 3. The statistics of the S.cerevisiae (S.C) and S.cerevisiae-cellular component (S.C-CC) datasets.
Table 4. The statistics of extra features in the S.C dataset.
|isoelectric point||theo_pI||Real numbers|
|average coefficients of hydrophilic||hydro||Real numbers|
|number of exons||position||Integer|
|adaptability index of codon||Cai||Real numbers|
|number of motifs||motifs||Integer|
|ORF number of chromosomes||chromosome||Integer|
ORF: Open reading frame
Table 5. The statistic of two human datasets.
Table 6. The impact on prior parameters of feature variables.
|The words under topic when mol_wt = 49629.3, theo_pI = 8.96, hydro = 0.1, position = 1, Cai = 0.17, motifs = 2, chromosome = 16|
|1.88||GM IH LH VH LK IG GC IC AK VM FG AM LW IK VG VW FC IG FH GK|
|1.32||LM ST SM LT KM LL IM KL LF SL EM LP DM IT LK EF KT LE SK SP|
|0.79||GH VC AC KC GC AL GM LH AH AF AM VW AW GW EC KH TH GF AT GT|
|0.64||IL VG GE FM YK QW YM VW GP TL KT LW RP LQ IR FH NW NX FS PM|
|0.23||TT SV TV ST SW SQ TQ PT PV IV SP TM CT QT AV TP TC SC VV NV|
|The words under topic when mol_wt = 85873.7, theo_pI = 9.74, hydro=0.664 position = 1, Cai = 0.1, motifs = 2, chromosome = 16|
|4.23||KR TF KE QS LW EW DM YF QT SM LX SF IN QW LR VL VS QG MC QC|
|3.77||LM SM LS RC DW EM LE QT LV EW FM QI RM NE DT IE FT AR QC GP|
|0.23||QM KR AP EF LF QR HP EC RE RF DS VE EW KF FE LT TL QV QC AR|
|0.11||CF PI ED QY GQ HN RI HD HI SN YQ TQ PW RH YL PQ PN SI QE RS|
|0.09||SW VF NW AC DF TW EQ LW EH MC DM AW PS GV VQ AQ ID TG RF VE|
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).