A Multi-Label Supervised Topic Model Conditioned on Arbitrary Features for Gene Function Prediction

With the continuous accumulation of biological data, more and more machine learning algorithms have been introduced into the field of gene function prediction, which has great significance in decoding the secret of life. Recently, a multi-label supervised topic model named labeled latent Dirichlet allocation (LLDA) has been applied to gene function prediction, and obtained more accurate and explainable predictions than conventional methods. Nonetheless, the LLDA model is only able to construct a bag of amino acid words as a classification feature, and does not support any other features, such as hydrophobicity, which has a profound impact on gene function. To achieve more accurate probabilistic modeling of gene function, we propose a multi-label supervised topic model conditioned on arbitrary features, named Dirichlet multinomial regression LLDA (DMR-LLDA), for introducing multiple types of features into the process of topic modeling. Based on DMR framework, DMR-LLDA applies an exponential a priori construction, previously with weighted features, on the hyper-parameters of gene-topic distribution, so as to reflect the effects of extra features on function probability distribution. In the five-fold cross validation experiment of a yeast datasets, DMR-LLDA outperforms the compared model significantly. All of these experiments demonstrate the effectiveness and potential value of DMR-LLDA for predicting gene function.


Introduction
As the main component of a cell, proteins are the most essential and versatile material of life. Thus, the research on protein functions is of great importance for the development of new drugs, better crops, and the development of synthetic biochemical [1]. In recent years, new protein function prediction methods using machine learning algorithms have proliferated, based on various known information about proteins, and have increasingly become important long-standing research works in the post-genomic era. From the point of molecular biology, a protein is the product of a gene after the process of transcribing, translating, and post-translational modifying. Even though the real function of a gene is to encode one or more proteins executing practical functions, the function of a gene product has usually been regarded as the native function of the gene in gene-level experiments. Therefore, we do not distinguish between gene function and protein function in this paper, which are known collectively as gene function.
The most common computational approach for gene function prediction is to transfer the gene function into some specific features from their sequence or structure similarity, such as BLAST [2]. In addition to sequence similarity, many gene function prediction methods have been exploited in Genes 2019, 10, 57 2 of 20 recent years as the additional information extracted from proteins, such as protein structure [3], protein motif, biophysical properties [4], and integrated heterogeneous data sources [5]. In reference [3], Evangelia et al. extract novel shape features from protein structures in the form of local (per amino acid) distribution of angles and amino acid distances, respectively. Each of the multi-channel feature maps is introduced into a deep convolutional neural network (CNN) for function prediction, and the outputs are fused through support vector machines or a correlation-based k-nearest neighbor classifier. In addition, automatic prediction using protein-protein similarity information can be further supplemented by experimental data [6,7]; this kind of method assumes that the closely related proteins (or genes) share similar functional annotations on the basis of network structure information. Researchers have made the relevant literature reviews of computational methods on gene function prediction in references [8][9][10].
From the point of machine learning algorithms, predicting gene function based on various data sources is a problem of classification in nature. A gene can be viewed as an instance to be classified-various kinds of data sources (such as an amino acid sequence, textual repositories, and motifs) can be organized into a feature space, so that each gene is represented as a set of attribute values; a function (such as a gene ontology (GO) term [11]) is regarded as a label. As a gene is always annotated by several functions, gene function prediction is actually a process of multi-label classification: a multi-label classifier is trained firstly on constructed attribute features and annotated genes, and then is used to predict function annotations for unannotated genes. From the above analysis, we believe that many multi-label classification algorithms have great potential to predict gene function, such as a support vector machine (SVM), neural network, and decision tree. In reference [12], Celine Vens et al. proposed three multi-label classifiers based on a hierarchical decision tree, and the experimental results from 24 datasets show that these classifiers are powerful and effective for gene function prediction.
In addition to traditional machine learning algorithms, a topic model is a kind of probabilistic generative model that has been applied into gene function prediction. In reference [13], Liu et al. introduced a typical multi-label supervised topic model into gene function prediction, which was called labeled latent Dirichlet allocation (LLDA) and is proposed in reference [14] for text mining. This research is the first effort to apply a multi-label supervised topic model into gene function prediction. Compared with traditional multi-label classification models, LLDA can model a function label as a topic, and thus can not only work out the function probability distributions over gene instances effectively, but can also directly provide the word probability distributions over functions. Nonetheless, the direct application of LLDA on a gene function dataset can only utilize protein sequence data by formalizing the sequences into a bag of words (BoW), and then the constructed bag of words is used for topic modeling. In other words, due to the restrictions of BoW construction in topic modeling, the feature space was constructed on sequence data rather than multiple biological data. However, we can see from the above paragraph that there are various protein features, such as hydrophobicity and the polarity of amino acids, which have a profound impact on gene structure and function. Apparently, the introduction of multiple kinds of gene features in a multi-label supervised topic model can improve the accuracy of gene function prediction.
Inspired by the application of a multi-label topic model in gene function prediction and a topic model conditioned on arbitrary features named the Dirichlet multinomial regression latent Dirichlet allocation (DMR-LDA) [15], we propose a DMR-LLDA model, which introduces a DMR framework into an LLDA model. Firstly, we describe DMR-LLDA for gene function prediction problem formulation. Then the generative process and the inference algorithm of DMR-LLDA are described. This model is fully compatible with both discrete and continuous features, whose inference is relatively simple. In a five-fold cross validation experiment on verified gene function prediction, DMR-LLDA significantly outperformed LLDA. In addition, the impact of feature variables on prior parameters and the comparison between two kinds of inference algorithms are shown in experimental data. All these experimental results demonstrate the effectiveness and potential value of DMR-LLDA for predicting gene function.

Related Definitions and Notations
In this paper, the topic modeling method of gene function prediction reported by reference [13] is utilized. We consider each gene to be a document [16], and GO terms (topics) are shared by a document collection. Meanwhile, we view the extra gene features, except for the bag of amino acid words, as the metadata, such as authors and dates of documents. Therefore, the introduction of extra gene features into topic modeling is similar to introducing metadata into the topic modeling of documents, and the type of metadata may be discrete or continuous. To better understand the practical application of our method, the relationship of text topic modeling and gene function predicting is illustrated by Figure 1.

Related Definitions and Notations
In this paper, the topic modeling method of gene function prediction reported by reference [13] is utilized. We consider each gene to be a document [16], and GO terms (topics) are shared by a document collection. Meanwhile, we view the extra gene features, except for the bag of amino acid words, as the metadata, such as authors and dates of documents. Therefore, the introduction of extra gene features into topic modeling is similar to introducing metadata into the topic modeling of documents, and the type of metadata may be discrete or continuous. To better understand the practical application of our method, the relationship of text topic modeling and gene function predicting is illustrated by Figure 1. In Figure 1, the right part describes the topic modeling concept of text data, and the left part describes the related concept of gene function data. For all topic models, there are three key concepts: "documents", "words", and "topics". In addition, the supervised topic model introduces "labels" for each document, and the proposed DMR-LLDA model introduces "features" for each document. Therefore, these concepts can now be reformulated with more detail, as follows.

Documents
For text data (right part of Figure 1), document collection is composed of several documents numbered D1 to Dn. In the other side (left part of Figure 1), the gene dataset is composed of several protein sequences, numbered G1 to Gn. Therefore, a document is equivalent to a gene in our model.
We suppose that there are D genes in a gene set, which compose the gene space , and the gene sample set X including D genes can be represented as , and d X denotes a gene sample. In Figure 1, the right part describes the topic modeling concept of text data, and the left part describes the related concept of gene function data. For all topic models, there are three key concepts: "documents", "words", and "topics". In addition, the supervised topic model introduces "labels" for each document, and the proposed DMR-LLDA model introduces "features" for each document. Therefore, these concepts can now be reformulated with more detail, as follows.

Documents
For text data (right part of Figure 1), document collection is composed of several documents numbered D1 to Dn. In the other side (left part of Figure 1), the gene dataset is composed of several protein sequences, numbered G1 to Gn. Therefore, a document is equivalent to a gene in our model. We suppose that there are D genes in a gene set, which compose the gene space D = {1, . . . , D}, and the gene sample set X including D genes can be represented as X = {X d } D d=1 , and X d denotes a gene sample.

Labels
For text data (right part of Figure 1), each document is labeled by one or more tags, such as "programming" and "language". On the other side (left part of Figure 1), each gene is annotated by several GO terms, such as "GO:0003012" and "GO:0003547". Therefore, a document tag is equivalent to a GO term in our model, and all of them are called "labels". In this paper, the gene function label space is expressed as L = {1, . . . , L}. Meanwhile, the observed labels of each gene are described by a sparse binary vector Λ d = {Λ dl } L l=1 , which is defined as follows: where, L d represents the label sub-space of gene X d : L d ⊆ L.

Words
For text data (right part of Figure 1), word terms are the main component of a document, such as the words "table" and "database". On the other side (left part of Figure 1), we consider a protein sequence to be a text string, which is defined by a fixed 20 amino acid alphabet (G,A,V,L,I,F,P, Y,S,C,M,N,Q,T,D,E,K,R,H,W). Correspondingly, amino acid blocks are the main components of a protein sequence, which is composed by two or more amino acid alphabets, such as "MS" and "TS". Therefore, a word term is equivalent to an amino acid block in our model, and all of them are called "words". Meanwhile, all of the words constitute a vocabulary. In this paper, the amino acid words space is represented as W = {1, . . . , W}. For a gene X d , X d = {x dn } N d n=1 denotes that the dth gene is composed by N d observed word samples, and x dn is one of word samples.

Topics
For text data (right part of Figure 1) and gene function data (left part of Figure 1), a "topic" is viewed as a probability distribution over a fixed vocabulary. Taking the text data as an example, the probabilities of the word "table" over "topic 1" are 0.05. For the gene function data, the probabilities of amino acid block MS over "topic 1" are 0.21. Obviously, topics are latent and needed to be inferred by topic modeling. In this paper, the global topic space includes T topics, which is represented as T = {1, . . . , T}. According to the definition of an LLDA model, there is a one-to-one correspondence between label and topic-therefore, L T ( represents equivalent relationship between two space), T =|T|= L =|L|.

Features
For text data (right part of Figure 1), the metadata of a document can be viewed as document features, such as the tags "author" and "publish year of document". On the other side (left part of Figure 1), each gene has several extra features, except for its sequence string, such as molecular weight and hydrophobicity. Therefore, the metadata of a document tag is equivalent to an extra feature of the gene in our model, and all of them are called "features". In this paper, the feature space composed by gene features is expressed as F = {1, . . . , F}. Therefore, there is a set of observed features for gene X d , which can be represented as a feature vector:

Others
In addition to the above five concepts, there are three other concepts illustrated in Figure 1. Firstly, the BoW, which is a word-document matrix and the input of the topic model. In an instance in the  Figure 1, the word "table" appears two times in document D1. Likewise, the word "MS" appears one time in gene G1. In other words, the element of the BoW represents the times of each word in each document. Meanwhile, there are two probability matrices that appear in Figure 1: one is the topic (label)-word probability matrix, and the other is the document (gene)-topic probability matrix. All of them are represented as parameter vectors for each topic or gene in the topic model.
A topic corresponds to a multinomial distribution of word space W, whose parameter vector is θ t = {θ tw } W w=1 , and θ tw is the probability of word w under topic t; a gene X d corresponds to a multinomial distribution of the topics space T, whose parameter vector is π d = {π dt } T t=1 , and π dt is the topic weight of topic t under gene X d . Finally, we utilize a feature parameter vector β t = β t f F f =1 to represent the relationship between features (f ) and topics (t) in making features influence the choice of topic.
Note that the shared parameters of a whole gene set, such as topic-word parameter θ, are called "global parameters" in this paper. Correspondingly, the parameter of one gene is called a local parameter, such as gene-topic (label) parameter π.

Overview of the Dirichlet-Multinomial Regression Latent Dirichlet Allocation Topic Modeling Process
Based on the above notation, we can provide the description of a gene function dataset as follows.
The gene X d is composed of N d , which are observed samples, and the word index of each sample x dn comes from the vocabulary w dn ∈ W d . Thus, the gene can also be represented as where W d is the local word subspace of X d . In addition, the latent variables of gene X d is its topic In this case, we suppose that each word w ∈ W d of each gene X d shares the same feature vector: Then, the topic modeling process of our model can be interpreted as follows: for the training set, learning the unknown parameter θ t , π d , and β t from the observed variables W d , T d , and y d ; for the testing set, predicting T d and π d from known parameters θ t and β t , and the observed variables W d and y d . Obviously, θ t and β t are global parameters, which are shared by the whole dataset. The above two steps are also called model training and predicting, and are realized by learning and inference algorithms, such as Gibbs sampling [17] and variable inference [18].
Moreover, there are two steps before model training and predicting: BoW construction and model description. Since we constructed the BoW of the gene in exactly the same way as reference [13], this step will be not repeated in this paper. For model description, there are usually two ways to describe a probabilistic graphical model, including the generative process and the graphic model, which are discussed in the next sections. The overview of our topic modeling process is depicted in Figure 2. one is the topic (label)-word probability matrix, and the other is the document (gene)-topic probability matrix. All of them are represented as parameter vectors for each topic or gene in the topic model. A topic corresponds to a multinomial distribution of word space  , whose parameter vector is , and tw  is the probability of word w under topic t ; a gene d X corresponds to a multinomial distribution of the topics space , whose parameter vector is , and dt  is the topic weight of topic t under gene d X . Finally, we utilize a feature parameter vector to represent the relationship between features (f) and topics (t) in making features influence the choice of topic. Note that the shared parameters of a whole gene set, such as topic-word parameter θ , are called "global parameters" in this paper. Correspondingly, the parameter of one gene is called a local parameter, such as gene-topic (label) parameter π .

Overview of the Dirichlet-Multinomial Regression Latent Dirichlet Allocation Topic Modeling Process
Based on the above notation, we can provide the description of a gene function dataset as follows.
Then, the topic modeling process of our model can be interpreted as follows: for the training set, learning the unknown parameter t θ , d π , and t β from the observed variables d W , d T , and d y ; for the testing set, predicting d T and d π from known parameters t θ and t β , and the observed variables d W and d y . Obviously, t θ and t β are global parameters, which are shared by the whole dataset. The above two steps are also called model training and predicting, and are realized by learning and inference algorithms, such as Gibbs sampling [17] and variable inference [18].
Moreover, there are two steps before model training and predicting: BoW construction and model description. Since we constructed the BoW of the gene in exactly the same way as reference [13], this step will be not repeated in this paper. For model description, there are usually two ways to describe a probabilistic graphical model, including the generative process and the graphic model, which are discussed in the next sections. The overview of our topic modeling process is depicted in Figure 2.

Description of the Dirichlet-Multinomial Regression Latent Dirichlet Allocation Model
This section provides the description of DMR-LLDA, including its generative process and graphic model. It is worth noting that our DMR-LLDA introduces the DMR framework for gene features based on the LLDA model, so this paper emphasizes the DMR part rather than the classic LLDA.
According to DMR framework, each word sample x dn of gene X d is a "individual", and all of the is composed by N dw (the number of word w appeared in gene X d ) samples of the w-th group, and corresponds to a feature vector y dw , which influences the latent topic t ∈ T choice of all samples.
We suppose that In Equation (2), represents feature parameters that correspond to topic t. Likewise, each bag of words w of gene X d shares the same clustering random variable: where π dt is the selecting probability of the n-th word sample of gene X d , which chooses topic t and maximizes the utility selection U dnt . In addition, where α dt is the hyper-parameter of π dt : The description of DMR-LLDA from the global and local perspective is shown below. From the global perspective, each topic t ∈ T can be represented as a multinomial distribution over vocabulary W, whose parameter is expressed as vector θ t = {θ tw } W w=1 , and we suppose that θ t obeys Dirichlet conjugate prior distribution. Each topic t ∈ T corresponds to a feature weight parameter vector β t , which obeys the normal distribution of parameter (µ, σ 2 ).
From the local perspective, each gene X d is composed by N d observed samples, which corresponds to local word number At the same time, the range of topics on feature weight parameter vector β t is limited to t ∈ T d . In addition, which is the dot-product of feature vector y d , corresponding to feature subspace F d and its weighted parameter vector β t .
Above all, the Dirichlet prior hyper-parameter α d of π d can be expressed as where α dt is computed by Equation (5). The local topic weight π d can be also represented as Given the above, the generative process of DMR-LLDA can be described as follows. The corresponding graphical model is shown in Figure 3.
For each global topic t ∈ T = {1, . . . , T}, we can (a) Generate a feature weighted parameter vector as the Dirichlet prior hyper-parameter of the topic weight (c) We can generate local weight topic vector of topic t from a Dirichlet distribution: (d) For each word sample x dn , we can i. Generate topic number t dn of x dn from T dimensions' multinomial distribution of parameter π d : ii. Generate word number w dn of x dn from W dimensions' multinomial distribution of parameter θ t dn : As we can see from Figure 3, α d is computed by feature vector y d and its weighted parameter. Therefore, α d is a parameter rather than a random variable in the LLDA.
Given the above, the generative process of DMR-LLDA can be described as follows. The corresponding graphical model is shown in Figure 3.
For each global topic For each gene d X , This means that (a) We suppose that (c) We can generate local weight topic vector of topic t from a Dirichlet distribution: As we can see from Figure 3, d α is computed by feature vector d y and its weighted parameter. Therefore, d α is a parameter rather than a random variable in the LLDA. In our DMR-LLDA model, the unknown parameters to be estimated are the global feature parameter β, the global topic-word multinomial distribution parameter θ, and the local topic weight π. The hidden variable to be estimated is T . The known data are the observed word samples W and binary vector Λ. The joint distribution of (β, π, θ, T , W ) is shown in Equation (15): Above all, the proposed method utilizes extra features as the prior knowledge of the related distribution, which is able to gain more reliable prior distribution for the LLDA; then a more precise estimation of posterior distributions is obtained.

Inference Algorithm of Dirichlet-Multinomial Regression Latent Dirichlet Allocation
The core learning task of DMR-LLDA is to compute the parameters (π, θ, β) and posterior distribution p(π, θ, β, T |W ). The posterior estimation represents the estimating value of the parameter under the training set. The prediction process of DMR-LLDA is that on the basis of the estimated three parameters and a hidden variable, we update the unknown local parameter π and hidden variable of the test gene by fixing the learned global parameters β and θ; then, we get the corresponding relationship between the label and the topic. The Gibbs sampling algorithm and the variable Bayesian algorithm are two essentially approximate inference algorithms of a probabilistic graphic model, and the purpose of them is universal. In order to compare their impact on the model performance of difference inference algorithms, we designed a collapsed Gibbs sampling algorithm (CGS), a collapsed variable Bayesian algorithm (CVB), and a zero-order variational Bayesian algorithm (CVB0) for DMR-LLDA, with detail as follows.

The Collapsed Construction of Dirichlet-Multinomial Regression Latent Dirichlet Allocation
First of all, after the integration of model parameters (π, θ) in a joint distribution, a semi-collapsed (β, T , W ) joint distribution is obtained: The predictive probability distribution for the topic assignment of sample x dn is In Equation (17), α dt is optimized by local observed feature vector , whose updating equation is Equation (5). To simplify the updating equation, we first suppose that log δ −1 d = y d f de f ault β t f de f ault , and then an item of hidden global feature parameter β t f de f ault is added for global feature parameter β t = β t f F f =1 , which corresponds to a "fake" observed feature y dF de f ault = 1. Thus, the updating equation of α dt is

The Optimization of the Feature Parameters of Dirichlet-Multinomial Regression Latent Dirichlet Allocation
For Gibbs sampling or variable Bayesian, we need to update the global feature parameterβ t in the inference process. We adopted the method of gradient descent for optimizingβ t .
In Equation (16), theβ-related section is Based on the logarithm of Equation (20), we take the derivative with respect to global feature parameter β t f and adjust it to zero. The updated equation of β t f is Finally, α new dt is updated by Equation (18).

The Collapsed Gibbs Sampling Algorithm of the Dirichlet-Multinomial Regression Latent Dirichlet Allocation
To determine the initial state of the Markov chain, we initiate the hidden topic number t dn of each sample x dn first; then, we utilize the predictive probability of hidden variable t dn from Equation (17) as the state transition probability of the Markov chain. In the process of Gibbs sampling, the topic number t dn of each sample x dn is updated, and the hyper-parameter α d = {α dt Λ dt } T t=1 is also updated by Equation (18). Finally, the global feature parameter β t f is updated by Equation (21).
After several iterations in the burn-in time, the Markov chain is attracted to objective distribution, and then the posterior distribution p β, T W, µ, σ 2 , α, λ is estimated. The posterior estimation of the local topic weight π d = {π dt Λ dt } T t=1 and topic-word multinomial distribution parameter iŝ

The Collapsed Variable Bayesian Inference Algorithm of the Dirichlet-Multinomial Regression Latent Dirichlet Allocation
The whole variational objective function before being collapsed is After margining the model parameters (π, θ), the objective function is where Const q(t dn ) represents the unrelated item with variational distribution q(t dn ). There are two kinds of construction below: In Equation (26), the updating equation of optimal variational parameter η * dnt by a CVB algorithm is Each expectation of the above equation is In Equation (27), the updating equation of the optimal variational parameter η * dnt by CVB0 algorithm is The plenitude statistic of samples in Equation (35) The \dn in the above equation can be adapted to \dw, because the bag of words dw not only shares a similar word number w dn = w, but also shares the same topic number t dn = t. Then, optimal variational distribution η * dnt can be adapted to η * dwt .
The inference equation difference between CVB and CVB0 shows that CVB only retains the zero-order information of the Taylor expansion; however, CVB0 is the re-collapse of a hidden variable space based on Jensen inequality. Therefore, CVB0 is much more precise than CVB. The corresponding algorithm of CVB and CVB0 are shown in Tables 1 and 2 respectively. While the number of iterations r < r max or F is not converged do 3 For d = 1 : D do 4 Initialize local variational parameters to constant 5 Repeat (the local variational inference of gene d) Update β (r) t f by Equation (21)  13 End while Table 2. Zero-order variational Bayesian (CVB0) algorithm of DMR-LLDA.

Initialize global variational parameters 2
While the number of iterations r < r max or F is not converged do 3 For d ∈ D do 4 Initialize local variational parameters to constant 5 Repeat: (the local variational inference of gene d) t f by Equation (21)  13 End while

Materials and Results
This section provides a concise and precise description of the experimental results, their interpretation, and the experimental conclusions that can be drawn.

Dataset
In this paper, the validity and accuracy of proposed models are tested on the S.cerevisiae (S.C) dataset, which is introduced in reference [12]. This dataset includes several aspects of the yeast genome, such as sequence statistics, phenotype, expression, secondary structure, and homology. Meanwhile, two kinds of function annotation standard, including FunCat and GO, are used to annotate gene function. Due the universality of GO, the dataset depends on the GO that is adopted in our experiments. As described in Section 2.1, the construction of the BoW is based on amino acid composition, so we mainly use one of datasets that depends on the sequence statistics. In addition, we construct a dataset named S.C-CC from S.C, which only includes the GO terms belonging to the cellular component (CC). Therefore, there are fewer GO terms in the S.C-CC dataset when compared with the S.C dataset, and both of them are used in our experiments for investigating the influence of different label numbers on prediction performance. The statistics of the S.C and S.C-CC dataset is shown in Table 3. In this set, F denotes the number of GO terms, D denotes the number of genes, and W denotes the size of the vocabulary. As shown in Table 3, there are 1692 genes and 4133 function labels in the S.C dataset; in the S.C-CC dataset, there are 1692 genes and 547 function labels. Due to the large number of GO terms in the gene function dataset, we utilized a Boolean matrix decomposition (BMD) method to reduce the dimensionality of the function labels. BMD is a kind of label space dimension reduction (LSDR) method [19], which addresses the multi-label classification problem with many labels. LSDR approaches use a compression step to transform the original high dimension label space into a lower dimension label space, and then multi-label classifiers are trained on a dataset with fewer labels, which can reduce the computation burden of the classifier. The existing studies about LSDR show that LSDR approaches are useful for optimizing the running time and accuracy of multi-label classification. In our BMD process, original label matrix Y ∈ {0, 1} D×F (D denotes the number of genes, and F denotes the number of features) is decomposed into the product of two matrices, C ∈ {0, 1} D×L (L denotes the number of labels) and B ∈ {0, 1} L×F , where Y = C • B (• denotes Boolean product) is satisfied. We also called it exact BMD and adopted this algorithm, which is proposed in reference [20]. Compared with other LSDR algorithms, an exact BMD can retain the interpretability of low dimension label space and restore the low dimension-predicted label matrix to the original label matrix by matrix B. At last, the number of function labels is reduced into a smaller dimension, and L denotes the number of GO terms after label space dimension reducing. Then DMR-LLDA actually needs to process 1358 GO terms of the S.C dataset and 319 GO terms of the S.C-CC dataset. Nonetheless, the lower dimensional label space can be recovered by a Boolean product after predicting, so we still get the whole function labels sets in the prediction results.
DMR-LLDA's advantage here over LLDA lies in the introduction of extra features. In the S.C and S.C-CC dataset, there are six extra gene features for each gene, including the molecular weight of the gene, the isoelectric point, the average coefficients of hydrophilic, the number of exons, the adaptability index of the codon, the number of motifs, and the open reading frame (ORF) number of chromosomes. The statistics of extra features are shown in Table 4. As the max word length of the S.C dataset is two amino acid alphabets (G,A,V,L,I,F,P, Y,S,C,M,N,Q,T,D,E,K,R,H,W), a human dataset constructed by ourselves is adopted to evaluate the impact on the topic model performance of word length. This human dataset is constructed in a similar way as in reference [13]. In addition, we also constructed two human datasets for different word lengths, where the max word length of the Human1 dataset is two amino acid alphabets, and that of the Human2 dataset is three amino acid alphabets. For the Human2 dataset, the original number of words is 8400, but we filtered several words that have a high frequency. Then, the statistic of the S.C and S.C-CC dataset is shown in Table 5.

Parameter Settings and Evaluation Criterias
The DMR-LLDA learning framework involves four different parameters: µ, σ 2 , α, and λ. The α and λ are the parameters of the two-Dirichlet distribution, where the larger the value of λ, the more balanced the probability of a word in a topic. The setting of the λ value has been discussed in reference [13]. Nonetheless, the value of α is optimized by a protein feature, so its initial value does not have a big effect on model performance. According to the experience, we set α = 50/T as the initial value, and set λ = 200/W, with T = L. In addition, µ and σ 2 are respectively the mean and variance of normal distribution, obeyed by feature weighted parameter β, and we set µ = 0, σ 2 = 1.
In the Gibbs sampling process of model training, we set the number of the Markov chain as 1 and the maximum number of iterations is 2000 times, where the number of iterations of burn-in time is set to 1000. We record the state space at intervals of 50 times on the converged Markov chain, and 20 times per record is conducted. In the process of model predicting, we set the number of iterations as 1000 times. After 500 iterations for the burn-in time, we record the state space at intervals of 50 times. In the variable Bayesian inferring process of model training, we initialize the global variable parameter µ tw through random number s and hyper-parameter λ w : µ tw = λ w + (s * λ w )/10; in each local variable inference, we set the converged threshold as 0.00001, and the maximum number of times of the local variable inference as 100. The number of global scanning iterations is 1000.
The five-fold cross validation is conducted to measure and compare the performance of DMR-LLDA and the comparative algorithms. Five representative multi-label learning evaluation criteria are used in this paper, including hamming loss (HL), average precision (AP), one error, and micro-averaged and macro-averaged F1 scores (Micro-F1 and Macro-F1). In addition, three kinds of areas under a precision-recall curve are also used, including AUPRC, AU(PRC), and AUPRCw, which is proposed in reference [12]. Finally, we repeat the random partition and evaluation in five independent rounds, and report the average results.

The Impact of Word Length on Model Performance
Firstly, the performance comparison of the LLDA model between the Human1 and Human2 datasets is shown in Figure 4. As shown in Figure 4, we find that the value of AUPRC and AUPRCw in Human1 is higher than that in Human2; the value of the AP on Human1 is lower than that of Human2; and the value of one error, HL, and AU(PRC) is almost equal to that of Human1 and Human2. These results show that the classification performance of the LLDA on Human1 and Human2 is almost the same, which reveals that the larger word space might not obtain a better classifying performance.
Moreover, related studies suggested that a word length of more than four amino acid alphabets would not improve the classification accuracy, and would only increase the complexity of computation [21]. Therefore, in the following experiments, we only adopt the S.C and S.C-CC datasets whose word length is two amino acid alphabets. Moreover, related studies suggested that a word length of more than four amino acid alphabets would not improve the classification accuracy, and would only increase the complexity of computation [21]. Therefore, in the following experiments, we only adopt the S.C and S.C-CC datasets whose word length is two amino acid alphabets.

Gene Function Prediction with Cross Validation
In addition to LLDA, we also adopted three widely adopted methods: multi-label k-nearest neighbor (MLKNN) [22], back propagation for multi-label learning (BPMLL) [23], and support vector machines (SVMs) for performance comparison. MLKNN and BPMLL are two representative multilabel classifiers, and can be performed by an open source tool called Mulan [24]. SVMs adopt a "oneversus-all" scheme, which trains each label by a binary SVM independently and is implemented using the LibLinear software package [25]. These five models are trained and used to predict with the S.C and S.C-CC datasets. Figure  Conversely, for HL and one error, the smaller the value, the better the performance. The red asterisk of Figure 5 represents the best results in each dataset. It is worth noting that the experimental results of this section are obtained by a CGS inference algorithm.
As shown in Figure 5, DMR-LLDA can achieve better results in almost all evaluation criteria for the two datasets. The concrete analysis is introduced as follows.
For the S.C dataset, the DMR-LLDA achieves the best performance for AP,

Gene Function Prediction with Cross Validation
In addition to LLDA, we also adopted three widely adopted methods: multi-label k-nearest neighbor (MLKNN) [22], back propagation for multi-label learning (BPMLL) [23], and support vector machines (SVMs) for performance comparison. MLKNN and BPMLL are two representative multi-label classifiers, and can be performed by an open source tool called Mulan [24]. SVMs adopt a "one-versus-all" scheme, which trains each label by a binary SVM independently and is implemented using the LibLinear software package [25]. These five models are trained and used to predict with the S.C and S.C-CC datasets. Figure 5 shows the HL, AP, one error, Micro-F1, Macro-F1, AU(PRC), AUPRC, and AUPRCw values of all models in the two datasets, respectively. For AP, Micro-F1, Macro-F1, AU(PRC), AUPRC, and AUPRCw, the larger the value, the better the performance. Conversely, for HL and one error, the smaller the value, the better the performance. The red asterisk of Figure 5 represents the best results in each dataset. It is worth noting that the experimental results of this section are obtained by a CGS inference algorithm.
As shown in Figure 5, DMR-LLDA can achieve better results in almost all evaluation criteria for the two datasets. The concrete analysis is introduced as follows.
For the S.C dataset, the DMR-LLDA achieves the best performance for AP, AU(PRC), AUPRC, Figure 5. The comparisons between DMR-LLDA, LLDA, back propagation for multi-label learning (BPMLL), support vector machines (SVMs), and multi-label k-nearest neighbor (MLKNN) in two datasets. The red asterisk represents the best results in each dataset.

The Impact on Prior Parameters of Feature Variables
In the DMR-LLDA model, the introduction of gene features is realized by feature weight parameter β . Then the operation on topic parameter vector t β and feature vector d y are reflected in Dirichlet hyper parameter d α . Table 6 shows the impact of different feature values on prior parameters d α For both of the datasets, we can find that the improvements on AUPRC are more significant than AU(PRC) and AUPRCw, which indicates that the DMR-LLDA has a stronger effect on improving the overall accuracy of gene function prediction without respect to label weights. In the comparisons of the S.C and S.C-CC datasets, we find that the values of AP, AU(PRC), AUPRC, and AUPRCw in the S.C dataset is lower than in the S.C-CC dataset, and the value of one error and HL in the S.C is higher than in the S.C-CC dataset. This is due to the same word space and different label number between these two datasets. The fewer labels of the S.C-CC dataset can promote a higher classifying performance.
Above all, these results indicate that the DMR-LLDA can further improve the accuracy of gene function prediction by introducing the DMR framework into the LLDA model, which optimizes the hyper-parameters of the topic weight. Meanwhile, the DMR-LLDA has an apparent advantage in improving the overall prediction accuracy.

The Impact on Prior Parameters of Feature Variables
In the DMR-LLDA model, the introduction of gene features is realized by feature weight parameter β. Then the operation on topic parameter vector β t and feature vector y d are reflected in Dirichlet hyper parameter α d . Table 6 shows the impact of different feature values on prior parameters α d . For the LLDA, the hyper-parameter value is set as a fixed value. However, Table 6 shows that only the different values on mol_wt, theo_pI, hydro, and Cai make a significant difference of hyper-parameter value in the DMR-LLDA, which is also the main way for gene features to impact label allocation.

The Comparison Results of Inference Algorithms
We designed three kinds of inference algorithm for the DMR-LLDA, including CGS, CVB, and CVB0. This section compares CGS with CVB0 in the S.C dataset. The experimental results are shown in Figure 6. As shown in Figure 6, the overall performance of CVB0 is better than the performance of CGS. Concrete analysis is represented as follows.
For the S.C dataset, CVB0 achieves the best results in AP, AUPRC, AU(PRC), and AUPRCw, and achieves almost similar results in HL. However, CVB0 has a worse value in one error.
For the S.C-CC dataset, CVB0 achieves the best results in AP, one error, AUPRC, AU(PRC), and AUPRCw, and achieves almost similar results in HL. The above results demonstrate the validity of the designed inference algorithms for the DMR-LLDA. Meanwhile, the experimental results indicate that the CVB0 inference algorithm can obtain more precise prediction results by the re-collapse of hidden variable space based on Jensen inequality. Above all, due to the lack of prior knowledge, the prior distributions of the Bayesian model are usually set for convenience. Meanwhile, the parameters of prior distribution are also set as a fixed value based on experience, which makes the inaccurate estimation of posterior distributions. In our DMR-LLDA model, the gene feature information is introduced into the LLDA as the prior knowledge by the DMR framework. The hyper-parameter of the prior distribution is updated in the inference process rather than by a fixed constant, which can improve the estimation precision of posterior distributions, so as to improve the accuracy of gene function prediction.

Conclusions
In this paper, we introduce multiple types of features into gene function prediction based on a multi-label surprised topic model, and propose a multi-label supervised topic model conditioned on arbitrary features named the DMR-LLDA. By applying an exponential a priori constructed previously with weighted features on the hyper-parameters of gene-topic (or label) distribution, this model can utilize the observed features of each gene in multi-label topic modeling. Furthermore, three learning algorithms are designed for this model, including CGS, CVB inference, and CVB0 inference. The predictive performance of this model is measured by the AP, one error, Hamming loss,  Above all, due to the lack of prior knowledge, the prior distributions of the Bayesian model are usually set for convenience. Meanwhile, the parameters of prior distribution are also set as a fixed value based on experience, which makes the inaccurate estimation of posterior distributions. In our DMR-LLDA model, the gene feature information is introduced into the LLDA as the prior knowledge by the DMR framework. The hyper-parameter of the prior distribution is updated in the inference process rather than by a fixed constant, which can improve the estimation precision of posterior distributions, so as to improve the accuracy of gene function prediction.

Conclusions
In this paper, we introduce multiple types of features into gene function prediction based on a multi-label surprised topic model, and propose a multi-label supervised topic model conditioned on arbitrary features named the DMR-LLDA. By applying an exponential a priori constructed previously with weighted features on the hyper-parameters of gene-topic (or label) distribution, this model can utilize the observed features of each gene in multi-label topic modeling. Furthermore, three learning algorithms are designed for this model, including CGS, CVB inference, and CVB0 inference. The predictive performance of this model is measured by the AP, one error, Hamming loss, AUPRC, AU(PRC), AUPRCw, Micro-F1, and Macro-F1. Experiments on a standard dataset show that the DMR-LLDA is superior to the LLDA, MLKNN, BPMLL, and SVM models. Meanwhile, experimental results show that the DMR-LLDA can get a much more accurate estimation of posterior distribution, due to using the gene feature information in addition to the amino acid sequence.