This section provides a concise and precise description of the experimental results, their interpretation, and the experimental conclusions that can be drawn.
3.1. Dataset
In this paper, the validity and accuracy of proposed models are tested on the S.cerevisiae (S.C) dataset, which is introduced in reference [
12]. This dataset includes several aspects of the yeast genome, such as sequence statistics, phenotype, expression, secondary structure, and homology. Meanwhile, two kinds of function annotation standard, including FunCat and GO, are used to annotate gene function. Due the universality of GO, the dataset depends on the GO that is adopted in our experiments. As described in
Section 2.1, the construction of the BoW is based on amino acid composition, so we mainly use one of datasets that depends on the sequence statistics. In addition, we construct a dataset named S.C-CC from S.C, which only includes the GO terms belonging to the cellular component (CC). Therefore, there are fewer GO terms in the S.C-CC dataset when compared with the S.C dataset, and both of them are used in our experiments for investigating the influence of different label numbers on prediction performance. The statistics of the S.C and S.C-CC dataset is shown in
Table 3. In this set,
denotes the number of GO terms,
denotes the number of genes, and
denotes the size of the vocabulary.
As shown in
Table 3, there are 1692 genes and 4133 function labels in the S.C dataset; in the S.C-CC dataset, there are 1692 genes and 547 function labels. Due to the large number of GO terms in the gene function dataset, we utilized a Boolean matrix decomposition (BMD) method to reduce the dimensionality of the function labels. BMD is a kind of label space dimension reduction (LSDR) method [
19], which addresses the multi-label classification problem with many labels. LSDR approaches use a compression step to transform the original high dimension label space into a lower dimension label space, and then multi-label classifiers are trained on a dataset with fewer labels, which can reduce the computation burden of the classifier. The existing studies about LSDR show that LSDR approaches are useful for optimizing the running time and accuracy of multi-label classification. In our BMD process, original label matrix
(
denotes the number of genes, and
denotes the number of features) is decomposed into the product of two matrices,
(
denotes the number of labels) and
, where
(
denotes Boolean product) is satisfied. We also called it exact BMD and adopted this algorithm, which is proposed in reference [
20]. Compared with other LSDR algorithms, an exact BMD can retain the interpretability of low dimension label space and restore the low dimension-predicted label matrix to the original label matrix by matrix
. At last, the number of function labels is reduced into a smaller dimension, and
denotes the number of GO terms after label space dimension reducing. Then DMR-LLDA actually needs to process 1358 GO terms of the S.C dataset and 319 GO terms of the S.C-CC dataset. Nonetheless, the lower dimensional label space can be recovered by a Boolean product after predicting, so we still get the whole function labels sets in the prediction results.
DMR-LLDA’s advantage here over LLDA lies in the introduction of extra features. In the S.C and S.C-CC dataset, there are six extra gene features for each gene, including the molecular weight of the gene, the isoelectric point, the average coefficients of hydrophilic, the number of exons, the adaptability index of the codon, the number of motifs, and the open reading frame (ORF) number of chromosomes. The statistics of extra features are shown in
Table 4.
As the max word length of the S.C dataset is two amino acid alphabets (G,A,V,L,I,F,P, Y,S,C,M,N,Q,T,D,E,K,R,H,W), a human dataset constructed by ourselves is adopted to evaluate the impact on the topic model performance of word length. This human dataset is constructed in a similar way as in reference [
13]. In addition, we also constructed two human datasets for different word lengths, where the max word length of the Human1 dataset is two amino acid alphabets, and that of the Human2 dataset is three amino acid alphabets. For the Human2 dataset, the original number of words is 8400, but we filtered several words that have a high frequency. Then, the statistic of the S.C and S.C-CC dataset is shown in
Table 5.
3.2. Parameter Settings and Evaluation Criterias
The DMR-LLDA learning framework involves four different parameters:
,
,
, and
. The
and
are the parameters of the two-Dirichlet distribution, where the larger the value of
, the more balanced the probability of a word in a topic. The setting of the
value has been discussed in reference [
13]. Nonetheless, the value of
is optimized by a protein feature, so its initial value does not have a big effect on model performance. According to the experience, we set
as the initial value, and set
, with
. In addition,
and
are respectively the mean and variance of normal distribution, obeyed by feature weighted parameter
, and we set
.
In the Gibbs sampling process of model training, we set the number of the Markov chain as 1 and the maximum number of iterations is 2000 times, where the number of iterations of burn-in time is set to 1000. We record the state space at intervals of 50 times on the converged Markov chain, and 20 times per record is conducted. In the process of model predicting, we set the number of iterations as 1000 times. After 500 iterations for the burn-in time, we record the state space at intervals of 50 times. In the variable Bayesian inferring process of model training, we initialize the global variable parameter through random number and hyper-parameter : ; in each local variable inference, we set the converged threshold as 0.00001, and the maximum number of times of the local variable inference as 100. The number of global scanning iterations is 1000.
The five-fold cross validation is conducted to measure and compare the performance of DMR-LLDA and the comparative algorithms. Five representative multi-label learning evaluation criteria are used in this paper, including hamming loss (HL), average precision (AP), one error, and micro-averaged and macro-averaged F1 scores (Micro-F1 and Macro-F1). In addition, three kinds of areas under a precision–recall curve are also used, including
,
, and
, which is proposed in reference [
12]. Finally, we repeat the random partition and evaluation in five independent rounds, and report the average results.
3.4. Gene Function Prediction with Cross Validation
In addition to LLDA, we also adopted three widely adopted methods: multi-label
k-nearest neighbor (MLKNN) [
22], back propagation for multi-label learning (BPMLL) [
23], and support vector machines (SVMs) for performance comparison. MLKNN and BPMLL are two representative multi-label classifiers, and can be performed by an open source tool called Mulan [
24]. SVMs adopt a “one-versus-all” scheme, which trains each label by a binary SVM independently and is implemented using the LibLinear software package [
25]. These five models are trained and used to predict with the S.C and S.C-CC datasets.
Figure 5 shows the HL, AP, one error, Micro-F1, Macro-F1,
,
, and
values of all models in the two datasets, respectively. For AP, Micro-F1, Macro-F1,
,
, and
, the larger the value, the better the performance. Conversely, for HL and one error, the smaller the value, the better the performance. The red asterisk of
Figure 5 represents the best results in each dataset. It is worth noting that the experimental results of this section are obtained by a CGS inference algorithm.
As shown in
Figure 5, DMR-LLDA can achieve better results in almost all evaluation criteria for the two datasets. The concrete analysis is introduced as follows.
For the S.C dataset, the DMR-LLDA achieves the best performance for AP, , , , Micro-F1, and Macro-F1. For example, with AP, the DMR-LLDA achieves 94%, 3.3%, 96%, and 26% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. With , the DMR-LLDA achieves 109%, 2.3%, 89%, and 24% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. For , the DMR-LLDA achieves 31%, 39%, 44%, and 25% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. For , the DMR-LLDA achieves 33%, 8.1%, 48%, and 10% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. With Micro-F1, the DMR-LLDA achieves 116%, 6.1%, 123%, and 29% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. On Macro-F1, DMR-LLDA achieves 22%, 2.9%, 24%, and 25% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. Nevertheless, for one error and HL, SVMs get better results than the DMR-LLDA.
For the S.C-CC dataset, the DMR-LLDA obtains a better performance in terms of AP, , , , Micro-F1, and Macro-F1. For AP, the DMR-LLDA achieves 36%, 1.7%, 39%, and 30% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. For , the DMR-LLDA achieves 68%, 4%, 64%, and 20% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. For , the DMR-LLDA achieves 73%, 35%, 62%, and 34% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. For , the DMR-LLDA achieves 67%, 6.4%, 92%, and 23% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. For Micro-F1, the DMR-LLDA achieves 101%, 4.1%, 114%, and 26% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. For Macro-F1, the DMR-LLDA achieves 18%, 1.8%, 16%, and 20% improvements over MLKNN, LLDA, BPMLL, and SVMs, respectively. Nevertheless, for one error, BPMLL gets better results than the DMR-LLDA; for HL, SVMs gets better results than the DMR-LLDA.
For both of the datasets, we can find that the improvements on are more significant than and , which indicates that the DMR-LLDA has a stronger effect on improving the overall accuracy of gene function prediction without respect to label weights. In the comparisons of the S.C and S.C-CC datasets, we find that the values of AP, , , and in the S.C dataset is lower than in the S.C-CC dataset, and the value of one error and HL in the S.C is higher than in the S.C-CC dataset. This is due to the same word space and different label number between these two datasets. The fewer labels of the S.C-CC dataset can promote a higher classifying performance.
Above all, these results indicate that the DMR-LLDA can further improve the accuracy of gene function prediction by introducing the DMR framework into the LLDA model, which optimizes the hyper-parameters of the topic weight. Meanwhile, the DMR-LLDA has an apparent advantage in improving the overall prediction accuracy.
3.6. The Comparison Results of Inference Algorithms
We designed three kinds of inference algorithm for the DMR-LLDA, including CGS, CVB, and CVB0. This section compares CGS with CVB0 in the S.C dataset. The experimental results are shown in
Figure 6. As shown in
Figure 6, the overall performance of CVB0 is better than the performance of CGS. Concrete analysis is represented as follows.
For the S.C dataset, CVB0 achieves the best results in AP, , , and , and achieves almost similar results in HL. However, CVB0 has a worse value in one error.
For the S.C-CC dataset, CVB0 achieves the best results in AP, one error, , , and , and achieves almost similar results in HL. The above results demonstrate the validity of the designed inference algorithms for the DMR-LLDA. Meanwhile, the experimental results indicate that the CVB0 inference algorithm can obtain more precise prediction results by the re-collapse of hidden variable space based on Jensen inequality.
Above all, due to the lack of prior knowledge, the prior distributions of the Bayesian model are usually set for convenience. Meanwhile, the parameters of prior distribution are also set as a fixed value based on experience, which makes the inaccurate estimation of posterior distributions. In our DMR-LLDA model, the gene feature information is introduced into the LLDA as the prior knowledge by the DMR framework. The hyper-parameter of the prior distribution is updated in the inference process rather than by a fixed constant, which can improve the estimation precision of posterior distributions, so as to improve the accuracy of gene function prediction.