Adapting Hidden Naive Bayes for Text Classiﬁcation

: Due to its simplicity, efﬁciency, and effectiveness, multinomial naive Bayes (MNB) has been widely used for text classiﬁcation. As in naive Bayes (NB), its assumption of the conditional independence of features is often violated and, therefore, reduces its classiﬁcation performance. Of the numerous approaches to alleviating its assumption of the conditional independence of features, structure extension has attracted less attention from researchers. To the best of our knowledge, only structure-extended MNB (SEMNB) has been proposed so far. SEMNB averages all weighted superparent one-dependence multinomial estimators; therefore, it is an ensemble learning model. In this paper, we propose a single model called hidden MNB (HMNB) by adapting the well-known hidden NB (HNB). HMNB creates a hidden parent for each feature, which synthesizes all the other qualiﬁed features’ inﬂuences. For HMNB to learn, we propose a simple but effective learning algorithm without incurring a high-computational-complexity structure-learning process. Our improved idea can also be used to improve complement NB (CNB) and the one-versus-all-but-one model (OVA), and the resulting models are simply denoted as HCNB and HOVA, respectively. The extensive experiments on eleven benchmark text classiﬁcation datasets validate the effectiveness of HMNB, HCNB, and HOVA.


Introduction
Due to its simplicity, efficiency, and effectiveness, naive Bayes (NB) has been widely used to analyze and solve many scientific and engineering problems, such as text classification [1,2], resistance of buildings [3], identification of areas susceptible to flooding [4], and urban flooding prediction [5]. Text classification is the task of assigning a text document to a pre-specified class, and it has been widely used in many real-world fields, such as spam filtering and short message service (SMS) filtering [2,6]. With the exponential growth of text data in various fields, text classification has attracted more and more attention from researchers in recent years. To address text classification tasks, text documents are generally featured by all of the words that occur in them. Because of the large numbers of documents, large numbers of words, and strong dependencies among these words, accurate and faster text classification presents unique challenges.
Beyond all questions, treating each word as a boolean variable is the simplest approach to applying machine learning for text classification. Based on this idea, multi-variate Bernoulli naive Bayes (BNB) [7] was proposed as the first statistical language model. BNB represents a document using a vector of binary feature variables, which indicates whether or not each word occurs in the document and, thus, ignores the frequency information of each word occurring in the document. To capture the frequency information of each occurring word, multinomial naive Bayes (MNB) [8] was proposed. Ref. [8] proved that MNB achieves, on average, a 27% reduction in the error rate compared to BNB at any vocabulary size. However, when the number of training documents of one class is much greater than those of the others, MNB tends to select poor weights for the decision boundary. To balance the number of training documents and to address the problem of skewed training data, a complement variant of MNB called complement NB (CNB) was proposed [9]. As a combination of MNB and CNB, OVA [9] was proposed.
where c is each possible class label, C is the set of all classes, c is the complement classes of c, m is the number of different words in the text collection, w i (i = 1, 2, · · · , m) is the ith word that occurs in d, and f i is the frequency count of the word w i in d. The prior probabilities P(c) and P(c) are computed in Equations (4) and (5), respectively, and the conditional probabilities P(w i |c) and P(w i |c) are computed in Equations (6) and (7), respectively.
where s is the number of classes, n is the number of training documents, c j is the class label of the jth training document, f ji is the frequency count of the ith word in the jth training document, and δ(c j , c) and δ(c j , c) are two indicator functions defined by: Due to their simplicity, efficiency, and efficacy, MNB and its variants, including CNB and OVA, have been widely used for text classification. However, as in naive Bayes (NB), the assumption of the attributes' (i.e., features) conditional independence that they need is usually violated and, therefore, reduces their classification accuracy. To alleviate their assumption of features' conditional independence, many approaches have been proposed. These approaches can be divided into five categories [10,11]: (1) feature weighting; (2) feature selection; (3) instance weighting; (4) instance selection; (5) structure extension.
Among these approaches, structure extension has attracted far less attention from researchers. To the best of our knowledge, only structure-extended multinomial naive Bayes (SEMNB) [12] has been proposed so far. SEMNB averages all weighted superparent one-dependence multinomial estimators and, therefore, is an ensemble learning model. In this paper, we propose a single model called hidden multinomial naive Bayes (HMNB). HMNB creates a hidden parent for each feature, which synthesizes all the other qualified features' influences. To learn HMNB, we proposed a simple but effective learning algorithm without incurring a high-computational-complexity structure-learning process. Our improved idea can also be used to improve CNB and OVA, and the resulting models are simply denoted as HCNB and HOVA, respectively. The extensive experiments on eleven benchmark text classification datasets show that the proposed HMNB, HCNB, and HOVA significantly outperform their state-of-the-art competitors.
To sum up, the main contributions of our work include the following: • We conducted a comprehensive survey on MNB extensions. Based on the survey, existing work can be divided into five categories: feature weighting, feature selection, instance weighting, instance selection, and structure extension. • We found that structure extension has attracted much less attention from researchers, and only SEMNB was proposed so far. However, it is an ensemble learning model. • We proposed a single model called hidden MNB (HMNB) by adapting the well-known hidden NB (HNB). HMNB creates a hidden parent for each feature, which synthesizes all of the other qualified features' influences. To learn HMNB, we proposed a simple but effective learning algorithm without incurring a high-computational-complexity structure-learning process. At the same time, we proposed HCNB and HOVA. • The extensive experiments on eleven benchmark text classification datasets validate the effectiveness of HMNB, HCNB, and HOVA.
The remainder of this paper is organized as follows. Section 2 conducts a compact survey on five categories of existing approaches. Section 3 describes our proposed models in detail. Section 4 presents the experimental setup and results. Section 5 draws conclusions and outlines the main directions.

Feature Weighting
The feature weighting approach assigns different weights W i (i = 1, 2, · · · , m) to different features (i.e., attributes) in building MNB, CNB, and OVA. To learn W i (i = 1, 2, · · · , m), Ref. [13] proposed χ 2 statistic-based feature weighting, which is denoted by R w,c . When R w,c is used to improve MNB, CNB, and OVA, the resulting models are simply denoted by R w,c MNB, R w,c CNB, and R w,c OVA, respectively. In addition, Ref. [14] proposed a deep feature weighting approach, simply denoted by DFW, which incorporates the learned weight W i (i = 1, 2, · · · , m) into not only the classification of the formula, but also the conditional probability estimates. When DFW is applied to MNB, CNB, and OVA, the resulting models are simply denoted by DFWMNB, DFWCNB and DFWOVA, respectively.
Based on the idea of deep feature weighting, Ref. [15] adapted two other deep feature weighting approaches: gain-ratio-based feature weighting (GRW) and decision-tree-based feature weighting (DTW). GRW sets the weight of each feature to its gain ratio relative to the average gain ratio across all features. When GRW is applied to MNB, CNB, and OVA, the resulting models are denoted by GRWMNB, GRWCNB, and GRWOVA, respectively. DTW sets the weight of each feature to be inversely proportional to the minimum depth at which it is tested in the built tree. When DTW is applied to MNB, CNB, and OVA, the resulting models are denoted by DTWMNB, DTWCNB, and DTWOVA, respectively.

Feature Selection
The feature selection approach trains MNB, CNB, and OVA on only the selected features instead of all features. In the machine learning community, feature selection is not new. In this paper, we focus our attention on text classification problems. In text classification problems, the dimensionality of features is very high, which is a major characteristic and difficulty. Even a moderate-sized text collection may have many unique words. This is too high for many machine learning algorithms. Therefore, it is indeed desirable to reduce the dimensionality without harming the classification accuracy. To execute feature selection, many approaches have been proposed. Ref. [16] conducted a comparative survey on five feature selection approaches. In addition, Ref. [17] proposed another feature selection approach based on a two-stage Markov blanket.
Generally, wrapper approaches have superior accuracy compared to filter approaches, but filter approaches always run faster than wrapper approaches. To integrate their advantages, Ref. [18] proposed gain-ratio-based feature selection (GRS). GRS takes advantage of base classifiers to evaluate the selected feature subsets like wrappers, but it does not need to repeatedly search feature subsets and train base classifiers. When GRS is applied to MNB, CNB, and OVA, the resulting models are simply denoted by GRSMNB, GRSCNB, and GRSOVA, respectively.

Instance Weighting
The instance weighting approach assigns different weights W j (j = 1, 2, · · · , n) to different instances (i.e., documents) in building MNB, CNB and OVA. To learn W j (j = 1, 2, · · · , n), the simplest way maybe boosting [19]. More specifically, the weights of the training instances misclassified by the base classifiers trained in the last iteration are increased, and then the base classifiers are trained from the re-weighted instances in the next iteration. After predefined rounds, this iteration process is stopped.
Different from boosting [19], Ref. [20] proposed a discriminative instance weighting approach, simply denoted by DW. In each iteration of DW, each different training instance is discriminatively assigned a different weight according to the computed conditional probability loss. This iteration process is repeated for predefined rounds. When DW is applied to MNB, CNB and OVA, the resulting models are simply denoted by DWMNB, DWCNB and DWOVA, respectively.

Instance Selection
The instance selection approach builds MNB, CNB, and OVA on the selected training instances rather than on all of the training instances. For conducting instance selection processes, the k-nearest neighbor algorithm (KNN) is the most well accepted. KNN selects training instances that drop into the neighborhood of a test instance, and it helps to alleviate the assumption of features' conditional independence required by MNB, CNB, and OVA. Therefore, combining KNN with MNB, CNB, and OVA is quite direct. When an instance is required for classification, a local MNB, CNB, or OVA is built on the k-nearest neighbors of the test instance, and then it is used to classify the test instance. Based on this improved idea, Ref. [21] proposed locally weighted MNB, CNB, and OVA; the resulting models are simply denoted by LWWMNB, LWCNB, and LWOVA, respectively.
Instead of the k-nearest neighbor algorithm, Ref. [22] applied the decision tree learning algorithm to find test instances' nearest neighbors, and then deployed MNB, CNB, or OVA on each leaf node of the built decision trees. The resulting models are simply denoted by MNBTree, CNBTree, and OVATree, respectively. MNBTree, CNBTree, and OVATree build binary trees, in which the split features' values are viewed as zero and nonzero. In addition, to reduce the time consumption, the information gain measure is used to build decision trees. Differently from LWWMNB, LWCNB, and LWOVA, which are lazy learning models, MNBTree, CNBTree, and OVATree are all eager learning models.

Structure Extension
The structure extension approach uses directed arcs to explicitly represent the dependencies among features. That is to say, we need to find an optimal feature parent set Π w i for each feature w i . However, learning an optimal feature parent set Π w i for each w i is almost an NP-hard problem [23]. In addition, when the training data are limited, the variance of a complex Bayesian network is high [24], and therefore, its probability estimations are poor. Thus, a multinomial Bayesian network without structure learning that can also represent feature dependencies is desirable.
Inspired by the weighted average of one-dependence estimators (WAODE) [25], Ref. [12] proposed structure-extended multinomial naive Bayes (SEMNB). SEMNB builds a one-dependence multinomial estimator for each present word, i.e., this word is all of the other present words' parent. Then, SEMNB averages all weighted super-parent onedependence multinomial estimators, and therefore, it is an ensemble learning model. If we apply the structure extension approach to CNB and OVA, we can easily obtain their structure-extended versions. For the sake of convenience, we denote them as SECNB and SEOVA, respectively.

The Proposed Models
Structure extension is not new to the Bayesian learning community, and especially not to the semi-naive Bayesian learning community [26,27]. Researchers have proposed many state-of-the-art structure-extended naive Bayes models, such as tree-augmented naive Bayes (TAN) [24] and its variants [28,29]. However, When the structure extension approach comes to high-dimensional text classification data, a key issue that must be addressed is its highcomputational-complexity structure learning process. This is the reason for why structure extension has attracted less attention from researchers. To the best of our knowledge, only structure-extended multinomial naive Bayes (SEMNB) [12] has been proposed so far. SEMNB averages all weighted super-parent one-dependence multinomial estimators and, thus, skillfully avoids high-computational-complexity structure-learning processes.
The extensive experiments on a large number of text classification datasets validate its effectiveness. However, beyond all questions, SEMNB is an ensemble learning model. Therefore, a simple but effective single model that does not incur a high-computationalcomplexity structure-learning process is still desirable. This is our paper's main motivation.
To maintain NB's simplicity and efficiency while alleviating its assumption of attributes' conditional independence, hidden naive Bayes (HNB) [30] has achieved remarkable classification performance. Inspired by the success of HNB, in this paper, we expected to adapt it to text classification tasks. We call our adapted model hidden multinomial naive Bayes (HMNB). In HMNB, a hidden parent w hpi is created for each present word w i , which combines the influences from all of the other present qualified words w t (t = 1, 2, · · · , m ∧ t = i). Now, given a test document d, HMNB classifies it by using Equation (10).
where P(w i |w hpi , c) is computed by: where W t (t = 1, 2, · · · , m ∧ t = i) indicates the importance of each possible parent word w t in the hidden parent w hpi . Therefore, for simplicity, we define it as the gain ratio GainRatio(w t ) of the word w t that splits the training data D. However, at the same time, we only select the word w t whose gain ratio is above the average aveGR of all words as the potential parent. The detailed calculation formulas are: where f t ∈ {0, 0}. f t = 0 indicates the absence of w t , and f t = 0 indicates the presence of w t . Now, the only thing left is the efficient calculation of P(w i |w t , c); the conditional probability w i appears given w t and c. It is well known that the space complexity of estimating P(w i |w t , c) directly from D is O(sm 2 ). To our knowledge, for text classification tasks, m (the vocabulary size in the text collection) is often too large to save the tables of each joint pair of words and class frequencies from which the conditional probability P(w i |w t , c) is estimated. At the same time, text data are usually in the form of a sparse matrix, and therefore, the number of different words present in a given document d-simply denoted by |d|-is much smaller than m. Therefore, as in SEMNB [12], we also transform a part of the training space consumption into classification time consumption. In more detail, we remove the step of computing P(w i |w t , c) from the training stage to the classification stage. At the classification stage, when a test document d is predicted, P(w i |w t , c) is computed according to D and d. More specifically, given a word w t in d, we only select the documents in which w t occurs to compute P(w i |w t , c) by using Equation (14), which has the space complexity of O(s|d|) only.
In summary, the whole algorithm for learning HMNB is partitioned into a training algorithm (HMNB-Training) and a classification algorithm (HMNB-Classification). They are described by Algorithms 1 and 2, respectively. Algorithm 1 takes the time complexity of O(nm + sm), and Algorithm 2 takes the time complexity of O(n|d| 2 + s|d| 2 + s|d|), where n is the number of training documents, m is the number of different words in the text collection, s is the number of classes, and |d| is the number of different words present in a given document d.
Input: d-a test document, D-training data, and the computed P(c) and W t Output: c(d) 1: for For each word w i (i = 1, 2, · · · , |d|) in d do 2: for For each word w t (t = 1, 2, · · · , |d| ∧ t = i) in d do 3: Denote all training documents in which w t occurs as D w t ; 4: for each class c do 5: Compute P(w i |w t , c) from D w t using Equation (14); 6: end for 7: end for 8: end for 9: Use W t and P(w i |w t , c) to compute P(w i |w hpi , c) with Equation (11); 10: Use P(c) and P(w i |w hpi , c) to predict the class label of d with Equation (10); 11: Return the predicted class label c(d) Our improved idea can also be used to improve CNB and OVA. The resulting models are denoted as HCNB and HOVA, respectively. Given a test document d, HCNB and HOVA use Equations (15) and (16) to classify it.
where P(w i |w hpi , c) is computed by: where P(w i |w t , c) is computed by: Similarly to HMNB, the algorithms for learning HCNB and HOVA are also partitioned into training algorithms (HCNB-Training and HOVA-Training) and classification algorithms (HCNB-Classification and HOVA-Classification). They are described by Algorithms 3-6, respectively. From Algorithms 3-6, we can see that the time complexities of HCNB and HOVA are almost the same as that of HMNB.

Algorithm 4: HCNB-Classification (d, D, P(c), W t ).
Input: d-a test document, D-training data, and the computed P(c) and W t Output: c(d) 1: for For each word w i (i = 1, 2, · · · , |d|) in d do 2: for For each word w t (t = 1, 2, · · · , |d| ∧ t = i) in d do 3: Denote all training documents in which w t occurs as D w t ; 4: for each class c do 5: Compute P(w i |w t , c) from D w t using Equation (18); 6: end for 7: end for 8: end for 9: Use W t and P(w i |w t , c) to compute P(w i |w hpi , c) with Equation (17); 10: Use P(c) and P(w i |w hpi , c) to predict the class label of d with Equation (15); 11: Return the predicted class label c(d)
Input: d-a test document, D-training data, and the computed P(c), P(c), and W t Output: c(d) 1: for For each word w i (i = 1, 2, · · · , |d|) in d do 2: for For each word w t (t = 1, 2, · · · , |d| ∧ t = i) in d do 3: Denote all training documents in which w t occurs as D w t ; 4: for each class c do 5: Compute P(w i |w t , c) from D w t using Equation (14); 6: Compute P(w i |w t , c) from D w t using Equation (18); 7: end for 8: end for 9: end for 10: Use W t and P(w i |w t , c) to compute P(w i |w hpi , c) with Equation (11); 11: Use W t and P(w i |w t , c) to compute P(w i |w hpi , c) with Equation (17); 12: Use P(c), P(c), P(w i |w hpi , c), and P(w i |w hpi , c) to predict the class label of d with Equation (16); 13: Return the predicted class label c(d)

Experiments and Results
To validate the effectiveness of the proposed HMNB, HCNB, and HOVA, we designed and completed three groups of experiments. The first group of experiments compared HMNB with MNB, R w,c MNB, GRSMNB, DWMNB, MNBTree, and SEMNB. The second group of experiments compared HCNB with CNB, R w,c CNB, GRSCNB, DWCNB, CNBTree, and SECNB. The third group of experiments compared HOVA with OVA, R w,c OVA, GRSOVA, DWOVA, OVATree, and SEOVA. We used the existing implementations of MNB and CNB in the platform of the Waikato environment for knowledge analysis (WEKA) [31] and implemented all of the other models by using the WEKA platform [31].
We conducted our three groups of experiments on eleven well-known text classification tasks published on the homepage of the WEKA platform [31], which cover a wide range of text classification characteristics. Table 1 lists the detailed data information of these eleven datasets. All of these eleven datasets were obtained from OHSUMED-233445, Reuters-21578, TREC, and the WebACE project. Ref. [32] originally converted them into term counts. Tables 2-4 show the results of a comparison of the accuracy of each model on each dataset after averaging the classification accuracies from ten runs of 10-fold cross-validation, respectively. Then, we use two-tailed t-tests at 95% significance level [33] to compare the proposed HMNB, HCNB and HOVA to each of their competitors. In these tables, the symbols • and • denote statistically significant improvement or degradation with respect to the competitors, respectively. The averaged classification accuracies and the Win/Tie/Lose (W/T/L) values are summarized at the bottom of the tables. The averaged classification accuracy of each model across all datasets provides a gross indicator of the relative classification performance in addition to the other statistics. Each W/T/L value in these tables indicates that, compared to their competitors, HMNB, HCNB, and HOVA won on W datasets, tied on T datasets, and lost on L datasets.   Based on the accuracy comparisons presented inTables 2-4, we then used the KEEL software [34] to complete the Wilcoxon signed-rank test [35,36] in order to thoroughly compare each pair of models. The Wilcoxon signed-rank test ranks the differences in the performance of two classification models for each dataset, ignoring the signs, and compares the ranks for the positive R + and the negative R − differences [35,36]. According to the table of the exact critical values for the Wilcoxon test, for a confidence level of α = 0.05 and N = 11 datasets, we speak of two classification models as being "significantly different" if the smaller of R + and R − is equal to or less than 11, and thus, we reject the null hypothesis. Tables 5-7 summarize the related comparison results. In these tables, • denotes that the model in the column improves the model in the corresponding row, and • denotes that the model in the row improves the model in the corresponding column. In the lower diagonal, the significance level is α = 0.05. In the upper diagonal, the significance level is α = 0.1. From all of the above comparison results, we can draw the following highlights: 1.
• -Finally, we conducted the Wilcoxon signed-rank test [35,36] to compare each pair of HMNB, HCNB, and HOVA. The detailed comparison results are shown in Table 8. From these, we can see that HMNB almost tied with HCNB and HOVA, and HCNB was notably better than HOVA. Considering the simplicity of the models, HMNB and HCNB could be appropriate choices.

Conclusions and Future Study
To alleviate MNB's assumption of features' conditional independence, this paper proposed a single model called hidden MNB (HMNB) by adapting the well-known hidden NB (HNB). HMNB creates a hidden parent for each feature that synthesizes all of the other qualified features' influences. For HMNB to learn, we proposed a simple but effective learning algorithm that does not incurring a high-computational-complexity structurelearning process. Our improved idea can also be used to improve CNB and OVA, and the resulting models are simply denoted as HCNB and HOVA, respectively. The extensive experiments show that the proposed HMNB, HCNB, and HOVA significantly outperform their state-of-the-art competitors.
In the proposed HMNB, HCNB, and HOVA, how the weight (importance) of each possible parent word is defined is crucial. Currently, we directly use the gain ratio of each possible parent word that splits the training data in order to define the weight, which is somewhat rough. We believe that using more sophisticated methods, such as the expectation-maximum (EM) algorithm, could further improve their classification performance and make their superiority stronger. This is a main topic for future study. In addition, to reduce the training space complexity, we transform a part of the training space consumption into classification time consumption, which leads to a relatively high classification time complexity. Therefore, the improvement of the efficiency of the proposed models is another interesting topic for future study.