Next Article in Journal
Enlargement of the Field of View Based on Image Region Prediction Using Thermal Videos
Next Article in Special Issue
Attribute Selecting in Tree-Augmented Naive Bayes by Cross Validation Risk Minimization
Previous Article in Journal
Existence and Uniqueness of Nontrivial Periodic Solutions to a Discrete Switching Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Adapting Hidden Naive Bayes for Text Classification

1
College of Computer, Hubei University of Education, Wuhan 430205, China
2
School of Computer Science, China University of Geosciences, Wuhan 430074, China
*
Author to whom correspondence should be addressed.
Mathematics 2021, 9(19), 2378; https://doi.org/10.3390/math9192378
Submission received: 5 September 2021 / Revised: 18 September 2021 / Accepted: 22 September 2021 / Published: 25 September 2021
(This article belongs to the Special Issue Machine Learning and Data Mining: Techniques and Tasks)

Abstract

:
Due to its simplicity, efficiency, and effectiveness, multinomial naive Bayes (MNB) has been widely used for text classification. As in naive Bayes (NB), its assumption of the conditional independence of features is often violated and, therefore, reduces its classification performance. Of the numerous approaches to alleviating its assumption of the conditional independence of features, structure extension has attracted less attention from researchers. To the best of our knowledge, only structure-extended MNB (SEMNB) has been proposed so far. SEMNB averages all weighted super-parent one-dependence multinomial estimators; therefore, it is an ensemble learning model. In this paper, we propose a single model called hidden MNB (HMNB) by adapting the well-known hidden NB (HNB). HMNB creates a hidden parent for each feature, which synthesizes all the other qualified features’ influences. For HMNB to learn, we propose a simple but effective learning algorithm without incurring a high-computational-complexity structure-learning process. Our improved idea can also be used to improve complement NB (CNB) and the one-versus-all-but-one model (OVA), and the resulting models are simply denoted as HCNB and HOVA, respectively. The extensive experiments on eleven benchmark text classification datasets validate the effectiveness of HMNB, HCNB, and HOVA.

1. Introduction

Due to its simplicity, efficiency, and effectiveness, naive Bayes (NB) has been widely used to analyze and solve many scientific and engineering problems, such as text classification [1,2], resistance of buildings [3], identification of areas susceptible to flooding [4], and urban flooding prediction [5]. Text classification is the task of assigning a text document to a pre-specified class, and it has been widely used in many real-world fields, such as spam filtering and short message service (SMS) filtering [2,6]. With the exponential growth of text data in various fields, text classification has attracted more and more attention from researchers in recent years. To address text classification tasks, text documents are generally featured by all of the words that occur in them. Because of the large numbers of documents, large numbers of words, and strong dependencies among these words, accurate and faster text classification presents unique challenges.
Beyond all questions, treating each word as a boolean variable is the simplest approach to applying machine learning for text classification. Based on this idea, multi-variate Bernoulli naive Bayes (BNB) [7] was proposed as the first statistical language model. BNB represents a document using a vector of binary feature variables, which indicates whether or not each word occurs in the document and, thus, ignores the frequency information of each word occurring in the document. To capture the frequency information of each occurring word, multinomial naive Bayes (MNB) [8] was proposed. Ref. [8] proved that MNB achieves, on average, a 27% reduction in the error rate compared to BNB at any vocabulary size. However, when the number of training documents of one class is much greater than those of the others, MNB tends to select poor weights for the decision boundary. To balance the number of training documents and to address the problem of skewed training data, a complement variant of MNB called complement NB (CNB) was proposed [9]. As a combination of MNB and CNB, OVA [9] was proposed.
Given a test document d, which is generally represented by a word vector < w 1 , w 2 , , w m > , MNB, CNB, and OVA classify it with Equations (1)–(3), respectively.
c ( d ) = arg max c C ( P ( c ) i = 1 m P ( w i | c ) f i )
c ( d ) = arg max c C ( P ( c ¯ ) i = 1 m P ( w i | c ¯ ) f i )
c ( d ) = arg max c C ( P ( c ) i = 1 m P ( w i | c ) f i P ( c ¯ ) i = 1 m P ( w i | c ¯ ) f i )
where c is each possible class label, C is the set of all classes, c ¯ is the complement classes of c, m is the number of different words in the text collection, w i   ( i = 1 , 2 , , m ) is the ith word that occurs in d, and f i is the frequency count of the word w i in d. The prior probabilities P ( c ) and P ( c ¯ ) are computed in Equations (4) and (5), respectively, and the conditional probabilities P ( w i | c ) and P ( w i | c ¯ ) are computed in Equations (6) and (7), respectively.
P ( c ) = j = 1 n δ ( c j , c ) + 1 n + s
P ( c ¯ ) = j = 1 n δ ( c j , c ¯ ) + 1 n + s
P ( w i | c ) = j = 1 n f j i δ ( c j , c ) + 1 i = 1 m j = 1 n f j i δ ( c j , c ) + m
P ( w i | c ¯ ) = j = 1 n f j i δ ( c j , c ¯ ) + 1 i = 1 m j = 1 n f j i δ ( c j , c ¯ ) + m
where s is the number of classes, n is the number of training documents, c j is the class label of the jth training document, f j i is the frequency count of the ith word in the jth training document, and δ ( c j , c ) and δ ( c j , c ¯ ) are two indicator functions defined by:
δ ( c j , c ) = { 1 , i f c j = c 0 , o t h e r w i s e
δ ( c j , c ¯ ) = { 1 , i f c j c ¯ , n a m e l y c j c 0 , o t h e r w i s e
Due to their simplicity, efficiency, and efficacy, MNB and its variants, including CNB and OVA, have been widely used for text classification. However, as in naive Bayes (NB), the assumption of the attributes’ (i.e., features) conditional independence that they need is usually violated and, therefore, reduces their classification accuracy. To alleviate their assumption of features’ conditional independence, many approaches have been proposed. These approaches can be divided into five categories [10,11]: (1) feature weighting; (2) feature selection; (3) instance weighting; (4) instance selection; (5) structure extension.
Among these approaches, structure extension has attracted far less attention from researchers. To the best of our knowledge, only structure-extended multinomial naive Bayes (SEMNB) [12] has been proposed so far. SEMNB averages all weighted super-parent one-dependence multinomial estimators and, therefore, is an ensemble learning model. In this paper, we propose a single model called hidden multinomial naive Bayes (HMNB). HMNB creates a hidden parent for each feature, which synthesizes all the other qualified features’ influences. To learn HMNB, we proposed a simple but effective learning algorithm without incurring a high-computational-complexity structure-learning process. Our improved idea can also be used to improve CNB and OVA, and the resulting models are simply denoted as HCNB and HOVA, respectively. The extensive experiments on eleven benchmark text classification datasets show that the proposed HMNB, HCNB, and HOVA significantly outperform their state-of-the-art competitors.
To sum up, the main contributions of our work include the following:
  • We conducted a comprehensive survey on MNB extensions. Based on the survey, existing work can be divided into five categories: feature weighting, feature selection, instance weighting, instance selection, and structure extension.
  • We found that structure extension has attracted much less attention from researchers, and only SEMNB was proposed so far. However, it is an ensemble learning model.
  • We proposed a single model called hidden MNB (HMNB) by adapting the well-known hidden NB (HNB). HMNB creates a hidden parent for each feature, which synthesizes all of the other qualified features’ influences. To learn HMNB, we proposed a simple but effective learning algorithm without incurring a high-computational-complexity structure-learning process. At the same time, we proposed HCNB and HOVA.
  • The extensive experiments on eleven benchmark text classification datasets validate the effectiveness of HMNB, HCNB, and HOVA.
The remainder of this paper is organized as follows. Section 2 conducts a compact survey on five categories of existing approaches. Section 3 describes our proposed models in detail. Section 4 presents the experimental setup and results. Section 5 draws conclusions and outlines the main directions.

2. Related Work

2.1. Feature Weighting

The feature weighting approach assigns different weights W i ( i = 1 , 2 , , m ) to different features (i.e., attributes) in building MNB, CNB, and OVA. To learn W i ( i = 1 , 2 , , m ), Ref. [13] proposed χ 2 statistic-based feature weighting, which is denoted by R w , c . When R w , c is used to improve MNB, CNB, and OVA, the resulting models are simply denoted by R w , c MNB, R w , c CNB, and R w , c OVA, respectively. In addition, Ref. [14] proposed a deep feature weighting approach, simply denoted by DFW, which incorporates the learned weight W i ( i = 1 , 2 , , m ) into not only the classification of the formula, but also the conditional probability estimates. When DFW is applied to MNB, CNB, and OVA, the resulting models are simply denoted by DFWMNB, DFWCNB and DFWOVA, respectively.
Based on the idea of deep feature weighting, Ref. [15] adapted two other deep feature weighting approaches: gain-ratio-based feature weighting (GRW) and decision-tree-based feature weighting (DTW). GRW sets the weight of each feature to its gain ratio relative to the average gain ratio across all features. When GRW is applied to MNB, CNB, and OVA, the resulting models are denoted by GRWMNB, GRWCNB, and GRWOVA, respectively. DTW sets the weight of each feature to be inversely proportional to the minimum depth at which it is tested in the built tree. When DTW is applied to MNB, CNB, and OVA, the resulting models are denoted by DTWMNB, DTWCNB, and DTWOVA, respectively.

2.2. Feature Selection

The feature selection approach trains MNB, CNB, and OVA on only the selected features instead of all features. In the machine learning community, feature selection is not new. In this paper, we focus our attention on text classification problems. In text classification problems, the dimensionality of features is very high, which is a major characteristic and difficulty. Even a moderate-sized text collection may have many unique words. This is too high for many machine learning algorithms. Therefore, it is indeed desirable to reduce the dimensionality without harming the classification accuracy. To execute feature selection, many approaches have been proposed. Ref. [16] conducted a comparative survey on five feature selection approaches. In addition, Ref. [17] proposed another feature selection approach based on a two-stage Markov blanket.
Generally, wrapper approaches have superior accuracy compared to filter approaches, but filter approaches always run faster than wrapper approaches. To integrate their advantages, Ref. [18] proposed gain-ratio-based feature selection (GRS). GRS takes advantage of base classifiers to evaluate the selected feature subsets like wrappers, but it does not need to repeatedly search feature subsets and train base classifiers. When GRS is applied to MNB, CNB, and OVA, the resulting models are simply denoted by GRSMNB, GRSCNB, and GRSOVA, respectively.

2.3. Instance Weighting

The instance weighting approach assigns different weights W j ( j = 1 , 2 , , n ) to different instances (i.e., documents) in building MNB, CNB and OVA. To learn W j ( j = 1 , 2 , , n ), the simplest way maybe boosting [19]. More specifically, the weights of the training instances misclassified by the base classifiers trained in the last iteration are increased, and then the base classifiers are trained from the re-weighted instances in the next iteration. After predefined rounds, this iteration process is stopped.
Different from boosting [19], Ref. [20] proposed a discriminative instance weighting approach, simply denoted by DW. In each iteration of DW, each different training instance is discriminatively assigned a different weight according to the computed conditional probability loss. This iteration process is repeated for predefined rounds. When DW is applied to MNB, CNB and OVA, the resulting models are simply denoted by DWMNB, DWCNB and DWOVA, respectively.

2.4. Instance Selection

The instance selection approach builds MNB, CNB, and OVA on the selected training instances rather than on all of the training instances. For conducting instance selection processes, the k-nearest neighbor algorithm (KNN) is the most well accepted. KNN selects training instances that drop into the neighborhood of a test instance, and it helps to alleviate the assumption of features’ conditional independence required by MNB, CNB, and OVA. Therefore, combining KNN with MNB, CNB, and OVA is quite direct. When an instance is required for classification, a local MNB, CNB, or OVA is built on the k-nearest neighbors of the test instance, and then it is used to classify the test instance. Based on this improved idea, Ref. [21] proposed locally weighted MNB, CNB, and OVA; the resulting models are simply denoted by LWWMNB, LWCNB, and LWOVA, respectively.
Instead of the k-nearest neighbor algorithm, Ref. [22] applied the decision tree learning algorithm to find test instances’ nearest neighbors, and then deployed MNB, CNB, or OVA on each leaf node of the built decision trees. The resulting models are simply denoted by MNBTree, CNBTree, and OVATree, respectively. MNBTree, CNBTree, and OVATree build binary trees, in which the split features’ values are viewed as zero and nonzero. In addition, to reduce the time consumption, the information gain measure is used to build decision trees. Differently from LWWMNB, LWCNB, and LWOVA, which are lazy learning models, MNBTree, CNBTree, and OVATree are all eager learning models.

2.5. Structure Extension

The structure extension approach uses directed arcs to explicitly represent the dependencies among features. That is to say, we need to find an optimal feature parent set Π w i for each feature w i . However, learning an optimal feature parent set Π w i for each w i is almost an NP-hard problem [23]. In addition, when the training data are limited, the variance of a complex Bayesian network is high [24], and therefore, its probability estimations are poor. Thus, a multinomial Bayesian network without structure learning that can also represent feature dependencies is desirable.
Inspired by the weighted average of one-dependence estimators (WAODE) [25], Ref. [12] proposed structure-extended multinomial naive Bayes (SEMNB). SEMNB builds a one-dependence multinomial estimator for each present word, i.e., this word is all of the other present words’ parent. Then, SEMNB averages all weighted super-parent one-dependence multinomial estimators, and therefore, it is an ensemble learning model. If we apply the structure extension approach to CNB and OVA, we can easily obtain their structure-extended versions. For the sake of convenience, we denote them as SECNB and SEOVA, respectively.

3. The Proposed Models

Structure extension is not new to the Bayesian learning community, and especially not to the semi-naive Bayesian learning community [26,27]. Researchers have proposed many state-of-the-art structure-extended naive Bayes models, such as tree-augmented naive Bayes (TAN) [24] and its variants [28,29]. However, When the structure extension approach comes to high-dimensional text classification data, a key issue that must be addressed is its high-computational-complexity structure learning process. This is the reason for why structure extension has attracted less attention from researchers. To the best of our knowledge, only structure-extended multinomial naive Bayes (SEMNB) [12] has been proposed so far. SEMNB averages all weighted super-parent one-dependence multinomial estimators and, thus, skillfully avoids high-computational-complexity structure-learning processes. The extensive experiments on a large number of text classification datasets validate its effectiveness. However, beyond all questions, SEMNB is an ensemble learning model. Therefore, a simple but effective single model that does not incur a high-computational-complexity structure-learning process is still desirable. This is our paper’s main motivation.
To maintain NB’s simplicity and efficiency while alleviating its assumption of attributes’ conditional independence, hidden naive Bayes (HNB) [30] has achieved remarkable classification performance. Inspired by the success of HNB, in this paper, we expected to adapt it to text classification tasks. We call our adapted model hidden multinomial naive Bayes (HMNB). In HMNB, a hidden parent w h p i is created for each present word w i , which combines the influences from all of the other present qualified words w t   ( t = 1 , 2 , , m t i ) . Now, given a test document d, HMNB classifies it by using Equation (10).
c ( d ) = arg max c C ( P ( c ) i = 1 f i > 0 m P ( w i | w h p i , c ) f i )
where P ( w i | w h p i , c ) is computed by:
P ( w i | w h p i , c ) = t = 1 t i f t > 0 W t a v e G R m W t P ( w i | w t , c ) t = 1 t i f t > 0 W t a v e G R m W t
where W t ( t = 1 , 2 , , m t i ) indicates the importance of each possible parent word w t in the hidden parent w h p i . Therefore, for simplicity, we define it as the gain ratio G a i n R a t i o ( w t ) of the word w t that splits the training data D. However, at the same time, we only select the word w t whose gain ratio is above the average a v e G R of all words as the potential parent. The detailed calculation formulas are:
W t = G a i n R a t i o ( w t ) = f t { 0 , 0 ¯ } , c P ( f t , c ) log P ( f t , c ) P ( f t ) P ( c ) f i { 0 , 0 ¯ } P ( f t ) log P ( f t )
a v e G R = 1 m t = 1 m G a i n R a t i o ( w t )
where f t { 0 , 0 ¯ } . f t = 0 indicates the absence of w t , and f t = 0 ¯ indicates the presence of  w t .
Now, the only thing left is the efficient calculation of P ( w i | w t , c ) ; the conditional probability w i appears given w t and c. It is well known that the space complexity of estimating P ( w i | w t , c ) directly from D is O ( s m 2 ) . To our knowledge, for text classification tasks, m (the vocabulary size in the text collection) is often too large to save the tables of each joint pair of words and class frequencies from which the conditional probability P ( w i | w t , c ) is estimated. At the same time, text data are usually in the form of a sparse matrix, and therefore, the number of different words present in a given document d—simply denoted by | d | —is much smaller than m. Therefore, as in SEMNB [12], we also transform a part of the training space consumption into classification time consumption. In more detail, we remove the step of computing P ( w i | w t , c ) from the training stage to the classification stage. At the classification stage, when a test document d is predicted, P ( w i | w t , c ) is computed according to D and d. More specifically, given a word w t in d, we only select the documents in which w t occurs to compute P ( w i | w t , c ) by using Equation (14), which has the space complexity of O ( s | d | ) only.
P ( w i | w t , c ) = j = 1 f j t > 0 n f j i δ ( c j , c ) + 1 i = 1 m j = 1 f j t > 0 n f j i δ ( c j , c ) + m
In summary, the whole algorithm for learning HMNB is partitioned into a training algorithm (HMNB-Training) and a classification algorithm (HMNB-Classification). They are described by Algorithms 1 and 2, respectively. Algorithm 1 takes the time complexity of O ( n m + s m ) , and Algorithm 2 takes the time complexity of O ( n | d | 2 + s | d | 2 + s | d | ) , where n is the number of training documents, m is the number of different words in the text collection, s is the number of classes, and | d | is the number of different words present in a given document d.
Algorithm 1: HMNB-Training (D).
  • Input:D—training data
  • Output: P ( c ) and W t ( t = 1 , 2 , , m )
  •  1: for each class c do
  •  2:   Use Equation (4) to compute P ( c ) from D;
  •  3: end for
  •  4: for For each word w t ( t = 1 , 2 , , m ) from D do
  •  5:   Compute W t using Equation (12);
  •  6: end for
  •  7: Compute the averaged gain ratio a v e G R of all words using Equation (13);
  •  8: if G a i n R a t i o ( w t ) a v e G R then
  •  9:    W t = G a i n R a t i o ( w t )
  • 10: else
  • 11:   W t = 0
  • 12: end if
  • 13: Return P ( c ) and W t ( t = 1 , 2 , , m )
Algorithm 2: HMNB-Classification (d, D, P ( c ) , W t ).
  • Input:d—a test document, D—training data, and the computed P ( c ) and W t
  • Output: c ( d )
  •  1: for For each word w i ( i = 1 , 2 , , | d | ) in d do
  •  2:   for For each word w t ( t = 1 , 2 , , | d | t i ) in d do
  •  3:     Denote all training documents in which w t occurs as D w t ;
  •  4:     for each class c do
  •  5:       Compute P ( w i | w t , c ) from D w t using Equation (14);
  •  6:     end for
  •  7:   end for
  •  8: end for
  •  9: Use W t and P ( w i | w t , c ) to compute P ( w i | w h p i , c ) with Equation (11);
  • 10: Use P ( c ) and P ( w i | w h p i , c ) to predict the class label of d with Equation (10);
  • 11: Return the predicted class label c ( d )
Our improved idea can also be used to improve CNB and OVA. The resulting models are denoted as HCNB and HOVA, respectively. Given a test document d, HCNB and HOVA use Equations (15) and (16) to classify it.
c ( d ) = arg max c C ( P ( c ¯ ) i = 1 f i > 0 m P ( w i | w h p i , c ¯ ) f i )
c ( d ) = arg max c C ( P ( c ) i = 1 f i > 0 m P ( w i | w h p i , c ) f i P ( c ¯ ) i = 1 f i > 0 m P ( w i | w h p i , c ¯ ) f i )
where P ( w i | w h p i , c ¯ ) is computed by:
P ( w i | w h p i , c ¯ ) = t = 1 t i f t > 0 W t a v e G R m W t P ( w i | w t , c ¯ ) t = 1 t i f t > 0 W t a v e G R m W t
where P ( w i | w t , c ¯ ) is computed by:
P ( w i | w t , c ¯ ) = j = 1 f j t > 0 n f j i δ ( c j , c ¯ ) + 1 i = 1 m j = 1 f j t > 0 n f j i δ ( c j , c ¯ ) + m
Similarly to HMNB, the algorithms for learning HCNB and HOVA are also partitioned into training algorithms (HCNB-Training and HOVA-Training) and classification algorithms (HCNB-Classification and HOVA-Classification). They are described by Algorithms 3–6, respectively. From Algorithms 3–6, we can see that the time complexities of HCNB and HOVA are almost the same as that of HMNB.
Algorithm 3: HCNB-Training (D).
  • Input:D—training data
  • Output: P ( c ¯ ) and W t ( t = 1 , 2 , , m )
  •  1: for each class c do
  •  2:  Use Equation (5) to compute P ( c ¯ ) from D;
  •  3: end for
  •  4: for For each word w t ( t = 1 , 2 , , m ) from D do
  •  5:  Compute W t using Equation (12);
  •  6: end for
  •  7: Compute the averaged gain ratio a v e G R of all words using Equation (13);
  •  8: if G a i n R a t i o ( w t ) a v e G R then
  •  9:    W t = G a i n R a t i o ( w t )
  • 10: else
  • 11:    W t = 0
  • 12: end if
  • 13: Return P ( c ¯ ) and W t ( t = 1 , 2 , , m )
Algorithm 4: HCNB-Classification (d, D, P ( c ¯ ) , W t ).
  • Input:d—a test document, D—training data, and the computed P ( c ¯ ) and W t
  • Output: c ( d )
  •  1: for For each word w i ( i = 1 , 2 , , | d | ) in d do
  •  2:   for For each word w t ( t = 1 , 2 , , | d | t i ) in d do
  •  3:     Denote all training documents in which w t occurs as D w t ;
  •  4:     for each class c do
  •  5:       Compute P ( w i | w t , c ¯ ) from D w t using Equation (18);
  •  6:     end for
  •  7:   end for
  •  8: end for
  •  9: Use W t and P ( w i | w t , c ¯ ) to compute P ( w i | w h p i , c ¯ ) with Equation (17);
  • 10: Use P ( c ¯ ) and P ( w i | w h p i , c ¯ ) to predict the class label of d with Equation (15);
  • 11: Return the predicted class label c ( d )
Algorithm 5: HOVA-Training (D).
  • Input:D—training data
  • Output: P ( c ) , P ( c ¯ ) , and W t ( t = 1 , 2 , , m )
  •  1: for each class c do
  •  2:   Use Equation (4) to compute P ( c ) from D;
  •  3:   Use Equation (5) to compute P ( c ¯ ) from D;
  •  4: end for
  •  5: for For each word w t ( t = 1 , 2 , , m ) from D do
  •  6:   Compute W t using Equation (12);
  •  7: end for
  •  8: Compute the averaged gain ratio a v e G R of all words using Equation (13);
  •  9: if G a i n R a t i o ( w t ) a v e G R then
  • 10:    W t = G a i n R a t i o ( w t )
  • 11: else
  • 12:    W t = 0
  • 13: end if
  • 14: Return P ( c ) , P ( c ¯ ) , and W t ( t = 1 , 2 , , m )
Algorithm 6: HOVA-Classification (d, D, P ( c ) , P ( c ¯ ) , W t ).
  • Input:d—a test document, D—training data, and the computed P ( c ) , P ( c ¯ ) , and W t
  • Output: c ( d )
  •  1: for For each word w i ( i = 1 , 2 , , | d | ) in d do
  •  2:   for For each word w t ( t = 1 , 2 , , | d | t i ) in d do
  •  3:     Denote all training documents in which w t occurs as D w t ;
  •  4:     for each class c do
  •  5:       Compute P ( w i | w t , c ) from D w t using Equation (14);
  •  6:       Compute P ( w i | w t , c ¯ ) from D w t using Equation (18);
  •  7:     end for
  •  8:   end for
  •  9: end for
  • 10: Use W t and P ( w i | w t , c ) to compute P ( w i | w h p i , c ) with Equation (11);
  • 11: Use W t and P ( w i | w t , c ¯ ) to compute P ( w i | w h p i , c ¯ ) with Equation (17);
  • 12: Use P ( c ) , P ( c ¯ ) , P ( w i | w h p i , c ) , and P ( w i | w h p i , c ¯ ) to predict the class label of d with Equation (16);
  • 13: Return the predicted class label c ( d )

4. Experiments and Results

To validate the effectiveness of the proposed HMNB, HCNB, and HOVA, we designed and completed three groups of experiments. The first group of experiments compared HMNB with MNB, R w , c MNB, GRSMNB, DWMNB, MNBTree, and SEMNB. The second group of experiments compared HCNB with CNB, R w , c CNB, GRSCNB, DWCNB, CNBTree, and SECNB. The third group of experiments compared HOVA with OVA, R w , c OVA, GRSOVA, DWOVA, OVATree, and SEOVA. We used the existing implementations of MNB and CNB in the platform of the Waikato environment for knowledge analysis (WEKA) [31] and implemented all of the other models by using the WEKA platform [31].
We conducted our three groups of experiments on eleven well-known text classification tasks published on the homepage of the WEKA platform [31], which cover a wide range of text classification characteristics. Table 1 lists the detailed data information of these eleven datasets. All of these eleven datasets were obtained from OHSUMED-233445, Reuters-21578, TREC, and the WebACE project. Ref. [32] originally converted them into term counts.
Table 2, Table 3 and Table 4 show the results of a comparison of the accuracy of each model on each dataset after averaging the classification accuracies from ten runs of 10-fold cross-validation, respectively. Then, we use two-tailed t-tests at 95% significance level [33] to compare the proposed HMNB, HCNB and HOVA to each of their competitors. In these tables, the symbols • and ∘ denote statistically significant improvement or degradation with respect to the competitors, respectively. The averaged classification accuracies and the W i n / T i e / L o s e (W/T/L) values are summarized at the bottom of the tables. The averaged classification accuracy of each model across all datasets provides a gross indicator of the relative classification performance in addition to the other statistics. Each W/T/L value in these tables indicates that, compared to their competitors, HMNB, HCNB, and HOVA won on W datasets, tied on T datasets, and lost on L datasets.
Based on the accuracy comparisons presented in Table 2, Table 3 and Table 4, we then used the KEEL software [34] to complete the Wilcoxon signed-rank test [35,36] in order to thoroughly compare each pair of models. The Wilcoxon signed-rank test ranks the differences in the performance of two classification models for each dataset, ignoring the signs, and compares the ranks for the positive R + and the negative R differences [35,36]. According to the table of the exact critical values for the Wilcoxon test, for a confidence level of α = 0.05 and N = 11 datasets, we speak of two classification models as being “significantly different” if the smaller of R + and R is equal to or less than 11, and thus, we reject the null hypothesis. Table 5, Table 6 and Table 7 summarize the related comparison results. In these tables, ∘ denotes that the model in the column improves the model in the corresponding row, and • denotes that the model in the row improves the model in the corresponding column. In the lower diagonal, the significance level is α = 0.05 . In the upper diagonal, the significance level is α = 0.1 . From all of the above comparison results, we can draw the following highlights:
  • The average accuracy of HMNB on eleven datasets is 85.60%, which is notably higher than those of MNB (83.18%), R w , c MNB (82.39%), GRSMNB (84.23%), DWMNB (83.72%), MNBTree (82.59%), and SEMNB (84.16%). HMNB substantially outperforms MNB (eight wins and zero losses), R w , c MNB (11 wins and zero losses), GRSMNB (six wins and one loss), DWMNB (five wins and one loss), MNBTree (eight wins and zero losses), and SEMNB (six wins and one loss).
  • The average accuracy of HCNB on eleven datasets is 86.17%, which is notably higher than those of CNB (83.8%), R w , c CNB (84.29%), GRSCNB (83.36%), DWCNB (85.39%), CNBTree (83.81%), and SECNB (84.28%). HCNB substantially outperforms CNB (eight wins and zero losses), R w , c CNB (eight wins and zero losses), GRSCNB (eight wins and zero losses), DWCNB (three wins and one loss), CNBTree (six wins and zero losses), and SECNB (six wins and zero losses).
  • The average accuracy of HOVA on eleven datasets is 85.44%, which is notably higher than those of OVA (84.51%), R w , c OVA (83.56%), GRSOVA (84.7%), DWOVA (84.97%), OVATree (83.54%), and SEOVA (84.12%). HOVA substantially outperformsOVA (four wins and one loss), R w , c OVA (six wins and zero losses), GRSOVA (four wins and zero losses), DWOVA (three wins and one loss), OVATree (five wins and zero losses), and SEOVA (seven wins and zero losses).
  • In addition, according to the results of the Wilcoxon test, HMNB significantly outperforms MNB, R w , c MNB, GRSMNB, DWMNB, MNBTree, and SEMNB. HCNB significantly outperforms CNB, R w , c CNB, GRSCNB, CNBTree, and SECNB. HOVA significantly outperforms OVA, R w , c OVA, OVATree, and SEOVA. All of these comparison results validate the effectiveness of the proposed HMNB, HCNB, and HOVA.
Finally, we conducted the Wilcoxon signed-rank test [35,36] to compare each pair of HMNB, HCNB, and HOVA. The detailed comparison results are shown in Table 8. From these, we can see that HMNB almost tied with HCNB and HOVA, and HCNB was notably better than HOVA. Considering the simplicity of the models, HMNB and HCNB could be appropriate choices.

5. Conclusions and Future Study

To alleviate MNB’s assumption of features’ conditional independence, this paper proposed a single model called hidden MNB (HMNB) by adapting the well-known hidden NB (HNB). HMNB creates a hidden parent for each feature that synthesizes all of the other qualified features’ influences. For HMNB to learn, we proposed a simple but effective learning algorithm that does not incurring a high-computational-complexity structure-learning process. Our improved idea can also be used to improve CNB and OVA, and the resulting models are simply denoted as HCNB and HOVA, respectively. The extensive experiments show that the proposed HMNB, HCNB, and HOVA significantly outperform their state-of-the-art competitors.
In the proposed HMNB, HCNB, and HOVA, how the weight (importance) of each possible parent word is defined is crucial. Currently, we directly use the gain ratio of each possible parent word that splits the training data in order to define the weight, which is somewhat rough. We believe that using more sophisticated methods, such as the expectation-maximum (EM) algorithm, could further improve their classification performance and make their superiority stronger. This is a main topic for future study. In addition, to reduce the training space complexity, we transform a part of the training space consumption into classification time consumption, which leads to a relatively high classification time complexity. Therefore, the improvement of the efficiency of the proposed models is another interesting topic for future study.

Author Contributions

Conceptualization, S.G. and L.J.; methodology, S.G., S.S. and L.J.; software, L.C. and L.Y.; validation, L.C., L.Y. and S.G.; formal analysis, S.G. and L.J.; investigation, S.G. and L.J.; resources, S.G. and L.J.; data curation, S.S.; writing—original draft preparation, S.G. and L.J.; writing—review and editing, S.G. and L.J.; visualization, S.G.; supervision, L.J.; project administration, S.G. and L.Y.; funding acquisition, S.G. and L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by KLIGIP-2018A05, 2019AEE020, X201900, Q20203003, and 20RC07.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in our paper:
NBNaive Bayes
BNBBernoulli NB
MNBMultinomial NB
CNBComplement NB
OVAOne-versus-all-but-one model
HMNBHidden MNB
HCNBHidden CNB
HOVAHidden OVA
SMSShort message service
R w , c χ 2 statistic-based feature weighting
R w , c MNBMNB with R w , c
R w , c CNBCNB with R w , c
R w , c OVAOVA with R w , c
DFWDeep feature weighting
DFWMNBMNB with DFW
DFWCNBCNB with DFW
DFWOVAOVA with DFW
GRWGain-ratio-based feature weighting
GRWMNBMNB with GRW
GRWCNBCNB with GRW
GRWOVAOVA with GRW
DTWDecision-tree-based feature weighting
DTWMNBMNB with DTW
DTWCNBCNB with DTW
DTWOVAOVA with DTW
GRSGain ratio-based hybrid feature selection
GRSMNBMNB with GRS
GRSCNBCNB with GRS
GRSOVAOVA with GRS
DWDiscriminative instance weighting
DWMNBMNB with DW
DWCNBCNB with DW
DWOVAOVA with DW
LWWMNBLocally weighted MNB
LWWCNBLocally weighted CNB
LWWOVALocally weighted OVA
MNBTreeMNB tree
CNBTreeCNB tree
OVATreeOVA tree
SEMNBStructure-extended MNB
SECNBStructure-extended CNB
SEOVAStructure-extended OVA
TANTree-augmented NB
WAODEWeighted average of one-dependence estimators
HNBHidden NB
WEKAWaikato environment for knowledge analysis
KEELKnowledge extraction based on evolutionary learning

References

  1. Chen, L.; Jiang, L.; Li, C. Modified DFS-based term weighting scheme for text classification. Expert Syst. Appl. 2021, 168, 114438. [Google Scholar] [CrossRef]
  2. Chen, L.; Jiang, L.; Li, C. Using modified term frequency to improve term weighting for text classification. Eng. Appl. Artif. Intell. 2021, 101, 104215. [Google Scholar] [CrossRef]
  3. Rusek, J. The Point Nuisance Method as a Decision-Support System Based on Bayesian Inference Approach. Arch. Min. Sci. 2020, 65, 117–127. [Google Scholar]
  4. Ali, S.A.; Parvin, F.; Pham, Q.B.; Vojtek, M.; Vojtekova, J.; Costache, R.; Linh, N.T.T.; Nguyen, H.Q.; Ahmad, A.; Ghorbani, M.A. GIS-based comparative assessment of flood susceptibility mapping using hybrid multi-criteria decision-making approach, naive Bayes tree, bivariate statistics and logistic regression: A case of Topla basin, Slovakia. Ecol. Indic. 2020, 117, 106620. [Google Scholar] [CrossRef]
  5. Wang, H.; Wang, H.; Wu, Z.; Zhou, Y. Using Multi-Factor Analysis to Predict Urban Flood Depth Based on Naive Bayes. Water 2021, 13, 432. [Google Scholar] [CrossRef]
  6. Deng, X.; Wang, Y.; Zhu, T.; Zhang, W.; Yin, Y.; Ye, L. Short Message Service (SMS) can Enhance Compliance and Reduce Cancellations in a Sedation Gastrointestinal Endoscopy Center: A Prospective Randomized Controlled Trial. J. Med. Syst. 2015, 39, 169. [Google Scholar] [CrossRef]
  7. Ponte, J.M.; Croft, W.B. A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 24–28 August 1998; pp. 275–281. [Google Scholar]
  8. McCallum, A.; Nigam, K. A comparison of event models for naive Bayes text classification. In Working Notes of the 1998 AAAI/ICML Workshop on Learning for Text Categorization; AAAI Press: Palo Alto, CA, USA, 1998; pp. 41–48. [Google Scholar]
  9. Rennie, J.D.; Shih, L.; Teevan, J.; Karger, D.R. Tackling the poor assumptions of Naive Bayes Text Classifiers. In Proceedings of the Twentieth International Conference on Machine Learning, Washington, DC, USA, 21–24 August 2003; pp. 616–623. [Google Scholar]
  10. Zhang, H.; Jiang, L.; Yu, L. Class-specific attribute value weighting for Naive Bayes. Inf. Sci. 2020, 508, 260–274. [Google Scholar] [CrossRef]
  11. Zhang, H.; Jiang, L.; Yu, L. Attribute and instance weighted naive Bayes. Pattern Recognit. 2021, 111, 107674. [Google Scholar] [CrossRef]
  12. Jiang, L.; Wang, S.; Li, C.; Zhang, L. Structure extended multinomial naive Bayes. Inf. Sci. 2016, 329, 346–356. [Google Scholar] [CrossRef]
  13. Li, Y.; Luo, C.; Chung, S.M. Weighted naive Bayes for Text Classification Using positive Term-Class Dependency. Int. J. Artif. Intell. Tools 2012, 21, 1250008. [Google Scholar] [CrossRef]
  14. Jiang, L.; Li, C.; Wang, S.; Zhang, L. Deep feature weighting for naive Bayes and its application to text classification. Eng. Appl. Artif. Intell. 2016, 52, 26–39. [Google Scholar] [CrossRef]
  15. Zhang, L.; Jiang, L.; Li, C.; Kong, G. Two feature weighting approaches for naive Bayes text classifiers. Knowl. Based Syst. 2016, 100, 137–144. [Google Scholar] [CrossRef]
  16. Yang, Y.; Pedersen, J.O. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning, San Francisco, CA, USA, 8–12 July 1997; pp. 412–420. [Google Scholar]
  17. Javed, K.; Maruf, S.; Babri, H.A. A two-stage Markov blanket based feature selection algorithm for text classification. Neurocomputing 2015, 157, 91–104. [Google Scholar] [CrossRef]
  18. Zhang, L.; Jiang, L.; Li, C. A New Feature Selection Approach to Naive Bayes Text Classifiers. Int. J. Pattern Recognit. Artif. Intell. 2016, 30, 1650003:1–1650003:17. [Google Scholar] [CrossRef]
  19. Freund, Y.; Schapire, R.E. Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning, Bari, Italy, 3–6 July 1996; pp. 148–156. [Google Scholar]
  20. Jiang, L.; Wang, D.; Cai, Z. Discriminatively Weighted Naive Bayes and its Application in Text Classification. Int. J. Artif. Intell. Tools 2012, 21, 1250007. [Google Scholar] [CrossRef]
  21. Jiang, L.; Cai, Z.; Zhang, H.; Wang, D. Naive Bayes text classifiers: A locally weighted learning approach. J. Exp. Theor. Artif. Intell. 2013, 25, 273–286. [Google Scholar] [CrossRef]
  22. Wang, S.; Jiang, L.; Li, C. Adapting naive Bayes tree for text classification. Knowl. Inf. Syst. 2015, 44, 77–89. [Google Scholar] [CrossRef]
  23. Chickering, D.M. Learning Bayesian networks is NP-complete. In Learning from Data; Springer: New York, NY, USA, 1996; pp. 121–130. [Google Scholar]
  24. Friedman, N.; Geiger, D.; Goldszmidt, M. Bayesian network classifiers. Mach. Learn. 1997, 29, 131–163. [Google Scholar] [CrossRef] [Green Version]
  25. Jiang, L.; Zhang, H.; Cai, Z.; Wang, D. Weighted Average of One-Dependence Estimators. J. Exp. Theor. Artif. Intell. 2012, 24, 219–230. [Google Scholar] [CrossRef]
  26. Sahami, M. Learning limited dependence Bayesian classifiers. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 335–338. [Google Scholar]
  27. Abellán, J.; Cano, A.; Masegosa, A.R.; Moral, S. A memory efficient semi-Naive Bayes classifier with grouping of cases. Intell. Data Anal. 2011, 15, 299–318. [Google Scholar] [CrossRef]
  28. Keogh, E.; Pazzani, M. Learning augmented Bayesian classifiers: A comparison of distribution-based and classification-based approaches. In Proceedings of the Seventh International Workshop on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 3–6 January 1999; pp. 225–230. [Google Scholar]
  29. Qiu, C.; Jiang, L.; Li, C. Not always simple classification: Learning SuperParent for Class Probability Estimation. Expert Syst. Appl. 2015, 42, 5433–5440. [Google Scholar] [CrossRef]
  30. Jiang, L.; Zhang, H.; Cai, Z. A Novel Bayes Model: Hidden Naive Bayes. IEEE Trans. Knowl. Data Eng. 2009, 21, 1361–1371. [Google Scholar] [CrossRef]
  31. Witten, I.H.; Frank, E.; Hall, M.A. Data Mining: Practical Machine Learning Tools and Techniques, 3rd ed.; Morgan Kaufmann: Burlington, MA, USA, 2011. [Google Scholar]
  32. Han, E.; Karypis, G. Centroid-Based Document Classification: Analysis and Experimental Results. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, PKDD 2000, Lyon, France, 13–16 September 2000; pp. 424–431. [Google Scholar]
  33. Nadeau, C.; Bengio, Y. Inference for the generalization error. Mach. Learn. 2003, 52, 239–281. [Google Scholar] [CrossRef] [Green Version]
  34. Alcalá-Fdez, J.; Fernandez, A.; Luengo, J.; Derrac, J.; García, S.; Sánchez, L.; Herrera, F. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. J. Mult.-Valued Log. Soft Comput. 2011, 17, 255–287. [Google Scholar]
  35. Garcia, S.; Herrera, F. An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. J. Mach. Learn. Res. 2008, 9, 2677–2694. [Google Scholar]
  36. Wilcoxon, F. Individual comparisons by ranking methods. Biom. Bull. 1945, 1, 80–83. [Google Scholar] [CrossRef]
Table 1. Text classification datasets in our experiments.
Table 1. Text classification datasets in our experiments.
Dataset#Documents#Words#Classes#Min Class#Max Class#Avg Class
fbis246320001738506144.9
la1s320431,4726273943534.0
la2s307531,4726248905512.5
oh0100331821051194100.3
oh10105032381052165105.0
oh159133100105315791.3
oh59183012105914991.8
ohscal11,16211,4651070916211116.2
re0150428861311608115.7
re116573758251037166.3
wap1560846020534178.0
Table 2. Comparisons of the classification accuracy for HMNB versus MNB, R w , c MNB, GRSMNB, DWMNB, MNBTree, and SEMNB.
Table 2. Comparisons of the classification accuracy for HMNB versus MNB, R w , c MNB, GRSMNB, DWMNB, MNBTree, and SEMNB.
DatasetHMNBMNB R w , c MNBGRSMNBDWMNBMNBTreeSEMNB
fbis81.4277.11 •79.87 •79.61 •80.3979.06 •83.27 ∘
la1s89.2088.4187.88 •88.40 •88.8587.22 •89.15
la2s90.7389.88 •88.72 •89.33 •90.1487.34 •91.01
oh091.7089.55 •89.05 •90.1889.64 •88.93 •88.87 •
oh1083.8780.60 •80.41 •81.10 •80.64 •83.2580.66 •
oh1586.5183.60 •83.61 •84.3883.29 •79.01 •83.36 •
oh590.0086.63 •86.46 •89.7286.87 •88.7487.55 •
ohscal79.8874.70 •74.18 •76.84 •74.30 •78.00 •76.40 •
re083.2980.02 •77.07 •80.56 •81.8177.30 •82.73
re184.6083.3182.72 •86.12 ∘83.1384.2682.22 •
wap80.4081.2276.33 •80.3481.83 ∘75.42 •80.53
Average85.6083.1882.3984.2383.7282.5984.16
W/T/L-8/3/011/0/06/4/15/5/18/3/06/4/1
Table 3. Comparisons of the classification accuracy for HCNB vs. CNB, R w , c CNB, GRSCNB, DWCNB, CNBTree, and SECNB.
Table 3. Comparisons of the classification accuracy for HCNB vs. CNB, R w , c CNB, GRSCNB, DWCNB, CNBTree, and SECNB.
DatasetHCNBCNB R w , c CNBGRSCNBDWCNBCNBTreeSECNB
fbis82.2476.78 •78.27 •76.91 •83.74 ∘79.32 •81.42
la1s88.1286.30 •87.33 •85.99 •88.4887.2187.82
la2s89.8688.26 •88.94 •87.69 •89.6188.08 •89.47
oh092.7392.3192.4991.4192.3690.7689.82 •
oh1084.8881.76 •82.20 •80.13 •82.36 •85.1681.24 •
oh1588.1984.38 •85.32 •85.36 •84.27 •81.74 •83.81 •
oh591.3490.5890.9689.9690.5189.9988.18 •
ohscal79.8576.50 •76.69 •75.34 •76.39 •76.94 •76.61 •
re084.7182.37 •80.74 •81.48 •85.3579.62 •83.79
re186.1884.9986.1686.3886.8886.4384.76 •
wap79.7477.53 •78.10 •76.31 •79.3276.69 •80.13
Average86.1783.8084.2983.3685.3983.8184.28
W/T/L-8/3/08/3/08/3/03/7/16/5/06/5/0
Table 4. Comparisons of the classification accuracy for HOVA versus OVA, R w , c OVA, GRSOVA, DWOVA, OVATree, and SEOVA.
Table 4. Comparisons of the classification accuracy for HOVA versus OVA, R w , c OVA, GRSOVA, DWOVA, OVATree, and SEOVA.
DatasetHOVAOVA R w , c OVAGRSOVADWOVAOVATreeSEOVA
fbis82.2180.94 •80.80 •80.95 •82.6881.7280.80 •
la1s88.9188.5288.1188.3688.8387.69 •86.94 •
la2s90.2290.2389.3289.9090.3687.94 •88.56 •
oh091.7091.4990.1291.0991.5390.0591.51
oh1084.2581.86 •81.51 •81.39 •81.94 •84.2084.04
oh1586.8684.39 •84.50 •85.5184.07 •80.35 •85.95
oh589.6589.4488.3190.1689.7589.4690.03
ohscal78.2975.81 •75.15 •76.91 •75.45 •78.2777.02 •
re083.0881.5478.81 •81.18 •83.4178.11 •81.35 •
re185.7084.7785.3786.5184.9785.2184.46 •
wap78.9380.65 ∘77.21 •79.7281.64 ∘75.90 •74.71 •
Average85.4484.5183.5684.7084.9783.5484.12
W/T/L-4/6/16/5/04/7/03/7/15/6/07/4/0
Table 5. Results of the Wilcoxon test with regard to HMNB.
Table 5. Results of the Wilcoxon test with regard to HMNB.
AlgorithmHMNBMNB R w , c MNBGRSMNBDWMNBMNBTreeSEMNB
HMNB-
MNB-
R w , c MNB-
GRSMNB -
DWMNB -
MNBTree -
SEMNB -
Table 6. Results of the Wilcoxon test with regard to HCNB.
Table 6. Results of the Wilcoxon test with regard to HCNB.
AlgorithmHCNBCNB R w , c CNBGRSCNBDWCNBMCBTreeSECNB
HCNB-
CNB-
R w , c CNB -
GRSCNB -
DWCNB -
CNBTree -
SECNB -
Table 7. Results of the Wilcoxon test with regard to HOVA.
Table 7. Results of the Wilcoxon test with regard to HOVA.
AlgorithmHOVAOVA R w , c OVAGRSOVADWOVAOVATreeSEOVA
HOVA-
OVA-
R w , c OVA-
GRSOVA -
DWOVA -
OVATree -
SEOVA -
Table 8. Results of the Wilcoxon test for HMNB, HCNB, and HOVA.
Table 8. Results of the Wilcoxon test for HMNB, HCNB, and HOVA.
AlgorithmHMNBHCNBHOVA
HMNB-
HCNB -
HOVA -
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Gan, S.; Shao, S.; Chen, L.; Yu, L.; Jiang, L. Adapting Hidden Naive Bayes for Text Classification. Mathematics 2021, 9, 2378. https://doi.org/10.3390/math9192378

AMA Style

Gan S, Shao S, Chen L, Yu L, Jiang L. Adapting Hidden Naive Bayes for Text Classification. Mathematics. 2021; 9(19):2378. https://doi.org/10.3390/math9192378

Chicago/Turabian Style

Gan, Shengfeng, Shiqi Shao, Long Chen, Liangjun Yu, and Liangxiao Jiang. 2021. "Adapting Hidden Naive Bayes for Text Classification" Mathematics 9, no. 19: 2378. https://doi.org/10.3390/math9192378

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop