Next Article in Journal
Identification of Denatured Biological Tissues Based on Time-Frequency Entropy and Refined Composite Multi-Scale Weighted Permutation Entropy during HIFU Treatment
Previous Article in Journal
Akaike’s Bayesian Information Criterion for the Joint Inversion of Terrestrial Water Storage Using GPS Vertical Displacements, GRACE and GLDAS in Southwest China
Article Menu
Issue 7 (July) cover image

Export Article

Entropy 2019, 21(7), 665; https://doi.org/10.3390/e21070665

Article
Structure Learning of Bayesian Network Based on Adaptive Thresholding
1
College of Computer Science and Technology, Jilin University, Changchun 130012, China
2
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China
*
Author to whom correspondence should be addressed.
Received: 1 June 2019 / Accepted: 5 July 2019 / Published: 8 July 2019

Abstract

:
Direct dependencies and conditional dependencies in restricted Bayesian network classifiers (BNCs) are two basic kinds of dependencies. Traditional approaches, such as filter and wrapper, have proved to be beneficial to identify non-significant dependencies one by one, whereas the high computational overheads make them inefficient especially for those BNCs with high structural complexity. Study of the distributions of information-theoretic measures provides a feasible approach to identifying non-significant dependencies in batch that may help increase the structure reliability and avoid overfitting. In this paper, we investigate two extensions to the k-dependence Bayesian classifier, MI-based feature selection, and CMI-based dependence selection. These two techniques apply a novel adaptive thresholding method to filter out redundancy and can work jointly. Experimental results on 30 datasets from the UCI machine learning repository demonstrate that adaptive thresholds can help distinguish between dependencies and independencies and the proposed algorithm achieves competitive classification performance compared to several state-of-the-art BNCs in terms of 0–1 loss, root mean squared error, bias, and variance.
Keywords:
Bayesian network classifiers; mutual information; conditional mutual information; thresholding

1. Introduction

Classification is one of the most important tasks in machine learning. The basic problem of supervised classification is the induction of a model with feature set X = { X 1 , , X n } that classifies testing instance (example) x = { x 1 , , x n } into one of the several class labels { c 1 , , c m } of class variable C. Bayesian network classifiers (BNCs) have many desirable properties over other numerous classification models, such as model interpretability, the ease of implementation, the ability to deal with multi-class classification problems and the comparable classification performance [1]. A BNC or B assigns the most probable label with the maximum posterior probability to x by calculating the posterior probability for each class label that is:
arg max C P B ( c | x ) = arg max C P B ( x , c ) P B ( x ) arg max C P B ( x , c ) ,
where class label c { c 1 , , c m } .
Although unrestricted BNCs are the least biased, the search-space that is needed to train such a model increases exponentially with the number of features [2]. The arising complexity issues limit the study of unrestricted BNCs and it has led to the study of restricted BNCs, from 0-dependence naive Bayes (NB) [3,4,5], 1-dependence tree-augmented naive Bayes (TAN) [6] to k-dependence Bayesian classifier (KDB) [7]. These classifiers take class variable as the common parent of all predictive features and use different learning strategies to explore the conditional dependence among features. KDB has numerous desirable characteristics in structure learning. For example, it has satisfactory classification accuracy while dealing with large quantities of data [2]. In addition, KDB uses a single parameter, k, to set the maximum number of parents for any feature and thus controls the structure complexity. KDB first determines the feature order by comparing MI. Suppose that the order is { X 1 , , X n } , then X i can select at most k, or more precisely min { i 1 , k } , features as parents from its candidates { X 1 , , X i 1 } . These parents correspond to the min { i 1 , k } largest CMI values. Figure 1 shows two examples, i.e., K 1 DB (KDB with k = 1 ) and K 2 DB (KDB with k = 2 ). Suppose that I ( X 1 ; C ) > I ( X 2 ; C ) > I ( X 3 ; C ) > I ( X 4 ; C ) , then the feature order is { X 1 , X 2 , X 3 , X 4 } . If I ( X 3 ; X 4 | C ) > I ( X 1 ; X 4 | C ) > I ( X 2 ; X 4 | C ) , X 4 in K 1 DB chooses X 3 as its only parent and X 4 in K 2 DB chooses { X 1 , X 3 } as its parents from candidates { X 1 , X 2 , X 3 } .
There are two basic kinds of dependencies in restricted BNCs: (1) direct dependence between feature X i and C that can be quantified by mutual information (MI) I ( X i ; C ) , and (2) conditional dependence between X i and X j given C that can be measured by conditional mutual information (CMI) I ( X i ; X j | C ) . Many researchers have exploited methods, such as filter and wrapper [8,9,10,11,12,13], to select direct dependencies by removing redundant features. The filter approach operates independently of any learning algorithms that rank the features by some criteria and omit all features that do not achieve a sufficient score [14,15,16]. The wrapper approach evaluates the feature subsets every time and may produce better results. For example, Backwards Sequential Elimination (BSE) [17] uses a simple heuristic wrapper approach that seeks a subset of the available features that minimizes 0–1 loss on the training set. Forward Sequential Selection (FSS) [18] uses the reverse search direction to BSE. Although the filter and wrapper approaches have proved to be beneficial in domains with highly correlated features, the learning procedure ends only when there is no accuracy improvement, thus they are expensive to run and can break down with very large numbers of features [8,19,20]. Suppose that we need to select m from n features for classification, BSE or FSS will construct P n m or n ! m ! candidate BNCs to judge if there exist non-significant features or direct dependencies. It is even more difficult for BSE or FSS to select the conditional dependencies. For example, the network topology of KDB consists of n k k 2 2 k 2 conditional dependencies [21]. If BSE or FSS evaluate them one by one to identify those relatively non-significant ones, the high computational overheads is almost unbearable and few approaches are proposed to address this issue.
Obviously, how to efficiently identify non-significant direct and conditional dependencies are two key issues to learn BNC. Strictly speaking, there exist no direct or conditional independence due to the fact that the MI and CMI values are non-negative. However, weak dependencies, if introduced into the network topology, will result in overfitting and classification bias. For KDB, all features are indiscriminately conditionally dependent on at most k parent features even if the conditional dependencies are very weak. Discarding these redundant features or weak conditional dependencies can help increase structure reliability and avoid overfitting. Figure 2 presents the distributions of MI and CMI values for K 2 DB (KDB with k = 2) on dataset Connect-4, which has 67,557 instances (or examples), 42 features and three classes. As shown in Figure 2a, there exist minor differences among some MI values, thus the significance of corresponding direct dependencies is almost the same and they can be treated in batch. From Figure 2b, the same also applies to CMI and corresponding conditional dependencies.
The filter approaches have computational efficiency while the wrapper approaches may produce better results. The algorithm proposed combines the characteristics of filter with wrapper approaches to exploit the complementary strengths. In this paper, we propose to group the direct (or conditional) dependencies into different batches using adaptive thresholds. We assume that there exists no significant difference between the MI (or CMI) values in the same batch. Then, the basic idea of filter and wrapper will be applied to a select batch rather than single dependence for each iteration. This learning strategy can help achieve much higher efficiency compared to BSE (or FSS) while retaining competitive classification performance, and above all it provides a feasible solution for selecting conditional dependencies, the number of which increases exponentially as the number of features increases.
We investigate two extensions to KDB, MI-based feature selection and CMI-based dependence selection based on a novel adaptive thresholding method. The final BNC, Adaptive KDB (AKDB), evaluates the subsets of features and conditional dependencies using leave-one-out cross validation (LOOCV). In the remaining sections, we prove that applying feature selection and dependence selection techniques to KDB can alleviate the potential redundancy problem. We present extensive experimental results, which prove that AKDB significantly outperforms several other state-of-the-art BNCs in terms of 0–1 loss, root mean squared error (RMSE), bias and variance.

2. Restricted Bayesian Network Classifiers

For convenience, except for the algorithm names, all the used acronyms in this work are listed in Table 1. The structure of BNC can be described as a directed acyclic graph [22]. Nodes in structure represent the class variable C or features, edge X i X j denotes probabilistic dependency relationship between these two features and X i is one of the immediate parent nodes of X j . Thus, in a restricted BNC or B, class variable C is required as the common parent of all features and does not have any parents so the individual probability of C is P ( c ) . We use P B ( x i | π i ) to denote the individual probability of feature X i , where π i denotes the set of values of X i ’s parents. The joint probability distribution can be calculated as the product of P B ( x i | π i ) of all features and P ( c ) that is:
P B ( x , c ) = P ( c ) i = 1 n P B ( x i | π i ) .
Unfortunately, the inference of an unrestricted BNC has been proved to be an NP-hard problem [23,24] and learning a restricted or pre-fixed BNC is one approach to deal with the intractable complexity. For example, NB [25,26] is the simplest classifier among restricted BNCs that assumes each feature is conditionally independent given the class variable C.
Since, in the real world, the dataset usually does not satisfy the independence assumption, this may cause a deterioration of the classification performance. KDB alleviates the independence assumption of NB that it constructs classifiers which allow feature X i within BNC to have at most k parent features. KDB firstly sets the feature order by comparing MI values and then calculates CMI values as the weights to measure the conditional relationship between features given C and select at most k parent features for one feature. MI and CMI are defined as follows:
I ( X i ; C ) = x i X i c C P ( x i , c ) l o g 2 P ( x i , c ) P ( x i ) P ( c ) , I ( X i ; X j | C ) = x i X i x j X j c C P ( x i , x j , c ) log 2 P ( x i , x j | c ) P ( x i | c ) P ( x j | c ) .
For KDB, I ( X i ; C ) measures the direct dependence between X i and C. I ( X i ; X j | C ) measures the conditional dependence between X i and X j given C. For a given training set with n features and the parameter k, KDB firstly calculates MI and CMI. Suppose that the feature order is { X 1 , , X n } by comparing MI, X i will choose min ( i 1 , k ) features with the highest CMI values from the first i 1 candidates. The structure learning procedure of KDB is depicted in Algorithm 1.
There have been some refinements that may improve KDB’s performance. Rodríguez and Lozano [27] proposed to extend KDB to a multi-dimensional classifier, which learned a population of classifiers (nondominated solutions) by a multi-objective optimization technique and the objective functions for the multi-objective approach are the multi-dimensional k-fold cross-validation estimations of the errors. Louzada [28] proposed to generate multiple KDB networks via a naive bagging procedure by obtaining the predicted values from the adjusted models, and then combine them into a single predictive classification.
Algorithm 1: Structure learning procedure of KDB: LearnStructure( T , L , k)
Entropy 21 00665 i001

3. Adaptive KDB

MI and CMI are non-negative in Equation (3). I ( X i ; C ) = 0 (or I ( X i ; X j | C ) = 0 ) if X i and C are independent (or X i and X j are conditionally independent given C). If X i and C are regarded as independent, the edge connecting them will be removed. Practically, the estimated MI is compared to a small threshold, in order to distinguish pairs of dependent and pairs of independent features [29,30,31,32]. In the following discussion, we mainly discuss how to choose the threshold of MI. The test for conditional independence using CMI is similar.
To refine the network structure, AKDB uses an adaptive threshold to filter out those non-significant dependencies. If the threshold is high, too many dependencies will be identified as non-significant and removed, and a sparse network may underfit the training data. In contrast, if the threshold is low, few dependencies will be identified as non-significant and a dense network may overfit the training data. The thresholds control AKDB’s bias-variance trade-off and, if appropriate thresholds are predefined, the lowest error will be achieved as this is a complex interplay between structure complexity and classification performance. Unfortunately, for different training datasets, the thresholds may differ and there are no formal methods to preselect the thresholds.
To guarantee satisfactory performance and overcome exhaustive experimentation, for KDB, given the feature order selected by KDB based on MI comparison, if feature X i is assumed to be independent of C when I ( X i ; C ) = 0 , it will be at the end of the order and the edge C X i will be removed. Furthermore, X i may be dependent on other features, whereas no feature depends on it. That is, X i will be irrelevant to classification directly or indirectly. The problem of choosing the threshold of MI turns to choosing a feature subset. Many feature selection algorithms are based on forward selection or backwards elimination strategies [18,33]. They start with either an empty set of features or a full set of features, and then only one feature is added to BNC or removed from BNC for each iteration. Feature selection is a complex task that the search space for n features is O ( n 2 ) . Thus, it is impractical to search the whole space exhaustively, unless n is small. Our proposed algorithm, AKDB, extends KDB to adaptively select a threshold of MI and the threshold can help remove more than one feature at each step.
To clarify the basic idea, we take datasets Hypo and Waveform for a case study. Dataset Hypo has 3772 instances, 29 features and four classes. Dataset Waveform has 100,000 instances, 21 features and three classes. Corresponding MI values (see details in Table A1 and Table A2 in the Appendix A) and CMI values (see details in Table A3 and Table A4 in the Appendix A) for K 2 DB are, respectively, presented in Figure 3 and Figure 4.
As Figure 3 shows, the features can be divided into different parts according to the distribution of MI values. In dataset Hypo, we can see that the difference in MI values of the first 26 features is not obvious and that these features can be grouped into one part. The 27th and 28th features can be grouped into another part, and the 29th feature is the last part. On dataset Waveform, the features can also be divided into three parts. The distribution of CMI values is similar. As Figure 4 shows, the CMI values on datasets Hypo and Waveform are both divided into five groups. The difference in MI values in the same part should be non-significant and, if the MI values are small, corresponding features can be identified as redundant for classification and removed from BNC. The test for redundant conditional dependencies is similar. From Figure 3 and Figure 4, we can see that the thresholds for identifying redundancy differ greatly for different datasets. Thus, a threshold that maximizes a performance measure should be adapted to different datasets. MI ^ and CMI ^ are introduced as the adaptive thresholds of redundant features and redundant conditional dependencies, respectively. AMI and ACMI respectively denote the average MI and the average CMI, which are defined as follows, and are introduced in this paper as the benchmark thresholds to distinguish between strong and weak dependencies:
AMI = 1 X X i X I ( X i ; C ) , ACMI = 1 i = 1 F π i X i X X j π i I ( X i ; X j | C ) ,
where π i is the size of π i , X and F denotes the cardinality of feature set X and feature subset F. To guarantee satisfactory performance and overcome exhaustive experimentation, we require that AMI > MI ^ and ACMI > CMI ^ hold.
AKDB applies the greedy-search strategy to iteratively identify redundant and near-redundant features. For feature selection, we take advantage of the feature order that is determined by comparing MI. For simplicity, we adaptively provide the threshold value of MI that cuts off an entire region at the end of the order. Let δ be a user-specified parameter, 0 % < δ 100 % (see detail in Section 4.1). We suppose that the difference between features X i and X j is non-significant if I ( X i ; C ) I ( X j ; C ) I ( X i ; C ) ( 1 + δ ) . Correspondingly, we regard the feature X j as near-redundant if X i is redundant and the difference between features X i and X j is non-significant. Given the feature order, AKDB firstly selects the feature, e.g., X i , at the end of the order and identifies it as a redundant feature. Then, we identify the near-redundant features. Finally, LOOCV is introduced to evaluate the classification performance after removing redundant and near-redundant features as it can provide the out-of-sample error with an unbiased low-variance estimation. In addition, the 0–1 loss is used as a loss function since it is an effective measure to evaluate the quality of a classifier’s classification performance.
Finally, the feature subset is selected with the lowest 0–1 loss. In case of a draw, preference is given to the smallest number of features. If the MI values are distributed densely, then all redundant and near-redundant features can be identified only in a few iterations. After that, the greedy-search strategy is applied to identify redundant and near-redundant conditional dependencies. In this paper, we proposed to extend KDB by using information-threshold based techniques, F e a t u r e S e l e c t i o n (FS) and D e p e n d e n c e S e l e c t i o n (DS), to respectively identify redundant features and redundant conditional dependencies. Both techniques are based on backward elimination that begins at the full set of features or conditional dependencies.
The learning procedure of FS is shown in Algorithm 2. By applying BSE, FS aims to seeks a subset of the available features that minimizes 0–1 loss on the training set. FS starts from the full set of features and corresponding MI values have been grouped into several batches. There should exist significant differences between the MI values in different batches. Suppose that, for successive batches B i and B i + 1 , I m = min { I ( X j ; C ) } for any I ( X j ; C ) B i and I m + 1 = min { I ( X k ; C ) } for any I ( X k ; C ) B i + 1 . In this paper, FS requires that, for batches B i and B i + 1 , the criterion I m ( 1 + δ ) < I m + 1 holds, or for batch B i the criterion I m I ( X j ; C ) I m ( 1 + δ ) holds. BSE operates by iteratively removing successive batches. Then, the threshold of MI, or MI ^ , will change from I m to I m + 1 if the removal can help reduce the 0–1 loss. The features in the batch or corresponding direct dependencies will be removed from the network structure and the classification performance will be evaluated iteratively using LOOCV. This procedure will terminate if there is no 0–1 loss improvement or I m > AMI.
When the learning procedure of FS terminates, DS is applied to identify non-significant conditional dependencies and its learning procedure is similar except that CMI rather than MI values will be grouped into several batches and we need to remove batch of CMI values iteratively to improve 0–1 loss. The learning procedure of DS is shown in Algorithm 3.
The description of a complete AKDB algorithm, which includes FS and DS techniques, is shown in Algorithm 4. Both FS and DS firstly apply the filter approach to rank feature or conditional dependence by MI or CMI criteria, then use the wrapper approach to evaluate the feature subset or dependence subset every time for better 0–1 loss results.
Algorithm 2: FeatureSelection( T , BN, L , AMI)
Entropy 21 00665 i002
Algorithm 3: DependenceSelection( T , G , ACMI)
Entropy 21 00665 i003
Algorithm 4: AKDB
Input: Training set T with features L = { X 1 , , X n , C } and k.
Output: AKDB model.
1 Calculate I ( X i ; C ) ( 1 i n ) from T for each feature and AMI;
2 Calculate I ( X i ; X j | C ) ( i j ) from T for every pair of features and ACMI;
3 Let L be a list which includes all X i in decreasing order of I ( X i ; C ) ;
4 Initialize the network structure G = LearnStructure( T , L , k );         // Algorithm 1
5 G = FeatureSelection( T , G , L , AMI);                      // Algorithm 2
6 G = DependenceSelection( T , G , ACMI);                    // Algorithm 3
7 return G ;

4. Experiments

We conduct the experiments on 30 benchmark datasets from UCI (University of California, Irvine) machine learning repository [34]. The detailed characteristics of these datasets are described in Table 2, which includes the number of instance, feature and class. The datasets are divided into two categories—first, small datasets with number of instances ≤3000; second, large datasets with number of instances >3000. Numeric features, if they exist in a dataset, are discretized based on Minimum Description Length (MDL) [35]. Missing values are considered as a distinct value and the m-estimation with m = 1 [36] is employed to smooth the probability estimates.
The following algorithms are compared:
  • NB, standard Naive Bayes.
  • TAN, tree-augmented naive Bayes.
  • NB-FSS, selective Naive Bayes classifier with forward sequential selection.
  • K 1 DB, standard k-dependence Bayesian classifier with k = 1.
  • K 2 DB, standard k-dependence Bayesian classifier with k = 2.
  • AKDB, KDB with feature selection and conditional dependence selection based on adaptive thresholding.
The classification accuracy of algorithms are compared in terms of 0–1 loss and RMSE, and the results of them are respectively presented in Table A5 and Table A6. The bias and variance results are respectively provided in Table A7 and Table A8 because the bias-variance decomposition can provide valuable insights into the components of the error of learned algorithms [37,38]. Note that only 13 large datasets are selected because of statistical significance in terms of bias-variance comparison.

4.1. Selection of the Value of Parameter for AKDB

Removing redundant features or conditional dependencies from BNC may positively affect its classification performance if the threshold value δ is selected appropriately. However, there is no priori work that can achieve this goal. We perform an empirical study to select an appropriate δ . The 0–1 loss results for all datasets with different δ values are presented in Table 3. We can see that AKDB achieves the lowest 0–1 loss results more often when δ = 10 % . Although on some datasets AKDB with δ = 10 % may perform relatively poorer, the difference between the 0–1 loss when δ = 10 % and the lowest 0–1 loss is not significant (less than 5 % ). For example, on dataset Splice-C4.5, AKDB achieves the lowest 0–1 loss (0.0468) when δ = 80 % , and when δ = 10 % the 0–1 loss is 0.0469. From the experimental results, we argue that δ = 10 % is appropriate to help identify the threshold efficiently.

4.2. Effects of Feature Selection and Conditional Dependence Selection on KDB

FS and DS are two information-threshold based techniques which are used in the proposed algorithm AKDB. Using these techniques will cause a portion of features and conditional dependencies to be removed. To prove that they can work severally, we present respectively two versions of KDB as follows:
  • KDB-FS, KDB with only feature selection,
  • KDB-DS, KDB with only conditional dependence selection.
In order to evaluate the difference between two classifiers, we define the relative ratio as follows:
R M ( A | B ) = 1 M A M B .
The values of parameter M represents different measures. Corresponding values of R M ( A | B ) represent the difference in percentage between two classifiers A and B based on parameter M.
In this paper, N F and N D are respectively used to denote the number of features and the number of conditional dependencies in BNC. SMI and SCMI are used to indicate the sum of MI and CMI in BNC, respectively. The results of R M (KDB-FS|K 2 DB) and R M (KDB-DS|K 2 DB) are shown in Figure 5a,b, respectively.
Figure 5a presents relative ratios between KDB-FS and K 2 DB in terms of N F , SMI and 0–1 loss. The effectiveness of FS can be demonstrated by comparing the SMI values before and after removing redundant features. From Figure 5a, FS removes features on 27 out of 30 datasets. The larger the value of R N F (KDB-FS|K 2 DB), the more features that are identified as redundant and removed. We can see that the values of R N F (KDB-FS|K 2 DB) on five datasets are greater than 50%. For example, on dataset Hypo (No. 20), R N F (KDB-FS|K 2 DB) = 79.31%, indicating that 79.31% of features are identified as redundant and removed. The AMI value on dataset Hypo with 29 features is 0.0251 and only three features have MI values greater than the AMI value. In addition, 23 of these 29 features have MI values lower than 0.007 and they are iteratively removed from KDB according to the greedy-search strategy. Thus, the significant difference in MI values contributes to this high value of R N F (KDB-FS|K 2 DB). Furthermore, removing features based on the FS technique will not result in strong direct dependencies to be removed. For example, the value of R SMI (KDB-FS|K 2 DB) on datasets Hypo is 4.14%, although 79.31% of features are removed. On dataset Wavement (No. 30), the value of R SMI (KDB-FS|K 2 DB) is close to 0% after removing 9.52% of features. These facts suggest that those removed features in KDB show weak direct dependencies. In addition, removing weak direct dependencies may help improve the classification performance. The values of R 0 - 1 Loss (KDB-FS|K 2 DB) on datasets Hypo and Wavement are 21.05% and 24.61%, respectively. That is, the classification performance is improved after removing the weak direct dependencies. The significant improvement in 0–1 loss (the value of R 0 - 1 Loss (•) > 5%) on 12 datasets has demonstrated that the FS technique demonstrates a positive influence on classification performance.
The redundancy of conditional dependencies may also exist in KDB. Figure 5b presents the relative ratios between KDB-DS and K 2 DB in terms of N D , SCMI and 0–1 loss. The comparison of the SCMI values before and after removing redundant conditional dependencies can demonstrate the effectiveness of DS. When DS is applied to KDB, the selection of conditional dependencies occurs on all 30 datasets. The value of R N D (KDB-DS|K 2 DB) ranges from 8.77% to 86.72%. The larger the value of R N D (KDB-DS|K 2 DB), the more conditional dependencies that are identified as redundant and removed. For example, the value of R N D (KDB-DS|K 2 DB) is 78.51% on dataset Credit-a. It indicates that on average only 5.8 of 27 conditional dependencies are retained. Furthermore, 24 of all 27 CMI values are lower than the ACMI value (0.1592), and even the minimum CMI value is 0.0189. The difference in CMI values on dataset Credit-a is obvious; by applying the DS technique with the greedy-search strategy, weak conditional dependencies are iteratively removed. The value of R SCMI (KDB-DS|K 2 DB) ranges from 1.54% to 45.83%. The high value of R SCMI (KDB-DS|K 2 DB) does not indicate that the strong conditional dependencies are removed. On dataset Hypo, R SCMI (KDB-DS|K 2 DB) = 36.40%. The factor that contributes to this high value is that the SCMI value of these 48 removed conditional dependencies reaches 2.3355, but the CMI value of each removed conditional dependence is lower than the ACMI value. When it comes to 0–1 loss, the value of R 0 - 1 Loss (KDB-DS|K 2 DB) on dataset Hypo is 14.04%. It indicates that deleting those weak conditional dependencies may help improve classification accuracy. KDB-DS achieves almost the same classification accuracy as K 2 DB with a simplified network structure on 14 datasets and achieves 0–1 loss improvement on 16 datasets. These results indicate that the DS technique is effective and can help reduce the structure complexity of KDB.
Both FS and DS techniques combine the characteristics of filter and wrapper approaches. The redundant features or conditional dependencies are filtered out and then we use classification accuracy to evaluate the feature subsets or the retained conditional dependencies, respectively. On the other hand, removing redundant features and conditional dependencies can reduce the parameters that are needed for probability estimates and may improve the classification accuracy. From the above discussion, we can see that both FS and DS techniques are efficient and can help improve the classification performance.

4.3. Comparison of AKDB vs. NB, NB-FSS, TAN, K 1 DB and K 2 DB

In this section, we conduct comparisons for related algorithms in terms of 0–1 loss, RMSE and bias-variance decomposition. RMSE [2] is computed as:
R M S E = 1 t x ϵ T ( 1 p ^ ( c x | x ) ) 2 ,
where t is the number of training instances in training set T , c x is the true class label for the instance x , and p ^ ( c x | x ) is the estimated posterior probability of the true class given x .
The win/draw/loss (W/D/L) records of 0–1 loss, RMSE and bias-variance decomposition are presented in Table 4, Table 5 and Table 6, respectively. The W/D/L record of the comparison results of every two different algorithms are presented in each cell [ i ; j ] of every table. When one algorithm in row i ( A l i ) and the another algorithm in column j ( A l j ) are compared, we can observe which algorithm performs better on all datasets from cell [ i ; j ] . This is because, in cell [ i ; j ] , a win denotes that A l i obtains a lower 0–1 loss than A l j , a loss denotes A l j that obtains a lower 0–1 loss than A l i , and a draw denotes that A l i and A l j perform comparably. We regard a difference as significant if the outcome of a one-tailed binomial sign test is less than 0.05 [39,40].
From Table 4, we can see that NB-FSS performs better than NB in terms of a 0–1 loss. It indicates that FSS is feasible to NB. Surprisingly, K 2 DB does not have an obvious advantage when compared to 1-dependence classifiers. In addition, it even performs poorer when compared to TAN. However, when it comes to large datasets, as Table 7 shows, K 2 DB performs better than both TAN and K 1 DB. We can see that AKDB significantly outperforms all other algorithms. Most importantly, when compared to K 2 DB, AKDB has a 0–1 loss improvement with 15 wins and only one loss, which proves that the proposed two information-threshold based techniques are effective. This advantage is even greater on small datasets. From Table A5, AKDB never loses on small datasets and it obtains a significantly lower 0–1 loss on 11 out of 17 small datasets. On dataset Lymphography, the error is substantially reduced from 0.2365 to 0.1554. Compared to K 2 DB on large datasets, AKDB achieves W/D/L record of 4/8/1. Although the improvement is not significant, AKDB only loses on dataset Spambase. Based on these facts, we argue that AKDB is a more effective algorithm in terms of 0–1 loss.
What is revealed in Table 5 is similar to that in Table 4. NB and NB-FSS perform worse, which demonstrates the limitations of the independence assumption in NB. TAN and K 1 DB get better performance than NB and NB-FSS. In addition, AKDB still achieves lower RMSE significantly more often than the other five algorithms. On average, 72.4% of the features and 59.6% of conditional dependencies are selected to build the network structure of AKDB, although in some cases the improvement in terms of RMSE is not significant. Considering that AKDB has significantly lower 0–1 loss and RMSE in comparison to other algorithms, we argue that the FS technique in tandem with the DS technique used in the proposed algorithm is powerful to improve classification accuracy.
The W/D/L records of bias-variance decomposition are presented in Table 6. We may observe that NB and NB-FSS achieve higher bias and lower variance significantly more often than the other algorithms because their structures are definite without considering the true data distribution. TAN, K 1 DB and K 2 DB are all low-bias and high-variance learners because they are derived from higher-dimensional probability estimates. Thus, these classifiers are more sensitive to the changes in the training data. AKDB performs best in terms of bias. When compared to K 1 DB and K 2 DB, AKDB obtains lower bias more often than them, as jointly applying both FS and DS to KDB can simplify the network structure. Furthermore, we can observe that AKDB shows an advantage over K 2 DB in variance. The average of variance of K 2 DB and AKDB are 0.045 and 0.025 on 13 large datasets, respectively. Based on these facts, we argue that the proposed AKDB is more stable for classification.

4.4. Tests of Significant Differences

Friedman proposed the Friedman test [41] for comparisons of multiple algorithms over multiple datasets. It first calculates the ranks of algorithms for each dataset separately, and then compares the average ranks of algorithms over datasets. The best performance algorithm getting the rank of 1, the second best rank of 2, and so on. The null-hypothesis is that there is no significant difference in terms of average ranks. The Friedman test is a non-parametric measure which can be computed as follows:
χ F 2 = 12 N t ( t + 1 ) j = 1 t R j 2 3 N ( t + 1 ) ,
where N and t respectively denote the number of datasets and the number of algorithms, and R j is the average rank of the j-th algorithm. With the 30 datasets and 6 (t = 6) algorithms, the critical value of χ α 2 for α = 0.05 with (t− 1) degrees of freedom is 11.07. The Friedman statistics χ F 2 of experimental results in Table A5 and Table A6 are respectively 36.56 and 22.90, which are larger than χ α 2 , 11.07. Hence, we reject all the null-hypothesis.
Figure 6 presents the results of average ranking in terms of 0–1 loss and RMSE for six algorithms. The average ranks of different algorithms based on 0–1 loss on all datasets are, respectively, {NB(4.32), NB-FSS(4.33), K 1 DB(3.07), TAN(3.83), K 2 DB(3.53), and AKDB(1.92)}. That is, the ranking of AKDB is higher than that of other algorithms, followed by TAN, K 2 DB, K 1 DB, NB, and NB-FSS. When assessing performance using RMSE, AKDB still obtains the advantage of ranking with the lowest average rank, i.e., 2.42.
In order to determine which algorithm has a significant difference to others, we further employ the Nemenyi test [42]. The comparisons of six algorithms against each other with the Nemenyi test on 0–1 loss and RMSE are shown in Figure 7. Critical difference (CD) is also presented in the figure that is calculated as follows:
C D = q α t ( t + 1 ) 6 N ,
where the critical value q α for α = 0.05 and t = 6 is 2.85. With the 30 (N = 30) datasets and six algorithms, CD = 2.85 × 6 × ( 6 + 1 ) / ( 6 × 30 ) = 1.377. On the top dotted line, we plot the algorithms based on their average ranks, which are indicated on the top solid line. On a line, the lower rank is to the more leftward position and the algorithm on the left side has better performance. The algorithms are connected by a line if their differences are not significant.
As shown in Figure 7a, these algorithms are divided into two groups clearly in terms of 0–1 loss. One group includes AKDB and TAN, and other algorithms are in another group. AKDB ranks first although it does not have a significant advantage when compared to TAN. AKDB enjoys a significant 0–1 loss advantage relative to K 2 DB, K 1 DB, NB and NB-FSS, proving the effectiveness of the proposed information-threshold based techniques in our algorithm. As shown in Figure 7b, when RMSE is compared, AKDB still achieves lower mean ranks than the other algorithms, although the differences between AKDB, K 1 DB, K 2 DB are not significant.

5. Discussion

KDB is a form of restricted BNCs, and the weak direct dependencies and conditional dependencies may exist in KDB and they may be redundant. To alleviate the potential redundancy problem, we develop an extension to KDB, called AKDB, which applies feature selection and conditional dependence selection to remove redundant features and conditional dependencies. These two techniques presented in this paper, MI-based feature selection and CMI-based dependence selection, are based on adaptive thresholding. They are designed to iteratively identify relevant features and conditional dependencies in certain circumstances, and they combine the characteristics of filter and wrapper approaches. Both techniques are efficient and complementary. By providing experiments on 30 UCI datasets and comparisons with other state-of-the-art BNCs, we prove that adaptive thresholding can help select the most relevant features and conditional dependencies with an improvement in classification performance. On average, 72.4% of the features and 59.6% of conditional dependencies are selected to build the network structure of AKDB. Overall, AKDB achieves significant advantage over KDB in terms of 0–1 loss by a 8.54% reduction on average. The statistical significance of the experiment results is further confirmed by the Friedman test and Nemenyi test.

6. Conclusions

To efficiently identify non-significant direct and conditional dependencies, we investigate two techniques to extend KDB, MI-based feature selection and CMI-based dependence selection based on adaptive thresholding. These two techniques combine the characteristics of filter and wrapper approaches and when applied to KDB, they are severally efficient for filtering out redundancy and can help improve the classification performance. The extensive experimental results show that the final classifier, AKDB, significantly outperforms several state-of-the-art BNCs, including NB, TAN and KDB.

Author Contributions

All authors have contributed to the study and preparation of the article. Y.Z. conceived the idea, derived equations and wrote the paper. L.W., Z.D. and M.S. did the analysis and finished the programming work. All authors have read and approved the final manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 61272209, 61872164), the Agreement of Science and Technology Development Project, Jilin Province (20150101014JC), and the Fundamental Research Funds for the Central Universities.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. The values of I ( X i ; C ) for each feature in K 2 DB on Hypo dataset.
Table A1. The values of I ( X i ; C ) for each feature in K 2 DB on Hypo dataset.
No.MINo.MINo.MINo.MINo.MI
10.000070.0002130.0009190.0020250.0123
20.000080.0004140.0012200.0030260.0337
30.000090.0005150.0017210.0052270.1425
40.0000100.0006160.0017220.0062280.1580
50.0001110.0007170.0018230.0065290.3528
60.0001120.0007180.0019240.0105
Table A2. The values of I ( X i ; C ) for each feature in K 2 DB on Waveform dataset.
Table A2. The values of I ( X i ; C ) for each feature in K 2 DB on Waveform dataset.
No.MINo.MINo.MINo.MINo.MI
10.000060.5847110.6348160.6588210.7497
20.000070.6014120.6379170.7023
30.489180.6020130.6439180.7111
40.496090.6295140.6446190.7400
50.5801100.6305150.6550200.7424
Table A3. The results of I ( X i ; X j | C ) for each feature pair in K 2 DB on Hypo dataset.
Table A3. The results of I ( X i ; X j | C ) for each feature pair in K 2 DB on Hypo dataset.
No.CMINo.CMINo.CMINo.CMINo.CMI
10.0000120.0035230.0092340.0355450.2331
20.0000130.0044240.0092350.0471460.3031
30.0000140.0048250.0098360.0864470.3354
40.0000150.0049260.0121370.1153480.4571
50.0000160.0058270.0146380.1189490.4782
60.0000170.0058280.0154390.1240500.4786
70.0018180.0064290.0241400.1313510.4834
80.0019190.0073300.0262410.1361520.4852
90.0023200.0073310.0279420.1390530.4912
100.0024210.0076320.0279430.1855540.5099
110.0032220.0090330.0286440.2007550.7263
Table A4. The results of I ( X i ; X j | C ) for each feature pair in K 2 DB on Waveform dataset.
Table A4. The results of I ( X i ; X j | C ) for each feature pair in K 2 DB on Waveform dataset.
No.CMINo.CMINo.CMINo.CMINo.CMI
10.000090.0029170.1024250.2741330.4917
20.0000100.0530180.1408260.3046340.5233
30.0000110.0580190.1463270.3077350.5291
40.0000120.0750200.1510280.3922360.5318
50.0016130.0872210.1548290.4092370.5449
60.0016140.0937220.1611300.4115380.5748
70.0017150.0969230.1612310.4564390.5847
80.0022160.0973240.2475320.4752
Table A5. Experimental results of 0–1 loss.
Table A5. Experimental results of 0–1 loss.
DatasetNBSNB-FSSTANK 1 DBK 2 DBKDB-FSKDB-DSAKDB
Echocardiogram0.33590.36640.32820.30530.34350.34350.32060.3206 ∘
Lymphography0.14860.16890.17570.17570.23650.17570.20950.1554 ∘
Iris0.08670.06000.08000.08670.08670.06670.08670.0733 ∘
Hepatitis0.19350.16770.16770.15480.18710.18060.18710.1419 ∘
Autos0.31220.35610.21460.21460.20490.19510.20490.1951
Glass-id0.26170.24300.21960.22430.21960.20090.21960.1963 ∘
Heart0.17780.17410.19260.19630.21110.19260.19260.1630 ∘
Primary-tumor0.54570.53980.54280.56930.57230.56930.57230.5428 ∘
Ionosphere0.10540.08260.06840.07410.07410.07410.07120.0712
Musk10.16600.14500.11340.11130.11550.10710.10340.1029 ∘
Balance-scale0.27200.36480.27360.28160.27840.27200.27000.2800
Soybean0.08930.09520.04690.06440.05560.05560.05560.0527 ∘
Credit-a0.14060.13770.15070.15510.14640.14350.14640.1420
Breast-cancer-w0.02580.02580.04150.04860.07440.06010.07150.0472 ∘
Vehicle0.39240.40540.29430.30140.29430.27780.29430.3014
German0.25300.26600.27300.27600.28900.27900.28100.2590 ∘
Yeast0.42390.42390.41710.42180.43870.43870.43330.4218
Splice-c4.50.04440.03810.04660.04820.09410.04690.09100.0469 ∘
Dis0.01590.01540.01590.01460.01380.01300.01380.0130 ∘
Hypo0.01380.02440.01410.00770.01140.00900.00980.0077 ∘
Spambase0.10150.07650.06690.07650.06350.06350.06280.0752 •
Phoneme0.26150.24770.27330.21200.19840.19840.19840.1984
Page-blocks0.06190.04420.04150.04330.03910.03780.03730.0391
Optdigits0.07670.07880.04070.04160.03720.03520.03700.0358
Mushrooms0.01960.01480.00010.00060.00000.00000.00000.0000
Magic0.22390.21320.16750.17420.16370.16370.16360.1636
Adult0.15920.16560.13800.13850.13830.13380.13830.1338
Shuttle0.00390.00400.00150.00150.00090.00090.00090.0009
Connect-40.27830.29990.23540.24060.22830.22820.22830.2283
Waveform0.02200.02730.02020.02260.02560.01930.01940.0196 ∘
∘, • denote significant improvement or degradation of AKDB over K 2 DB.
Table A6. Experimental results of RMSE.
Table A6. Experimental results of RMSE.
DatasetNBSNB-FSSTANK 1 DBK 2 DBAKDB
Echocardiogram0.48960.48230.48860.48460.48890.4807
Lymphography0.34650.35050.38130.37260.43620.4076
Iris0.25450.21580.24410.24350.24470.2224
Hepatitis0.39010.37700.36100.35590.38750.3823
Autos0.51900.53300.44750.44600.43990.4380
Glass-id0.43530.43250.41090.42230.42050.4105
Heart0.36510.35790.37710.37520.39490.3773
Primary-tumor0.70840.71590.71700.71900.72620.7092
Ionosphere0.08560.05380.26150.06210.04990.0561
Musk10.39720.38390.30220.30340.30580.3034
Balance-scale0.44310.54480.43440.43840.43230.4605
Soybean0.29450.38450.20140.22060.20630.2223
Credit-a0.33420.31790.34110.34000.35250.3391
Breast-cancer-w0.15700.15700.19280.19510.24970.2199
Vehicle0.57360.56630.45930.46230.45910.4419
German0.49450.42120.50000.49910.50530.4644
Yeast0.59870.59870.59940.59970.60350.6035
Splice-c4.50.18830.20300.19170.19440.27560.1848
Dis0.11770.11040.11030.10720.10240.1024
Hypo0.11050.14010.10500.08810.09550.0863
Spambase0.29940.39390.24030.24800.23000.2300
Phoneme0.47920.46320.50480.43850.41950.4195
Page-blocks0.23310.19230.18940.19400.18110.1781
Optdigits0.26370.28930.19060.19370.18060.1736
Mushrooms0.12290.10830.00830.01880.00010.0001
Magic0.39740.38020.34610.35090.34700.3470
Adult0.34090.33450.30760.30710.30890.3047
Shuttle0.05610.06740.03560.03670.02900.0279
Connect-40.47870.50240.44350.44800.43360.4206
Waveform0.14410.14990.11640.12850.14020.1253
Table A7. Experimental results of Bias.
Table A7. Experimental results of Bias.
DatasetNBSNB-FSSTANK 1 DBK 2 DBAKDB
Splice-c4.50.03410.03550.04440.03580.09680.0353
Dis0.01600.01910.01880.01740.01710.0190
Hypo0.00980.01770.01010.00830.00720.0077
Spambase0.09650.07350.06560.06650.05040.0589
Phoneme0.22840.20040.24700.17400.15990.1572
Page-blocks0.04090.03630.03310.03420.0280.0286
Optdigits0.06550.06850.03080.03130.02850.0235
Mushrooms0.03990.01480.00020.00110.00020.0000
Magic0.19870.19420.13570.14510.13210.1292
Adult0.14850.18800.11250.11170.11350.1236
Shuttle0.00660.00360.00230.00260.00280.0008
Connect-40.23270.29590.18290.18820.17880.2069
Waveform0.03140.02570.01380.01540.01800.0164
Table A8. Experimental results of Variance.
Table A8. Experimental results of Variance.
DatasetNBSNB-FSSTANK 1 DBK 2 DBAKDB
Splice-c4.50.00950.00510.02960.03570.08130.0572
Dis0.00910.00000.00090.00210.00250.0012
Hypo0.00630.00330.00780.00660.00590.0069
Spambase0.01040.00700.01710.01760.02380.0178
Phoneme0.18310.07830.24960.17100.14900.1064
Page-blocks0.01280.01100.01420.01710.01850.0161
Optdigits0.02470.01560.02800.02900.03220.0227
Mushrooms0.00810.00000.00060.00130.00050.0002
Magic0.04090.02840.07920.07440.08180.0453
Adult0.03550.03040.06400.06520.07170.0196
Shuttle0.00380.00040.00080.00160.00210.0004
Connect-40.09530.00370.08830.09560.10440.0294
Waveform0.00440.00090.01190.01100.01020.0023

References

  1. Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan Kaufmann: San Francisco, CA, USA, 1988; pp. 29–75. [Google Scholar]
  2. Martínez, A.M.; Webb, G.I.; Chen, S.; Zaidi, N.A. Scalable learning of Bayesian network classifiers. J. Mach. Learn. Res. 2016, 17, 1515–1549. [Google Scholar]
  3. Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification and Scene Analysis; Wiley-Interscience: New York, NY, USA, 1973; pp. 16–22. [Google Scholar]
  4. Minsky, M. Steps toward artificial intelligence. Proc. Inst. Radio Eng. 1961, 49, 8–30. [Google Scholar] [CrossRef]
  5. Lewis, D.D. Naive Bayes at Forty: The Independence Assumption in Information Retrieval. In Proceedings of the 10th European Conference on Machine Learning (ECML), Chemnitz, Germany, 21–23 April 1998; pp. 4–15. [Google Scholar]
  6. Friedman, N.; Geiger, D.; Goldszmidt, M. Bayesian network classifiers. Mach. Learn. 1997, 29, 131–163. [Google Scholar] [CrossRef]
  7. Sahami, M. Learning Limited Dependence Bayesian Classifiers. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD), Portland, OR, USA, 2–4 August 1996; pp. 335–338. [Google Scholar]
  8. Kohavi, R.; John, G.H. Wrappers for feature subset selection. Artif. Intell. 1997, 91, 273–324. [Google Scholar] [CrossRef]
  9. Blum, A.L.; Langley, P. Selection of relevant features and examples in machine learning. Artif. Intell. 1997, 97, 245–271. [Google Scholar] [CrossRef]
  10. Yu, L.; Liu, H. Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res. 2004, 5, 1205–1224. [Google Scholar]
  11. Liu, H.; Yu, L. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 2005, 4, 491–502. [Google Scholar]
  12. Bielza, C.; Larrañaga, P. Discrete Bayesian network classifiers: A survey. ACM Comput. Surv. 2014, 47, 5. [Google Scholar] [CrossRef]
  13. Blanco, R.; Inza, I.; Merino, M.; Quiroga, J.; Larrañaga, P. Feature selection in Bayesian classifiers for the prognosis of survival of cirrhotic patients treated with TIPS. J. Biomed. Inform. 2005, 38, 376–388. [Google Scholar] [CrossRef]
  14. Lee, S.; Park, Y.T.; d’Auriol, B.J. A novel feature selection method based on normalized mutual information. Appl. Intell. 2012, 37, 100–120. [Google Scholar]
  15. Aghdam, M.H.; Ghasem-Aghaee, N.; Basiri, M.E. Text feature selection using ant colony optimization. Expert Syst. Appl. 2009, 36, 6843–6853. [Google Scholar] [CrossRef]
  16. Kabir, M.M.; Shahjahan, M.; Murase, K. A new local search based hybrid genetic algorithm for feature selection. Neurocomputing 2011, 74, 2914–2928. [Google Scholar] [CrossRef]
  17. Kittler, J. Feature selection and extraction. In Handbook of Pattern Recognition and Image Procedureing; Academic Press: New York, NY, USA, 1986. [Google Scholar]
  18. Langley, P.; Sage, S. Induction of selective Bayesian classifiers. In Proceedings of the 10th Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI), Seattle, DC, USA, 29–31 July 1994; pp. 399–406. [Google Scholar]
  19. Kashef, S.; Nezamabadi-pour, H. An advanced ACO algorithm for feature subset selection. Neurocomputing 2015, 147, 271–279. [Google Scholar] [CrossRef]
  20. Basiri, M.E.; Ghasem-Aghaee, N.; Aghdam, M.H. Using ant colony optimization-based selected features for predicting post-synaptic activity in proteins. In Proceedings of the 6th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics (EvoBIO), Naples, Italy, 26–28 March 2008; pp. 12–23. [Google Scholar]
  21. Wang, L.; Chen, S.; Mammadov, M. Target Learning: A Novel Framework to Mine Significant Dependencies for Unlabeled Data. In Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Melbourne, VIC, Australia, 3–6 June 2018; pp. 106–117. [Google Scholar]
  22. Liu, H.; Zhou, S.; Lam, W.; Guan, J. A new hybrid method for learning bayesian networks: Separation and reunion. Knowl.-Based Syst. 2017, 121, 185–197. [Google Scholar] [CrossRef]
  23. Cooper, G.F. The computational complexity of probabilistic inference using Bayesian belief networks. Artif. Intell. 1990, 42, 393–405. [Google Scholar] [CrossRef]
  24. Dagum, P.; Luby, M. Approximating probabilistic inference in Bayesian belief networks is NP-hard. Artif. Intell. 1993, 60, 141–153. [Google Scholar] [CrossRef]
  25. Langley, P.; Iba, W.; Thompson, K. An analysis of Bayesian classifiers. Aaai 1992, 90, 223–228. [Google Scholar]
  26. Lee, L.H.; Isa, D. Automatically computed document dependent weighting factor facility for Naïve Bayes classification. Expert Syst. Appl. 2010, 37, 8471–8478. [Google Scholar] [CrossRef]
  27. Rodríguez, J.D.; Lozano, J.A. Multi-objective learning of multi-dimensional Bayesian classifiers. In Proceedings of the 8th International Conference on Hybrid Intelligent Systems (HIS), Barcelona, Spain, 10–12 September 2008; pp. 501–506. [Google Scholar]
  28. Louzada, F.; Ara, A. Bagging k-dependence probabilistic networks: An alternative powerful fraud detection tool. Expert Syst. Appl. 2012, 39, 11583–11592. [Google Scholar] [CrossRef]
  29. Aliferis, C.F.; Statnikov, A.; Tsamardinos, I.; Mani, S.; Koutsoukos, X.D. Local causal and markov blanket induction for causal discovery and feature selection for classification part i: Algorithms and empirical evaluation. J. Mach. Learn. Res. 2010, 11, 171–234. [Google Scholar]
  30. Besson, P.; Richiardi, J.; Bourdin, C.; Bringoux, L.; Mestre, D.R.; Vercher, J.L. Bayesian networks and information theory for audio-visual perception modeling. Biol. Cybern. 2010, 103, 213–226. [Google Scholar] [CrossRef] [PubMed]
  31. Cheng, J.; Greiner, R. Comparing Bayesian network classifiers. In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (UAI), Stockholm, Sweden, 30 July–1 August 1999; pp. 101–108. [Google Scholar]
  32. Cheng, J.; Greiner, R.; Kelly, J.; Bell, D.; Liu, W. Learning Bayesian Network from Data: An Information-Theory Based Approach. Artif. Intell. 2002, 137, 43–90. [Google Scholar] [CrossRef]
  33. Koller, D.; Sahami, M. Toward optimal feature selection. In Proceedings of the 13th International Conference on Machine Learning (ICML), Bari, Italy, 3–6 July 1996; pp. 284–292. [Google Scholar]
  34. Bache, K.; Lichman, M. UCI Machine Learning Repository. Available online: http://archive.ics.uci.edu/ml/index.php (accessed on 8 July 2019).
  35. Fayyad, U.M.; Irani, K.B. Multi-interval Discretization of Continuous-Valued Attributes for Classification Learning (IJCAI). In Proceedings of the 13th International Joint Conference on Artificial Intelligence, Chambery, France, 28 August–3 September 1993; pp. 1022–1029. [Google Scholar]
  36. Cestnik, B. Estimating probabilities: a crucial task in machine learning. In Proceedings of the Ninth European Conference on Artificial Intelligence (ECAI), Stockholm, Sweden, 6–10 August 1990; pp. 147–149. [Google Scholar]
  37. Kohavi, R.; Wolpert, D. Bias plus variance decomposition for zero-one loss functions. In Proceedings of the 13th International Conference on Machine Learning (ICML), Bari, Italy, 3–6 July 1996; pp. 275–283. [Google Scholar]
  38. Webb, G.I. Multiboosting: A technique for combining boosting and wagging. Mach. Learn. 2000, 40, 159–196. [Google Scholar] [CrossRef]
  39. Zaidi, N.A.; Cerquides, J.; Webb, G.I. Alleviating naive Bayes attribute independence assumption by attribute weighting. Mach. Learn. Res. 2013, 14, 1947–1988. [Google Scholar]
  40. Zheng, F.; Webb, G.I. Finding the right family: parent and child selection for averaged one-dependence estimators. In Proceedings of the 8th European Conference on Machine Learning (ECML), Warsaw, Poland, 17–21 September 2007; pp. 490–501. [Google Scholar]
  41. Friedman, M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 1937, 32, 675–701. [Google Scholar] [CrossRef]
  42. Nemenyi, P. Distribution-Free Multiple Comparisons. Ph.D. Thesis, Princeton University, Princeton, NJ, USA, 1963. [Google Scholar]
Figure 1. Examples of network structures with four features for KDB.
Figure 1. Examples of network structures with four features for KDB.
Entropy 21 00665 g001
Figure 2. The distributions of (a) MI; (b) CMI values on dataset Connect-4. Note that the MI and CMI values are sorted in descending order.
Figure 2. The distributions of (a) MI; (b) CMI values on dataset Connect-4. Note that the MI and CMI values are sorted in descending order.
Entropy 21 00665 g002
Figure 3. The MI values for K 2 DB on datasets Hypo and Waveform. Note that features are sorted in ascending order of I ( X i ; C ) .
Figure 3. The MI values for K 2 DB on datasets Hypo and Waveform. Note that features are sorted in ascending order of I ( X i ; C ) .
Entropy 21 00665 g003
Figure 4. The CMI values for K 2 DB on datasets Hypo and Waveform. Note that the conditional dependencies are sorted in ascending order of I ( X i ; X j | C ) .
Figure 4. The CMI values for K 2 DB on datasets Hypo and Waveform. Note that the conditional dependencies are sorted in ascending order of I ( X i ; X j | C ) .
Entropy 21 00665 g004
Figure 5. The comparison results of R M (KDB-FS|K 2 DB) and R M (KDB-DS|K 2 DB).
Figure 5. The comparison results of R M (KDB-FS|K 2 DB) and R M (KDB-DS|K 2 DB).
Entropy 21 00665 g005
Figure 6. The results of ranking in terms of 0–1 loss and RMSE for alternative algorithms.
Figure 6. The results of ranking in terms of 0–1 loss and RMSE for alternative algorithms.
Entropy 21 00665 g006
Figure 7. The results of Nemenyi tests in terms of 0–1 loss and RMSE for alternative algorithms.
Figure 7. The results of Nemenyi tests in terms of 0–1 loss and RMSE for alternative algorithms.
Entropy 21 00665 g007
Table 1. List of acronyms used.
Table 1. List of acronyms used.
NotationDescription
MImutual information
CMIconditional mutual information
BNCsBayesian network classifiers
BNCBayesian network classifier
Ba BNC
LOOCVleave-one-out cross validation
RMSEroot mean squared error
AMIthe average MI
ACMIthe average CMI
FSfeature selection
DSdependence selection
MDLMinimum Description Length
N F the number of features in BNC
N D the number of conditional dependencies in BNC
SMIthe sum of MI
SCMIthe sum of CMI
Table 2. Description of the datasets used in the experiments.
Table 2. Description of the datasets used in the experiments.
No.DatasetInstanceFeatureClassNo.DatasetInstanceFeatureClass
1Echocardiogram1316216German1000202
2Lymphography14818417Yeast1484810
3Iris1504318Splice-c4.53177603
4Hepatitis15519219Dis3772292
5Autos20525720Hypo3772294
6Glass Identification2149321Spambase4601572
7Heart27013222Phoneme5438750
8Primary Tumor339172223Page-blocks5473105
9Ionosphere35134224Optdigits56206410
10Musk1476166225Mushroom8124222
11Balance-scale6254326Magic19,020102
12Soybean683351927Adult48,842142
13Credit-a69015228Shuttle58,00097
14Breast-cancer-w6999229Connect-467,557423
15Vehicle84618430Waveform100,000213
Table 3. The 0–1 loss results of AKDB for all datasets with different δ values.
Table 3. The 0–1 loss results of AKDB for all datasets with different δ values.
Dataset δ = 5 % δ = 10 % δ = 20 % δ = 30 % δ = 40 % δ = 50 % δ = 60 % δ = 70 % δ = 80 % δ = 90 %
Echocardiogram0.37400.32060.33590.35110.35880.36640.37400.36640.36640.3664
Lymphography0.23650.15540.24320.20950.21620.20270.25680.20270.21620.2500
Iris0.08670.07330.08670.08670.08000.07670.07670.08000.07330.0733
Hepatitis0.16770.14190.21290.19350.23230.21290.20000.23230.23230.2323
Autos0.20980.19510.20000.20980.20980.19510.21950.20000.20000.2000
Glass-Id0.21030.19630.21030.21500.21960.22430.21500.21960.21030.2150
Heart0.19630.16300.19630.19630.19630.18890.18520.18520.18150.1815
Primary-Tumor0.56930.54280.59880.59880.56930.56640.58110.58410.57520.5782
Ionosphere0.09120.07120.08830.09120.08550.07690.09120.09400.08260.0940
Musk10.11760.10290.11340.11550.11550.10920.11970.11970.12180.1261
Balance-Scale0.27840.28000.27840.27840.27840.27840.27840.27840.27840.2752
Soybean0.05560.05270.05560.06000.05710.06300.05710.07320.07610.1098
Credit-A0.16810.14200.15510.16090.17680.16380.14930.16230.15360.1522
Breast-Cancer-W0.07440.04720.05440.06440.07300.07580.07580.07580.07580.0601
Vehicle0.29830.30140.29960.29860.29900.30020.31680.31090.33220.3310
German0.29200.25900.27000.29200.28800.28900.28800.29400.29500.2940
Yeast0.43870.42180.44470.44680.44610.44610.45010.45690.46160.4778
Splice-C4.50.08530.04690.06610.05850.05160.05290.04750.04750.04680.0468
Dis0.01510.01300.01460.01540.01510.01460.01380.01430.01460.0151
Hypo0.01300.00770.01300.01030.01700.02170.02250.02330.02250.0233
Spambase0.07620.07520.07670.07960.07610.07760.07850.07950.07870.0813
Phoneme0.19840.19840.19840.19840.28960.26020.26550.25190.25890.2758
Page-Blocks0.03910.03910.03910.03910.03910.03760.03800.03890.04020.0386
Optdigits0.04380.03580.03720.03680.03910.03880.03740.03680.04000.0418
Mushrooms0.00040.00000.00000.00000.00000.00040.00000.00000.00100.0011
Magic0.16370.16360.17220.19000.19040.19140.19090.19060.19170.1897
Adult0.13750.13380.13380.13370.13470.13470.13480.14090.14210.1413
Shuttle0.00090.00090.00180.00180.00180.00210.00180.00180.00180.0019
Connect-40.22940.22830.24420.24990.25350.25910.25750.25920.26520.2692
Waveform0.01930.01960.01930.01950.01940.01940.01940.01940.01940.0234
The lowest 0–1 loss results for datasets are shown in bold.
Table 4. W/D/L records of 0–1 loss on all datasets.
Table 4. W/D/L records of 0–1 loss on all datasets.
W/D/LNBNB-FSSTANK 1 DBK 2 DB
NB-FSS10/12/8
TAN17/8/518/6/6
K 1 DB19/5/619/5/65/18/7
K 2 DB17/7/620/1/98/11/1111/11/8
AKDB21/7/222/5/316/11/319/11/015/14/1
Table 5. W/D/L records of RMSE on all datasets.
Table 5. W/D/L records of RMSE on all datasets.
W/D/LNBNB-FSSTANK 1 DBK 2 DB
NB-FSS6/18/6
TAN17/10/314/9/7
K 1 DB18/10/215/10/52/25/3
K 2 DB16/9/516/7/76/19/56/18/6
AKDB20/8/218/6/610/13/78/18/47/21/2
Table 6. W/D/L records of bias and variance on large datasets.
Table 6. W/D/L records of bias and variance on large datasets.
W/D/LNBNB-FSSTANK 1 DBK 2 DB
NB-FSS6/3/4
TAN9/1/310/1/2
BiasK 1 DB11/1/112/1/04/5/4
K 2 DB11/0/212/0/16/4/37/3/3
AKDB11/1/111/2/08/2/38/1/45/3/5
NB-FSS13/0/0
TAN4/0/90/0/13
VarianceK 1 DB4/2/70/0/124/3/6
K 2 DB5/0/80/0/134/1/84/0/9
AKDB8/0/51/1/119/1/310/2/112/0/1
Table 7. W/D/L records of 0–1 loss on large datasets.
Table 7. W/D/L records of 0–1 loss on large datasets.
W/D/LNBNB-FSSTANK 1 DBK 2 DB
NB-FSS5/5/3
TAN9/4/010/1/2
K 1 DB11/1/110/2/13/7/3
K 2 DB11/0/212/0/18/3/29/1/3
AKDB12/0/111/1/17/5/19/4/04/8/1

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Entropy EISSN 1099-4300 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top