Exact Learning Augmented Naive Bayes Classifier

Earlier studies have shown that classification accuracies of Bayesian networks (BNs) obtained by maximizing the conditional log likelihood (CLL) of a class variable, given the feature variables, were higher than those obtained by maximizing the marginal likelihood (ML). However, differences between the performances of the two scores in the earlier studies may be attributed to the fact that they used approximate learning algorithms, not exact ones. This paper compares the classification accuracies of BNs with approximate learning using CLL to those with exact learning using ML. The results demonstrate that the classification accuracies of BNs obtained by maximizing the ML are higher than those obtained by maximizing the CLL for large data. However, the results also demonstrate that the classification accuracies of exact learning BNs using the ML are much worse than those of other methods when the sample size is small and the class variable has numerous parents. To resolve the problem, we propose an exact learning augmented naive Bayes classifier (ANB), which ensures a class variable with no parents. The proposed method is guaranteed to asymptotically estimate the identical class posterior to that of the exactly learned BN. Comparison experiments demonstrated the superior performance of the proposed method.


Introduction
Classification contributes to solving real-world problems. The naive Bayes classifier, in which the feature variables are conditionally independent given a class variable, is a popular classifier [1]. Initially, the naive Bayes was not expected to provide highly accurate classification, because actual data were generated from more complex systems. Therefore, the general Bayesian network (GBN) with learning by marginal likelihood (ML) as a generative model was expected to outperform the naive Bayes, because the GBN is more expressive than the naive Bayes. However, Friedman et al. [2] demonstrated that the naive Bayes sometimes outperformed the GBN using a greedy search to find the smallest minimum description length (MDL) score, which was originally intended to approximate ML. They explained the inferior performance of the MDL by decomposing it into the log likelihood (LL) term, which reflects the model fitting to training data, and the penalty term, which reflects the model complexity. Moreover, they decomposed the LL term into a conditional log likelihood (CLL) of the class variable given the feature variables, which is directly related to the classification, and a joint LL of the feature variables, which is not directly related to the classification. Furthermore, they proposed conditional MDL (CMDL), a modified MDL replacing the LL with the CLL.
Consequently, Grossman and Domingos [3] claimed that the Bayesian network (BN) minimizing CMDL as a discriminative model showed better accuracy than that maximizing ML. Unfortunately, the CLL has no closed-form equation for estimating the optimal parameters. This implies that optimizing the CLL requires a gradient descent algorithm (e.g., extended logistic regression algorithm [4]). Nevertheless, the optimization algorithm involves the reiteration of each structure candidate, which renders the method computationally expensive. To solve this problem, Friedman et al. [2] proposed an augmented naive Bayes classifier (ANB) in which the class variable directly links to all feature variables, and links among feature variables are allowed. ANB ensures that all feature variables can contribute to classification. Later, various types of restricted ANBs were proposed, such as tree-augmented naive Bayes (TAN) [2] and forest-augmented naive Bayes (FAN) [5].
Because maximization of CLL entails heavy computation, various approximation methods have been proposed to maximize it. Carvalho et al. [6] proposed the approximated CLL (aCLL), which is decomposable and computationally efficient. Grossman and Domingos [3] proposed the BNC2P, which is a greedy learning method with at most two parents per variable using the hill-climbing search by maximizing CLL while estimating parameters by maximizing LL. Mihaljević et al. [7] proposed MC-DAGGES, which reduces the space for the greedy search of BN Classifiers (BNCs) using the CLL score. These reports described that the BNC maximizing the approximated CLL performed better than that maximizing the approximated ML. Nevertheless, they did not explain why CLL outperformed ML. For large data, the classification accuracies presented by maximizing ML are expected to be comparable to those presented by maximizing CLL, because ML has asymptotic consistency. Differences between the performances of the two scores in these studies might depend on their respective learning algorithms; they were approximate learning algorithms, not exact ones.
This study compares the classification performances of the BNC with exact learning using ML as a generative model and those with approximate learning using CLL as a discriminative model. The results show that maximizing ML shows better classification accuracy when compared with maximizing CLL for large data. However, the results also show that classification accuracies obtained by exact learning BNC using ML are much worse than those obtained by other methods when the sample size is small and the class variable has numerous parents in the exactly learned networks. When a class variable has numerous parents, estimation of the conditional probability parameters of the class variable become unstable because the number of parent configurations becomes large and the sample size for learning the parameters becomes sparse.
To solve this problem, this study proposes an exact learning ANB which maximizes ML and ensures that the class variable has no parents. In earlier studies, the ANB constraint was used to learn the BNC as a discriminative model. In contrast, we use the ANB constraint to learn the BNC as a generative model. The proposed method asymptotically learns the optimal ANB, which asymptotically represents the true probability distribution with the fewest parameters among all possible ANB structures. Moreover, the proposed ANB is guaranteed to asymptotically estimate the identical conditional probability of the class variable to that of the exactly learned GBN. Furthermore, learning ANBs has lower computational costs than learning GBNs. Although the main theorem assumes that all feature variables are included in the Markov blanket of the class variable, this assumption does not necessarily hold. To address this problem, we propose a feature selection method using Bayes factor for exact learning of the ANB so as to avoid increasing the computational costs. Comparison experiments show that our method outperforms the other methods.

Background
In this section, we introduce the notation and background material required for our discussion.

Bayesian Network
A BN is a graphical model that represents conditional independence among random variables as a directed acyclic graph (DAG). The BN provides a good approximation of the joint probability distribution because it decomposes the distribution exactly into a product of the conditional probability for each variable.
Let V = {X 0 , X 1 , · · · , X n } be a set of discrete variables, where X i , (i = 0, · · · , n) can take values in the set of states {1, · · · , r i }. One can say X i = k when X i takes the state k. According to the BN structure G, the joint probability distribution is represented as where Pa G i is the parent variable set of X i in G. When the structure G is obvious from the context, we use Pa i to denote the parents. Let θ ijk be a conditional probability parameter of X i = k when the j-th instance of the parents of X i is observed (we can say Pa i = j). Then, The BN structure represents conditional independence assertions in the probability distribution by d-separation. First, we define collider, for which we need to define the d-separation. Letting path denote a sequence of adjacent variables, the collider is defined as follows.
Definition 1. Assuming we have a structure G = (V, E), a variable Z ∈ V on a path ρ is a collider if and only if there exist two distinct incoming edges into Z from non-adjacent variables.
We then define d-separation as explained below.

Definition 2.
Assuming we have a structure G = (V, E), X, Y ∈ V, and Z ⊆ V \ {X, Y}, the two variables X and Y are d-separated, given Z in G, if and only if every path ρ between X and Y satisfies either of the following two conditions: There is a collider Z on ρ; Z does not include Z and its descendants.
We denote the d-separation between X and Y given Z in the structure G as Dsep G (X, Y | Z). Two variables are d-connected if they are not d-separated.
If we have X, Y, Z ∈ V, and X and Y are not adjacent, then the following three possible types of connections characterize the d-separations: serial connections such as X → Z → Y, divergence connections such as X ← Z → Y, and convergence connections such as X → Z ← Y. The following theorem of d-separations for these connections holds.
Theorem 1 (Koller and Friedman [17]). First, assume a structure G = (V, E), X, Y, Z ∈ V. If G has a convergence connection X → Z ← Y, then the following two propositions hold: If G has a serial connection X → Z → Y or divergence connection X ← Z → Y, then negations of the above two propositions hold.
The two DAGs are Markov equivalent when they have the same d-separations.
Definition 3. Let G 1 = (V, E 1 ) and G 2 = (V, E 2 ) be the two DAGs; then G 1 and G 2 are called Markov equivalent if the following holds: Verma and Pearl [18] described the following theorem to identify Markov equivalence.
Theorem 2 (Verma and Pearl [18]). Two DAGs are Markov equivalent if and only if they have identical links (edges without direction) and identical convergence connections.
Let I P * (X, Y | Z) denote that X and Y are conditionally independent given Z in the true joint probability distribution P * . A BN structure G is an independence map (I-map) if all the d-separations in G are entailed by conditional independences in P * : Definition 4. Assuming the true joint probability distribution P * of the random variables in a set V and a structure G = (V, E), then G is an I-map if the following proposition holds: Probability distributions represented by an I-map converge to P * when the sample size becomes sufficiently large.
We introduce the following notations required for our discussion on learning BNs. Let . For a variable set Z ⊆ V, we define N Z j as the number of samples of Z = j in the entire dataset D, and we define N Z ijk as the number of samples of X i = k when Z = j in D. In addition, we define a joint frequency table JFT(Z) and a conditional frequency table CFT(X i , Z), respectively, as a list of N Z j for j = 1, · · · , q Z and that of N Z ijk for i = 0, · · · , n, j = 1, · · · , q Z , and k = 1, · · · , r i . The likelihood of BN B, given D, is represented as The maximum likelihood estimators of θ ijk are given aŝ The most popular parameter estimator of BNs is the expected a posteriori (EAP) of Equation (1), which is the expectation of θ ijk with respect to the density p(Θ ij | D, G) of Equation (2), assuming Dirichlet prior density p(Θ ij | G) of Equation (3). (1) (2) In Equations (1)- (3), N ijk denotes the hyperparameters of the Dirichlet prior distributions (N ijk is a pseudo-sample corresponding to N Pa i ijk ), with N ij = ∑ r i k=1 N ijk . The BN structure must be estimated from observed data because it is generally unknown. To learn the I-map with the fewest parameters, we maximize the score with an asymptotic consistency defined as shown below.
Definition 5 (Chickering [19]). Let G 1 = (V, E 1 ) and G 2 = (V, E 2 ) be the structures. A scoring criterion Score has an asymptotic consistency if the following two properties hold when the sample size is sufficiently large.

•
If G 1 is an I-map and G 2 is not an I-map, then Score(G 1 ) > Score(G 2 ). • If G 1 and G 2 both are I-maps, and if G 1 has fewer parameters than G 2 , then Score(G 1 ) > Score(G 2 ).
The ML score P(D | G) is known to have asymptotic consistency [19]. When we assume the Dirichlet prior density of Equation (3), ML is represented as In particular, Heckerman et al. [20] presented the following constraint related to the hyperparameters N ijk for ML satisfying the score-equivalence assumption, where it takes the same value for the Markov equivalent structures: where N is the equivalent sample size (ESS) determined by users, and G h is the hypothetical BN structure that reflects a user's prior knowledge. This metric was designated as the Bayesian Dirichlet equivalent (BDe) score metric. As Buntine [21] described, N ijk = N /(r i q Pa i ) is regarded as a special case of the BDe score. Heckerman et al. [20] called this special case the Bayesian Dirichlet equivalent uniform (BDeu), defined as In addition, the minimum description length (MDL) score presented in (4), which approximates the negative logarithm of ML, is often used for learning BNs.
The first term of Equation (4) is the penalty term, which signifies the model complexity.
The second term, LL, is the fitting term that reflects the degree of model fitting to the training data. Both BDeu and MDL are decomposable, i.e., the scores can be expressed as a sum of local scores depending only on the conditional frequency table for one variable and its parents as follows.
For example, the local score of log BDeu for CFT(X i , Pa i ) is The decomposable score enables an extremely efficient search for structures [10,15].

Bayesian Network Classifiers
A BNC can be interpreted as a BN for which X 0 is the class variable and X 1 , · · · , X n are feature variables. Given an instance x = (x 1 , · · · , x n ) for feature variables X 1 , . . . , X n , the BNC B infers class c by maximizing the posterior probability of X 0 aŝ where 1 ijk = 1 if X i = k and Pa i = j in the case of x, and 1 ijk = 0 otherwise. Furthermore, C is a set of children of the class variable X 0 . From Equation (6), we can infer class c given only the values of the parents of X 0 , the children of X 0 , and the parents of the children of X 0 , which comprise the Markov blanket of X 0 . However, Friedman et al. [2] reported that BNC-minimizing MDL cannot optimize classification performance. They proposed the sole use of the following CLL of the class variable given feature variables, instead of the LL for learning BNC structures.
Furthermore, they proposed conditional MDL (CMDL), which is a modified MDL replacing LL with CLL, as shown below.
Consequently, they claimed that the BN minimizing CMDL as a discriminative model showed better accuracy than that maximizing ML as a generative model. Unfortunately, the CLL is not decomposable, because we cannot describe the second term of Equation (7) as a sum of the log parameters in Θ. This finding implies that no closed-form equation exists for the maximum CLL estimator for Θ. Therefore, learning the network structure that minimizes the CMDL requires a search method such as gradient descent over the space of parameters for each structure candidate. Therefore, exact learning network structures by minimizing CMDL is computationally infeasible.
As a simple means of resolving that difficulty, Friedman et al. [2] proposed an ANB that ensures an edge from the class variable to each feature variable and allows edges among feature variables. Furthermore, they proposed TAN in which the class variable has no parent, and each feature variable has a class variable and at most one other feature variable as a parent variable.
Various approximate methods to maximize CLL have been proposed. Carvalho et al. [6] proposed an aCLL score, which is decomposable and computationally efficient. Let G ANB be an ANB structure. In addition, let N ijck be the number of samples of X i = k when X 0 = c and Pa i \ {X 0 } = j (i = 1, · · · , n; j = 1, · · · , q Pa i \{X 0 } ; c = 1, · · · , r 0 ; k = 1, · · · , r i ). In addition, let N > 0 represent the number of pseudocounts. Under several assumptions, the aCLL can be represented as The value of β is found by using the Monte Carlo method to approximate the CLL. When the value of β is optimal, the aCLL is a minimum-variance unbiased approximation of the CLL. Moreover, Grossman and Domingos [3] proposed a learning structure method using a greedy hill-climbing algorithm [20] by maximizing the CLL while estimating the parameters by maximizing the LL. Recently, Mihaljević et al. [7] identified the smallest subspace of DAGs that covered all possible class-posterior distributions when the data were complete.
All the DAGs in this space, which they call minimal class-focused DAGs (MC-DAGs), are such that every edge is directed toward a child of the class variable. In addition, they proposed a greedy search algorithm in the space of Markov equivalent classes of MC-DAGs using the CLL score. These reports described that the BNC maximizing the approximated CLL provides better performance than that maximizing the approximated ML. However, they did not explain why CLL outperformed ML. For large data, the classification accuracies obtained by maximizing ML are expected to be comparable to those obtained by maximizing CLL because ML has asymptotic consistency. Differences between the performances of the two scores in these earlier studies might depend on their learning algorithms to maximize ML; they were approximate learning algorithms, not exact ones.

Classification Accuracies of Exact Learning GBN
This section presents experiments comparing the classification accuracies of the exactly learned GBN by maximizing the BDeu as a generative model with those of the approximately learned BNC by maximizing the CLL as a discriminative model. Although determining the hyperparameter N of BDeu is difficult [16,[22][23][24], we use N = 1.0, which allows the data to reflect the estimated parameters to the greatest degree possible [25,26].
The experiment compared the respective classification accuracies of seven methods in Table 1. All the methods were implemented in Java. The source code is available at http://www.ai.lab.uec.ac.jp/software/ (accessed on 29 November 2021). Throughout this paper, our experiments were conducted on a computational environment, as shown in Table 2. This experiment used 43 classification benchmark datasets from the UCI repository [27]. Continuous variables were discretized into two bins using the median value as the cutoff, as in [28]. In addition, data with missing values were removed from the datasets. We used EAP estimators as conditional probability parameters of the respective classifiers. The hyperparameters N ijk of EAP were found to be 1/(r i q Pa i ). Throughout our experiments, we defined "small datasets" as the datasets with less than 200 samples, and we defined "large datasets" as the datasets with 10,000 or more samples.

GBN-CMDL [3]
Greedy learning GBN method using the hill-climbing search by minimizing CMDL while estimating parameters by maximizing LL.
BNC2P [3] Greedy learning method with at most two parents per variable using the hill-climbing search by maximizing CLL while estimating parameters by maximizing LL.

MC-DAGGES [7]
Greedy learning method in the space of the Markov equivalent classes of MC-DAGs using the greedy equivalence search [19] by maximizing CLL while estimating parameters by maximizing LL.  Table 3 presents the classification accuracies of the respective classifiers. However, we will discuss the results of ANB-BDeu and fsANB-BDeu in a later section. The values shown in bold in Table 3 represent the best classification accuracies for each dataset. Here, the classification accuracies represent the average percentage of correct classifications from a ten-fold cross-validation. Moreover, to investigate the relation between the classification accuracies and GBN-BDeu, Table 4 presents the details of the achieved structures using GBN-BDeu. "Parents" in Table 4 represents the average number of the class variable's parents in the structures learned by GBN-BDeu. "Children" denotes the average number of the class variable's children in the structures learned by GBN-BDeu. "Sparse data" denotes the average number of patterns of X 0 's parents value j with null data, N Pa 0 j = 0 (j = 1, · · · , q Pa 0 ) in the structures learned by GBN-BDeu.
From Table 3, GBN-BDeu shows the best classification accuracies among the methods for large data, such as dataset Nos. 22, 29, and 33. Because BDeu has asymptotic consistency, the joint probability distribution represented by GBN-BDeu approaches the true distribution as the sample size increases. However, it is worth noting that GBN-BDeu provides much worse accuracy than the other methods in datasets No. 3 and No. 9. In these datasets, the learned class variables by GBN-BDeu have no children. Numerous parents are shown in "Parents" and "Children" in Table 4. When a class variable has numerous parents, the estimation of the conditional probability parameters of the class variable becomes unstable, because the class variable's parent configurations become numerous. Then, the sample size for learning the parameters becomes small, as presented in "Sparse data" in Table 4. Therefore, numerous parents of the class variable might be unable to reflect the feature data for classification when the sample is insufficiently large.  Table 4. Statistical summary of GBN-BDeu and fsANB-BDeu.

Exact Learning ANB Classifier
The preceding section suggested that exact learning of GBN by maximizing BDeu to have no parents of the class variable might improve the accuracy of GBN-BDeu. In this section, we propose an exact learning ANB, which maximizes BDeu and ensures that the class variable has no parents. In earlier reports, the ANB constraint was used to learn the BNC as a discriminative model. In contrast, we use the ANB constraint to learn the BNC as a generative model. The space of all possible ANB structures includes at least one I-map, because it includes a complete graph, which is an I-map. From the asymptotic consistency of BDeu (Definition 5), the proposed method is guaranteed to achieve the I-map with the fewest parameters among all possible ANB structures when the sample size becomes sufficiently large. Our empirical analysis in Section 3 suggests that the proposed method can improve the classification accuracy for small data. We employed the dynamic programming (DP) algorithm learning GBN [10] for the exact learning of ANB. The DP algorithm for exact learning ANB was almost twice as fast as that for the exact learning of GBN. We prove that the proposed ANB asymptotically estimates the identical conditional probability of the class variable to that of the exactly learned GBN.

Learning Procedure
The proposed method is intended to seek the optimal structure that maximizes the BDeu score among all possible ANB structures. The local score of the class variable in ANB structures is constant because the class variable has no parents in the ANB structure. Therefore, we can ascertain the optimal ANB structure by maximizing Score ANB (G) = Score(G) − Score 0 (φ).
Before we describe the procedure of our method, we introduce the following notations. Let G * (Z) denote the optimal ANB structure composed of a variable set Z, (X 0 ∈ Z). When a variable has no child in a structure, we say it is a sink in the structure. We use X * s (Z) to denote a sink in G * (Z). Additionally, letting Π(Z) denote a set of all the Z's subsets including X 0 , we define the best parents of X i in a candidate set Π(Z), as the parent set that maximizes the local score in Π(Z):

Score i (W).
Our algorithm has four logical steps. The following process improves the DP algorithm proposed by [10] to learn the optimal ANB structure.

1.
For all possible pairs of a variable X i ∈ V \ {X 0 } and a variable set Z ⊆ V \ {X i }, (X 0 ∈ Z), calculate the local score Score i (Z) (Equation (5)).

2.
For all possible pairs of a variable X i ∈ V \ {X 0 } and a variable set Z ⊆ V \ {X i }, (X 0 ∈ Z), calculate the best parents g * (Π(Z)).

4.
Calculate G * (V) using Steps 3 and 4. Steps 3 and 4 of the algorithm are based on the observation that the best network G * (Z) necessarily has a sink X * s (Z) with incoming edges from its best parents g * s (Π(Z \ {X * s (Z)})). The remaining variables and edges in G * (Z) necessarily construct the best network G * (Z \ {X * s (Z)}). More formally, From Equation (8), we can decompose G * (Z) into G * (Z \ {X * s (Z)}) and X * s (Z) with incoming edges from g * s (Π(Z \ {X * s (Z)}). Moreover, this decomposition can be performed recursively. At the end of the recursive decomposition, we obtain n pairs of the sink and its best network, denoted by (X s 1 , g * s 1 ), · · · , (X s i , g * s i ), · · · , (X s n , g * s n ). Finally, we obtain G * (V) for which X s i 's parent set is g * s i . The number of iterations to calculate all the local scores, best parents, and best sinks for our algorithm are (n − 1)2 n−2 , (n − 1)2 n−2 , and 2 n−1 , respectively, and those for GBN are n2 n−1 , n2 n−1 , and 2 n , respectively. Therefore, the DP algorithm for ANB is almost twice as fast as that for GBN. The details of the proposed algorithm are shown in the Appendix A.

Asymptotic Properties of the Proposed Method
Under some assumptions, the proposed ANB is proven to asymptotically estimate the identical conditional probability of the class variable, given the feature variables of the exactly learned GBN. When the sample size becomes sufficiently large, the structure learned by the proposed method and the exactly learned GBN are classification-equivalent defined as follows: Definition 6 (Acid et al. [29]). Let G be a set of all the BN structures. Furthermore, let D be any finite dataset. For ∀G 1 , G 2 ∈ G, we say that G 1 and G 2 are classification-equivalent if P(X 0 | x, G 1 , D) = P(X 0 | x, G 2 , D) for any feature variable's value x.
To derive the main theorem, we introduce five lemmas as below.
Lemma 1 (Mihaljević et al. [7]). Let G = (V, E) be a structure. Then, G is classificationequivalent to G , which is a modified G by the following operations: 1.
For ∀X, Y ∈ Pa G 0 , add an edge between X and Y in G.

2.
For ∀X ∈ Pa G 0 , reverse an edge from X to X 0 in G.
Next, we use the following lemma from Chickering [19] to derive the main theorem: Lemma 2 (Chickering [19]). Let G Imap be a set of all I-maps. When the sample size becomes sufficiently large, then the following proposition holds.
Moreover, we provide Lemma 3 under the following assumption.
Assumption 1. Let the true joint probability distribution of random variables in a set V be P * . Under Assumption 1, a true structure G * = (V, E * ) exists that satisfies the following property:

Lemma 3. Let G
Imap ANB be a set of all the ANB structures that are I-maps. For ∀G Imap ANB ∈ G Imap ANB , ∀X, Y ∈ V, if G * has a convergence connection X → X 0 ← Y, then G Imap ANB has an edge between X and Y.
Proof. We prove Lemma 3 by contradiction. Assuming that G Imap ANB has no edge between X and Y, because G Imap ANB has a divergence connection X ← X 0 → Y, we obtain Because G * has a convergence connection X → X 0 ← Y, the following proposition holds from Theorem 1: This result contradicts (9). Consequently, G Imap ANB has an edge between X and Y.
Furthermore, under Assumption 1 and the following assumptions, we derive Lemma 4.

Assumption 2.
All feature variables are included in the Markov blanket M of the class variable in the true structure G * .

Assumption 3.
For ∀X ∈ M, X and X 0 are adjacent to G * .

Lemma 4.
Let G * 1 be the modified G * by operation 1 in Lemma 1. In addition, let G * 12 be the structure that is modified to G * 1 by operation 2 in Lemma 1. Under Assumptions 1-3, G * 1 is Markov equivalent to G * 12 .
Proof. From Theorem 2, we prove Lemma 4 by showing the following two propositions: (I) G * 1 and G * 12 have the same links (edges without direction), and (II) they have the same set of convergence connections. Proposition (I) can be proved immediately because the difference between G * 1 and G * 12 is only the direction of the edges between X 0 and the variables in Pa G * 0 . For the same reason, G * 1 and G * 12 have the same set of convergence connections as colliders in V \ (Pa G * 0 ∪ {X 0 }). Moreover, there are no convergence connections with colliders in Pa G * 0 ∪ {X 0 } in both G * 1 and G * 12 because all the variables in Pa G * 0 ∪ {X 0 } are adjacent in the two structures. Consequently, they have the same set of convergence connections, so Proposition (II) holds. This completes the proof.
Finally, under Assumptions 1-3, we derive the following lemma.
Proof. The DAG G * 1 results from adding the edges between the variables in Pa G * 0 to G * . Because adding edges does not create a new d-separation, G * 1 remains an I-map. Lemma 5 holds because G * 1 is a Markov equivalent to G * 12 from Lemma 4.

Theorem 3.
Under Assumptions 1-3, when the sample becomes sufficiently large, the proposal (learning ANB using BDeu) achieves the classification-equivalent structure to G * .
Proof. Because G * 12 is classification-equivalent to G * from Lemma 1, we prove Theorem 3 by showing that the proposed method asymptotically learns a Markov-equivalent structure to G * 12 . We prove Theorem 3 by showing that G * 12 asymptotically has the maximum BDeu score among all the ANB structures: From Definition 5, the BDeu scores of the I-maps are higher than those of any non-I-maps when the sample size becomes sufficiently large. Therefore, it is sufficient to show that the following proposition holds asymptotically to prove that Proposition (11) asymptotically holds.

∀G
Imap ANB ∈ G Imap ANB , Score(G Imap ANB ) ≤ Score(G * 12 ). (12) From Lemma 5, G * 12 is an I-map. Therefore, from Lemma 2, a sufficient condition for satisfying (12) is as follows: We prove (13) by dividing it into two cases: From Lemma 3, all variables in Pa G * 0 are adjacent to G Imap ANB . Therefore, we obtain For two Boolean propositions p and q, the following holds.
This completes the proof of (13) in Case I.
From Definition 4 and Assumption 1, we obtain Thus, we can prove (13) by showing that the following proposition holds: For the remainder of the proof, we prove the sufficient condition (16) to satisfy (13) by dividing it into two cases: X 0 ∈ Z and X 0 / ∈ Z.
Case i: X 0 ∈ Z All pairs of non-adjacent variables in Pa G * 0 in G * comprise a convergence connection with collider X 0 . From Theorem 1, these pairs are necessarily d-connected, given X 0 in G * . Therefore, all the variables in Pa G * 0 are d-connected, given X 0 in G * . This means that G * and G * 1 represent identical d-separations given X 0 . Because G * 1 is Markov equivalent to G * 12 from Lemma 4, G * and G * 12 represent identical d-separations given X 0 ; i.e., Proposition (16) holds. Case ii: X 0 / ∈ Z We divide (16) into two cases: X = X 0 ∨ Y = X 0 and X = X 0 ∧ Y = X 0 .
If both G * 12 and G * have no edge between X and Y, they have a serial or divergence connection: X → X 0 → Y or X ← X 0 → Y. Because the serial and divergence connections represent d-connections between X and Y in this case from Theorem 1, we obtain ¬Dsep G * 12 (X, Y | Z) ∧ ¬Dsep G * (X, Y | Z). From (15), proposition (16) holds.
Thus, we complete the proof of (13) in Case II.
Consequently, proposition (13) is true, which completes the proof of Theorem 3.
We proved that the proposed ANB asymptotically estimates the identical conditional probability of the class variable to that of the exactly learned GBN.

Numerical Examples
This subsection presents the numerical experiments conducted to demonstrate the asymptotic properties of the proposed method. To demonstrate that the proposed method asymptotically achieves the I-map with the fewest parameters among all the possible ANB structures, we evaluated the structural Hamming distance (SHD) [30], which measures the distance between the structure learned by the proposed method and the I-map with the fewest parameters among all the possible ANB structures. To demonstrate Theorem 3, we evaluated the Kullback-Leibler divergence (KLD) between the learned class variable posterior using the proposed method and that by the true structure. This experiment used two benchmark datasets from bnlearn [31]: CANCER and ASIA, as depicted in Figures 1 and 2. We used the variables "Cancer" and "either" as the class variables in CANCER and ASIA, respectively. In that case, CANCER satisfied Assumptions 2 and 3, but ASIA did not.  From the two networks, we randomly generated sample data for each sample size N = 100, 500, 1000, 5000, 10,000, 50,000, and 100,000. Based on the generated data, we learned the BNC structures using the proposed method and then evaluated the SHDs and KLDs. Table 5 presents the results. The results show that the SHD converged to 0 when the sample size increased in both CANCER and ASIA. Thus, the proposed method asymptotically learned the I-map with the fewest parameters among all possible ANB structures. Furthermore, in CANCER, the KLD between the learned class variable posterior by the proposed method and that by the true structure became 0 when N ≥ 1000. The results demonstrate that the proposed method learns the classification-equivalent structure of the true one when the sample size becomes sufficiently large, as described in Theorem 3. In ASIA however, the KLD between the learned class variable posterior by the proposed method and that by the true structure did not reach 0 even when the sample size became large because ASIA did not satisfy Assumptions 2 and 3. Table 5. The SHD between the structure learned by the proposed method and the I-map with the fewest parameters among all the ANB structures and the KLD between the learned class variable posterior by the proposed method and learned one using the true structure.

Network
Variables

Learning Markov Blanket
Theorem 3 assumes all feature variables are included in the Markov blanket of the class variable. However, this assumption does not necessarily hold. To solve this problem, we must learn the Markov blanket of the class variable before learning the ANB. Under Assumption 3, the Markov blanket of the class variable is equivalent to the parent-child (PC) set of the class variable. It is known that the exact learning of a PC set of variables is computationally infeasible when the number of variables increases. To reduce the computational cost of learning a PC set, ref. [32] proposed a score-based local learning algorithm (SLL), which has two learning steps. In step 1, the algorithm sequentially learns the PC set by repeatedly using the exact learning structure algorithm on a set of variables containing the class variable, the current PC set, and one new query variable. In step 2, SLL enforces the symmetry constraint: if X i is a child of X j , then X j is a parent of X i . This allows us to try removing extra variables from PC set, proving that the SLL algorithm always finds the correct PC of the class variable when the sample size is sufficiently large. Moreover, ref. [33] proposed the S 2 TMB algorithm, which improved the efficiency over the SLL by removing the symmetric constraints in PC search steps. However, S 2 TMB is computationally infeasible when the size of the PC set surpasses 30.
As an alternative approach for learning large PC sets, previous studies proposed constraint-based PC search algorithms, such as MMPC [30], HITON-PC [34], and PCMB [35]. These methods produce an undirected graph structure using statistical hypothesis tests or information theory tests. As statistical hypothesis tests, the G 2 and χ 2 tests were used for these constraint-based methods. In these tests, the independence of the two variables was set as a null hypothesis. A p-value signifies the probability that the null hypothesis is correct at a user-determined significance level. If the p-value exceeds the significance level, the null hypothesis is accepted, and the edge is removed. However, [36] reported that statistical hypothesis tests have a significant shortcoming: the p-value sometimes becomes much smaller than the significance level as the sample size increases. Therefore, statistical hypothesis tests suffer from Type I errors (detecting dependence for an independent conditional relation in the true DAG). Conditional mutual information (CMI) is often used as a CI test [37]. The CMI strongly depends on a hand-tuned threshold value. Therefore, it is not guaranteed to estimate the true CI structure. Consequently, CI tests have no asymptotic consistency.
For a CI test with asymptotic consistency, [38] proposed a Bayes factor with BDeu (the "BF method", below), where the Bayes factor is the ratio of marginal likelihoods between two hypotheses [39]. For two variables X, Y ∈ V and a set of conditional variables Z ⊆ V \ {X, Y}, the BF method BF(X, Y | Z) is defined as where Score(CFT(X, Z)) and Score(CFT(X, Z ∪ {Y})) can be obtained using Equation (5). The BF method detects I P * (X, Y | Z) if BF(X, Y | Z) is larger than the threshold δ, and detects the ¬I P * (X, Y | Z) otherwise. Natori et al. [40] and Natori et al. [41] applied the BF method to a constraint-based approach, and showed that their method was more accurate than the other methods with traditional CI tests. We propose the constraint-based PC search algorithm using a BF method. The proposed PC search algorithm finds the PC set of the class variable using a BF method between the class variable and all feature variables because the Bayes factor has an asymptotic consistency for the CI tests [41]. It is known that missing crucial variables degrades the accuracy [2]. Therefore, we redundantly learn the PC set of the class variable to reduce extra variables with no missing variables as follows. • The proposed PC search algorithm only conducts the CI tests at the zero order (given no conditional variables), which is more reliable than those at the higher order. • We use a positive value as Bayes factor's threshold δ.
Furthermore, we compare the accuracy of the proposed PC search method with those of the MMPC, HITON-PC, PCMB, and S 2 TMB. Learning Bayesian networks is known to be highly sensitive to the chosen ESS-value [22,25,26]. Therefore, we determine the ESS N ∈ {1.0, 2.0, 5.0} and the threshold δ ∈ {3, 20, 150} in the Bayes factor using two-fold cross validation to obtain the highest classification accuracy. The three ESS-values of N are determined according to Ueno [25,26]. The three values of δ are determined according to Heckerman et al. [20]. All the compared methods were implemented in Java (Source code is available at http://www.ai.lab.uec.ac.jp/software/, accessed on 29 November 2021). This experiment used six benchmark datasets from bnlearn: ASIA, SACHS, CHILD, WATER, ALARM, and BARLEY. From each benchmark network, we randomly generated sample data N = 10,000. Based on the generated data, we learned all the variables' PC sets using each method. Table 6 shows the average runtimes of each method. We calculated missing variables, representing the number of removed variables existing in the true PC set, and extra variables, which indicated the number of remaining variables that do not exist in the true PC set. Table 6 also shows the average missing and extra variables from the learned PC sets of all the variables. We compared the classification accuracies of the exact learning ANB with BDeu score (designated as ANB-BDeu), using each PC search method as a feature selection method. Table 7 shows the average accuracies of each method from the 43 UCI repository datasets listed in Table 3.
From Table 6, the results show that the runtimes of the proposed method were shorter than those of the other methods. Moreover, the results show that the missing variables of the proposed method were smaller than those of the other methods. On the other hand, Table 6 also shows that the extra variables of the proposal were greater than those of the other methods in all datasets. From Table 7, the results show that the ANB-BDeu using the proposed method provided a much higher average accuracy than the other methods. This is because missing variables degrade classification accuracy more significantly than extra variables (Friedman et al., 1997).

Experiments
This section presents numerical experiments conducted to evaluate the effectiveness of the exact learning ANB. First, we compared the classification accuracies of ANB-BDeu with those of the other methods in Section 3. We used the same experimental setup and evaluation method described in Section 3. The classification accuracies of ANB-BDeu are presented in Table 3. To confirm the significant differences of ANB-BDeu from the other methods, we applied Hommel's tests [42], which are used as a standard in machine learning studies [43]. The p-values are presented at the bottom of Table 3. In addition, "MB size" in Table 4 denotes the average number of the class variable's Markov blanket size in the structures learned by GBN-BDeu.
The results show that ANB-BDeu outperformed Naive Bayes, GBN-CMDL, BNC2P, TAN-aCLL, gGBN-BDeu, and MC-DAGGES at the p < 0.1 significance level. Moreover, the results show that ANB-BDeu improved the accuracy of GBN-BDeu when the class variable had numerous parents such as the No. 3, No. 9, and No. 31 datasets, as shown in Table 4. Furthermore, ANB-BDeu provided higher accuracies than GBN-BDeu, even for large data such as datasets 13, 22, 29, and 33, although the difference between ANB-BDeu and GBN-BDeu was not statistically significant. These actual datasets did not necessarily satisfy Assumptions 1-3 in Theorem 3. These results imply that the accuracies of ANB-BDeu without satisfying Assumptions 1-3 might be comparable to those of GBN-BDeu for large data. It is worth noting that the accuracies of ANB-BDeu were much worse than those provided by GBN-BDeu for datasets No. 5 and No. 12. "MB size" in these datasets were much smaller than the number of all feature variables, as shown in Table 4. The results show that feature selection by the Markov blanket is expected to improve the classification accuracies of the exact learning ANB, as described in Section 5.
We compared the classification accuracies of ANB-BDeu using the PC search method proposed in Section 5 (referred to as "fsANB-BDe") with the other methods in Table 3. Table 3 shows the classification accuracies of fsANB-BDe and the p-values of Hommel's tests for differences in fsANB-BDeu from the other methods. The results show that fsANB-BDeu outperformed all the compared methods at the p < 0.05 significance level.
"Max parents" in Table 4 presents the average maximum number of parents learned by fsANB-BDeu. The value of "Max parents" represents the complexity of the structure learned by fsANB-BDeu. The results show that the accuracies of Naive Bayes were better than those of fsANB-BDeu when the sample size was small, such as the No. 36 and No. 38 datasets. In these datasets, the values of "Max parents" are large. The estimation of the variable parameters tends to become unstable when a variable has numerous parents, as described in Section 3. Naive Bayes can avoid this phenomenon because the maximum number of parents in Naive Bayes is one. However, Naive Bayes cannot learn relationships between the feature variables. Therefore, for large samples such as the No. 8 and No. 29 datasets, Naive Bayes showed much worse accuracy than those provided by other methods.
Similar to Naive Bayes, BNC2P and TAN-aCLL show better accuracies than fsANB-BDeu for small samples such as the No. 38 dataset because the upper bound of the maximum number of parents was two in the two methods. However, the small upper bound of the maximum number of parents tends to lead to a poor representational power of the structure [44]. As a result, the accuracies of both methods tend to be worse than those of the fsANB-BDeu of which the value of "Max parents" is greater than two, such as the No. 29 dataset.
For large samples, such as datasets Nos 29 and 33, GBN-CMDL, gGBN-BDeu, and MC-DAGGES showed worse accuracies than fsANB-BDeu, because the exact learning methods estimate the network structure more precisely than the greedy learned structure.
We compared fsANB-BDeu and ANB-BDeu. The difference between the two methods is whether the proposed PC search method is used. "Removed variables" in Table 4 represents the average number of variables removed from the class variable's Markov blanket by our proposed PC search method. The results demonstrated that the accuracies of fsANB-BDeu tended to be much higher than those of ANB-BDeu when the value of "Removed variables" was large, such as Nos. 5, 12, 16, 34, and 38. Consequently, discarding numerous irrelevant variables in the features improved the classification accuracy.
Finally, we compared the runtimes of fsANB-BDeu and GBN-BDeu to demonstrate the efficiency of the ANB constraint. Table 8 presents the runtimes of GBN-BDeu, fsANB-BDeu, and the proposed PC search method. The results show that the runtimes of fsANB-BDeu were shorter than those of GBN-BDeu in all the datasets, because the execution speed of the exact learning ANB was almost twice that of the exact learning GBN, as described in Section 4. Moreover, the runtimes of fsANB-BDeu were much shorter than those of GBN-BDeu when our PC search method removed many variables, such as the No. 34 and No. 39 datasets. This is because the runtimes of GBN-BDeu decrease exponentially with the removal of variables, whereas our PC search method itself has a negligibly small runtime compared to those of the exact learning as shown in Table 8. As a result, the proposed method fsANB-BDeu provides the best classification performances in all the methods with a lower computational cost than that of the GBN-BDeu.

Conclusions
First, this study compared the classification performances of the BNs exactly learned by BDeu as a generative model and those learned approximately by CLL as a discriminative model. Surprisingly, the results demonstrated that the performance of BNs achieved by maximizing ML was better than that of BNs achieved by maximizing CLL for large data. However, the results also showed that the classification accuracies of the BNs that learned exactly by BDeu were much worse than those that learned by the other methods when the class variable had numerous parents. To solve this problem, this study proposed an exact learning ANB by maximizing BDeu as a generative model. The proposed method asymptotically learned the optimal ANB, which is an I-map with the fewest parameters among all possible ANB structures. In addition, the proposed ANB is guaranteed to asymptotically estimate the identical conditional probability of the class variable to that of the exactly learned GBN. Based on these properties, the proposed method is effective for not only classification but also decision making, which requires a highly accurate probability estimate of the class variable. Furthermore, the learning ANB has lower computational costs than the learning BN does. The experimental results demonstrated that the proposed method significantly outperformed the approximately learned structure by maximizing CLL.
We plan on exploring the following in future work.

1.
It is known that neural networks are universal approximators, which means that they can approximate any functions to an arbitrary small error. However, Choi et al. [45] showed that the functions induced by BN queries are polynomials. To improve their queries to become universal approximators, they proposed a testing BN, which chooses a parameter value depending on a threshold instead of simply having a fixed parameter value. We will apply our proposed method to the testing BN.

2.
Recent studies have developed methods for compiling BNCs into Boolean circuits that have the same input-output behavior [46,47]. We can explain and verify any BNCs by operating on their compiled circuits [47][48][49]. We will apply the compiling method to our proposed method.

3.
Sugahara et al. [50] proposed the Bayesian network model averaging classifier with subbagging to improve the classification accuracy for small data. We will extend our proposed method to the model averaging classifier.
The above future works are expected to improve the classification accuracies and comprehensibility of our proposed method.