Augmenting Naïve Bayes Classifiers with k-Tree Topology

Dastjerdi, Fereshteh R.; Cai, Liming

doi:10.3390/math13132185

Open AccessArticle

Augmenting Naïve Bayes Classifiers with k-Tree Topology

by

Fereshteh R. Dastjerdi

^* and

Liming Cai

^*

School of Computing, University of Georgia, Athens, GA 30602, USA

^*

Authors to whom correspondence should be addressed.

Mathematics 2025, 13(13), 2185; https://doi.org/10.3390/math13132185

Submission received: 17 February 2025 / Revised: 21 April 2025 / Accepted: 29 June 2025 / Published: 4 July 2025

Download

Browse Figures

Versions Notes

Abstract

The Bayesian network is a directed, acyclic graphical model that can offer a structured description for probabilistic dependencies among random variables. As powerful tools for classification tasks, Bayesian classifiers often require computing joint probability distributions, which can be computationally intractable due to potential full dependencies among feature variables. On the other hand, Naïve Bayes, which presumes zero dependencies among features, trades accuracy for efficiency and often comes with underperformance. As a result, non-zero dependency structures, such as trees, are often used as more feasible probabilistic graph approximations; in particular, Tree Augmented Naïve Bayes (TAN) has been demonstrated to outperform Naïve Bayes and has become a popular choice. For applications where a variable is strongly influenced by multiple other features, TAN has been further extended to the k-dependency Bayesian classifier (KDB), where one feature can depend on up to k other features (for a given

k \geq 2

). In such cases, however, the selection of the k parent features for each variable is often made through heuristic search methods (such as sorting), which do not guarantee an optimal approximation of network topology. In this paper, the novel notion of k-tree Augmented Naïve Bayes (k-TAN) is introduced to augment Naïve Bayesian classifiers with k-tree topology as an approximation of Bayesian networks. It is proved that, under the Kullback–Leibler divergence measurement, k-tree topology approximation of Bayesian classifiers loses the minimum information with the topology of a maximum spanning k-tree, where the edge weights of the graph are mutual information between random variables conditional upon the class label. In addition, while in general finding a maximum spanning k-tree is NP-hard for fixed

k \geq 2

, this work shows that the approximation problem can be solved in time

O (n^{k + 1})

if the spanning k-tree also desires to retain a given Hamiltonian path in the graph. Therefore, this algorithm can be employed to ensure efficient approximation of Bayesian networks with k-tree augmented Naïve Bayesian classifiers of the guaranteed minimum loss of information.

Keywords:

Bayesian networks; Naïve Bayes augmentation; classification; mutual information; KL-divergence; treewidth; k-tree; maximum spanning k-tree

MSC:

62H22

1. Introduction

Bayesian networks are acyclic graphical models that can effectively model complex random systems with probabilistic dependencies among random variables. The structured framework plays a crucial role in statistical classification tasks, where accurately assigning class labels to unclassified samples is highly desirable [1]. Bayesian classifiers assume

(n - 1)

-dependency among n feature variables [2]; for instance, when computing the probabilities of various diseases for class labels, given n symptoms as features, the network assumes each feature may possibly be influenced by all other

n - 1

features. The utility of Bayesian classifiers extends across various application domains, including health care [3], object detection [4], document classification [5], fraud detection [6], and spam filtering [7], showcasing their versatility. While Bayesian classifiers allow for modeling arbitrary complex dependencies between features across different domains, their computational cost is notable [2]; in particular, to capture complex dependences and higher-order interactions among feature variables, many applications of Bayesian classifiers rely on computing joint probability distributions [8]. Therefore, as the number of features increases, the involved computational complexity escalates due to the potentially full dependencies among feature variables [9,10]. It is widely acknowledged that optimal learning of Bayesian networks is computationally intractable [11,12].

To cope with the intractability in computing with Bayesian networks, one viable approach is to approximate the joint probability distribution. Technically, this strategy is to relax constraints on dependencies between feature variables when their joint probability distribution is considered. The simplest form of the approximation for Bayes classifiers assumes zero dependencies among features, referred to as Naïve Bayes classifier [13]. Such classifiers operate under the assumption that each feature is conditionally independent of other features given the class label. The simplicity of the model makes it an appropriate choice for various applications, such as image classification [14], spam detection [15,16,17], and sentiment analysis [18,19,20]. However, while the independence assumption significantly simplifies the computation for joint probability, it may not be realistic, particularly in applications involving complex attribute dependencies [21] where performance can adversely be limited [12]. Therefore, in many practices, probabilistic graph approximations often prioritize structures where feature variables exhibit dependencies, such as those found in tree structures. Specifically, within a tree topology, each feature variable is influenced by at most one other feature variable [22].

In particular, tree augmented Naïve Bayes (TAN) is a preferred approximation of Bayesian networks. It is based on a tree-dependency structure assumption for feature variables [23]. TAN classifiers mitigate the independence assumption among features in Naïve Bayes while still maintaining lower time complexity compared to full Bayesian networks [12]. TAN classifiers demonstrate much better performance over Naïve Bayes in various applications and are a popular choice for many classification tasks. For example, they have found widespread use across diverse domains, including cryptocurrency trend classification [24], anomaly-based intrusion detection [25], facial biotype classification [26], and other prediction tasks [27]. However, in applications heavily characterized by complex attribute dependencies, TAN classifiers may struggle to represent the probabilistic graphical models accurately. This is especially true when one feature is strongly influenced by more than one other feature. For such applications, the tree topology of TAN has been further improved by considering more dependencies for each feature variable. In particular, the k-dependency Bayes classifier (KDB) [2] and its variant [28] have been introduced, which can represent feature dependencies with up to k feature parents, where

k \geq 2

. In other words, it allows each feature to be influenced by a maximum of k other variables. Unfortunately, in those studies, the construction of k-dependence tree classifiers relied on heuristic ranking methods to determine the k parents of each variable, which do not ensure the optimality of the resulting dependence topology [29].

In this paper, we introduce a novel approach to ensure optimal construction of topology for KDB. We coin k-tree augmented Naïve Bayes (k-TAN) for the Bayesian networks whose structure is of the k-tree topology. A k-tree is a tree-like graph where vertices (representing feature variables) can be defined in some desired order such that each vertex is introduced with respect to other k (parent) vertices that already exist. In other words, a k-TAN is a KDB where the k-dependency of feature variables follows the k-tree definition. In this work, we prove that the proposed k-TAN enables an optimal approximation of the Bayesian network of k-tree topology. In particular, we show that, under the KL-divergence measurement [30], the k-tree topology approximation has the minimum information loss if the structure topology is taken from a maximum spanning k-tree. In this context, the edge weights of the graph are calculated as the mutual information between feature variables conditional upon the class label. Though finding a maximum spanning k-tree (MSkT) is intractable [31] for fixed

k \geq 2

(and thus neither fixed-parameter tractable [32]), we demonstrate in this paper that solving the MSkTproblem can be done in time

O (n^{k + 1})

for every fixed

k \geq 1

so long as the maximum spanning k-tree retains a Hamiltonian path presented in the input graph. Consequently, a k-TAN as the approximation of Bayesian networks with minimum loss of information can be found efficiently for small to moderate values of k.

The remainder of this paper is organized as follows: Section 2 gives an introduction to Bayesian networks and reviews previous approaches to its approximation, including Naïve Bayes, tree augmented Naïve Bayes (TAN), and k-dependency Bayesian networks (KDBs). Section 3 introduces the k-tree augmented Naïve Bayes (k-TAN) and Section 4 presents a detailed proof that optimal k-tree topology is ensured by the maximum spanning k-tree. We elaborate on our proposed polynomial time algorithm to solve MSkT in Section 5.

2. Backgrounds

Consider dataset D that consists of m data samples for random variable set

X \cup {Y}

, where

X = {X_{1}, \dots, X_{n}}

is the set n feature variables, and Y is the class variable. Thus D can be represented as an

m \times (n + 1)

matrix of numerical values on m data samples. For simplicity, we assume all variables have the binary domain

{0, 1}

of values. The classification problem is, given a dataset D of samples, to construct a predictive model that can accurately classify any input sample of unknown label. Technically, A classifier

F

assigns a class label, either 0 or 1, to the input unclassified sample

z = (z_{1}, \dots, z_{n})

, where

z_{j}

is a value of

j^{th}

feature variable

X_{j}

in sample

z

,

1 \leq j \leq n

, such that

y = F (z) \in {0, 1}

.

Definition 1.

A Bayesian network over random variables

X \cup {Y}

is a probabilistic, directed acyclic graph

G = (X \cup {Y}, E)

, where vertices denote random variables in

X

and directed edges in E between vertices denote causal dependencies between the corresponding random variables. Consequently, the absence of an edge between two particular variables signifies their causal independence.

In the Bayesian network, if

(X_{j}, X_{i}) \in E

, variable

X_{j}

is called a parent of variable

X_{i}

. We denote with

π (i)

the set of indexes of parents for

X_{i}

. That is,

π (i) = {X_{j} : (X_{j}, X_{i}) \in E}

. Due to the causal independence assumption between variables that do not share directed edges, the likelihood of variables in

X

given class variable Y can be decomposed and written as

P (X | Y) = \prod_{i = 1}^{n} P (X_{i} | Y, X_{j} : j \in π (i))

(1)

Based on the well-known Bayes formula

P (A | B) = P (B | A) P (A) / P (B)

[8] and (1), the predicted class label

y^{*}

on any given feature values

x

can be computed as the one with the highest posterior probability. That is,

y^{*} = arg max_{y} P (y | x) = arg max_{y} \frac{P (y) \prod_{i = 1}^{n} P (x_{i} | y, x_{j} : j \in π (i))}{P (x)},

(2)

where the denominator

P (x)

can be dropped. However, the above formula indicates that

π (i)

may contain up to

n - 1

indexes of variables and the probability distribution may be of an order of n, making the computation of likelihood challenging with the standard Bayesian network classifiers.

2.1. Naïve Bayes

Naïve Bayes classifier [12] is a Bayesian classification model that makes the Naïve assumption that each feature variable

X_{i}

is conditionally independent of other feature variables given the class label variable Y, implying that there are no causal dependencies between feature variables in the graphical model. This assumption greatly simplifies the computation with Bayesian networks because the likelihood of variables in

X

given class variable Y can be computed with

P_{NB} (X | Y) = \prod_{j = 1}^{n} P_{N B} (X_{j} | Y)

Therefore, under the Naïve Bayes modeling, maximizing the posterior probability in Equation (2) to predict the class label on a given collection of feature values

x = (x_{1}, \dots, x_{n})

is computed as

y^{*} = \arg max_{y} \prod_{j = 1}^{n} P_{NB} (x_{j} | y) \cdot P_{NB} (y)

Note that the above optimization computation can be very efficient since for given feature values

b x = (x_{1}, \dots, x_{n})

, and the maximization over the choices of y does not need to consider the value of

P_{NB} (X)

for the given

b x

. The typical topology structure of the Naïve Bayes classifier is illustrated in Figure 1a.

2.2. Tree Augmented Naïve Bayes

There have been some further significant developments on augmenting the Naïve Bayes classifiers. These efforts aim to relax the independence assumption among features while maintaining lower time complexity compared to full Bayesian networks. Tree structures are preferred as a feasible probabilistic graph approximation, namely tree augmented Naïve Bayes (TAN) [33], which permits causal dependencies between feature variables as well as the dependence between every feature variable and the label variable Y. The overall structure topology of the causal dependencies is yet simple enough to form a tree structure. Therefore, every feature variable, except one, have one other feature variable as its parent [23]. Figure 1b illustrates such a topology. For TAN classifiers, the the likelihood of variables in

X

given class variable Y can be written as

P_{TAN} (X | Y) = \prod_{j = 1}^{n} P_{TAN} (X_{j} | Y, X_{j^{'}}),

where for

j = 1, \dots, n

,

X_{j^{'}}

is the parent feature variable of

X_{j}

, i.e.,

(X_{j^{'}}, X_{j}) \in E

, for which the parent relation is acyclic, and for exactly one j,

j^{'}

does not exist.

With the TAN model, maximizing the posterior probability in Equation (2) to predict class label on a given collection of feature values

x = (x_{1}, \dots, x_{n})

is expressed as

y^{*} = \arg max_{y} \prod_{j = 1}^{n} P_{TAN} (x_{j} | y, x_{j^{'}}) \cdot P_{TAN} (y)

2.3. Optimality of TAN Topology

The structure topology of causal relationships among the feature variables is a tree in a TAN classifier. Unlike the Naïve Bayes where feature variables are assumed independent, TANs of different tree topologies approximate the joint distribution of the underlying (unknown) Bayesian network to different extents. This discrepancy between the unknown joint distribution P and the approximated distribution

P_{TAN}

is information loss due to structure topology approximation by some TAN, which is measurable with the KL-divergence

D_{KL} (P ∥ P_{TAN})

[30]. Indeed, the question here is how to represent the

n^{t h}

-order relationships of P with binary relationships characterized as a tree structure that has the minimum information loss. In addressing this issue for general graphical models, the seminal work of Chow and Liu [22] showed that the smallest KL-divergence is achieved by the topology of a maximum spanning tree where edge weights take mutual information between feature variables. This idea was adopted to determine the optimal TAN network structure

T^{*}

through computing a maximum spanning tree over mutual information between feature variables conditional upon the label variable [23], where

T^{*} = arg max_{tree T} \sum_{(i, i^{'}) \in T} I (X_{i}; X_{i^{'}} | Y),

in which

i^{'}

is the parent of i in tree T.

While optimal approximation of Bayesian networks with tree structure like TAN can be computed efficiently by the maximum spanning tree algorithms, in such a model, every feature variable can only be a causal effect of another (single) feature variable. In many practical applications, classifiers capable of modeling causal dependence upon multiple feature variables become necessary. To cope with this situation, the k-dependency Bayesian classifier (KDB) was introduced in [2] to relax casual independence assumptions by allowing every feature variable to have up to k other feature variable parents, for chosen

k \geq 2

. In addition to the capability to handle multi-dependency among feature variables, the parameter k in a KDB is adjustable and comes in handy for trading between model construction efficiency and classification performance. On the other hand, however, determining an optimal structure topology for KDB poses a great challenge. Technically, representing the

n^{th}

order relationships with

k^{th}

order relationships in KDB to minimize the divergence

D_{KL} (P ∥ P_{KDB})

proves a computationally difficult task.

Hence, the construction of a pertinent k-dependence structure of KDB for applications has resorted to heuristics methods. One typical practice is to compute the conditional mutual information

I (X_{i}; X_{j} | Y)

for each pair of feature variables

X_{i}

and

X_{j}

given class label Y. On an existing variable

X_{i}

of interest, k other variables

X_{j}

with the highest mutual information with

X_{i}

are identified as the feature parent variables for

X_{i}

. Yet another k-dependence classifier named Extended Tree Augmented Naïve Bayes (ETAN) [28] has also been introduced. Unlike KDB, ETAN employs higher-order conditional mutual information, e.g.,

I (X_{i}; X_{1}, \dots, X_{i - 1} | Y)

, to capture the conditional dependencies between

X_{i}

and its parent attributes. A drawback of these strategies is that they would generate a large number of redundant dependencies between feature variables in the model and result in over-fitting. Other methods based on sorting of feature variables and on filtering are all heuristic and do not guarantee the optimality of the determined structure topology [29,34].

3. $k$ -Tree Augmented Naïve Bayes

3.1. Acyclic k-Tree

In this section, we introduce the approximation of Bayesian networks with the k-tree topology. Intuitively, a k-tree is a tree-like graph with positive integer k representing the “thickness of the tree trunk and branches”; the smaller the value of k, the more the graph resembles a tree, and 1-tree is a tree. Formally,

Proposition 1.

Let

k \geq 1

be a fixed integer. Let Δ be a non-empty, directed, acyclic clique with vertex set

V (Δ)

and edge set

E (Δ)

. Then there must be exactly one vertex

v \in V (Δ)

such that

\forall u \in V (Δ), u \neq v \Rightarrow (u, v) \in E (Δ)

.

Definition 2.

Let

k \geq 1

be an integer. The class of (acyclic) k-trees of

n \geq k

vertices are defined as follows, recursively,

1.: A k-tree of k vertices is an acyclic k-clique;
2.: A k-tree $G = (V, E)$ of n vertices, for $n > k$ , consists of a k-tree $H = (U, F)$ of $n - 1$ vertices and a vertex $v \notin U$ , where $V = U \cup {v}$ , and $E = F \cup \{(u, v) : (u, v) i s a d i r e c t e d e d g e u \in C\}$ , for some acyclic k-clique C in H.

Based on the definition, any acyclic k-tree G containing vertices

{x_{1}, \dots, x_{n}}

is created recursively by starting from an acyclic k-clique and then introducing one new vertex at a step. Thus the acyclic k-tree T can be expressed as an orthodox sequence that represents the creation process as:

G = C_{1} v_{k + 1} C_{2} v_{k + 2} \dots C_{n - k} v_{n},

(3)

where

C_{1} = {v_{1}, \dots, v_{k}}

is the first acyclic k-clique created, and, for

k + 1 \leq i \leq n

, directed edges

(u, v_{i})

are created from all vertices u in some acyclic k-clique

C_{i - k}

that already exists to newly introduced vertex

v_{i}

. Despite that expression (3) may not be unique for a k-tree G, it gives a definite structure topology via the following definition. Figure 2a illustrates Definition 2 with creation of a 2-tree example.

Definition 3.

Let

G = (V, E)

be an acyclic k-tree containing vertices

{v_{1}, \dots, v_{n}}

. For any two vertices

u, v \in V

, u is called a parent of v if directed edge

(u, v) \in E

. All parents of a vertex form the parent set for the vertex. Specifically,

π (i)

denotes the parent set of vertex

v_{i}

and is defined as

π (i) = {j : (v_{j}, v_{i}) \in E}

.

In particular, for the k-tree expressed with (3), the parent set of vertex

v_{i} \in V

, for every index

i > k

, is

π (i) = {j : v_{j} \in C_{i - k}}

. For those vertices

v_{i} \in C_{1}

, the initial k-clique, the parent set is defined as:

π (i) = {j : (v_{j}, v_{i}) \in E and v_{j} \in C_{1}}

. By Proposition 1, the k vertices in

C_{1}

should have

0, 1, 2, \dots, k - 1

exactly as their parent set sizes, respectively.

Proposition 2.

Let

G = (V, E)

be an acyclic k-tree. Then any vertex

v \in V

, together with its parent vertices, forms an acyclic clique, which is denoted with

Δ_{v}

.

By the proposition and definition of acyclic k-trees,

(k + 1)

-cliques in a k-tree of n vertices, with

n \geq k + 1

, can be denoted as

Δ_{v}

, for every vertex

v \in V ∖ C_{1}

and the number of acyclic

(k + 1)

-cliques in the k-tree is

n - k

. It is also not difficult to see that the creation process of the k-tree given in (3) actually also describes the created k-tree as a tree that connects

(k + 1)

-cliques in it (see Figure 2b).

Definition 4.

Let

G = (V, E)

be an acyclic k-tree of

n \geq k + 1

vertices created by the process in (3). Then the tree-representation of G is a tree in which tree nodes are

(k + 1)

-cliques in G and a tree edge

(Δ_{v_{i}}, Δ_{v_{j}})

exists between

(k + 1)

-cliques

Δ_{v_{i}}

and

Δ_{v_{j}}

if and only if

Δ_{v_{i}} \cap Δ_{v_{j}}

is the parent set of vertex

v_{j}

in the k-tree and for i,

i < j

, the largest.

Figure 2a,b illustrate the concept of tree-representation of k-trees. Note that tree-representations are not necessarily unique for a given k-tree. While there may be more than one

Δ_{v_{i}}

that satisfies the condition to establish a tree-edge with

Δ_{v_{j}}

, our definition chooses the most recently created

Δ_{v_{i}}

. Therefore, a representation is unique with respect to any given creation process. The notion of tree-representations will facilitate discussions in Section 5.

3.2. k-Tree Augmented Naïve Bayes

Definition 5.

Let

k \geq 1

be an integer. A k-tree augmented Naïve Bayes (k-TAN) is a Bayesian network

G = (X \cup {Y}, E)

where the induced subgraph by feature variables in

X

forms exactly an acyclic k-tree of n vertices, where

n = | X |

, and of edge set

E ∖ {(Y, X) : X \in X}

.

Figure 2c shows a k-TAN in which the set of feature variables forms an acyclic 2-tree. Note that k-TAN is TAN for

k = 1

and generalizes TAN with the permission for a feature variable to have a causal effect from multiple (k or fewer) other (parent) variables. As later analyses will show, an optimal topology structure does not need to require every variable to have exactly k parent variables, making it flexible to deal with sophisticated causal dependencies in various applications.

Now we consider the problem of classification by k-TAN of topology G over feature variables

X = {X_{1}, \dots, X_{n}}

and class variable Y. First, the likelihood (i.e., conditional probability) of variables

X

with the K-TAN is computed as

P_{G} (X | Y) = \prod_{i = 1}^{n} P (X_{i} | Y, X_{j} : j \in π (i)),

(4)

where parent set

{X_{j} : j \in π (i)}

of variable

X_{i}

,

i = 1, \dots, n

, is determined by the acyclic k-tree imposed on the variables

X

. Let

x = (x_{1}, \dots, x_{n})

be a given sample of values for feature variables

X = {X_{1}, \dots, X_{n}}

of the the k-TAN. Then predicted label value

y^{*}

for class variable Y can be computed with the maximization of the posterior probability as

y^{*} = \arg max_{y} P (y | x) = \arg max_{y} \frac{P (y) \prod_{i = 1}^{n} P (x_{i} | y, {x_{j} : j \in π (i)})}{P (x)},

where

P (x)

can be omitted as it remains unchanged for different values of y.

4. Optimal $k$ -Tree Topology

In general, different graph topologies for k-TAN may result in different performances of the model. Theoretical performance of approximated graphical models can be effectively measured with information-theoretics. Let P be the true yet unknown probability distribution of random variables defined on some graphical model, and let G be the structure topology to approximate the model. Then the information loss due to the approximation can be measured with the KL-divergence

D_{KL} (P ‖ P_{G}) = \sum P {log}_{2} \frac{P}{P_{G}},

(5)

where

P_{G}

is the joint probability distribution of random variables pertaining to the approximated structure topology G. An optimal structure topology for the approximation is the one that minimizes the information loss

D_{KL} (P ‖ P_{G})

. Chow and Liu initiated such approximation with tree topologies in their seminal work [22]. They proved that the optimal approximation can be achieved via a tree topology that yields the maximum sum of mutual information. The result has been applied to finding optimal tree topology for the construction of tree augmented Bayesian networks, TANs [23]. Unfortunately, using the KL-divergence measurement to compute information loss due to topology approximation with G, especially to find an optimal topology approximation has not been successful on non-tree structures, in particular, for KDB and its variants.

An optimal structure topology of k-TAN for classification tasks can be identified by minimizing the information loss due to topology approximation with k-TAN. Specifically, the optimal structure

G^{*}

is computed with

G^{*} = \arg min_{k - TAN G} D_{KL} (P (X, Y) ‖ P_{G} (X, Y)),

(6)

where

P (X, Y)

is the unknown probability distribution over feature variables

X

and the label variable Y, and

P_{G} (X, Y)

is the distribution enabled by the structure topology G over the same set of variables in the k-TAN.

We now give a derivation of an optimal k-tree topology of k-TAN. By (5), the goal of Equation (6) can be achieved by minimizing computation of the relative entropy as:

\sum_{x, y} P (x, y) {log}_{2} \frac{P (x, y)}{P_{G} (x, y)},

(7)

where pair

(x, y)

is a data sample for variables

X

and Y. By conditional entropy, expression in (7) can be computed exactly as,

\begin{matrix} \sum_{x, y} P (x, y) {log}_{2} \frac{P (x | y) P (y)}{P_{G} (x | y) P_{G} (y)} & = \sum_{x, y} P (x, y) {log}_{2} \frac{P (x | y)}{P_{G} (x | y)} \\ = - H (X | Y) - \sum_{x, y} P (x, y) {log}_{2} P_{G} (x | y) \end{matrix}

The second term of the right-hand side of the above equation can be turned into

\begin{matrix} - \sum_{x, y} P (x, y) {log}_{2} \prod_{i = 1}^{n} P (x_{i} | y, {x_{j} : j \in π (i)}) \\ = & - \sum_{x, y} P (x, y) \sum_{i = 1}^{n} {log}_{2} P (x_{i} | y, {x_{j} : j \in π (i)}) \\ = & - \sum_{x, y} P (x, y) \sum_{i = 1}^{n} {log}_{2} \frac{P (x_{i} | y) \cdot P (x_{i}, {x_{j} : j \in π (i)} | y)}{P (x_{i} | y) \cdot P ({x_{j} : j \in π (i)} | y)} \end{matrix}

The last term can be decomposed into:

\begin{matrix} - \sum_{x, y} P (x, y) {log}_{2} P (x_{i} | y) - \sum_{x, y} P (x, y) \sum_{i = 1}^{n} {log}_{2} \frac{P (x_{i}, {x_{j} : j \in π (i)} | y)}{P (x_{i} | y) \cdot P ({x_{j} : j \in π (i)} | y)} \\ = & - \sum_{x, y} P (x, y) {log}_{2} P (x_{i} | y) - \sum_{i = 1}^{n} \sum_{x, y} P (x, y) {log}_{2} \frac{P (x_{i}, {x_{j} : j \in π (i)} | y)}{P (x_{i} | y) \cdot P ({x_{j} : j \in π (i)} | y)} \\ = & - \sum_{i = 1}^{n} \sum_{x_{i}, y} P (x_{i}, y) {log}_{2} P (x_{i} | y) - \sum_{i = 1}^{n} \sum_{x, y} P (y, x_{i}, {x_{j} : j \in π (i)}) {log}_{2} \frac{P (x_{i}, {x_{j} : j \in π (i)} | y)}{P (x_{i} | y) \cdot P ({x_{j} : j \in π (i)} | y)} \\ = & \sum_{i = 1}^{n} H (X_{i} | Y) - \sum_{i = 1}^{n} I (X_{i}; {X_{j} : j \in π (i)} | Y) \end{matrix}

The second last equality holds because the probability

P (x, y)

is projected onto the

i th

variable, resulting in

P (x_{i}, y)

and omitting other variables due to summation of their probabilities. It is true also because likewise

P (x, y)

is projected onto the

i th

variable and the

j th

variables for all

j \in π (i)

, resulting in

P (y, x_{i}, {x_{j} : j \in π (i)})

. The last equality is due to the definition of entropy and mutual information.

The above steps yield

\sum_{x, y} P (\underline{x}, y) {log}_{2} \frac{P (x, y)}{P_{G} (x, y)} = - H (X | Y) + \sum_{i = 1}^{n} H (X_{i} | Y) - \sum_{i = 1}^{n} I (X_{i}; {X_{j} : j \in π (i)} | Y)

Since the terms of entropies are independent of choices of

π

, the minimization of the left-hand-side is realized by the maximization of the sum

\sum_{i = 1}^{n} I (X_{i}; {X_{j} : j \in π (i)} | Y)

. Note that

I (X_{i}; {X_{j} : j \in π (i)} | Y)

is mutual information between variable

X_{i}

and its variable parent set

{X_{j} : j \in π (i)}

conditional upon class variable Y, where

π

is drawn from the acyclic k-tree topology G that is imposed on the k-TAN. Therefore, we conclude:

Theorem 1.

The optimal structure topology G for feature variables in k-TAN can be determined by maximizing sum

\sum_{i = 1}^{n} I (X_{i}; {X_{j} : j \in π (i)} | Y)

, where π represents the k-tree structure among feature variables in G.

5. Finding Maximum Spanning $k$ -Tree

We show in the next section that the maximization problem in information theoretics given in Theorem 1 has an equivalent problem in graph theoretics.

5.1. Maximum Spanning k-Tree

Definition 6.

Let

G = (V, E)

be an acyclic graph. A real-value function

f : V \to R

is a neighborhood function if for any subset

Δ \subseteq V

,

F (Δ) \neq - \infty

if and only if Δ is an acyclic clique.

Definition 7.

Let

k \geq 1

be a fixed integer and f be a neighborhood function. The Maximum spanning k-tree problem associated with f, denoted as MSkT

(f)

, is, given a directed acyclic graph

G = (V, E)

, finding a spanning acyclic k-tree H that maximizes the sum:

f (H) = \sum_{v \in V, Δ_{v} \subseteq H} f (Δ_{v}),

where

Δ_{v}

denotes the acyclic clique formed by v along with its parent vertices in H. Note that problem MSkT

(f)

is a generic problem that can be specific when the neighborhood function f is specified.

MSkT

(f)

generalizes the traditional maximum spanning tree problem where the neighborhood function

f (Δ_{v})

is simply

w (u, v)

, the weight between v and its parent u. Another further generalization is to consider

Δ_{v}

to be a hyper edge shared by v and its parent vertices and the neighborhood function

f (Δ_{v})

to be the weight on the hyper edge. In particular, we can cast the maximization problem in Theorem 1 as the problem MSkT

(f)

for the choice of the neighborhood function f as follows:

f (Δ_{v}) = I (X_{v}; {X_{u} : u \in π (v)} | Y),

(8)

where vertex v denotes the random feature variable

X_{v}

in the topology of the k-tree associated with the k-TAN being constructed and

π (v)

be the set of v’s parent vertices.

We will show that the MSkT

(f)

problem is intimately related to the known decision problem spanning k-tree, which answers the following question: given a graph, does it possess a subgraph that is a spanning k-tree? The spanning k-tree appears to be difficult to solve because it has been proved NP-hard [31] even for fixed

k = 2

. We shall connect the intractable problem to MSkT

(f)

for some specific functions f, and the connection will carry the intractability over to the latter, unfortunately.

Definition 8.

Let

k \geq 1

be a fixed integer and f be a neighborhood function. The problem D-MSkT

(f)

is a decision problem that: given a directed acyclic graph

G = (V, E)

and real number threshold w, determines if G contains a spanning k-tree H such that the value

f (H) \geq w

.

It is not difficult to see, if we choose the neighborhood function

f (Δ_{v})

to be the count of directed edges from parent vertices of v to v on k-tree H, then

f (H) = (\sum_{j = 1}^{k - 1} j) + k (n - k)

, where the first sum is the total count of edges in the intial k-clique and term

k (n - k)

is the count of new edges due to vertices introduced to existing k-cliques. Clearly the count is

\frac{k}{2} (2 n - k - 1)

. Therefore, the input directed acyclic graph G contains a directed k-tree H if and only if the value

f (H) \geq \frac{k}{2} (2 n - k - 1)

. The above analysis yields a polynomial-time transformation from problem spanning k-tree to D-MSkT

(f)

and thus a justification for the intractability of the latter. That is, there are (simple) neighborhood functions f for which problem D-MSkT

(f)

is NP-hard. This leads to the intractability of the MSkT

(f)

problem as well.

Theorem 2.

For each fixed value of

k \geq 2

, there are neighborhood weight functions f for which problem MSkT

(f)

is NP-hard.

The above analyses and those from the previous section suggest that the problem MSkT

(f)

(and the problem of finding an optimal structure topology for k-TAN as well) is computationally intractable even for fixed values of

k \geq 2

. Nevertheless, we will show in the next section that, with a meaningful restriction on the topology structure of the k-tree, polynomial-time algorithms can be obtained for MSkT

(f)

.

5.2. Backbone k-Trees

In this section, we focus on a specific type of k-tree whose topology imposes a linear order on the involved vertices. For such k-trees, we derive efficient algorithms for MSkT

(f)

.

Definition 9.

Let

G = (V, E)

be a graph, where

V = {v_{1}, \dots, v_{n}}

. If the edge set E contains a Hamiltonian path

{(v_{i}, v_{i + 1}) : 1 \leq i < n}

, then G is called a backbone graph. We also call the Hamiltonian path the backbone, and edge

(v_{i}, v_{i + 1})

a backbone edge. If G is a k-tree, it is called a backbone k-tree.

Graphs with a built-in Hamiltonian path can be ideal as the structural topology of graphical models arising from meaningful scientific applications. The linearity underlying the importance aspect of relationships among random variables upon which higher-order, more sophisticated relations are established. For example, backbone graph topology is most suitable for modeling time-series events and processes, known significant pathways in gene networks, and biomolecule 3D structure, where the backbone of the molecule is coupled in a linear fashion.

We now introduce and prove a few properties associated with backbone k-trees. For this purpose, our discussion will be based on the tree-representation for k-trees in Definition 4. Since backbone edges are of special interest, a more elaborate definition of tree-representation for the backbone k-tree is needed.

Proposition 3.

Let G be a backbone k-tree of size

n \geq k + 1

and let interval

[1, 2, \dots, n]

denote the ordered consecutive positions of all vertices in the backbone. Let Δ be a

(k + 1)

-clique in G with

{i_{1}, \dots, i_{k + 1}}

being the positions of vertices in Δ on the backbone in the assumed order

1 \leq i_{1} < i_{2} < \dots < i_{k + 1} \leq n

. Then these positions partition the consecutive interval

[1, 2, \dots, n]

into at most

k + 2

non-empty, consecutive intervals as

[1, \dots, i_{1}], [i_{1}, \dots, i_{2}], \dots, [i_{k}, \dots, i_{k + 1}], [i_{k + 1}, \dots, n]

Let

v \notin Δ

. We denote with

B (Δ, v)

the subset of backbone edges formed by vertices whose positions on the backbone are in the same interval containing the position of v.

We now consider a new characterization for backbone k-trees. Specifically, we define the collection of partial backbone k-trees (abbreviated with pbkts) recursively as follows:

Definition 10.

Let H be a graph of size

n \geq k + 1

, Δ be a

(k + 1)

-clique in H, and α is a subset of backbone edges. Tuple

(H, Δ, α)

is a pbkt rooted at

(k + 1)

-clique Δ, which retains backbone edges in α if

(1): either $H = Δ$ and $α ∖ (Δ \times Δ) = \emptyset$ ;
(2): or $H = H_{1} \cup H_{2}$ , where $(H_{1}, Δ, β)$ and $(H_{2}, Δ^{'}, γ)$ are pbkts rooted at $(k + 1)$ -cliques Δ and $Δ^{'}$ and retain backbone edges in β and γ, respectively, such that $Δ^{'} {= Δ |}_{v}^{u}$ for some vertices $u \in Δ$ and v that occur in α but not in Δ, and $γ = B (Δ, v)$ and $β = α ∖ γ$ .

{Δ |}_{v}^{u}

denotes the

(k + 1)

-clique derived from Δ, such that u is removed and replaced by v along with new edges

{(x, v) : x \in Δ ∖ {u}}

added on.

Note that the term partial k-tree has been used for subgraphs of k-trees [35]. Since a subgraph of a backbone k-tree does not necessarily contain all the backbone edges, pbkt is defined with an associated subset of backbone edges. The use of a

(k + 1)

-clique as the root is technical, which makes it feasible to view backbone k-trees from the perspective of a relation among

(k + 1)

-cliques.

The atomic case of pbkt is simply a

(k + 1)

-clique that only contains backbone edges that are already on the clique. Furthermore, it is not difficult to see that, if G is a (full) backbone k-tree, there is a

(k + 1)

-clique

Δ

in G to serve as the root such that tuple

(G, Δ, α)

is a pbkt with

α

being the complete set of backbone edges.

We now examine two properties of backbone k-trees important to developing efficient algorithms that find the maximum backbone k-tree.

Lemma 1.

Let

k \geq 1

and

G = (V, E)

be a backbone k-tree. Then, any

(k + 1)

-clique Δ in G separates G into at most

k + 2

connected components.

Proof.

By Proposition 3, the positions of vertices in

Δ

partitions the consecutive interval into at most

k + 2

consecutive intervals. We claim that any set of vertices whose positions belong to the same, non-empty interval, say

[i_{h}, \dots, i_{h + 1}]

, form at most one connected component separated by

Δ

. Suppose otherwise there are two connected components

C_{1}

and

C_{2}

separated by

Δ

and there exists an integer t, where

i_{h} \leq t < t + 1 \leq i_{h + 1}

, such that

v_{t} \in C_{1} ∖ C_{2}

and

v_{t + 1} \in C_{2} ∖ C_{1}

. Since backbone edge

(v_{t}, v_{t + 1})

does not belong to

C_{1}

nor

C_{2}

, there would have to be another component that contains both

v_{t}

and

v_{t + 1}

, which cannot exist, leading to a contradiction. Based on the above analysis, the number of connected components in the backbone k-tree G separated by

(k + 1)

-clique is at most

k + 2

. (See Figure 3 for an illustration of the above argument). □

The notation

{Δ |}_{v}^{u}

used in Definition 10 underlies how one

(k + 1)

-clique comes to exist based on another. We give a formal term to this relationship between two

(k + 1)

-cliques.

Definition 11.

Let Δ be a

(k + 1)

-clique and

Δ^{'} {= Δ |}_{v}^{u}

for some

u \in Δ

and

v \notin Δ

.

Δ^{'}

is said to be a neighbor of Δ.

Lemma 2.

Let Δ be a

(k + 1)

-clique in a backbone k-tree, which has two different neighbors

Δ_{1}

and

Δ_{2}

. If

x_{1} \in Δ_{1} ∖ Δ

and

x_{2} \in Δ_{2} ∖ Δ

, then the positions of

x_{1}

and

x_{2}

on the backbone belong to two different intervals separated by the positions of vertices in Δ.

Proof.

According to Definition 4,

Δ_{1}

and

Δ_{2}

should be the roots of two different pbkts that are two connected components separated by

Δ

to which

x_{1}

and

x_{2}

belong, respectively. By the proof for Lemma 1, the positions of

x_{1}

and

x_{2}

on the backbone should belong to two different consecutive intervals separated by the positions of the vertices in

Δ

. □

5.3. A Polynomial-Time Algorithm

We now present an efficient algorithm for solving the MSkT

(f)

problem where the desired maximum spanning k-tree should contain a certain backbone (i.e., Hamiltonian path) designated in the input graph. We only consider neighborhood functions

f (Δ)

that can be computed in time

δ (k)

on any

(k + 1)

-clique

Δ

for some computable function

δ

. The properties demonstrated earlier will facilitate the following discussion of the algorithm.

The problem MSBkT

(f)

, tailored from MSkT

(f)

, for backbone k-trees, is defined as: given a directed acyclic graph

G = V, E)

with a designated backbone B, finds a spanning directed acyclic k-tree

H^{*}

in graph G such that it retains

B \subseteq H^{*}

, and the following sum is maximized:

f (H^{*}) = \sum_{v \in V, Δ_{v} \subseteq H^{*}} f (Δ_{v})

By default, we assume the designated backbone B in the input graph can always be denoted as

B = {(i, i + 1) : 1 \leq i < n}

by a simple relabeling on the vertices of the graph.

To facilitate discussion, we denote with

E (Δ)

the set of edges formed by vertices in clique

Δ

. Also recall notation

B (Δ, v)

introduced earlier, which represents the subset of backbone edges formed by vertices whose positions on the backbone are in the same interval containing the position of v.

Theorem 3.

Problem MSBkT

(f)

can be solved in time

O (k^{2} δ (k) n^{k + 2})

on input directed, acyclic graphs of n vertices.

Proof.

The algorithm finds an optimal spanning backbone k-tree based on the tree-representation of k-trees in Definition 4, which offers a recursive construction of pbkts in the input graph G. We define function

S (Δ, α)

to be the maximum sum of neighborhood function f values over all

(k + 1)

-cliques in a pbkt that is rooted at

Δ

and retains all backbone edges in

α

. Then

S (Δ, α) = \{\begin{matrix} f (Δ) & α ∖ E (Δ) = \emptyset \\ max_{u \in Δ, v \notin Δ nor in α} \{S (Δ, β) + S (Δ |_{v}^{u}, B (Δ, v))\} & β = α ∖ B (Δ, v), \end{matrix}

which strictly follows the rules for pbkt given in Definition 4.

Function

S (Δ, α)

maximizes its value over all pbkts rooted at

Δ

and retains backbone edges in set

α

. Then a desired spanning backbone k-tree is one of the pbkts rooted at some

(k + 1)

-clique

Δ

, which maximizes function

S (Δ, B)

.

The recurrence relation for function S makes it computable with a dynamic programming process. For every pbkt that is simply

(k + 1)

-clique

Δ

, the value is the neighborhood function value

f (Δ)

. For every more general case of pbkt, the computation maximizes over all possible vertex pair u and v that are used to split the pbkt into two subcases of pbkts, one rooted at

Δ

and another at

{Δ |}_{v}^{u}

. The corresponding subset

α

of backbone edges that needs to be retained is also split into two. One subset is

B (Δ, v)

containing those backbone edges whose positions are in the same interval as the position of vertex v. The other one is the rest of the backbone edges in subset

α

. This is because, by the proof of Lemma 1, the backbone edges whose positions are in the same interval as the position of vertex v belong to the same connected component as v.

For the maximization operation, there are

k + 1

choices of vertices in

Δ

for u and n options for v. Then a table of

n^{k + 1} \times m

entries can be used for storing values of function

S (Δ, α)

, where m is the number of backbone subsets. By Lemma 2, the positions

[1, 2 \dots, n]

of all backbone edges are partitioned into

k + 1

consecutive intervals, and those with positions belonging to the same interval should belong to the same connected pbkt. Consequently,

m \leq k + 2

and

α

can be encoded with bitmap of length

k + 2

. The overall time used by the algorithm is thus

O (k 2^{k} δ (k) n^{k + 2})

. □

Corollary 1.

For every fixed

k \geq 1

, problem MSBkT

(f)

can be solved in time

O (n^{k + 2})

on input directed, acyclic graphs of n vertices.

To connect the problem MSBkT

(f)

to the problem of finding optimal structure topology for k-TAN that retains an designated Hamiltonian path, the neighborhood function

f (Δ) = I (X_{i}; X_{j} : j \in π (i))

is the mutual information between variable

X_{i}

and its parent set variables in the acyclic directed

(k + 1)

-clique, where

x_{i}

is the vertex such that

| {(x_{i}, x_{j}) : j \in π (i)} | = k

. The function can be computed in time a function in k.

In addition, an earlier investigation [36] showed that the dynamic programming to computing functions around backbone k-tree can be approached in a slightly different way with the argument

Δ

being a k-clique instead of a

(k + 1)

-clique, yielding a time complexity

O (n^{k + 1})

instead of

O (n^{k + 2})

for every fixed

k \geq 1

.

Theorem 4.

For every fixed

k \geq 1

, the optimal structure topology for feature variables in k-TAN can be determined in time

O (n^{k + 1})

provided that the structure retains a predefined Hamiltonian path in the input graph of n vertices.

Corollary 2.

For every fixed

k \geq 1

, there are polynomial time algorithms for the minimum loss of information approximation of Bayesian networks with a conditional k-tree topology that retains a designated Hamiltonian path in the input network.

6. Conclusions

Bayesian networks offer a structured framework for modeling dependencies among features and class labels, making them invaluable in classification tasks. Generally, there are

n - 1

dependencies among features, where n is the total number of features, which need to be considered. Thus, the complexity of acquiring a complete Bayesian network may be intractable with increasing feature numbers. Naïve Bayes assumes no dependencies among features given the class label, which is often unrealistic in real-world scenarios. To address this issue, Tree Augmented Naïve Bayes (TAN) is introduced to capture the most important dependencies between features by constructing a simple topology, a tree. Therefore, TAN is notably preferred for probabilistic graph approximations. However, more complex topologies such as KDB are required for situations where features strongly influence each other. Previous research on k-Dependency Bayesian classifiers has largely depended on heuristic algorithms that do not guarantee an optimal topology for the approximation.

We introduced an augmented Naïve Bayes classifier, k-TAN, of a k-dependency topology where structure among feature variables can be represented as k-tree graph. We demonstrated that when the k-tree is a maximum spanning k-tree, it provides the optimal approximation of the Bayesian network and ensures minimum information loss under the KL-divergence metric. The edge weights represent the mutual information between feature variables conditioned on the class label. Although finding the maximum spanning k-tree is an intractable task for fixed

k \geq 2

, this paper presents a polynomial-time approach to solve a constrained case of the problem. That is, if the maximum spanning k-tree requires retaining a designated Hamiltonian path in the graph topology, the maximum spanning k-tree problem can be solved in time

O (n^{k + 1})

. Therefore, employing this algorithm ensures the efficient construction of k-TAN classifiers with minimal information loss. The topology constraint of retaining a Hamiltonian path has a wide spectrum of applications with Bayesian networks.

Our final note is that the proposed k-TAN method may be incorporated with some recently developed extensions to Naïve Bayesian classifiers, which are orthogonal to structural topology extension to Naïve Bayes. This includes, for example, a method in which the integration of a latent component into the Naïve Bayes classifier can account for hidden dependencies among attributes in domains such as health care [9], where data complexity and inter-feature correlations are common. Another work investigates relationships between computing conditional log likelihood and marginal likelihood under the exact and approximation learning setting [37]. Moreover, rather than assuming all features are conditionally independent, Comonotonic-Inference-Based classifier (CIBer) [38] identifies an optimal partition of features, grouping those that exhibit strong dependencies.

Author Contributions

Writing—original draft, F.R.D.; Writing—review & editing, L.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bielza, C.; Larrañaga, P. Discrete Bayesian Network Classifiers: A Survey. ACM Comput. Surv. 2014, 47, 1–43. [Google Scholar] [CrossRef]
Sahami, M. Learning Limited Dependence Bayesian Classifiers. In KDD; AAAI Press: Menlo Park, CA, USA, 1996; Volume 96, pp. 335–338. [Google Scholar]
McLachlan, S.; Dube, K.; Hitman, G.A.; Fenton, N.E.; Kyrimi, E. Bayesian Networks in Healthcare: Distribution by Medical Condition. Artif. Intell. Med. 2020, 107, 101912. [Google Scholar] [CrossRef]
Schneiderman, H. Learning a Restricted Bayesian Network for Object Detection. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 27 June–2 July 2004; Volume 2. [Google Scholar]
Denoyer, L.; Gallinari, P. Bayesian Network Model for Semi-Structured Document Classification. Inf. Process. Manag. 2004, 40, 807–827. [Google Scholar] [CrossRef]
Taniguchi, M.; Haft, M.; Hollmén, J.; Tresp, V. Fraud Detection in Communication Networks Using Neural and Probabilistic Methods. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, WA, USA, 15 May 1998; Volume 2, pp. 1241–1244. [Google Scholar]
Jin, X.; Xu, A.; Bie, R.; Shen, X.; Yin, M. Spam Email Filtering with Bayesian Belief Network: Using Relevant Words. In Proceedings of the 2006 IEEE International Conference on Granular Computing, Atlanta, GA, USA, 10–12 May 2006; pp. 238–243. [Google Scholar]
Zhang, D. Bayesian Classification. In Fundamentals of Image Data Mining: Analysis, Features, Classification and Retrieval; Springer: Cham, Switzerland, 2019; pp. 161–178. [Google Scholar]
Gohari, K.; Kazemnejad, A.; Mohammadi, M.; Eskandari, F.; Saberi, S.; Esmaieli, M.; Sheidaei, A. A Bayesian Latent Class Extension of Naïve Bayesian Classifier and Its Application to the Classification of Gastric Cancer Patients. BMC Med. Res. Methodol. 2023, 23, 190. [Google Scholar] [CrossRef] [PubMed]
Sun, L.; Erath, A. A Bayesian Network Approach for Population Synthesis. Transp. Res. Part C Emerg. Technol. 2015, 61, 49–62. [Google Scholar] [CrossRef]
Chickering, D.M. Learning Bayesian Networks is NP-complete. In Learning from Data—Artificial Intelligence and Statistics V; Lecture Notes in Statistics; Springer: New York, NY, USA, 1994; Volume 112, pp. 121–130. [Google Scholar]
Jiang, L.; Cai, Z.; Wang, D.; Zhang, H. Improving Tree Augmented Naïve Bayes for Class Probability Estimation. Knowl.-Based Syst. 2012, 26, 239–245. [Google Scholar] [CrossRef]
Duda, R.O.; Hart, P.E. Pattern Classification and Scene Analysis; John Wiley & Sons: New York, NY, USA, 1973. [Google Scholar]
Phoenix, P.; Sudaryono, R.; Suhartono, D. Classifying Promotion Images Using Optical Character Recognition and Naïve Bayes Classifier. Procedia Comput. Sci. 2021, 179, 498–506. [Google Scholar]
Ma, T.; Yamamori, K.; Thida, A. A Comparative Approach to Naïve Bayes Classifier and Support Vector Machine for Email Spam Classification. In Proceedings of the 2020 IEEE 9th Global Conference on Consumer Electronics (GCCE), Kobe, Japan, 13–16 October 2020; pp. 324–326. [Google Scholar]
Santoshi, K.U.; Bhavya, S.S.; Sri, Y.B.; Venkateswarlu, B. Twitter Spam Detection Using Naïve Bayes Classifier. In Proceedings of the 2021 6th International Conference on Inventive Computation Technologies (ICICT), Coimbatore, India, 20–22 January 2021; pp. 773–777. [Google Scholar]
Sekhar, B.; Padmapriya, G. Realtime Spam Detection System Using Naive Bayes Algorithm in Comparison with Support Vector Machine Learning Algorithm. AIP Conf. Proc. 2023, 2821, 020040. [Google Scholar]
Chebil, W.; Wedyan, M.; Alazab, M.; Alturki, R.; Elshaweesh, O. Improving Semantic Information Retrieval Using Multinomial Naive Bayes Classifier and Bayesian Networks. Information 2023, 14, 272. [Google Scholar] [CrossRef]
Gaur, P.; Vashistha, S.; Jha, P. Twitter Sentiment Analysis Using Naive Bayes-Based Machine Learning Technique. In Sentiment Analysis and Deep Learning: Proceedings of ICSADL 2022; Springer: Singapore, 2023; pp. 367–376. [Google Scholar]
Wickramasinghe, I.; Kalutarage, H. Naive Bayes: Applications, Variations and Vulnerabilities: A Review of Literature with Code Snippets for Implementation. Soft Comput. 2021, 25, 2277–2293. [Google Scholar] [CrossRef]
Meghana, M.S.; Davuluru, A.; Shaik, A.; Kumar, K.P. Sentiment Analysis on Amazon Product Reviews Using LSTM and Naive Bayes. In Proceedings of the IEEE 2023 7th International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 23–25 February 2023; pp. 626–631. [Google Scholar]
Chow, C.K.; Liu, C. Approximating Discrete Probability Distributions with Dependence Trees. IEEE Trans. Inf. Theory 1968, 14, 462–467. [Google Scholar] [CrossRef]
Friedman, N.; Geiger, D.; Goldszmidt, M. Bayesian Network Classifiers. Mach. Learn. 1997, 29, 131–163. [Google Scholar] [CrossRef]
Dag, A.; Dag, A.Z.; Asilkalkan, A.; Simsek, S.; Delen, D. A Tree Augmented Naïve Bayes-Based Methodology for Classifying Cryptocurrency Trends. J. Bus. Res. 2023, 156, 113522. [Google Scholar] [CrossRef]
Wester, P.; Heiding, F.; Lagerström, R. Anomaly-based Intrusion Detection using Tree Augmented Naive Bayes. In Proceedings of the 2021 IEEE 25th International Enterprise Distributed Object Computing Workshop (EDOCW), Gold Coast, Australia, 25–29 October 2021; pp. 112–121. [Google Scholar]
Ruz, G.A.; Araya-Díaz, P.; Henríquez, P.A. Facial Biotype Classification for Orthodontic Treatment Planning Using an Alternative Learning Algorithm for Tree Augmented Naive Bayes. BMC Med. Inform. Decis. Mak. 2022, 22, 316. [Google Scholar] [CrossRef] [PubMed]
Ahmad, F.; Tang, X.W.; Qiu, J.N.; Wróblewski, P.; Ahmad, M.; Jamil, I. Prediction of Slope Stability Using Tree Augmented Naive-Bayes Classifier: Modeling and Performance Evaluation. Math. Biosci. Eng. 2022, 19, 4526–4546. [Google Scholar] [CrossRef]
Long, Y.; Wang, L.; Sun, M. Structure Extension of Tree-Augmented Naive Bayes. Entropy 2019, 21, 721. [Google Scholar] [CrossRef]
Ren, H.; Wang, X. Scalable Structure Learning of K-Dependence Bayesian Network Classifier. IEEE Access 2020, 8, 200005–200020. [Google Scholar] [CrossRef]
Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Cai, L.; Maffray, F. On the Spanning K-Tree Problem. Discret. Appl. Math. 1993, 44, 139–156. [Google Scholar] [CrossRef]
Downey, R.G.; Fellows, M.R. Parameterized Complexity; Springer: Berlin, Germany, 1999. [Google Scholar]
Zheng, F.; Webb, G.I. Tree Augmented Naive Bayes. In Encyclopedia of Machine Learning; Springer: Boston, MA, USA, 2010; pp. 990–991. [Google Scholar]
Martinez, A.M.; Webb, G.I.; Chen, S.; Zaidi, N.A. Scalable Learning of Bayesian Network Classifiers. J. Mach. Learn. Res. 2016, 17, 1515–1549. [Google Scholar]
Arnborg, S.; Proskurowski, A. Linear Time Algorithms for NP-hard Problems Restricted to Partial k-trees. Discret. Appl. Math. 1989, 23, 11–24. [Google Scholar] [CrossRef]
Ding, L. Maximum Spanning K-Trees: Models and Applications. Ph.D. Dissertation, University of Georgia, Athens, GA, USA, 2016. [Google Scholar]
Sugahara, S.; Ueno, M. Exact Learning Augmented Naive Bayes Classifier. Entropy 2021, 23, 1703. [Google Scholar] [CrossRef] [PubMed]
Vishwakarma, S.; Ganguly, S. Optimal partition of feature using Bayesian classifier. arXiv 2023, arXiv:2304.14537. [Google Scholar] [CrossRef]

Figure 1. Bayesian networks classifiers with different structure topologies. (a) Naïve Bayes; and (b) tree augmented Naïve Bayes (TAN).

Figure 2. (a) An acyclic 2-tree of seven vertices, for which a creation process can be:

{a, c} b {a, c} d

{c, d} f {a, d} g {c, d} e

; (b) a tree-representation of the 2-tree in (a) in which 3-cliques are connected with a tree topology according to Definition 4; and (c) the 2-TAN in which the topology structure of random variables (excluding label variable Y) is the 2-tree in (a).

Figure 2. (a) An acyclic 2-tree of seven vertices, for which a creation process can be:

{a, c} b {a, c} d

{c, d} f {a, d} g {c, d} e

; (b) a tree-representation of the 2-tree in (a) in which 3-cliques are connected with a tree topology according to Definition 4; and (c) the 2-TAN in which the topology structure of random variables (excluding label variable Y) is the 2-tree in (a).

Figure 3. A schematic illustration for the proof of Lemma 1 with a backbone 2-tree, where 3-clique

{3, 8, 12}

separates the graph into at most

k + 2 = 4

connected components bordered by blue dotted lines. Should it be not true, the assumed border line (red dotted) separates components

C_{1}

from

C_{2}

and backbone edge

(5, 6)

would have not been included in the graph.

Figure 3. A schematic illustration for the proof of Lemma 1 with a backbone 2-tree, where 3-clique

{3, 8, 12}

separates the graph into at most

k + 2 = 4

connected components bordered by blue dotted lines. Should it be not true, the assumed border line (red dotted) separates components

C_{1}

from

C_{2}

and backbone edge

(5, 6)

would have not been included in the graph.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dastjerdi, F.R.; Cai, L. Augmenting Naïve Bayes Classifiers with k-Tree Topology. Mathematics 2025, 13, 2185. https://doi.org/10.3390/math13132185

AMA Style

Dastjerdi FR, Cai L. Augmenting Naïve Bayes Classifiers with k-Tree Topology. Mathematics. 2025; 13(13):2185. https://doi.org/10.3390/math13132185

Chicago/Turabian Style

Dastjerdi, Fereshteh R., and Liming Cai. 2025. "Augmenting Naïve Bayes Classifiers with k-Tree Topology" Mathematics 13, no. 13: 2185. https://doi.org/10.3390/math13132185

APA Style

Dastjerdi, F. R., & Cai, L. (2025). Augmenting Naïve Bayes Classifiers with k-Tree Topology. Mathematics, 13(13), 2185. https://doi.org/10.3390/math13132185

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Augmenting Naïve Bayes Classifiers with k-Tree Topology

Abstract

1. Introduction

2. Backgrounds

2.1. Naïve Bayes

2.2. Tree Augmented Naïve Bayes

2.3. Optimality of TAN Topology

3. $k$ -Tree Augmented Naïve Bayes

3.1. Acyclic k-Tree

3.2. k-Tree Augmented Naïve Bayes

4. Optimal $k$ -Tree Topology

5. Finding Maximum Spanning $k$ -Tree

5.1. Maximum Spanning k-Tree

5.2. Backbone k-Trees

5.3. A Polynomial-Time Algorithm

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Augmenting Naïve Bayes Classifiers with k-Tree Topology

Abstract

1. Introduction

2. Backgrounds

2.1. Naïve Bayes

2.2. Tree Augmented Naïve Bayes

2.3. Optimality of TAN Topology

3. k -Tree Augmented Naïve Bayes

3.1. Acyclic k-Tree

3.2. k-Tree Augmented Naïve Bayes

4. Optimal k -Tree Topology

5. Finding Maximum Spanning k -Tree

5.1. Maximum Spanning k-Tree

5.2. Backbone k-Trees

5.3. A Polynomial-Time Algorithm

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3. $k$ -Tree Augmented Naïve Bayes

4. Optimal $k$ -Tree Topology

5. Finding Maximum Spanning $k$ -Tree