1. Introduction
Bayesian networks are acyclic graphical models that can effectively model complex random systems with probabilistic dependencies among random variables. The structured framework plays a crucial role in statistical classification tasks, where accurately assigning class labels to unclassified samples is highly desirable [
1]. Bayesian classifiers assume
-dependency among
n feature variables [
2]; for instance, when computing the probabilities of various diseases for class labels, given
n symptoms as features, the network assumes each feature may possibly be influenced by all other
features. The utility of Bayesian classifiers extends across various application domains, including health care [
3], object detection [
4], document classification [
5], fraud detection [
6], and spam filtering [
7], showcasing their versatility. While Bayesian classifiers allow for modeling arbitrary complex dependencies between features across different domains, their computational cost is notable [
2]; in particular, to capture complex dependences and higher-order interactions among feature variables, many applications of Bayesian classifiers rely on computing joint probability distributions [
8]. Therefore, as the number of features increases, the involved computational complexity escalates due to the potentially full dependencies among feature variables [
9,
10]. It is widely acknowledged that optimal learning of Bayesian networks is computationally intractable [
11,
12].
To cope with the intractability in computing with Bayesian networks, one viable approach is to approximate the joint probability distribution. Technically, this strategy is to relax constraints on dependencies between feature variables when their joint probability distribution is considered. The simplest form of the approximation for Bayes classifiers assumes zero dependencies among features, referred to as Naïve Bayes classifier [
13]. Such classifiers operate under the assumption that each feature is conditionally independent of other features given the class label. The simplicity of the model makes it an appropriate choice for various applications, such as image classification [
14], spam detection [
15,
16,
17], and sentiment analysis [
18,
19,
20]. However, while the independence assumption significantly simplifies the computation for joint probability, it may not be realistic, particularly in applications involving complex attribute dependencies [
21] where performance can adversely be limited [
12]. Therefore, in many practices, probabilistic graph approximations often prioritize structures where feature variables exhibit dependencies, such as those found in tree structures. Specifically, within a tree topology, each feature variable is influenced by at most one other feature variable [
22].
In particular, tree augmented Naïve Bayes (TAN) is a preferred approximation of Bayesian networks. It is based on a tree-dependency structure assumption for feature variables [
23]. TAN classifiers mitigate the independence assumption among features in Naïve Bayes while still maintaining lower time complexity compared to full Bayesian networks [
12]. TAN classifiers demonstrate much better performance over Naïve Bayes in various applications and are a popular choice for many classification tasks. For example, they have found widespread use across diverse domains, including cryptocurrency trend classification [
24], anomaly-based intrusion detection [
25], facial biotype classification [
26], and other prediction tasks [
27]. However, in applications heavily characterized by complex attribute dependencies, TAN classifiers may struggle to represent the probabilistic graphical models accurately. This is especially true when one feature is strongly influenced by more than one other feature. For such applications, the tree topology of TAN has been further improved by considering more dependencies for each feature variable. In particular, the
k-dependency Bayes classifier (KDB) [
2] and its variant [
28] have been introduced, which can represent feature dependencies with up to
k feature parents, where
. In other words, it allows each feature to be influenced by a maximum of
k other variables. Unfortunately, in those studies, the construction of
k-dependence tree classifiers relied on heuristic ranking methods to determine the
k parents of each variable, which do not ensure the optimality of the resulting dependence topology [
29].
In this paper, we introduce a novel approach to ensure optimal construction of topology for KDB. We coin
k-tree augmented Naïve Bayes (
k-TAN) for the Bayesian networks whose structure is of the
k-tree topology. A
k-tree is a tree-like graph where vertices (representing feature variables) can be defined in some desired order such that each vertex is introduced with respect to other
k (parent) vertices that already exist. In other words, a
k-TAN is a KDB where the
k-dependency of feature variables follows the
k-tree definition. In this work, we prove that the proposed
k-TAN enables an optimal approximation of the Bayesian network of
k-tree topology. In particular, we show that, under the KL-divergence measurement [
30], the
k-tree topology approximation has the minimum information loss if the structure topology is taken from a maximum spanning
k-tree. In this context, the edge weights of the graph are calculated as the mutual information between feature variables conditional upon the class label. Though finding a maximum spanning
k-tree (MS
kT) is intractable [
31] for fixed
(and thus neither fixed-parameter tractable [
32]), we demonstrate in this paper that solving the MS
kTproblem can be done in time
for every fixed
so long as the maximum spanning
k-tree retains a Hamiltonian path presented in the input graph. Consequently, a
k-TAN as the approximation of Bayesian networks with minimum loss of information can be found efficiently for small to moderate values of
k.
The remainder of this paper is organized as follows:
Section 2 gives an introduction to Bayesian networks and reviews previous approaches to its approximation, including Naïve Bayes, tree augmented Naïve Bayes (TAN), and
k-dependency Bayesian networks (KDBs).
Section 3 introduces the
k-tree augmented Naïve Bayes (
k-TAN) and
Section 4 presents a detailed proof that optimal
k-tree topology is ensured by the maximum spanning
k-tree. We elaborate on our proposed polynomial time algorithm to solve MS
kT in
Section 5.
2. Backgrounds
Consider dataset D that consists of m data samples for random variable set , where is the set n feature variables, and Y is the class variable. Thus D can be represented as an matrix of numerical values on m data samples. For simplicity, we assume all variables have the binary domain of values. The classification problem is, given a dataset D of samples, to construct a predictive model that can accurately classify any input sample of unknown label. Technically, A classifier assigns a class label, either 0 or 1, to the input unclassified sample , where is a value of feature variable in sample , , such that .
Definition 1. A Bayesian network over random variables is a probabilistic, directed acyclic graph , where vertices denote random variables in and directed edges in E between vertices denote causal dependencies between the corresponding random variables. Consequently, the absence of an edge between two particular variables signifies their causal independence.
In the Bayesian network, if
, variable
is called a
parent of variable
. We denote with
the set of indexes of parents for
. That is,
. Due to the causal independence assumption between variables that do not share directed edges, the likelihood of variables in
given class variable
Y can be decomposed and written as
Based on the well-known Bayes formula
[
8] and (
1), the predicted class label
on any given feature values
can be computed as the one with the highest posterior probability. That is,
where the denominator
can be dropped. However, the above formula indicates that
may contain up to
indexes of variables and the probability distribution may be of an order of
n, making the computation of likelihood challenging with the standard Bayesian network classifiers.
2.1. Naïve Bayes
Naïve Bayes classifier [
12] is a Bayesian classification model that makes the Naïve assumption that each feature variable
is conditionally independent of other feature variables given the class label variable
Y, implying that there are no causal dependencies between feature variables in the graphical model. This assumption greatly simplifies the computation with Bayesian networks because the likelihood of variables in
given class variable
Y can be computed with
Therefore, under the Naïve Bayes modeling, maximizing the posterior probability in Equation (
2) to predict the class label on a given collection of feature values
is computed as
Note that the above optimization computation can be very efficient since for given feature values
, and the maximization over the choices of
y does not need to consider the value of
for the given
. The typical topology structure of the Naïve Bayes classifier is illustrated in
Figure 1a.
2.2. Tree Augmented Naïve Bayes
There have been some further significant developments on augmenting the Naïve Bayes classifiers. These efforts aim to relax the independence assumption among features while maintaining lower time complexity compared to full Bayesian networks. Tree structures are preferred as a feasible probabilistic graph approximation, namely
tree augmented Naïve Bayes (TAN) [
33], which permits causal dependencies between feature variables as well as the dependence between every feature variable and the label variable
Y. The overall structure topology of the causal dependencies is yet simple enough to form a tree structure. Therefore, every feature variable, except one, have one other feature variable as its parent [
23].
Figure 1b illustrates such a topology. For TAN classifiers, the the likelihood of variables in
given class variable
Y can be written as
where for
,
is the parent feature variable of
, i.e.,
, for which the parent relation is acyclic, and for exactly one
j,
does not exist.
With the TAN model, maximizing the posterior probability in Equation (
2) to predict class label on a given collection of feature values
is expressed as
2.3. Optimality of TAN Topology
The structure topology of causal relationships among the feature variables is a tree in a TAN classifier. Unlike the Naïve Bayes where feature variables are assumed independent, TANs of different tree topologies approximate the joint distribution of the underlying (unknown) Bayesian network to different extents. This discrepancy between the unknown joint distribution
P and the approximated distribution
is information loss due to structure topology approximation by some TAN, which is measurable with the
KL-divergence [
30]. Indeed, the question here is how to represent the
-order relationships of
P with binary relationships characterized as a tree structure that has the minimum information loss. In addressing this issue for general graphical models, the seminal work of Chow and Liu [
22] showed that the smallest KL-divergence is achieved by the topology of a maximum spanning tree where edge weights take mutual information between feature variables. This idea was adopted to determine the optimal TAN network structure
through computing a maximum spanning tree over mutual information between feature variables conditional upon the label variable [
23], where
in which
is the parent of
i in tree
T.
While optimal approximation of Bayesian networks with tree structure like TAN can be computed efficiently by the maximum spanning tree algorithms, in such a model, every feature variable can only be a causal effect of another (single) feature variable. In many practical applications, classifiers capable of modeling causal dependence upon multiple feature variables become necessary. To cope with this situation, the
k-dependency Bayesian classifier (KDB) was introduced in [
2] to relax casual independence assumptions by allowing every feature variable to have up to
k other feature variable parents, for chosen
. In addition to the capability to handle multi-dependency among feature variables, the parameter
k in a KDB is adjustable and comes in handy for trading between model construction efficiency and classification performance. On the other hand, however, determining an optimal structure topology for KDB poses a great challenge. Technically, representing the
order relationships with
order relationships in KDB to minimize the divergence
proves a computationally difficult task.
Hence, the construction of a pertinent
k-dependence structure of KDB for applications has resorted to heuristics methods. One typical practice is to compute the conditional mutual information
for each pair of feature variables
and
given class label
Y. On an existing variable
of interest,
k other variables
with the highest mutual information with
are identified as the feature parent variables for
. Yet another
k-dependence classifier named Extended Tree Augmented Naïve Bayes (ETAN) [
28] has also been introduced. Unlike KDB, ETAN employs higher-order conditional mutual information, e.g.,
, to capture the conditional dependencies between
and its parent attributes. A drawback of these strategies is that they would generate a large number of redundant dependencies between feature variables in the model and result in over-fitting. Other methods based on sorting of feature variables and on filtering are all heuristic and do not guarantee the optimality of the determined structure topology [
29,
34].
4. Optimal -Tree Topology
In general, different graph topologies for
k-TAN may result in different performances of the model. Theoretical performance of approximated graphical models can be effectively measured with information-theoretics. Let
P be the true yet unknown probability distribution of random variables defined on some graphical model, and let
G be the structure topology to approximate the model. Then the information loss due to the approximation can be measured with the KL-divergence
where
is the joint probability distribution of random variables pertaining to the approximated structure topology
G. An optimal structure topology for the approximation is the one that minimizes the information loss
. Chow and Liu initiated such approximation with tree topologies in their seminal work [
22]. They proved that the optimal approximation can be achieved via a tree topology that yields the maximum sum of mutual information. The result has been applied to finding optimal tree topology for the construction of tree augmented Bayesian networks, TANs [
23]. Unfortunately, using the KL-divergence measurement to compute information loss due to topology approximation with
G, especially to find an optimal topology approximation has not been successful on non-tree structures, in particular, for KDB and its variants.
An optimal structure topology of
k-TAN for classification tasks can be identified by minimizing the information loss due to topology approximation with
k-TAN. Specifically, the optimal structure
is computed with
where
is the unknown probability distribution over feature variables
and the label variable
Y, and
is the distribution enabled by the structure topology
G over the same set of variables in the
k-TAN.
We now give a derivation of an optimal
k-tree topology of
k-TAN. By (
5), the goal of Equation (
6) can be achieved by minimizing computation of the relative entropy as:
where pair
is a data sample for variables
and
Y. By conditional entropy, expression in (
7) can be computed exactly as,
The second term of the right-hand side of the above equation can be turned into
The last term can be decomposed into:
The second last equality holds because the probability is projected onto the variable, resulting in and omitting other variables due to summation of their probabilities. It is true also because likewise is projected onto the variable and the variables for all , resulting in . The last equality is due to the definition of entropy and mutual information.
Since the terms of entropies are independent of choices of , the minimization of the left-hand-side is realized by the maximization of the sum . Note that is mutual information between variable and its variable parent set conditional upon class variable Y, where is drawn from the acyclic k-tree topology G that is imposed on the k-TAN. Therefore, we conclude:
Theorem 1. The optimal structure topology G for feature variables in k-TAN can be determined by maximizing sum , where π represents the k-tree structure among feature variables in G.
5. Finding Maximum Spanning -Tree
We show in the next section that the maximization problem in information theoretics given in Theorem 1 has an equivalent problem in graph theoretics.
5.1. Maximum Spanning k-Tree
Definition 6. Let be an acyclic graph. A real-value function is a neighborhood function if for any subset , if and only if Δ is an acyclic clique.
Definition 7. Let be a fixed integer and f be a neighborhood function. The Maximum spanning k-tree problem associated with f, denoted as
MS
kT
, is, given a directed acyclic graph , finding a spanning acyclic k-tree H that maximizes the sum:where denotes the acyclic clique formed by v along with its parent vertices in H. Note that problem
MS
kT
is a generic problem that can be specific when the neighborhood function f is specified. MS
kT
generalizes the traditional maximum spanning tree problem where the neighborhood function
is simply
, the weight between
v and its parent
u. Another further generalization is to consider
to be a hyper edge shared by
v and its parent vertices and the neighborhood function
to be the weight on the hyper edge. In particular, we can cast the maximization problem in Theorem 1 as the problem MS
kT
for the choice of the neighborhood function
f as follows:
where vertex
v denotes the random feature variable
in the topology of the
k-tree associated with the
k-TAN being constructed and
be the set of
v’s parent vertices.
We will show that the MS
kT
problem is intimately related to the known decision problem spanning
k-tree, which answers the following question: given a graph, does it possess a subgraph that is a spanning
k-tree? The spanning
k-tree appears to be difficult to solve because it has been proved NP-hard [
31] even for fixed
. We shall connect the intractable problem to MS
kT
for some specific functions
f, and the connection will carry the intractability over to the latter, unfortunately.
Definition 8. Let be a fixed integer and f be a neighborhood function. The problem
D-MSkTis a decision problem that: given a directed acyclic graph and real number threshold w, determines if G contains a spanning k-tree H such that the value .
It is not difficult to see, if we choose the neighborhood function to be the count of directed edges from parent vertices of v to v on k-tree H, then , where the first sum is the total count of edges in the intial k-clique and term is the count of new edges due to vertices introduced to existing k-cliques. Clearly the count is . Therefore, the input directed acyclic graph G contains a directed k-tree H if and only if the value . The above analysis yields a polynomial-time transformation from problem spanning k-tree to D-MSkT and thus a justification for the intractability of the latter. That is, there are (simple) neighborhood functions f for which problem D-MSkT is NP-hard. This leads to the intractability of the MSkT problem as well.
Theorem 2. For each fixed value of , there are neighborhood weight functions f for which problem
MSkTis NP-hard.
The above analyses and those from the previous section suggest that the problem MSkT (and the problem of finding an optimal structure topology for k-TAN as well) is computationally intractable even for fixed values of . Nevertheless, we will show in the next section that, with a meaningful restriction on the topology structure of the k-tree, polynomial-time algorithms can be obtained for MSkT.
5.2. Backbone k-Trees
In this section, we focus on a specific type of k-tree whose topology imposes a linear order on the involved vertices. For such k-trees, we derive efficient algorithms for MSkT.
Definition 9. Let be a graph, where . If the edge set E contains a Hamiltonian path , then G is called a backbone graph. We also call the Hamiltonian path the backbone, and edge a backbone edge. If G is a k-tree, it is called a backbone k-tree.
Graphs with a built-in Hamiltonian path can be ideal as the structural topology of graphical models arising from meaningful scientific applications. The linearity underlying the importance aspect of relationships among random variables upon which higher-order, more sophisticated relations are established. For example, backbone graph topology is most suitable for modeling time-series events and processes, known significant pathways in gene networks, and biomolecule 3D structure, where the backbone of the molecule is coupled in a linear fashion.
We now introduce and prove a few properties associated with backbone k-trees. For this purpose, our discussion will be based on the tree-representation for k-trees in Definition 4. Since backbone edges are of special interest, a more elaborate definition of tree-representation for the backbone k-tree is needed.
Proposition 3. Let G be a backbone k-tree of size and let interval denote the ordered consecutive positions of all vertices in the backbone. Let Δ
be a -clique in G with being the positions of vertices in Δ
on the backbone in the assumed order . Then these positions partition the consecutive interval into at most non-empty, consecutive intervals as Let . We denote with the subset of backbone edges formed by vertices whose positions on the backbone are in the same interval containing the position of v.
We now consider a new characterization for backbone k-trees. Specifically, we define the collection of partial backbone k-trees (abbreviated with pbkts) recursively as follows:
Definition 10. Let H be a graph of size , Δ be a -clique in H, and α is a subset of backbone edges. Tuple is a pbkt rooted at -clique Δ, which retains backbone edges in α if
- (1)
either and ;
- (2)
or , where and are pbkts rooted at -cliques Δ and and retain backbone edges in β and γ, respectively, such that for some vertices and v that occur in α but not in Δ, and and .
denotes the -clique derived from Δ, such that u is removed and replaced by v along with new edges added on.
Note that the term partial
k-tree has been used for subgraphs of
k-trees [
35]. Since a subgraph of a backbone
k-tree does not necessarily contain all the backbone edges, pb
kt is defined with an associated subset of backbone edges. The use of a
-clique as the root is technical, which makes it feasible to view backbone
k-trees from the perspective of a relation among
-cliques.
The atomic case of pbkt is simply a -clique that only contains backbone edges that are already on the clique. Furthermore, it is not difficult to see that, if G is a (full) backbone k-tree, there is a -clique in G to serve as the root such that tuple is a pbkt with being the complete set of backbone edges.
We now examine two properties of backbone k-trees important to developing efficient algorithms that find the maximum backbone k-tree.
Lemma 1. Let and be a backbone k-tree. Then, any -clique Δ in G separates G into at most connected components.
Proof. By Proposition 3, the positions of vertices in
partitions the consecutive interval into at most
consecutive intervals. We claim that any set of vertices whose positions belong to the same, non-empty interval, say
, form at most one connected component separated by
. Suppose otherwise there are two connected components
and
separated by
and there exists an integer
t, where
, such that
and
. Since backbone edge
does not belong to
nor
, there would have to be another component that contains both
and
, which cannot exist, leading to a contradiction. Based on the above analysis, the number of connected components in the backbone
k-tree
G separated by
-clique is at most
. (See
Figure 3 for an illustration of the above argument). □
The notation used in Definition 10 underlies how one -clique comes to exist based on another. We give a formal term to this relationship between two -cliques.
Definition 11. Let Δ be a -clique and for some and . is said to be a neighbor of Δ.
Lemma 2. Let Δ be a -clique in a backbone k-tree, which has two different neighbors and . If and , then the positions of and on the backbone belong to two different intervals separated by the positions of vertices in Δ.
Proof. According to Definition 4, and should be the roots of two different pbkts that are two connected components separated by to which and belong, respectively. By the proof for Lemma 1, the positions of and on the backbone should belong to two different consecutive intervals separated by the positions of the vertices in . □
5.3. A Polynomial-Time Algorithm
We now present an efficient algorithm for solving the MSkT problem where the desired maximum spanning k-tree should contain a certain backbone (i.e., Hamiltonian path) designated in the input graph. We only consider neighborhood functions that can be computed in time on any -clique for some computable function . The properties demonstrated earlier will facilitate the following discussion of the algorithm.
The problem MSB
kT
, tailored from MS
kT
, for backbone
k-trees, is defined as: given a directed acyclic graph
with a designated backbone
B, finds a spanning directed acyclic
k-tree
in graph
G such that it retains
, and the following sum is maximized:
By default, we assume the designated backbone B in the input graph can always be denoted as by a simple relabeling on the vertices of the graph.
To facilitate discussion, we denote with the set of edges formed by vertices in clique . Also recall notation introduced earlier, which represents the subset of backbone edges formed by vertices whose positions on the backbone are in the same interval containing the position of v.
Theorem 3. Problem
MSBkT
can be solved in time on input directed, acyclic graphs of n vertices.
Proof. The algorithm finds an optimal spanning backbone
k-tree based on the tree-representation of
k-trees in Definition 4, which offers a recursive construction of pb
kts in the input graph
G. We define function
to be the maximum sum of neighborhood function
f values over all
-cliques in a pb
kt that is rooted at
and retains all backbone edges in
. Then
which strictly follows the rules for pb
kt given in Definition 4.
Function maximizes its value over all pbkts rooted at and retains backbone edges in set . Then a desired spanning backbone k-tree is one of the pbkts rooted at some -clique , which maximizes function .
The recurrence relation for function S makes it computable with a dynamic programming process. For every pbkt that is simply -clique , the value is the neighborhood function value . For every more general case of pbkt, the computation maximizes over all possible vertex pair u and v that are used to split the pbkt into two subcases of pbkts, one rooted at and another at . The corresponding subset of backbone edges that needs to be retained is also split into two. One subset is containing those backbone edges whose positions are in the same interval as the position of vertex v. The other one is the rest of the backbone edges in subset . This is because, by the proof of Lemma 1, the backbone edges whose positions are in the same interval as the position of vertex v belong to the same connected component as v.
For the maximization operation, there are choices of vertices in for u and n options for v. Then a table of entries can be used for storing values of function , where m is the number of backbone subsets. By Lemma 2, the positions of all backbone edges are partitioned into consecutive intervals, and those with positions belonging to the same interval should belong to the same connected pbkt. Consequently, and can be encoded with bitmap of length . The overall time used by the algorithm is thus . □
Corollary 1. For every fixed , problem
MSBkT
can be solved in time on input directed, acyclic graphs of n vertices.
To connect the problem MSBkT to the problem of finding optimal structure topology for k-TAN that retains an designated Hamiltonian path, the neighborhood function is the mutual information between variable and its parent set variables in the acyclic directed -clique, where is the vertex such that . The function can be computed in time a function in k.
In addition, an earlier investigation [
36] showed that the dynamic programming to computing functions around backbone
k-tree can be approached in a slightly different way with the argument
being a
k-clique instead of a
-clique, yielding a time complexity
instead of
for every fixed
.
Theorem 4. For every fixed , the optimal structure topology for feature variables in k-TAN can be determined in time provided that the structure retains a predefined Hamiltonian path in the input graph of n vertices.
Corollary 2. For every fixed , there are polynomial time algorithms for the minimum loss of information approximation of Bayesian networks with a conditional k-tree topology that retains a designated Hamiltonian path in the input network.
6. Conclusions
Bayesian networks offer a structured framework for modeling dependencies among features and class labels, making them invaluable in classification tasks. Generally, there are dependencies among features, where n is the total number of features, which need to be considered. Thus, the complexity of acquiring a complete Bayesian network may be intractable with increasing feature numbers. Naïve Bayes assumes no dependencies among features given the class label, which is often unrealistic in real-world scenarios. To address this issue, Tree Augmented Naïve Bayes (TAN) is introduced to capture the most important dependencies between features by constructing a simple topology, a tree. Therefore, TAN is notably preferred for probabilistic graph approximations. However, more complex topologies such as KDB are required for situations where features strongly influence each other. Previous research on k-Dependency Bayesian classifiers has largely depended on heuristic algorithms that do not guarantee an optimal topology for the approximation.
We introduced an augmented Naïve Bayes classifier, k-TAN, of a k-dependency topology where structure among feature variables can be represented as k-tree graph. We demonstrated that when the k-tree is a maximum spanning k-tree, it provides the optimal approximation of the Bayesian network and ensures minimum information loss under the KL-divergence metric. The edge weights represent the mutual information between feature variables conditioned on the class label. Although finding the maximum spanning k-tree is an intractable task for fixed , this paper presents a polynomial-time approach to solve a constrained case of the problem. That is, if the maximum spanning k-tree requires retaining a designated Hamiltonian path in the graph topology, the maximum spanning k-tree problem can be solved in time . Therefore, employing this algorithm ensures the efficient construction of k-TAN classifiers with minimal information loss. The topology constraint of retaining a Hamiltonian path has a wide spectrum of applications with Bayesian networks.
Our final note is that the proposed
k-TAN method may be incorporated with some recently developed extensions to Naïve Bayesian classifiers, which are orthogonal to structural topology extension to Naïve Bayes. This includes, for example, a method in which the integration of a latent component into the Naïve Bayes classifier can account for hidden dependencies among attributes in domains such as health care [
9], where data complexity and inter-feature correlations are common. Another work investigates relationships between computing conditional log likelihood and marginal likelihood under the exact and approximation learning setting [
37]. Moreover, rather than assuming all features are conditionally independent, Comonotonic-Inference-Based classifier (CIBer) [
38] identifies an optimal partition of features, grouping those that exhibit strong dependencies.