Next Article in Journal
Generalized M-Estimation-Based Framework for Robust Guidance Information Extraction
Previous Article in Journal
Extreme Events and Event Size Fluctuations in Resetting Random Walks on Networks
Previous Article in Special Issue
Information Content and Maximum Entropy of Compartmental Systems in Equilibrium
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

TAN-FGBMLE: Tree-Augmented Naive Bayes Structure Learning Based on Fast Generative Bootstrap Maximum Likelihood Estimation for Continuous-Variable Classification

1
School of Computer Science, Hubei University of Technology, Wuhan 430068, China
2
Hubei Provincial Key Laboratory of Green Intelligent Computing Power Network, Hubei University of Technology, Wuhan 430068, China
*
Author to whom correspondence should be addressed.
Entropy 2025, 27(12), 1216; https://doi.org/10.3390/e27121216
Submission received: 31 August 2025 / Revised: 25 November 2025 / Accepted: 26 November 2025 / Published: 28 November 2025

Abstract

Tree-Augmented Naive Bayes (TAN) is an interpretable graphical structure model. However, its structure learning for continuous attributes depends on the class-conditional mutual information, which is sensitive to one-dimensional or two-dimensional density estimation. Accurate estimation is challenging under complex distributions such as multi-peak, long-tailed and heteroscedastic cases. To address this issue, we propose a structure learning method for TAN based on Fast Generative Bootstrap Maximum Likelihood Estimation (TAN-FGBMLE). FGBMLE consists of two stages of work. In the first stage, resampling weights and random noise are input into a network generator to rapidly produce candidate parameters, efficiently covering the latent density space without repeated independent optimization. In the second stage, optimal mixture weights are estimated by maximum likelihood estimation, assigning appropriate contributions to each candidate component. This design enables fast and accurate complex density estimation for both single and joint attributes, providing reliable computation of class-conditional mutual information. The TAN structure is then constructed using Prim’s maximum spanning tree algorithm. Experiments show that our estimation method attains higher fitting accuracy and lower runtime compared with traditional nonparametric estimators. By using open-source datasets, the TAN-FGBMLE achieves superior accuracy and recall compared to classic methods, demonstrating good robustness and interpretability. On publicly available real air quality data, it has a high classification result and produces graph structures that more accurately capture dependencies among continuous attributes.

1. Introduction

Bayesian approaches are crucial in statistics and machine learning, offering a cohesive theoretical structure for modeling and inferring uncertainty [1]. They have been widely applied in fields such as engineering monitoring [2], control theory [3], and medical diagnosis [4]. Their core concept is to update beliefs about underlying parameters or structures by combining prior knowledge with data evidence, thereby enabling systematic reasoning from data. The emergence of probabilistic graphical models has provided a powerful tool for discovering causal or dependent relationships in complex systems. Probabilistic graphical models, by representing conditional independence among variables with graph structures, allow for the modeling and inferring complex high-dimensional distributions.
The Naive Bayes (NB) model has achieved great success in practice due to its simplicity and efficiency [5]. However, NB relies on a strict conditional independence assumption, namely that features are independent of each other given a class label. While the assumption greatly simplifies computation, it often does not hold true across real-world datasets. Dependencies between features often exist. Ignoring these dependencies can lead to classifier bias, insufficient expressiveness, and poor predictive performance [6]. To address this limitation, researchers have proposed the TAN model [7]. By upholding NB’s simplicity, the TAN model allows attribute interdependencies to be encoded in a tree structure, thereby balancing computational feasibility with model accuracy. The TAN model is a theoretical extension of NB, exhibiting notable benefits in areas like text classification, genomics, and medical diagnosis.
The core of TAN lies in learning dependency structures. The key step in learning is estimating class-conditional mutual information (CMI) [8]. Mutual information quantifies the strength of the statistical dependencies between feature pairs and directly determines the selection of edges in the TAN structure. Large biases in the CMI estimation can lead to incorrect representation of dependencies, compromising classification accuracy and interpretability. Therefore, achieving stable and accurate CMI estimation is a primary challenge, especially in continuous attribute conditions [9].
The calculation of mutual information for continuous variable relies on one-dimensional and two-dimensional class-conditional density estimation. However, real-world data distributions are often far from simple unimodal and symmetric cases. They may exhibit multimodal structures, long-tailed distributions, heteroskedasticity, and severe skewness. These complexities present substantial obstacles to the effectiveness of traditional density estimation methods [10]. Kernel density estimation (KDE) has the advantages of being nonparametric and consistent in low dimensions [11], but suffers from the curse of dimensionality and local overfitting in high dimensions when the data distribution is complex. Finite Gaussian Mixture Models (GMMs) can approximate complex distributions to a certain extent, but their sensitivity to the number of mixture components and their lack of adaptability to non-Gaussian distributions [12].
To overcome these shortcomings, hybrid approaches combining bootstrapping with generative modeling have gained increasing attention. Bootstrapping, a classic resampling technique, effectively captures sample uncertainty and improves estimation robustness [13]. Generative models excel at capturing the complex structure of the underlying distribution. Combining these two approaches could potentially achieve a balance between computational efficiency and estimation accuracy, overcoming the limitations of traditional methods. However, such approaches come with very high computational cost. Every resampling step requires a complete optimization of the likelihood function. In high-dimensional or large-sample problems, this repeated optimization becomes prohibitively expensive, making traditional bootstrap methods inefficient for practical applications.
To alleviate this computational burden, we propose a computational strategy based on the neural network generative process, called the Fast Generative Bootstrap Maximum Likelihood Estimator (FGBMLE). This method significantly reduces the computational cost and theoretically retains the expressive power and flexibility of bootstrap resampling. Our main contributions are as follows:
  • This method constructs an optimization framework based on a neural network generator, thereby avoiding the high computational cost of repeatedly optimizing weight combinations in the traditional bootstrap. Instead of relying solely on resampling weights, the generator also incorporates an additional source of randomness. The combination of resampling information and stochastic perturbations allows the model to capture the essential representation of the optimization problem. By leveraging the expressive capacity of neural networks, this strategy enables the generator to flexibly adapt to complex distributional characteristics, substantially improving both the efficiency and accuracy of density estimation.
  • Unlike traditional bootstrap methods, which require a complete optimization of the likelihood function at each resampling step, our approach condenses this repetitive and costly procedure into a single, efficient computation. This paper proposes a novel two-stage algorithm for the FGBMLE estimation process. In the first stage, the neural generator rapidly produces a set of candidate parameters that cover the potential distribution space by leveraging resampling information and additional randomness. In the second stage, maximum likelihood estimation is performed on this limited set to obtain the optimal configuration of mixture weights.
  • The proposed FGBMLE is applied to TAN structure learning. Compared with traditional KDE and GMMs, FGBMLE has greater adaptability and stability in univariate and bivariate density estimation, enabling more reliable calculation of class-conditional mutual information, and thus the optimized TAN obtains more reasonable dependency structures. It significantly improves the stability of mutual information estimation and the reliability of structure learning.
In this work, the FGBMLE is proposed as a general two-stage framework for flexible density estimation and latent mixture modeling. The TAN structure learning is then chosen as a representative and practically important application of this framework, because it requires accurate class-conditional marginals and bivariate densities as core building blocks. Throughout the paper, we first develop and validate FGBMLE as a general estimator on synthetic density estimation tasks, and then instantiate it within TAN to obtain the proposed TAN-FGBMLE classifier. In this sense, TAN is not the sole goal of the method, but rather a concrete probabilistic graphical model that allows us to demonstrate how FGBMLE can be embedded into structure learning and classification.
The paper consists of the following parts. In Section 2, we provide a detailed discussion about the relevant research works. In Section 3, we describe in detail the complete structure learning process of the TAN-FGBMLE method. In Section 4, we carry out a range of experimental works for evaluating our method. Finally, in Section 5, we present the conclusion that summarizing the our work and outlining the potential future work.

2. Preliminaries

2.1. Latent Mixture Models and Semiparametric Estimation

Mixture models have long been a fundamental tool in statistics and machine learning for capturing heterogeneous or multimodal distributions. They provide a flexible framework for representing samples generated by latent variables. They have been extensively studied in both parametric and nonparametric settings.
To illustrate this framework, we use a mixed generation model to express the distribution characteristics of the observed data. Assuming that observation Y = { y 1 , y 2 , , y n } R n comes from the latent mixture model, we assume that the generation process of each y i is affected by an unobserved latent variable θ i , and these latent variables obey an unknown mixture density function π ( θ ) .
The latent variable θ i is independently sampled from the mixture density distribution π ( θ ) . Then, given θ i , the observation data y i is generated according to the known observation model f ( y i | θ i ) . The model process can be expressed as the following equation.
θ i π ( θ ) , y i θ i f ( y i θ i ) , i { 1 , 2 , , n } .
Here, the observation model f ( y i | θ i ) is a known conditional probability distribution [14], such as the normal distribution, Poisson distribution, or Gamma distribution, and π ( θ ) is the target function we want to estimate under the parameter conditions. After integrating the latent variable θ i , we can obtain the marginal distribution of y i , which is in the form of the following equation.
y i f ( y i | θ i ) d π ( θ i ) i { 1 , 2 , , n } .
This formula describes a typical mixed model structure. For each observation y i , it is not generated by fixed parameters, but by a set of potential models f ( y i | θ i ) with π ( θ ) as the mixing weight. It can be seen that π ( θ ) describes the diversity of the potential structure. It is the core object for our subsequent statistical inference and modeling.
To estimate the mixture density π ( θ ) , we use semiparametric maximum likelihood estimation. Unlike classical mixture models, which assume a parameterized form with a finite number of components, we do not make a priori assumptions about the specific structure of π . Instead, we directly search for the optimal solution among the set of all probability density functions. We estimate π by maximizing the log-marginal likelihood function of the observed samples.
π ^ = arg max π Π log p ( y 1 , y 2 , , y n ; π ) = arg max π Π i = 1 n log Θ f ( y i θ i ) d π ( θ i ) .
The symbol Π represents the set of all possible probability density functions, which are defined on the parameter space Θ . This optimization process does not rely on any specific parameterization. Therefore, it provides a high degree of model flexibility. Regardless of whether it is discrete or continuous, the solution π ^ of (3) exists and is unique, which can ensure that the optimization problem has a solution under reasonable constraints. In addition, under finite sample conditions, the maximum likelihood estimation tends to construct a solution through finite support points to maximize the log-likelihood function. Although the true mixture density π may be a continuous distribution, the solution of π ^ is almost a discrete distribution, and the number of data points it supports is at most n. It can better adapt to the log-likelihood optimization problem of finite sample data. Many numerical optimization algorithms can be used to solve the maximization problem, such as the EM algorithm [15,16] and variational inference [17,18].
When the underlying distribution is reasonably considered to be a continuous distribution, the discrete form of the estimator shows consistency and excellent optimality in theory. Its discrete nature may lead to a significant decrease in the estimation effect. Because the discrete solution only distributes the probability mass on a limited number of support points, it cannot effectively capture the smooth transition and local detail characteristics inherent in the continuous distribution.
Researchers have proposed related smoothing variants [19]. These variants attempt to make the estimated results closer to the true continuous distribution by introducing additional smoothing mechanisms. Techniques such as bandwidth adjustment [20], roughness penalty [21], and spline fitting [22] are used widely. However, these smoothing techniques usually require complex parameter adjustments, and the quality of parameter selection directly affects the accuracy and stability of the estimation. A small bandwidth may lead to oversmoothing and be unable to get rid of discrete characteristics, while a big bandwidth may cause oversmoothing. In addition, the parameter dependence and computational complexity between different methods also bring challenges to large-scale datasets.
In statistical inference and model estimation, bootstrapping has been shown to be a very effective tool for simulating continuous densities from prior and posterior distributions [23]. The bootstrap repeatedly samples the data to generate multiple subsets. It performs statistical inference on each subset. The goal is to generate a new dataset by introducing perturbations to the observed data. The repeated estimate π is based on these datasets, eventually converging to a more stable and smoother approximation of π . A natural approach is to replace resampling with random weights w = ( w 1 , w 2 , , w n ) R n , forming a weighted bootstrap. The importance of each sample point y i in the bootstrap sample is determined by the weight w i . The corresponding bootstrap version of the log-likelihood function is calculated by the following equation.
π ^ w = arg max π Π i = 1 n w i · log Θ f ( y i θ i ) d π ( θ i ) .
By repeatedly sampling and estimating different weight vectors w ( 1 ) , , w ( B ) , we can obtain mixture density estimators π ^ w ( 1 ) , , π ^ w ( B ) , and smoothly integrate them into an approximation of a continuous mixture density.
π ¯ ( θ ) = 1 B b = 1 B π ^ w ( b ) ( θ )
By introducing bootstrap weights, the choice of weight distribution has a significant impact on estimation performance and computational characteristics. Generally, there are two classic choices. One is to use the multinomial distribution, i.e., w M u l t i n o m i a l ( n , ( 1 1 n n , , 1 1 n n ) ) , which corresponds to the traditional nonparametric bootstrap. The other is to use the Dirichlet distribution, i.e., w n × D i r i c h l e t ( n , I n ) , which corresponds to the weighted likelihood bootstrap.
The weight w i is a non-negative integer representing the number of times the i-th observation is drawn in resampling with replacement. The total weight is strictly n. This approach is intuitive and completely consistent with the classic bootstrap. The weights are discrete integers; it is easy for some observations in a given sample to have zero weights. The discrete weights can cause the subsequent optimization objective function to be less smooth, thus affecting the optimization efficiency and stability of gradient-based methods.
By using a Dirichlet distribution to generate weights, w i is a continuous positive real number with i = 1 n w i ( b ) n . The weight of each sample is different, which improves the diversity and consistency of bootstrap samples. The continuous weights ensure smoother optimization of the weighted likelihood function, which facilitates stable training for deep learning methods such as neural network generators. We chose to use a Dirichlet distribution to generate weights. This choice not only enhances the training stability but also results in more efficient and scalable mixture distribution estimation in practical applications.
There always exits a cost issue in the calculation [24]. In the optimization process, it is often necessary to calculate high-dimensional integrals. When high dimensions exit, such integral problems cannot be solved analytically and must be approximated by numerical methods. If the data needs to be resampled several times and the sample size n is large, the computational and memory costs will increase exponentially. The bootstrap for large-scale datasets faces computational difficulty. Secondly, bootstrapping often introduces weight randomness to approximate the smooth distribution, meaning that the results may not be smooth enough [25]. According to the Kiefer–Wolfowitz theorem [26], the distribution estimated by bootstrap must be the number of support points at most n. If multiple resampling is performed, the final estimate generated will still be a discrete distribution [27]. The final result is a combination of multiple discrete distributions. Although this combination can approximate the true distribution in some cases, it usually cannot provide sufficient smoothness. In the high-dimensional data or complex distributions, this discrete estimation often manifests as local ‘spike’ characteristics and lacks the smooth transition of the true distribution, causing the model estimation to deviate from the continuity of the true distribution.
Latent mixture models and their semiparametric maximum likelihood estimators provide a flexible foundation for density estimation. Bootstrap-based variants further introduce smoothing effects and enhance robustness, yet they inevitably suffer from high computational costs and a lack of smoothness due to their discrete nature. These limitations highlight the need for an efficient and expressive approaches. In this work, we address these challenges by introducing a fast generative bootstrap framework. It retains the flexibility of MLE while overcoming the computational and smoothness issues.

2.2. Bayesian Networks and TAN Model

Bayesian networks (BNs) are probabilistic graphical models that combine probability theory with graph theory [28]. Their structure is represented by a directed acyclic graph (DAG), where nodes denote random variables and edges capture conditional dependencies. The key idea of BNs is to decompose the joint probability distribution according to the graphical structure. It enables efficient modeling and inference in high-dimensional scenarios. If the network consists of d nodes X 1 , , X d , the joint distribution can be factorized as the following equation.
p ( y 1 , , y d ) = i = 1 d p ( y i Pa ( X i ) )
where P a ( X i ) denotes the parent set of node X i . This factorization relies on conditional independence assumptions, which allow complex global distributions to be expressed by local conditional distributions. The learning of Bayesian networks typically involves two tasks, which are structure learning and parameter learning. Structure learning aims to determine the network, which reflects the dependency relations among variables, while parameter learning estimates the conditional probability distributions p ( y i | P a ( Y i ) ) given the structure [29]. Its search space grows super-exponentially with the number of variables. To tackle this, researchers have proposed a range of methods, including score-based search, constraint-based independence tests, and hybrid approaches [30]. These methods balance expressive power and computational feasibility, making BNs widely applicable in practice.
Naive NB assumes conditional independence among features, which ensures high efficiency but may lead to performance degradation due to its unrealistic assumptions. To overcome this, the TAN model was proposed, which relaxes the independence assumption by allowing each feature. It depends not only on the class label c but also on another features, thus forming a maximum spanning tree structure. The class-conditional factorization is given by the following equation.
p ( y 1 , , y d c ) = i = 1 d p ( y i Pa ( X i ) , c )
The critical challenge in TAN learning lies in selecting an appropriate tree structure [31]. A common strategy is to use CMI to measure the dependency between features.
I ( X i ; X j C ) = c p ( c ) p ( y i , y j c ) log p ( y i , y j c ) p ( y i c ) p ( y j c ) d y i d y j .
The above equation shows the calculation of CMI. The larger the CMI, the stronger the conditional dependency between two features, and thus the higher priority for establishing an edge in the TAN tree. A maximum spanning tree (MST) algorithm is applied to construct the dependency graph, which is then combined with the class node to form the complete TAN model [32]. This procedure enables TAN to retain the computational simplicity of NB while capturing essential feature dependencies. It improves predictive performance and interpretability. Nevertheless, the accuracy of TAN heavily relies on the reliability of CMI estimation, especially for continuous variables where poor density estimation can degrade structure. To address this issue, the proposed FGBMLE framework provides high-quality density estimates, ensuring robust and accurate TAN structure learning.

3. Structural Learning Process Based on the TAN-FGBMLE

3.1. FGBMLE Two-Stage Algorithm

Generative learning has made significant progress, such as variational autoencoders (VAEs) [33] and generative adversarial networks (GANs) [34]. They have demonstrated powerful capabilities in unsupervised learning, image generation, and high-dimensional data modeling. As shown in Figure 1, VAEs learn data generation mechanisms in the latent variable space z by jointly training the encoder q ( z | x ) and the decoder p ( x | z ) . Its optimization goal is to maximize the log-marginal likelihood log p ( x ) of the observed data, and to achieve this by optimizing the evidence lower bound.
log p ( x ) E q ( z | x ) [ log p ( x | z ) ] K L ( q ( z | x ) | |   p ( z ) )
Here, p ( z ) is a pre-set prior distribution of the latent variable, typically a standard normal distribution. The encoder learns to encode the input sample x into a latent space representation z. The decoder reconstructs the sample x ^ based on z and jointly optimizes the reconstruction error with the Kullback–Leibler (KL) divergence to capture the underlying structure of the data. By using the latent variable learning and sampling mechanism, VAEs can effectively approximate the complex distribution of observed data in the latent space.
This variational lower bound (ELBO) serves as a conceptual foundation for our proposed FGBMLE framework. While we do not directly optimize this bound, it inspires the design of our generator-based likelihood estimation: instead of explicitly computing the variational posterior q ( z | x ) , the generator G ( w , z ) implicitly learns to produce latent representations θ that maximize the weighted likelihood under bootstrap perturbations. Hence, the ELBO highlights the connection between traditional variational inference and our implicit generative estimation approach.
Inspired by the general philosophy of deep generative models such as VAEs, we rethink the estimation of the latent mixture distribution π ( θ ) . The central idea is to construct a neural mapping from an easy-to-sample distribution to a complex target distribution, without requiring an explicit parametric form for π ( θ ) . This implicit generative formulation provides high flexibility and strong expressive power in high-dimensional latent spaces. A generator can learn structural patterns of the latent distribution directly from data-driven signals, thus enabling mixture modeling without explicit density assumptions.
In the traditional MLE framework, we typically estimate the latent distribution π ( θ ) by maximizing the log-likelihood function. This process relies on density estimation of the sample data, especially when multiple bootstrap sampling is involved. We introduce a parameterized neural network generator G ( w , z ) , where w represents the bootstrap weights, z is random noise sampled from an underlying distribution, and the network outputs a latent variable θ = G ( w , z ) , thus implementing a nonlinear generative process that maps the guided weights to the space of the mixture distribution. This method essentially attempts to mimic the sample generation behavior of the bootstrap under different weight configurations, but instead of solving the optimization problem separately each time, a unified estimation function is obtained through end-to-end training. For a given weight w, we hope that the latent variable output by the generator has as high a weighted marginal likelihood as possible on the observed data. Due to the randomness of the generator output, it is necessary to average the noise z [35] to obtain the expected likelihood of the observed samples under the latent variable generation. In order to reflect the overall performance under different weight perturbations, it is also necessary to take the expectation on the weight w. Therefore, the training objective of the generator can be formalized as the following equation.
G ^ = arg max G E w i = 1 n w i · log E z f ( y i G ( w , z ) )
The expectation comes from the bootstrap weights w, while the inner expectation represents the likelihood of the generator’s output samples under the observation model f ( y i | θ i ) by given the weights. In this objective function, the inner expectation E z is responsible for integrating the uncertainty in latent variable generation caused by noise, ensuring that the generator’s output under different noise sampling conditions. It has good average likelihood performance for the observed data. The outer expectation E w simulates the generator’s robustness on the entire dataset under different bootstrap sampling weights. By jointly optimizing these two layers of expectations, the generator G can learn a method to quickly generate latent parameter samples.It avoids the computational overhead of resolving the problem for each bootstrap in traditional methods.
Although the initial training phase of the generator introduces a higher computational load due to the end-to-end optimization process, this cost is incurred only once. Once trained, the generator can directly produce latent samples θ = G ( w , z ) without solving separate optimization problems for each bootstrap weight configuration, resulting in a substantial reduction in inference time. In practice, the initialization stage typically takes about 1.2–1.5 times the cost of a single bootstrap optimization run, but subsequent inference is accelerated by more than an order of magnitude. Thus, the overall computational trade-off favors the FGBMLE approach, as the amortized inference efficiency significantly outweighs the one-time training overhead.
With this strategy, we successfully transform the repeated optimization bootstrap process into a one-time generator training, significantly reducing the overall computational burden. We use a feedforward neural network (FNN) to construct G [36]. FNNs have universal approximation properties for a large number of functions [37]. Each layer of the neural network can automatically learn important features and patterns in the data. When the network is used as a generator, the network itself does not rely on any assumptions and it can approximately approach the true distribution of the data. In order to ensure that the generator receives different noise input z at each iteration, we sample z U n i f ( 0 , 1 ) from a uniform distribution. The bootstrap weight w is the importance assigned to each sample during the training process. In some datasets, the distribution of samples may be unbalanced, so we sample from a Dirichlet distribution, i.e., w n × D i r i c h l e t ( n , I n ) , where I n is an n-dimensional all-ones vector producing normalized non-negative weights w 1 , , w n . This imbalance is solved by appropriate weighting to prevent the model from focusing only on certain samples during training [38].
Although the generator alone can already approximate the target density, the subsequent candidate generation and two-stage optimization further refine both the component weights and the mixture structure. Empirically, we observed that using only the generator (without Stages I and II) results in a relatively small improvement in the average negative log-likelihood (NLL), typically less than 1–2% after convergence. In contrast, when the candidate generation and bootstrap refinement stages are included, the improvement in NLL usually reaches 10–15% for one-dimensional data and around 8–12% for two-dimensional cases. Furthermore, the final model produces noticeably smoother estimated densities and exhibits better generalization on unseen samples. Therefore, although the initial generator provides a strong baseline, the two-stage refinement significantly enhances both the accuracy and stability of the overall density estimation process.
The latent variables θ in such models often exist in spaces with tens or even hundreds of dimensions d 1 . While this high-dimensional representation can capture complex data structures such as multimodal distributions and nonlinear dependencies, it also imposes a significant computational burden and makes statistical inference difficult [39].
Traditional maximum likelihood estimation methods encounter significant bottlenecks in this area. As the dimension increases, the computational complexity of the optimization process increases exponentially, making standard algorithms impractical for d > 10 . When using bootstrap methods to quantify uncertainty, each resampling requires resolving the high-dimensional optimization problem, quickly becoming computationally prohibitive. This “curse of dimensionality” severely restricts the practical application of high-dimensional latent variable models.
To improve the model’s expressiveness and training stability for high-dimensional nonparametric mixture densities, we expand the output of the generator G ( w , z ) from a single parameter vector θ R d to a sequence of candidate variables θ = { θ ( j ) } j = 1 l , where each θ ( j ) R d represents a set of selectable mixture parameter vectors. This eliminates the generator output as a single point estimate and instead creates a high-dimensional support set, allowing it to approximate more complex density structures. The primary purpose of this approach is to effectively reduce the variance of Monte Carlo expectation estimate by introducing multiple candidate samples, thereby improving the stability of the gradient estimate in stochastic gradient descent [40].
There are often statistical correlations between multiple generated samples, and directly averaging them may lead to estimation bias. We further introduce a mixed probability vector τ = ( τ 1 , , τ l ) to perform weighted sampling on each θ ( j ) in the candidate set. It obtains independent bootstrap samples while maintaining expressiveness. At each sampling, an item τ is randomly selected from j = 1 , , l according to the θ ( j ) distribution. This process is repeated several times. To achieve this process, we design the following two-stage training framework in Figure 2.
Given a dataset Y = { y i } i = 1 n , the Stage-I training introduces two types of random variables as input to the generator. One is the bootstrap weight vector w n × D i r i c h l e t n , I n sampled from the Dirichlet distribution. The other is the potential noise z U n i f ( 0 , 1 ) that follows a uniform distribution. After receiving these two inputs, the generator G ( w , z ) outputs a candidate parameter matrix.
Θ = G ( w , z ) = [ θ ( 1 ) , θ ( 2 ) , , θ ( l ) ] R d × l
The each column θ ( j ) R d is a candidate density parameter. l represents the number of candidates output by the generator. This allows the generator to simultaneously provide multiple candidate solutions in different regions of the latent distribution, providing richer structure for subsequent modeling. Initially, τ is set to a uniform distribution τ 0 = ( 1 / l , , 1 / l ) . Subsequently, during training, an index variable r is extracted via r M u l t i n o m i a l ( 1 , τ 0 ) and indicates the selected candidate parameter in a one-hot fashion. At each iteration, r randomly selects a parameter θ ( r ) from the candidate set { θ ( 1 ) , , θ ( l ) } and uses it in the likelihood calculation. This random sampling strategy allows the model to traverse all candidates in the desired sense, but a single update depends on only one of them, thus avoiding the dilution effect caused by averaging. So the optimization goal of Stage-I is defined as the following equation.
L 1 ( G ) = E w , z , r i = 1 n w i · log f ( y i | θ ( r ) ) , r M u l t i n o m i a l ( 1 , τ 0 )
The f ( y i | θ ( r ) ) represents the probability density of sample y i under the parameter θ ( r ) . This objective combines bootstrap weights and candidate sampling to effectively control the estimation variance while ensuring that the generator covers a wide range of underlying distribution structures. Theoretically, it satisfies the following inequality.
V a r z [ f ( y i | θ i ) ] V a r z [ E r f ( y i | θ ( r ) ) ]
By introducing candidate structures and random sampling, the variance of the training process can be effectively suppressed, thereby improving the stability of the estimation.
During the optimization process, we use stochastic gradient descent (SGD) to update the generator parameter θ and continuously adjust the expressive power of the candidate structure. Finally, the optimization result of Stage-I can be expressed as the following equation.
G * = arg max L 1 ( G ) G
Here, G * represents the generator that achieves the optimality under the objective function L 1 ( G ) . This stage of training lays the initial foundation for subsequent more sophisticated modeling in Algorithm 1.
Algorithm 1 FGBMLE Stage-I
Input: Dataset Y = { y i } i = 1 n ; epochs T; candidate number l; initial generator G ( · ; θ 0 ) ; uniform prior τ 0 = ( 1 / l , , 1 / l ) ; learning rate η .
Output: generator G * producing candidate set { θ ( 1 ) , , θ ( l ) } .
  1: for  t = 1 to T do
  2:       Sample bootstrap weights: w D i r i c h l e t ( n , I n )
  3:       Sample latent noise: z U n i f ( 0 , 1 )
  4:       Generate candidate parameters: Θ = G ( w , z ) = [ θ ( 1 ) , , θ ( l ) ]
  5:       Sample index variable: r M u l t i n o m i a l ( 1 , τ 0 )
  6:       Compute objective L 1 ( G ) using Equation (12)
  7:       Update generator parameters θ via SGD using learning rate η
  8: end for
  9: return  G *
For all experiments, the proposed FGBMLE model was implemented using a fully connected generator network with three layers (dimensions 256–128–d) and ReLU activations. The model was optimized with the Adam optimizer at a fixed learning rate of 1 × 10 3 across all tasks, ensuring comparable convergence behavior. Training was performed with a batch size of 128 and Xavier initialization for all weights. Early stopping was applied when the validation negative log-likelihood (NLL) improvement dropped below 10 4 for ten consecutive epochs, effectively preventing overfitting. The number of training epochs T was set to 500 for one-dimensional data, 800 for two-dimensional data, and 1000 for the TAN experiments. The number of candidate components l was fixed at 50 in most cases, and adjusted within the range of 20–80 depending on data complexity to balance expressiveness and computational cost.
In Stage-II, we improve the density estimate by optimizing the mixture weights associated with the set of candidate parameters generated in the first stage. We introduce a mixture weight vector τ = ( τ 1 , , τ l ) , which belongs to the l-dimensional probability vector, i.e., τ j 0 and j = 1 l τ j = 1 . This weight vector is used to combine the candidate parameters into a mixture model. At this stage, the estimated density of data sample y i can be expressed as the following equation.
p ^ ( y i ) = j = 1 l τ j · E w , z f ( y i θ ( j ) )
Here, G * is the generator trained in Stage-I, θ ( j ) represents the jth candidate parameter generated by G * ( w , z ) . The E w , z [ · ] represents the expectation over the guided weights w D i r i c h l e t ( n , 1 n ) and the underlying noise z U n i f ( 0 , 1 ) . This formula allows the model to aggregate information from multiple candidate parameters. It also retains flexibility in representation.
To determine the optimal mixture weights, we minimize the negative log-likelihood.
L 2 ( τ ) = i = 1 n log j = 1 l τ j · E w , z [ f ( y i θ ( j ) ) ]
It makes the mixture weights more consistent with the empirical distribution of the dataset. The τ optimization is performed by using the Monte Carlo Expectation Maximization (MCEM) algorithm [41]. In each iteration t, the update rule for the jth component is achieved by using the following equation.
τ j ( t + 1 ) = 1 n i = 1 n τ j ( t ) · E w , z [ f ( y i θ ( j ) ) ] j = 1 l τ j ( t ) · E w , z [ f ( y i θ ( j ) ) ]
The denominator ensures that the updated weights remain regularized on the simplex. The iterative process continues until it reaches convergence, which is determined by the tolerance criterion τ n e w τ o l d .
This two-stage approach ensures that the generator provides diverse candidate structures in high-dimensional latent spaces, while the mixture weights refine the density estimation through adaptive reweighting. The combination of these two steps enables the model to capture complex high-dimensional distributions with improved accuracy and stability in Algorithm 2.
Algorithm 2 FGBMLE Stage-II
Input: Trained generator G * ; dataset Y = { y i } i = 1 n ; tolerance t o l ; number of candidates l.
Output: Optimized mixture weights τ n e w = ( τ 1 , , τ l ) .
  1: Initialize: τ n e w = ( 1 / l , , 1 / l ) , τ o l d = ( 0 , , 0 )
  2: while min j τ j o l d τ j n e w t o l   do
  3:      Set τ o l d = τ n e w
  4:      for  j = 1 to l do
  5:            Sample bootstrap weights: w D i r i c h l e t ( n , I n )
  6:            Sample latent noise: z U n i f ( 0 , 1 )
  7:            Update τ j n e w using Equation (17)
  8:      end for
  9: end while
10: return  τ n e w

3.2. TAN-FGBMLE Framework

It is worth noting that the proposed FGBMLE framework is not limited to TAN models. The two-stage mechanism described above is a general estimation scheme that can be embedded into various probabilistic models, mixture estimators, and graph structures as long as the likelihood function f ( y θ ) is available. In this section, we instantiate FGBMLE using the TAN model as an example. TAN was chosen because its structure learning process heavily relies on accurate estimates of class-conditional margin densities and joint densities, making it an ideal testbed to demonstrate how FGBMLE improves density estimation and conditional dependency inference [42].
With the two-stage estimation method, this section applies it to TAN structure learning. The core idea is to leverage the rich candidate parameter space provided by the generator, achieve robust distribution estimates through flexible weighting, and then embed these estimations into the measure of conditional dependencies between features for TAN structure learning.
The marginal and joint distributions of features need to be calculated for each class c. Our random variables are denoted by capital letters X i . Therefore, the data sample y = ( y 1 , , y d ) is considered to be a realization of the random vector X = ( X 1 , , X d ) .
Directly adopting empirical frequency or kernel density methods usually leads to bias in small sample sizes, resulting in unstable mutual information estimation. We extend the two-stage FGBMLE estimator to the class-conditional scenario and give the following distribution estimation.
p ^ ( y i | c ) = j = 1 l τ i , c , j E w , z f ( y i | θ i , c ( j ) )
Here, θ i , c ( j ) represents candidate parameters generated by the Stage-I generator G * ( w , z ) ; τ i , c , j , which denotes the mixture weights optimized in Stage-II. The subscript i , c , j indicates that the weight corresponds to the ith feature, the category c, and the jth of the lth candidate parameters. The E w , z [ · ] denotes the expectation with respect to bootstrap weights w D i r i c h l e t ( n , I n ) and latent noise z U n i f ( 0 , 1 ) . This design ensures that the estimated probability distributions retain smoothness and generalization ability even under finite samples.
After obtaining the class-conditional probability estimates, we can further define the conditional mutual information, which measures the dependency strength between features X i and X j given the class c.
I ( X i , X j | C ) = c p ^ ( c ) p ^ ( y i , y j | c ) log p ^ ( y i , y j | c ) p ^ ( y i | c ) p ^ ( y j | c ) d y i d y j
In this equation, p ^ ( c ) denotes the empirical prior probability of class c. The denominator term p ^ ( y i | c ) p ^ ( y j | c ) represents the product of the marginal distributions. The integral d y i d y j is taken over the feature space of ( y i , y j ) . The integral can be approximated by sampling from the estimated distribution and applying the Monte Carlo method. Compared to traditional mutual information computation, the FGBMLE smoothing terms in Equations (18) and (19) effectively reduce the bias caused by noisy samples, which will lead to more robust edge weights and improved reliability in structure learning.
Based on the values of conditional mutual information, a weighted complete graph can be constructed, and the maximum spanning tree algorithm can be used to extract the main dependencies among features. By orienting the spanning tree and adding the class node c, the complete TAN structure can be obtained. The conditional probability of each feature node given its parent node and class can be expressed as the following equation.
p ^ ( y i | y p a , c ) = p ^ ( y i , y p a | c ) p ^ ( y p a | c ) , p ^ ( y | c ) = i = 1 d p ^ ( y i | c , P a ( X i ) )
Here, y p a denotes the observed value of the parent node of X i in the TAN structure, and P a ( X i ) denotes the parent set of feature X i . By combining with the class prior p ^ ( c ) , the classification rule can be written as the following equation.
c ^ ( y ) = arg max c   p ^ ( c ) p ^ ( y | c )
This modeling approach remains consistent with the original definition of TAN, but incorporates the two-stage mechanism of FGBMLE into the estimation process, significantly enhancing the stability of mutual information calculation and the accuracy of edge structure learning [43].
Since the structure of TAN relies on the precise calculation of class-conditional mutual information. With estimated the class-conditional mutual information matrix between continuous attributes using the FGBMLE method, and used this as the edge weights of the graph. The class-conditional mutual information is calculated according to Formula (8), where y i and y j represent pollutants or environmental factors, and C denotes the air quality level. This metric quantifies the dependency strength between two variables under given classification conditions.
The Prim algorithm was applied to construct a maximum spanning tree. It obtains the optimal TAN dependency structure. The basic process of Prim’s algorithm is to start from any node and iteratively add the edge with the maximum weight that connects a new node to the existing tree, until all nodes are included. Mathematically, this can be formalized as the following:
T * = arg max T G ( Y i , Y j ) T I c ( Y i ; Y j ) ,
where I c ( Y i ; Y j ) represents the class-conditional mutual information and T * is the resulting optimal dependency tree.
Overall, this integration shows how the general FGBMLE estimator can serve as a plug-and-play density estimation engine within the TAN framework in Algorithm 3. While the present work focuses on TAN as a demonstrative application, the same mechanism can be naturally extended to more general Bayesian network structures, mixture-based models, or continuous graphical learning tasks where accurate conditional density estimation is required. This highlights that FGBMLE is a broadly applicable framework, with TAN structure learning being only one of its practical instantiations.
Algorithm 3 TAN-FGBMLE
Input: Dataset Y = { y i } i = 1 n ; trained generator G * from Algorithm 1; optimized mixture weights τ from Algorithm 2; number of features d; number of candidate parameters l.
Output: TAN structure and classification rule c ^ ( y ) .
  1: for each class c do
  2:       Estimate marginal distributions p ^ ( y i | c ) for all i = 1 , , d using Equation (18)
  3:       Compute conditional mutual information I ( X i , X j | C ) using Equation (19)
  4: end for
  5: Construct a weighted complete graph with edge weights I ( X i , X j | C )
  6: Extract maximum spanning tree to determine feature dependencies
  7: Orient the spanning tree and add class node c to obtain TAN structure
  8: for each feature X i  do
  9:       Compute conditional probability p ^ ( y i | y p a , c ) using Equation (20)
10: end for
11: Define classification rule c ^ ( y ) = arg max c p ^ ( c ) p ^ ( y | c )
12: return TAN structure and c ^ ( y )

4. Experiment Results

4.1. Simulation Experiments

In order to systematically evaluate the performance of TAN-FGBMLE, we designed a series of simulation experiments. These experiments cover typical potential distribution characteristics. They aim to test the ability of our method to capture complex distribution structures, handle different sample sizes, and achieve efficient computation. Specifically, we define two simulation scenarios with Table 1, presenting bimodal and unimodal restricted distributions.
In the GMM scenario, the latent distribution has a bimodal structure, which is suitable for testing the performance. In the GaMM scenario, the latent variables are restricted to the high-dimensional space [ 0 , 1 ] d . This spatially restricted latent distribution is used to evaluate the adaptability of our method to constrained unimodal distributions. We set the data size to n = 1000 and further expanded it to n = 10,000 and n = 100,000 to evaluate computational efficiency. The generator structure of FGBMLE uses a two-layer fully connected neural network, each containing 600 neurons. ReLU is used as the activation function, and the parameters are initialized using Xavier, which is a commonly used weight initialization function designed to avoid gradient vanishing or exploding problems. Training is performed using the Adam optimizer, with an initial learning rate of 0.001. Bootstrap weights w are generated using a Dirichlet distribution. All experiments are based on the PyTorch 2.2.2 framework and run on an NVIDIA GeForce RTX 4060 Ti GPU to ensure efficient computational performance.
Figure 3 comprehensively demonstrates the ability of FGBMLE and bootstrap methods to fit the underlying distribution π ( θ ) in two typical simulation scenarios, corresponding to GMM and GaMM, respectively. Figure 3 shows that in the GMM scenario, the true distribution exhibits two distinct peaks, located near θ = −3 and θ = 3, respectively. FGBMLE almost perfectly reproduces the height and position of these two peaks, demonstrating its strong ability to capture multimodal distributions. Bootstrap also performs well, but its fit to the peak height is slightly insufficient. In the GaMM scenario, the true distribution is unimodal, located in the interval (0, 1). FGBMLE accurately captures this unimodal characteristic, and its smooth curve is very close to the true distribution, especially in terms of the peak position and distribution shape. Bootstrap has slight deviations in fitting unimodal characteristics, especially near the boundaries.
The results in Figure 4 clearly show that FGBMLE exhibits excellent fitting capabilities in multimodal and unimodal restricted scenarios. The smooth distribution it generates can not only accurately capture the detailed characteristics of the real distribution, but also avoid the shortcomings of traditional methods caused by oversmoothing, verifying its superiority in complex distribution modeling.
Table 2 compares the performance of FGBMLE, bootstrap, and KDE using four complementary evaluation metrics. The 1-Wasserstein distance, W 1 ( π , π ^ ) = | F π ( x ) F π ^ ( x ) | d x , where F denotes the cumulative distribution function, reflects global distributional discrepancies. The integrated squared error (ISE), I S E = R d { π ^ ( x ) π ( x ) } 2 d x , measures local density fidelity. In addition, we report the mean squared error (MSE) between the estimated and true density values, as well as the Kullback–Leibler (KL) [44] divergence, which evaluates the probabilistic discrepancy between π and π ^ . Table 2 summarizes the averages over 50 independent simulation runs for both scenarios. The results show that FGBMLE performs comparable to or better than traditional bootstrap methods across all metrics, particularly in the GaMM scenario, while the KDE baseline serves as a nonparametric reference for comparison.
We also compared the average computation time of FGBMLE and bootstrap in simulations with sample sizes n { 1000 , 10 , 000 , 100 , 000 } . The results are plotted on a logarithmic scale in Figure 5. This figure shows that FGBMLE is more scalable. Specifically, when n = 100,000, FGBMLE completes in only approximately 5 min, while bootstrap takes nearly triple the amount of time.
To validate the applicability of the proposed TAN-FGBMLE method in complex distribution scenarios, we designed a probability density estimation experiment based on artificial data. The experimental data consists of three Gaussian components, N ( 3 , 0.5 2 ) , N ( 0 , 1 2 ) , and N ( 3 , 0.7 2 ) , forming a typical trimodal, long-tailed, and heteroskedastic distribution. We compared the proposed method with traditional KDE and GMM, using real distributions as a benchmark.
To keep the evaluation framework consistent with Table 2, we also assess KDE, GMM, and TAN-FGBMLE under the trimodal long-tailed distribution using the same four quantitative metrics: W1, ISE, MSE, and KL divergence.
Table 3 reports the quantitative density estimation results for the trimodal distribution experiment using four evaluation metrics. As shown in the table, TAN-FGBMLE achieves the lowest values across all four metrics, indicating that its estimated density is consistently closer to the true distribution than those produced by KDE and GMM. In particular, the improvements in W1 and KL suggest that TAN-FGBMLE captures both the global distributional structure and the probabilistic discrepancies more effectively. The reduction in ISE and MSE further demonstrates its accuracy in recovering the local shape of the trimodal density. Overall, these results highlight the advantage of TAN-FGBMLE in modeling complex multimodal distributions.
Figure 6 compares the density estimation results of the proposed FGBMLE with the true GMM and bootstrap NPMLE under a 2D trimodal distribution. FGBMLE reconstructs the three modal regions accurately and maintains sharp, well-aligned contour structures. Conversely, the bootstrap estimator produces overly dispersed contours and fails to preserve the multimodal shape. The results show that FGBMLE achieves clearly superior performance in capturing multimodal and heteroskedastic density structures in higher-dimensional settings.

4.2. Structure Recovery Experiment

To further evaluate the accuracy of the learned structures, a simulation-based experiment was conducted using data generated from a known continuous Tree-Augmented Naive Bayes model. In this model, one binary class variable C { 0 , 1 } and eight continuous attributes X 1 , X 2 , , X 8 were considered. The underlying TAN structure over the attributes was predefined as a fixed chain, for example, X 1 X 2 X 8 , and remained constant throughout all trials.
For each class c, the joint distribution P ( X 1 , , X 8 C = c ) followed a linear Gaussian model, where the root node X 1 was generated according to
X 1 C = c N ( μ 1 ( c ) , σ 1 2 ) ,
and each non-root node X j was generated from its attribute parent pa ( j ) as
X j X pa ( j ) , C = c N ( a j ( c ) X pa ( j ) + b j ( c ) , σ j 2 ) ,
where all parameters were randomly initialized and fixed across experiments to ensure reproducibility. The class prior was set to P ( C = 1 ) = 0.5 .
Two synthetic datasets were generated with sample sizes n = 1000 and n = 2000 , and each configuration was repeated for twenty independent trials. For each dataset, three methods, TAN-KDE, TAN-GMM, and the proposed TAN-FGBMLE were used to estimate the TAN structure by calculating conditional mutual information between attributes given the class variable and constructing a maximum-weight spanning tree. In all methods, the structure learning process was identical except for the underlying conditional density estimator used for CMI computation.
To measure the accuracy of the recovered structures, we adopted the Structural Hamming Distance (SHD) [45] between the estimated and the true attribute tree. Both trees were represented as undirected adjacency matrices A true and A est of dimensions d × d , and the SHD was defined as
SHD ( A true , A est ) = 1 i < j d I [ A i j true A i j est ] ,
where I [ · ] denotes the indicator function. SHD quantifies the number of edge additions or deletions required to transform the estimated tree into the true one. Only attribute–attribute edges were considered, excluding those connected to the class node C. For each method and sample size, we report the mean and standard deviation of SHD values averaged over twenty repetitions.
As shown in Table 4, the SHD values decrease with the increase in sample size, indicating that all methods benefit from larger datasets in structural learning. Among the three methods, FGBMLE-TAN consistently achieves competitive or superior SHD performance compared with KDE-TAN and GMM-TAN. In particular, when n = 2000 , FGBMLE-TAN obtains the smallest SHD value (0.2 ± 0.2), suggesting that the proposed method more accurately recovers the underlying TAN structure with sufficient data.
Although the SHD of TAN-FGBMLE is only marginally lower than that of TAN-GMM, it demonstrates higher stability (lower variance) across repeated trials. This result implies that the FGBMLE-based conditional density estimation provides a smoother and more robust approximation of the joint distribution, which effectively improves the reliability of structure learning. Consequently, the proposed TAN-FGBMLE achieves a better balance between structural accuracy and distributional modeling capability.

4.3. Comparative Experiments with Extended Naive Bayes and Discriminative Models

To further evaluate the effectiveness of the proposed TAN-FGBMLE model, we conducted comparative experiments against several state-of-the-art Naive Bayes extensions that explicitly model attribute dependencies, as well as strong discriminative classifiers. The compared Naive Bayes-type models include TAN, Averaged One-Dependence Estimators (AODEs) [46], Weighted AODE (WAODE) [47], Hidden Naive Bayes (HNB) [48], and Correlation-based Feature-Weighted Naïve Bayes (CFWNB) [49]. These models were selected because they relax the independence assumption of traditional NB by incorporating single or multiple parent dependencies among features. All implementations of the baseline algorithms are available in the WEKA platform, ensuring fair and reproducible comparisons.
In addition, three discriminative models, Logistic Regression, Random Forest (RF), and RBF-SVM, were included to assess whether the generative optimization of TAN-FGBMLE generalizes beyond the Naïve Bayes family. All models were trained under identical data partitions and preprocessing procedures. Two primary metrics were used for evaluation: classification accuracy (%) and average log-likelihood (LL). Each result represents the mean and standard deviation over five independent runs.
As shown in Table 5, TAN-FGBMLE consistently achieves superior results in both classification accuracy and log-likelihood compared with all competing models. Relative to the standard TAN, FGBMLE-TAN improves accuracy by 4.3% and significantly increases log-likelihood, indicating that the proposed generative Bayesian optimization effectively captures the conditional dependencies among features while maintaining a coherent probabilistic structure.
To assess the robustness of these improvements, we performed a Wilcoxon signed-rank test [50] across all datasets for pairwise comparisons between TAN-FGBMLE and each baseline model. The test statistic is defined as
W = min ( W + , W ) , W + = d i > 0 R i , W = d i < 0 R i ,
where d i = x i y i denotes the performance difference on the i-th dataset and R i is the rank of | d i | . Under the null hypothesis that there is no systematic difference between models, FGBMLE-TAN was found to be significantly better than all other NB-type baselines at the 5% significance level ( p < 0.05 ) in Table 5.
Although discriminative models such as Random Forest and RBF-SVM also achieve competitive accuracy, TAN-FGBMLE slightly outperforms them in both metrics. This demonstrates that a well-trained generative structure can rival, and even surpass, discriminative models when handling complex or partially observed distributions, while also maintaining the interpretability advantages inherent to probabilistic graphical models.

4.4. Classification Performance on UCI Benchmark Datasets

To further evaluate the effectiveness of the proposed TAN-FGBMLE method on real-world classification tasks, we conducted experiments on 20 widely used UCI benchmark datasets. These datasets cover a wide range of scales, feature dimensions, and class complexity, providing a comprehensive testbed beyond controlled simulation environments. When encountering the issue of absent data points in the dataset, we implement a method of substituting them with the average value of each feature. The specific descriptive information of the dataset is shown in Table 6. We compared classification performance by comparing multiple classifiers, including Tree-Augmented Kernel Density Bayes (TAN-KDE), Naive Bayes classifier (NBC), Flexible Bayes Classifier (FBC), k-Nearest Neighbors (KNN), Decision Tree C4.5, Neural Network (NN), and Support Vector Machine (SVM). We used ten-fold cross-validation. The average classification accuracy is shown in Table 7.
From the Table 7, we can observe that there are clear differences in classification performance across different classifiers. For most datasets, TAN-FGBMLE achieves superior results compared to the baselines, demonstrating its ability to capture complex dependency structures more effectively. The SVM classifier chooses the RBF kernel, and its parameter adopts the lattice search method, c [ 8 , 8 ] , g [ 8 , 8 ] , with a step size of 0.1. Decision Tree uses the C 4.5 model with post-pruning. The neural network parameters are set as h i d d e n l a y e r s i z e s = 100 , a c t i v a t i o n = r e l u , learning rate α = 0.001, r a n d o m s t a t e = 42 and the max iteration m a x i t e r = 1000. The nearest neighbor size is 5. The smoothing parameter for Gaussian is 0.001 .
By using the 10-fold cross-validation, we obtain the mean and the variance of the classification accuracy. When the sample size is small or the feature dimension is limited, traditional classifiers such as NBC or SVM can still maintain reasonable accuracy due to their simple assumptions. However, as the number of samples and attributes increases, the advantage of TAN-FGBMLE becomes more evident, as it stabilizes mutual information estimation and reduces noise in high-dimensional spaces. Overall, TAN-FGBMLE improves average classification accuracy over TAN-KDE, NBC, FBC, KNN, C4.5, NN, and SVM by 3.1%, 2.4%, 3.6%, 2.7%, 4.2%, 1.5%, and 1.2%, respectively. These results confirm that TAN-FGBMLE not only outperforms existing methods but also adapts well to datasets of varying scales and complexities, making it a robust and generalizable classifier.

4.5. TAN-FGBMLE for Graph Structure Learning on Air Quality Data

After validating TAN-FGBMLE on benchmark classification datasets, we further explore its applicability in domain-specific real-world data by a graph structure learning experiment on the open-source air quality dataset. This dataset contains daily records of multiple continuous environmental variables such as temperature, humidity, and pollutant concentrations including NO2, SO2, CO, PM2.5, and PM10, providing a representative test case for evaluating TAN structure learning.
The dependency tree is shown in Figure 7. The figure shows a direct connection between PM2.5 and PM10, indicating a high correlation between the two. The edge relationship between temperature and NO2 reveals the influence of meteorological conditions on nitrogen oxide concentrations. The connection between industrial proximity and SO2 and CO demonstrates the dominant role of emission sources. The relationship between CO and population density reflects the contribution of human activities to air quality. These findings are consistent with the generation mechanism of atmospheric pollutants and demonstrate that the Prim algorithm can extract a reasonable and interpretable dependency structure.
After constructing the TAN structure, we further compared the edge stability performance under different density estimation methods. We used bootstrap sampling to repeatedly estimate the mutual information and counted the frequency of occurrence of key dependency edges across repeated experiments, denoting this as the Bootstrap Edge Consistency (BEC) metric [50] . A higher BEC value indicates that the edge can be stably identified under different sampling conditions, demonstrating the robustness of the structure learning. Its mathematical definition is B E C ( e i , j ) = 1 B b = 1 B δ i , j ( b ) , where δ i j ( b ) = 1 if edge e i , j appears in the bth bootstrap result, and 0 otherwise. A higher BEC value indicates that the edge is more stable under different sampling conditions. The dependency structure is more explanatory.
Table 8 shows the BEC comparison results for some key edges. It can be seen that the traditional KED is unstable in estimating mutual information in multimodal and heteroskedastic data, resulting in low edge frequency. Our method TAN-FGBMLE exhibits higher BEC values for all key dependency edges. For example, the PM2.5–PM10 relationship improves from 0.71 to 0.92, and the temperature–NO2 relationship improves from 0.64 to 0.88. These results demonstrate that TAN-FGBMLE not only improves stability in structure learning on real-world air quality data but also provides interpretable insights into environmental and pollution-related dependencies, highlighting its potential for practical applications beyond benchmark datasets.

4.6. Summary of Strengths and Limitations

The experimental analyses above show that the proposed FGBMLE framework performs particularly well in scenarios involving complex, multimodal, or high-dimensional data distributions. By decoupling the candidate generation and mixture reweighting processes, FGBMLE effectively captures intricate latent structures while maintaining stability under limited-sample conditions. Compared with classical EM-based or kernel density methods, it consistently achieves smoother likelihood surfaces, lower estimation bias, and improved robustness in conditional mutual information computation and TAN structure learning.
Nevertheless, the method may underperform when the underlying data distribution is simple and unimodal—such as Gaussian-like datasets—where the flexibility of the generator offers limited additional benefit while increasing computational cost. Furthermore, in extremely high-dimensional cases with very small sample sizes, the Stage-II reweighting process may exhibit instability due to insufficient density support. In such cases, conventional parametric models or regularized variants of FGBMLE may yield more stable performance. Future work will explore adaptive regularization and model selection strategies to mitigate these limitations and extend the framework’s scalability to ultra-high-dimensional domains.

5. Conclusions

This paper addresses the difficulty of accurately estimating the class-conditional mutual information in the TAN model under continuous attribute conditions. We propose an improved method, TAN-FGBMLE, based on the FGBMLE framework. This method efficiently models complex mixture densities by introducing a generative network and a two-stage optimization framework, significantly improving computational efficiency while maintaining estimation accuracy. Experimental results show that TAN-FGBMLE outperforms traditional nonparametric methods in classification with multiple public datasets and real-world air quality data. The learned graph structure reflects the dependencies between continuous attributes, demonstrating strong interpretability. Current applications focus primarily on classification tasks, and the method’s applicability to other fields requires further verification. Future research could focus on optimizing the network structure, enhancing its ability to characterize complex distributions, and expanding its application to scenarios such as time series analysis and anomaly detection.

Author Contributions

Conceptualization, C.W. and C.L.; Methodology, C.L.; Software, T.Z.; Validation, T.Z.; Investigation, P.W.; Writing—original draft, C.W.; Writing—review & editing, C.W.; Supervision, C.W. and C.L.; Project administration, Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China General Project (62376089). The work is supported by the Yellow Crane Talents Program funding and Research funding from Hubei University of Technology (HBUT 4301/00550).

Data Availability Statement

The data that support the findings of this study are available in the UCI dataset https://archive.ics.uci.edu/datasets (accessed on 24 October 2025). The code of the paper is in the following link: https://github.com/ZTY0516/TAN-FGBMLE (accessed on 24 October 2025).

Acknowledgments

The authors would like to thank the editors and anonymous reviewers who carefully read the paper and provided valuable suggestions that considerably improved the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Gelman, A.; Carlin, J.B.; Stern, H.S.; Rubin, D.B. Bayesian Data Analysis; Chapman and Hall: New York, NY, USA, 1995. [Google Scholar] [CrossRef]
  2. Zhang, J.; Shi, M.H.; Lang, X.S.; You, Q.J.; Jing, Y.L.; Huang, D.Y.; Dai, H.Y.; Kang, J. Dynamic risk evaluation of hydrogen station leakage based on fuzzy dynamic Bayesian network. Int. J. Hydrogen Energy 2024, 50, 1131–1145. [Google Scholar] [CrossRef]
  3. Zhang, J.F.; Jin, M.; Wan, C.P.; Dong, Z.J.; Wu, X.H. A Bayesian network-based model for risk modeling and scenario deduction of collision accidents of inland intelligent ships. Reliab. Eng. Syst. Saf. 2024, 243, 109816. [Google Scholar] [CrossRef]
  4. Muñoz-Valencia, C.S.; Quesada, J.A.; Orozco, D.; Barber, X. Employing Bayesian Networks for the Diagnosis and Prognosis of Diseases: A Comprehensive Review. arXiv 2023, arXiv:2304.06400. [Google Scholar] [CrossRef]
  5. Lewis, D.D. Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. In Proceedings of the 10th European Conference on Machine Learning (ICML), Chemnitz, Germany, 21–23 April 1998. [Google Scholar] [CrossRef]
  6. Domingos, P.; Pazzani, M. On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Mach. Learn. 1997, 29, 103–130. [Google Scholar] [CrossRef]
  7. Friedman, N.; Geiger, D.; Goldszmidt, M. Bayesian Network Classifiers. Mach. Learn. 1997, 29, 131–163. [Google Scholar] [CrossRef]
  8. Bielza, C.; Larranaga, P. Discrete Bayesian Network Classifiers: A Survey. ACM Comput. Surv. 2014, 47, 1–43. [Google Scholar] [CrossRef]
  9. Kraskov, A.; Stogbauer, H.; Grassberger, P. Estimating Mutual Information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef]
  10. Silverman, B.W. Density Estimation for Statistics and Data Analysis; Routledge: New York, NY, USA, 2018. [Google Scholar] [CrossRef]
  11. Wei, C.H.; Peng, B.; Li, C.; Liu, Y.Y.; Ye, Z.W.; Zuo, Z.Q. A Two-Stage Optimized Robust Kernel Density Estimation for Bayesian Classification with Outliers. Int. J. Mach. Learn. Cyber. 2025, 1–25. [Google Scholar] [CrossRef]
  12. Peel, D.; McLachlan, G.J. Robust mixture modelling using the t distribution. Stat. Comp. 2025, 10, 339–348. [Google Scholar] [CrossRef]
  13. Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; Chapman and Hall: New York, NY, USA, 1993. [Google Scholar] [CrossRef]
  14. Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational Inference: A Review for Statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef]
  15. Laird, N.M. Nonparametric Maximum Likelihood Estimation of a Mixing Distribution. J. Am. Stat. Assoc. 1978, 73, 805–811. [Google Scholar] [CrossRef]
  16. Zhang, C.H. Compound Decision Theory and Empirical Bayes Methods. Ann. Stat. 2003, 31, 379–390. [Google Scholar] [CrossRef]
  17. Koenker, R.; Mizera, I. Convex Optimization, Shape Constraints, Compound Decisions, and Empirical Bayes Rules. J. Am. Stat. Assoc. 2014, 109, 674–685. [Google Scholar] [CrossRef]
  18. Feng, L.; Dicker, L.H. Approximate Nonparametric Maximum Likelihood for Mixture Models: A Convex Optimization Approach to Fitting Arbitrary Multivariate Mixing Distributions. Comput. Stat. Data Anal. 2018, 122, 80–91. [Google Scholar] [CrossRef]
  19. Ronn, B.B.; Skovgaard, I.M. Nonparametric Maximum Likelihood Estimation of Randomly Time-Transformed Curves. Ann. Stat. 2009, 37, 1–17. [Google Scholar] [CrossRef]
  20. Li, Y.; Ye, Z. Boosting in Univariate Nonparametric Maximum Likelihood Estimation. IEEE Signal Process. Lett. 2021, 28, 623–627. [Google Scholar] [CrossRef]
  21. Efron, B. Empirical Bayes Deconvolution Estimates. Biometrika 2016, 103, 1–20. [Google Scholar] [CrossRef]
  22. Shao, H.J.; Yao, S.C.; Sun, D.C.; Zhang, A.; Liu, S.Z.; Liu, D.X.; Wang, J.; Abdelzaher, T. ControlVAE: Controllable Variational Autoencoder. In Proceedings of the 37th International Conference on Machine Learning (PMLR), Virtual, 13–18 July 2020; Volume 119, pp. 8655–8664. Available online: https://dl.acm.org/doi/10.5555/3524938.3525741 (accessed on 24 October 2025).
  23. Rubin, D.B. The Bayesian Bootstrap. Ann. Stat. 1981, 9, 130–134. [Google Scholar] [CrossRef]
  24. Lam, H.; Liu, Z.Y. Bootstrap in High Dimension with Low Computation. In Proceedings of the 40th International Conference on Machine Learning (PMLR), Honolulu, HI, USA, 23–29 July 2023; Volume 202, pp. 18419–18453. Available online: https://proceedings.mlr.press/v202/lam23a.html (accessed on 24 October 2025).
  25. Kagerer, K. A Hat Matrix for Monotonicity Constrained B-Spline and P-Spline Regression; Technical Report; University of Regensburg: Regensburg, Germany, 2015; Available online: https://epub.uni-regensburg.de/31450/ (accessed on 24 October 2025).
  26. Nadaraya, E.A.; Kotz, S. Nonparametric Estimation of Probability Densities and Regression Curves; Springer: Dordrecht, The Netherlands, 1989; Available online: https://link.springer.com/book/10.1007/978-94-009-2583-0 (accessed on 24 October 2025).
  27. Efron, B. Bootstrap Methods: Another Look at the Jackknife. In Breakthroughs in Statistics: Methodology and Distribution; Springer: New York, NY, USA, 1992. [Google Scholar] [CrossRef]
  28. Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan Kaufmann: San Francisco, CA, USA, 2014. [Google Scholar] [CrossRef]
  29. Wei, C.; Li, C.; Liu, Y.; Chen, S.; Zuo, Z.; Wang, P.; Ye, Z. Causal Discovery and Reasoning for Continuous Variables with an Improved Bayesian Network Constructed by Locality Sensitive Hashing and Kernel Density Estimation. Entropy 2025, 27, 123. [Google Scholar] [CrossRef]
  30. Spirtes, P.; Glymour, C.N.; Scheines, R. Causation, Prediction, and Search; MIT Press: Cambridge, MA, USA, 2000. [Google Scholar]
  31. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley and Sons: New York, NY, USA, 1999. [Google Scholar] [CrossRef]
  32. Chow, C.K.; Liu, C. Approximating Discrete Probability Distributions with Dependence Trees. IEEE Trans. Inf. Theory 1968, 14, 462–467. [Google Scholar] [CrossRef]
  33. Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar] [CrossRef]
  34. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 8–13 December 2014; Available online: https://dl.acm.org/doi/proceedings/10.5555/2969033 (accessed on 24 October 2025).
  35. Noh, H.; You, T.; Mun, J.; Han, B. Regularizing Deep Neural Networks by Noise: Its Interpretation and Optimization. In Proceedings of the 31st International Conference on Neural Information Processing System (NIPS), Long Beach, CA, USA, 4–9 December 2017; Available online: https://dl.acm.org/doi/10.5555/3295222.3295264 (accessed on 24 October 2025).
  36. Hornik, K.; Stinchcombe, M.; White, H. Multilayer Feedforward Networks Are Universal Approximators. Neural Netw. 1989, 2, 359–366. [Google Scholar] [CrossRef]
  37. Hornik, K. Approximation Capabilities of Multilayer Feedforward Networks. Neural Netw. 1991, 4, 251–257. [Google Scholar] [CrossRef]
  38. Wei, H.; Xie, R.; Yang, L.; Xu, Z.; Li, Z. MetaInfoNet: Learning Task-Guided Information for Sample Reweighting. arXiv 2020, arXiv:2012.05273. [Google Scholar] [CrossRef]
  39. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
  40. Bottou, L.; Curtis, F.E.; Nocedal, J. Optimization Methods for Large-Scale Machine Learning. SIAM Rev. 2018, 60, 223–311. [Google Scholar] [CrossRef]
  41. Wei, G.C.; Tanner, M.A. A Monte Carlo Implementation of the EM Algorithm for the Mixture of Experts Model. J. Am. Stat. Assoc. 1990, 85, 699–704. [Google Scholar] [CrossRef]
  42. Mohamed, S.; Lakshminarayanan, B. Learning in Implicit Generative Models. arXiv 2016, arXiv:1610.03483. [Google Scholar] [CrossRef]
  43. Kalisch, M.; Buhlmann, P. Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm. J. Mach. Learn. Res. 2007, 8, 613–636. Available online: https://dl.acm.org/doi/10.5555/1314498.1314520 (accessed on 24 October 2025).
  44. Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; Available online: https://dl.acm.org/doi/abs/10.5555/3305381.3305404 (accessed on 24 October 2025).
  45. Tsamardinos, I.; Brown, L.E.; Aliferis, C.F. The max-min hill-climbing Bayesian network structure learning algorithm. Mach Learn 2006, 65, 31–78. [Google Scholar] [CrossRef]
  46. Zhang, C.; Chen, S.; Ke, H. Research on Model Selection-Based Weighted Averaged One-Dependence Estimators. Mathematics 2024, 12, 2306. [Google Scholar] [CrossRef]
  47. Wang, L.; Zhang, S.; Mammadov, M.; Li, K.; Zhang, X. Semi-supervised Weighting for Averaged One-Dependence Estimators. Appl. Intell. 2022, 52, 4057–4073. [Google Scholar] [CrossRef]
  48. Yu, L.; Gan, S.; Chen, Y.; Luo, D. A Novel Hybrid Approach: Instance Weighted Hidden Naive Bayes. Mathematics 2021, 9, 2982. [Google Scholar] [CrossRef]
  49. Jiang, L.; Zhang, L.; Li, C.; Wu, J. A Correlation-Based Feature Weighting Filter for Naive Bayes. IEEE Trans. Knowl. Data Eng. 2019, 31, 201–213. [Google Scholar] [CrossRef]
  50. Friedman, N.; Goldszmidt, M.; Wyner, A. Data Analysis with Bayesian Networks: A Bootstrap Approach. arXiv 2013, arXiv:1301.6695. [Google Scholar] [CrossRef]
Figure 1. Graph network using autoencoder.
Figure 1. Graph network using autoencoder.
Entropy 27 01216 g001
Figure 2. Two-stage algorithm framework of FGBMLE.
Figure 2. Two-stage algorithm framework of FGBMLE.
Entropy 27 01216 g002
Figure 3. Estimating the probability density using two-dimensional data.
Figure 3. Estimating the probability density using two-dimensional data.
Entropy 27 01216 g003
Figure 4. Sampling performance comparison using GMM.
Figure 4. Sampling performance comparison using GMM.
Entropy 27 01216 g004
Figure 5. Average computation time for the different sample sizes.
Figure 5. Average computation time for the different sample sizes.
Entropy 27 01216 g005
Figure 6. Density estimation results of the proposed TAN-FGBMLE method under a 2D trimodal distribution.
Figure 6. Density estimation results of the proposed TAN-FGBMLE method under a 2D trimodal distribution.
Entropy 27 01216 g006
Figure 7. Dependency structure diagram of the AirQuality dataset based on Prim’s algorithm.
Figure 7. Dependency structure diagram of the AirQuality dataset based on Prim’s algorithm.
Entropy 27 01216 g007
Table 1. Simulation scenarios of the FGBMLE method.
Table 1. Simulation scenarios of the FGBMLE method.
DistributionGaussian Mixture Model (GMM)Gamma Mixture Model (GaMM)
d-dimension π ( θ ) = 0.5 N ( μ 1 , Σ 1 ) + 0.5 N ( μ 2 , Σ 2 ) π ( θ ) = j = 1 d Beta ( 10 , 5 )
x θ N ( θ , I d ) x θ j = 1 d Gamma ( 10 , θ j )
μ 1 = ( 3 , 3 , , 3 ) R d
μ 2 = ( 3 , 3 , , 3 ) R d
Σ 1 = 2 I d , Σ 2 = I d
Table 2. Performance comparison of different methods under two-dimensional simulation scenarios.
Table 2. Performance comparison of different methods under two-dimensional simulation scenarios.
ModelMethodW1ISEMSEKL
GMMFGBMLE0.3350.0090.0060.045
Bootstrap0.3100.0100.0080.067
KDE0.4820.0260.0140.112
GaMMFGBMLE0.0350.2700.0110.083
Bootstrap0.0380.5100.0140.094
KDE0.0720.6930.0230.141
Table 3. Density estimation results of different methods under a trimodal distribution.
Table 3. Density estimation results of different methods under a trimodal distribution.
MethodW1ISEMSEKL
KDE0.1540.03140.01250.0847
GMM ( K = 2 )0.1020.02460.00970.0632
GMM ( K = 3 )0.0680.01790.00640.0415
TAN-FGBMLE0.0410.01030.00480.0286
Table 4. Average SHD values of different methods under two sample sizes.
Table 4. Average SHD values of different methods under two sample sizes.
MethodSHD (n = 1000)SHD (n = 2000)
TAN-KDE0.4 ± 0.50.2 ± 0.4
TAN-GMM0.5 ± 0.40.3 ± 0.6
TAN-FGBMLE0.3 ± 0.40.2 ± 0.2
Table 5. Performance comparison of TAN-FGBMLE and competing models on the simulated dataset ( n = 2000 ). Statistical significance is based on the Wilcoxon signed-rank test at α = 0.05 .
Table 5. Performance comparison of TAN-FGBMLE and competing models on the simulated dataset ( n = 2000 ). Statistical significance is based on the Wilcoxon signed-rank test at α = 0.05 .
ModelAccuracy (%)Log-Likelihood
NB83.5 ± 1.2−2.37 ± 0.08
TAN86.8 ± 0.9−2.13 ± 0.07
AODE87.4 ± 0.8−2.09 ± 0.05
WAODE87.9 ± 0.8−2.07 ± 0.05
HNB88.1 ± 0.7−2.02 ± 0.05
CFWNB88.5 ± 0.6−1.98 ± 0.04
KDB-288.6 ± 0.6−1.97 ± 0.04
Logistic Regression88.9 ± 0.6−1.95 ± 0.04
Random Forest90.2 ± 0.5−1.89 ± 0.03
RBF-SVM89.8 ± 0.5−1.90 ± 0.04
TAN-FGBMLE91.1 ± 0.4−1.82 ± 0.03
Table 6. Description of the datasets used in experiments.
Table 6. Description of the datasets used in experiments.
No.DatasetInstancesAttributesClasses
1Abalone417783
2Breast Cancer569302
3Car Evaluation172864
4Credit Approval690152
5Dermatology366346
6E. coli33678
7Glass21496
8Haberman30632
9Heart Disease303132
10ILPD58392
11Ionosphere351342
12Iris15043
13Landsat Satellite2000366
14Parkinsons195222
15Pima Indians Diabetes76882
16Student Performance649332
17Vehicle846184
18Wine178133
19Wine Quality15991110
20Yeast1484810
Table 7. Classification accuracy comparison of the proposed method against existing classifiers.
Table 7. Classification accuracy comparison of the proposed method against existing classifiers.
Dataset NameTAN-KDENBCFBCKNNC4.5NNSVMTAN-FGBMLE
Abalone0.512 ± 0.0670.498 ± 0.0580.505 ± 0.0610.528 ± 0.0640.490 ± 0.0550.545 ± 0.0600.551 ± 0.0660.612 ± 0.059
Breast Cancer0.861 ± 0.0420.845 ± 0.0460.850 ± 0.0390.872 ± 0.0400.858 ± 0.0440.875 ± 0.0360.868 ± 0.0420.889 ± 0.038
Car Evaluation0.902 ± 0.0350.890 ± 0.0330.896 ± 0.0340.918 ± 0.0370.910 ± 0.0380.922 ± 0.0320.916 ± 0.0360.934 ± 0.031
Credit Approval0.823 ± 0.0470.812 ± 0.0420.818 ± 0.0430.834 ± 0.0410.826 ± 0.0480.842 ± 0.0390.838 ± 0.0440.856 ± 0.037
Dermatology0.931 ± 0.0280.920 ± 0.0300.925 ± 0.0270.936 ± 0.0310.929 ± 0.0290.942 ± 0.0250.938 ± 0.0280.951 ± 0.023
E. coli0.825 ± 0.0550.810 ± 0.0520.814 ± 0.0500.838 ± 0.0480.822 ± 0.0530.846 ± 0.0470.840 ± 0.0510.862 ± 0.046
Glass0.673 ± 0.0710.652 ± 0.0740.660 ± 0.0700.684 ± 0.0720.676 ± 0.0680.691 ± 0.0660.687 ± 0.0710.712 ± 0.065
Haberman0.738 ± 0.0580.720 ± 0.0610.728 ± 0.0570.741 ± 0.0600.732 ± 0.0620.752 ± 0.0540.746 ± 0.0590.764 ± 0.053
Heart Disease0.836 ± 0.0460.820 ± 0.0480.828 ± 0.0440.844 ± 0.0470.832 ± 0.0450.850 ± 0.0410.846 ± 0.0460.868 ± 0.040
ILPD0.752 ± 0.0620.740 ± 0.0590.746 ± 0.0610.758 ± 0.0570.749 ± 0.0600.763 ± 0.0550.760 ± 0.0580.778 ± 0.053
Ionosphere0.887 ± 0.0360.872 ± 0.0390.880 ± 0.0350.894 ± 0.0380.885 ± 0.0370.901 ± 0.0330.896 ± 0.0360.912 ± 0.032
Iris0.955 ± 0.0280.940 ± 0.0320.948 ± 0.0290.958 ± 0.0270.950 ± 0.0300.962 ± 0.0260.959 ± 0.0280.970 ± 0.025
Landsat Satellite0.704 ± 0.0650.688 ± 0.0680.694 ± 0.0640.716 ± 0.0660.707 ± 0.0670.724 ± 0.0610.719 ± 0.0650.738 ± 0.060
Parkinsons0.823 ± 0.0510.810 ± 0.0540.816 ± 0.0500.829 ± 0.0520.820 ± 0.0530.834 ± 0.0490.828 ± 0.0520.846 ± 0.047
Pima Indians Diabetes0.775 ± 0.0590.762 ± 0.0610.770 ± 0.0580.782 ± 0.0560.774 ± 0.0600.788 ± 0.0550.784 ± 0.0580.802 ± 0.054
Student Performance0.741 ± 0.0640.728 ± 0.0670.734 ± 0.0620.748 ± 0.0660.739 ± 0.0650.755 ± 0.0600.750 ± 0.0630.768 ± 0.058
Vehicle0.746 ± 0.0570.730 ± 0.0600.738 ± 0.0560.751 ± 0.0590.742 ± 0.0580.756 ± 0.0540.752 ± 0.0570.770 ± 0.052
Wine0.944 ± 0.0300.930 ± 0.0330.938 ± 0.0290.948 ± 0.0310.940 ± 0.0320.952 ± 0.0270.949 ± 0.0300.960 ± 0.026
Wine Quality0.706 ± 0.0620.691 ± 0.0650.698 ± 0.0610.711 ± 0.0640.703 ± 0.0630.718 ± 0.0590.714 ± 0.0620.732 ± 0.058
Yeast0.602 ± 0.0710.586 ± 0.0740.592 ± 0.0690.608 ± 0.0720.598 ± 0.0700.614 ± 0.0660.610 ± 0.0710.628 ± 0.065
Average0.782 ± 0.0540.768 ± 0.0570.774 ± 0.0530.789 ± 0.0560.780 ± 0.0550.795 ± 0.0510.791 ± 0.0540.812 ± 0.049
Table 8. Comparison of BEC of critical edges.
Table 8. Comparison of BEC of critical edges.
Edge (Undirected)TAN-KDETAN-FGBMLE
PM2.5–PM100.710.92
temperature–NO20.640.88
Proximity to industrial areas–SO20.620.85
Proximity to industrial areas–CO0.580.81
humidity–PM2.50.490.74
CO–population density0.560.79
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wei, C.; Zhang, T.; Li, C.; Wang, P.; Ye, Z. TAN-FGBMLE: Tree-Augmented Naive Bayes Structure Learning Based on Fast Generative Bootstrap Maximum Likelihood Estimation for Continuous-Variable Classification. Entropy 2025, 27, 1216. https://doi.org/10.3390/e27121216

AMA Style

Wei C, Zhang T, Li C, Wang P, Ye Z. TAN-FGBMLE: Tree-Augmented Naive Bayes Structure Learning Based on Fast Generative Bootstrap Maximum Likelihood Estimation for Continuous-Variable Classification. Entropy. 2025; 27(12):1216. https://doi.org/10.3390/e27121216

Chicago/Turabian Style

Wei, Chenghao, Tianyu Zhang, Chen Li, Pukai Wang, and Zhiwei Ye. 2025. "TAN-FGBMLE: Tree-Augmented Naive Bayes Structure Learning Based on Fast Generative Bootstrap Maximum Likelihood Estimation for Continuous-Variable Classification" Entropy 27, no. 12: 1216. https://doi.org/10.3390/e27121216

APA Style

Wei, C., Zhang, T., Li, C., Wang, P., & Ye, Z. (2025). TAN-FGBMLE: Tree-Augmented Naive Bayes Structure Learning Based on Fast Generative Bootstrap Maximum Likelihood Estimation for Continuous-Variable Classification. Entropy, 27(12), 1216. https://doi.org/10.3390/e27121216

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop