Universal Target Learning: An Efficient and Effective Technique for Semi-Naive Bayesian Learning

To mitigate the negative effect of classification bias caused by overfitting, semi-naive Bayesian techniques seek to mine the implicit dependency relationships in unlabeled testing instances. By redefining some criteria from information theory, Target Learning (TL) proposes to build for each unlabeled testing instance P the Bayesian Network Classifier BNCP, which is independent and complementary to BNCT learned from training data T. In this paper, we extend TL to Universal Target Learning (UTL) to identify redundant correlations between attribute values and maximize the bits encoded in the Bayesian network in terms of log likelihood. We take the k-dependence Bayesian classifier as an example to investigate the effect of UTL on BNCP and BNCT. Our extensive experimental results on 40 UCI datasets show that UTL can help BNC improve the generalization performance.


Introduction
Supervised learning is a machine learning paradigm that has been successfully applied in many classification tasks [1,2]. Supervised learning has widespread deployment in applications including medical diagnosis [3][4][5], email filtering [6,7], and recommender systems [8][9][10]. The mission of supervised classification is to learn a classifier, such as neural network propagation and decision tree, from labeled training set T and then use it to assign class label c to some testing instance x = {x 1 , · · · , x n }, where x i and c respectively denote the value of attribute X i and class variable C. Bayesian Network Classifiers (BNCs) [11] are such tools for indicating the probabilistic dependency relationships graphically and inferring under uncertainty conditions. They supply a framework to compute the joint probability, which can be written as the individual conditional probabilities of attributes given their parents, that is: where π i and π c respectively denote the parents of attribute X i and that of class variable C. Learning unrestricted BNCs is often time consuming and quickly becomes intractable as the number of attributes in a research domain grows. Moreover, inference in such unrestricted models has been shown to be NP-hard [12]. The success of Z-dependence Naive Bayes (NB) [13] has led to learning restricted BNCs or BNC T from labeled training data T , e.g., one-dependence Tree Augmented Bayesian classifier (TAN) [14] and k-Dependence Bayesian classifier (KDB) [12]. Among them, KDB can generalize from one-dependence to an arbitrary k-dependence network structure and has received great attention from researchers in different domains. These BNCs attempt to extract from labeled training data the significant dependencies implicated, whereas overfitting may result in classification bias. For example, patients with similar symptoms sometimes may have diverse kinds of diseases, for example, VM (viral myocarditis) [15] is often diagnosed as influenza due to the low incidence rate.
Semi-supervised learning methods generally apply unlabeled data to either reprioritize or modify hypotheses learned from labeled data alone [16][17][18]. These methods efficiently combine the expressed classification information of the labeled data with the information concealed in the unlabeled data [19]. These algorithms generally assume that The general assumption of this class of algorithms is that data points in high density regions likely belong to the same class simultaneously as decision boundary exists in low density regions [20]. However, the information carried by one single unlabeled instance may be overwhelmed by mass training data, and a wrongly-assigned class label may result in "noise propagation". To address this problem, we presented the Target Learning (TL) framework [21], in which an independent Bayesian model BNC P learned from testing instance P can work jointly with BNC T and effectively improve BNC T 's generalization performance with minimal additional computation. In this paper, we present an expanded presentation of TL, Universal Target Learning (UTL), through dynamically adjusting dependency relationships implicated in one single testing instance at classification time to explore the most appropriate network topology. Conditional entropy is introduced as the loss function to measure the bits encoded in BNC in terms of log likelihood.
The remainder of the paper is organized as follows: Section 2 reviews the state-of-the-art-related BNCs. Section 3 shows the theoretical justification of the UTL framework and describes the learning procedure of KDB within UTL. The extensive experimental studies on 40 datasets are revealed in Section 4. To finalize, the final section shows the conclusions and the future work.

Preliminaries
A pair with < G, Θ > can formalize a Bayesian Network (BN). G represents the structure containing nodes and arcs with a directed acyclic graph. Nodes symbolize the class or attribute variable, and arcs correspond to dependency relationships existing between the child nodes and parent nodes. Θ represents the parameter set, which includes the conditional probability distribution of each node in G, namely P B (c|π c ) or P B (x i |π i ), where π i and π c respectively denote the parents of attribute X i and that of class variable C in structure G. Facts proved that it is an NP-hard problem to learn an optimal BN [22]. To deal with the sticky complexity, some learning of restricted network structures is under research [23]. Thus, the joint probability distribution is defined as: Taking advantage of the underlying network topology of B and Equation (2), a BNC computes P B (c|x) by: Among numerous restricted BNCs, NB is an extremely simple and remarkably effective approach with a zero-dependence structure (see Figure 1a) for classification [24,25]. It uses a simplifying assumption that given the class label, the attributes are independent of each other [26,27], i.e., However, in the real-world, NB's attribute independence assumption is often violated and sometimes affects its classification performance. There has been generous prior work that explored methods to improve NB's classification performance. Information theory, which was proposed by Shannon, has established a mathematical basis for the rapid development of BN. Mutual Information (MI) I(X i ; C) is the most commonly-used criterion to rank attributes for attribute sorting or filtering [28,29], and Conditional Mutual Information (CMI) I(X i ; X j |C) is used to find conditional dependence between attribute pair X i and X j for identifying possible dependencies. I(X i ; C) and I(X i ; X j |C) are defined as follows, The independence assumption may not hold for all attribute pairs, but may hold for some attribute pairs. Two categories of learning strategies have been proven effective based on NB. The first category aims at identifying the independency relationships to approximate NB's independence assumption. Langley and Sage [27] proposed the wrapper-based Selective Bayes (SB) classifier, which carries out a greedy search through the space of attributes to accommodate redundant ones within the prediction process. Some methods relieve the violations of the attribute independence assumption through deleting strong related attributes (such as Backwards Sequential Elimination (BSE) [30] and Forward Sequential Selection (FSS) [31]). Some attribute weighting methods also achieve competitive performance. The earliest methods of weighted naive Bayes were proposed by Hilden and Bjerregaard [32], which used a single weight, then Ferreira [33] improved this by weighting each attribute value rather than each attribute. Hall [34] assigned the weight, which is in reverse ratio to the minimum depth at first tested in an uncorrected decision tree to each attribute. The other group introduced various categories to NB. Kwoh and Gillies [35] proposed a method that introduces one hidden variable to NB's model as a child of the class label and as the parent of all predictor labels. Kohavi [36] described a hybrid approach that attempts to utilize the advantages of both decision trees and naive Bayes. Yang [37] proposed to fit NB's conditional independence assumption by discretization.
The second category aims at relaxing the independence assumption by introducing the significant dependency relationships. TAN relaxes the independence assumption, as well as extends NB from the zero-dependence to the one-dependence maximum weighted spanning tree [14] (see Figure 1b). Based on this, Keogh and Pazzani [38] proposed to construct TAN by choosing the augmented arcs, which maximized the improvement of classification accuracy. ATAN [39] predicts by averaging each built TAN's estimated class-membership probabilities. Weighted Averaged Tree-Augmented Naive Bayes (WATAN) [39] applies the aggregation weight by the mutual information between the class variable and root attribute. To represent more dependency relationships, an ensemble of one-dependence BNCs or high-dependence BNC is a feasible solution. RTAN [40] generates TAN, which describes the dependency relationships within a certain attribute sub-spaces. As a consequence, BaggingMultiTAN [40] trains these RTAN as component classifiers and is generated by the most votes. Averaged One-Dependence Estimators (AODE) [41] assumes that every attribute relies on the class and a shared attribute and only uses one-dependence estimators. To handle continuous variables, in every model, HAODE [42] considers a super-parent attribute's discrete version, so that it can estimate the previous relationships by a univariate Gaussian distribution. As shown in Figure 1c (KDB with four attributes when k = 2), KDB can represent the arbitrary degree of dependency relationships and also achieve similar computational efficiency of NB [21]. Bouckaert proposed to average all of the possible network structures for the fixed value of k (containing lower orders) [43]. Rubio and Gámez presented a variant of KDB, which provided a hill-climbing algorithm to build a KDB incrementally [44].
To avoid high variance and classification bias caused by overfitting, how to mine the information existing in testing instance P is an interesting issue and has attracted more attention recently. Some algorithms try to combine P into training data T , which can help refine the network structure of classifier BNC T , which is learned from T only. The recursive Bayesian classifier [31] captures each predicted label provided by NB, and if misclassified, it induces a new NB from the cases that have the predicted label. A random oracle classifier [45] splits the labeled training data into two subsets using the random oracle and respectively trains two sub-classifiers. The testing instance then uses the random oracle to select one sub-classifier for classification. Other algorithms, though few, seek to explore the dependency relationships implicated in P only. Subsumption Resolution (SR) [46] identifies pairs of attribute-values in P, and if one is a generalization of the other, SR will delete the generalization. Target learning [21] extends P to a pseudo training set and then builds an independent BNC P for it, which is complementary to BNC T in nature.

Target Learning (TL)
Relaxing the independence assumption by adding augmented edges to NB is a feasible approach to refining NB and increasing the confidence level of the estimate of joint probability P(x, c). However, from Equation (5), we can see that, to compute MI or CMI, the (conditional) probability distributions needed are learned from labeled training dataset T only. Thus, as the structure complexity increases, the corresponding BNC may overfit the training data and underfit the unlabeled testing instance. This may lead to classification bias and high variance. To address the issue, we proposed the TL framework to build a specific BNC P for any testing instance P at classification time to explore possible conditional dependencies that exist in P only. The BNC P applies the same learning strategy as that of BNC T learned from T . Thus, BNC P and BNC T are complementary to each other and can work jointly.
We take KDB as an example to illustrate the basic idea of TL. Given training dataset T , the learning procedure of KDB T is shown in Algorithm 1.
From the viewpoint of information theory, MI or I(X i ; C) can measure the mutual dependence between C and X i . From Equation (5), we can see that I(X i ; C) is the expected value of mutual information over all possible values of C and X i . Thus, although the dependency relationships between attributes may vary for different instances to a certain extent [21], the structure of traditional KDB cannot automatically fit diverse instances. To address the issue, for unlabeled testing instance {x 1 , · · · , x n }, Local Mutual Information (LMI) and Conditional Local Mutual Information (CLMI) are introduced as follows to measure the dependency relationship between attribute values [21]: Given training set T, KDB T sorts attributes by comparing I(X i ; C) and chooses conditional dependency relationships by comparing I(X i ; X j |C). In contrast, given testing instance P = {x 1 , x 2 , · · · , x n }, KDB P sorts attributes by comparingÎ(X i ; C) and chooses conditional dependency relationships by comparingÎ(X i ; X j |C). The learning procedure of KDB P is shown in Algorithm 2 as follows.

Algorithm 2:
The learning procedure of KDB P .
1 CalculateÎ(X i ; C) by Equation (6) for each attribute value x i ∈ x where Cis the class. 2 CalculateÎ(X i ; X j |C) by Equation (6) for each pair of attributes, where i = j and C is the class. 3 Let the used attributes list, S, be empty. 4 Let the Bayesian network that is to be constructed, KDB P , start from a single class node, C. 5 while (S contains all attribute values) do 6 Select an attribute value x max that has the highest valueÎ(x max ; C) and is not in S. 7 Add a node to KDB P standing for x max . 8 Add an arc from C to x max . 9 Add min(|S|, k) arcs from distinct attribute values x j with the highestÎ(x max ; x j |C), where x j ∈ S.

10
Add x max to S.

Universal Target Learning
Generally speaking, the aim of BNC learning is to find a network structure that can facilitate the shortest description of the original data. The length of this description considers the description of the BNC itself and the data applying the BNC [38]. Such a BNC represents a probability distribution P B (x) over the instance x appearing in the training data T.
Given training data T with N instances T = {d 1 , · · · , d N }, the log likelihood of classifier B given T is defined as: which represents how many bits are required to describe T on account of the probability distribution P B . The log likelihood has a statistical interpretation as well: the higher the log likelihood, the closer the classifier B is to model the probability distribution in T . The label of testing instance U = {x 1 , · · · , x n } may take any one of the |C| possible values of variable C. Thus, TL assumes that U is equivalent to a pseudo training set P that consists of |C| instances as follows, Similar to the definition of LL(B|T), the log likelihood of classifier B given P is defined as: By applying different CMI criteria as shown in Equations (5) and (6), BNC P and BNC T provide two network structures to describe possible dependency relationships implicated in testing instances. These two CMI criteria cannot directly measure the bits that are needed to describe P based on P B , whereas LL(B|P ) can. From Equation (2), If there exist strong correlations between the values of parent attributes, we may choose to replace these correlations with meaningful dependency relationships. For example, let Gender and Pregnant be two attributes. If Pregnant = "yes", it follows that Gender = "female". Thus, Gender = "female" is a generalization of Pregnant = "yes" [46] and P(Gender = " f emale , Pregnant = "yes ) = P(Pregnant = "yes ). Given some other attribute valuesx = {x 1 , · · · , x m }, we can also have P(Gender = " f emale , Pregnant = "yes ,x) = P(Pregnant = "yes ,x). Correspondingly, Obviously, for specific instances in which such correlations hold, the parent attribute Gender can not provide any extra information to X m+1 and should be removed. To maximize LL(B|P ), X m+1 may select another attribute, e.g., X p , as its parent to take the place of attribute Gender; thus, the dependency relationship between X p and X m+1 that was neglected before can be added into the network structure. Many algorithms only explore improving the performance by removing redundant dependency relationships in the network structure, without considering to search for more meaningful dependency relationships. Because of the constraint of computational complexity that is closely related to structure complexity, each node in BNC can only take a limited number of attributes as parents. For example, KDB demands that at most k parents can be chosen for each node. Similarly, the proposed algorithm also follows this rule.
The second term in Equation (10), i.e.,Ĥ(X j |C, Π j ), is the log likelihood of conditional dependency relationships in B given P. To find proper dependency relationships implicated in each testing instance and maximize the estimate of LL(B|P ), we need to maximizeĤ(X j |C, Π j ) for each attribute X j in turn. We argue that LL(B|P ) provides a more intuitive and scalable measure for a proper evaluation. Based on the discussion presented above, in this paper, we propose to refine the network structures of BNC P and BNC T based on Universal Target Learning (UTL). In the following discussion, we take KDB as an example and apply UTL to KDB T and KDB P in similar ways, then we have UKDB T and UKDB P correspondingly. For testing instance P, UKDB T or UKDB P will recursively check all possible combinations of candidate parent attributes and attempt to find Π j , which corresponds to the maximum ofĤ(X j |C, Π j ), that is Π j may contain less than min{i − 1, k} attributes. By minimizingĤ(X j |C, Π j ) for each attribute X j , UKDB T and UKDB P are supposed to be able to seek more proper dependency relationships implicated in specific testing instance P and that may help to maximize the estimate of LL(B|P ). For example, suppose that the attribute order of KDB T is {X 0 , X 1 , X 2 , X 3 } and k = 2, then for attribute X 2 , its candidate parents are {X 0 , X 1 }. Given testing instance P, we will compare and find Π 2 whereĤ(X 2 |C, Π 2 ) = max{Ĥ(X 2 |C, X 0 ),Ĥ(X 2 |C, X 1 ),Ĥ(X 2 |C, X 0 , X 1 )}, and Π 2 ⊂ {X 0 , X 1 , (X 0 , X 1 )}. Thus, UKDB T dynamically adjusts dependency relationships for different testing instances at classification time. Similarly, UKDB P applies the same learning strategy to refine the network structure of KDB P .
Given n attributes, we can have n! possible attribute orders, and among them, the orders respectively determined by I(X i ; C) andÎ(X i ; C) have been proven to be feasible and effective. Thus, for attribute X i , its parents can be selected from two sets of candidates. The final classifier is also an ensemble of UKDB T and UKDB P . UTL retains the characteristic of target learning, that is UKDB T and UKDB P are complementary, and they can work jointly to make the final prediction. The learning procedures of UKDB T and UKDB P , which are respectively shown in Algorithms 3 and 4 as follows, are almost the same, except the pre-determined attribute orders.
In contrast to TL, UTL can help BNC P and BNC T encode the most possible dependency relationships implicated in one single testing instance. The linear combiner is appropriate to be used for models that output real-valued numbers, so it is applicable for BNC. For testing instance x, the ensemble probability estimate for UKDB T and UKDB P is, For different instances, the weights, α and β, may differ greatly, and there is no effective way to address issue. Thus, in fact, we simply use the uniformly-rather than non-uniformly-weighted average of the probability estimates. That is, we set α = β = 0.5 for Equation (12).
1 Let S be a list of attributes in descending order of I(X i ; C), and suppose S = {X 1 , · · · ,X n }. 2 Let the Bayesian network that is to be constructed, UKDB T , start from the class node and n attributes.
1 Let S be a list of attributes in descending order ofÎ(X i ; C), and suppose S = {X 1 , · · · ,X n }. 2 Let the Bayesian network that is to be constructed, UKDB P , start from the class node and n attributes.
Add |Π| arcs from |Π| distinct attributes X i inΠ to X i . 7 end 8 return UKDB P

Results and Discussion
All algorithms for the experimental study ran on a C++ system (GCC 5.4.0). For KDB and its variations, as k increased, the time complexity and the structure complexity always increased exponentially. The k with larger values may contribute to promoting the classification accuracy in contrast to the smaller value of k. There are some requirements on k due to the constraint of currently available hardware resources. When k = 3, UKDB's experimental results on some large-scale datasets can not be tested due to the amount of CPU available. Thus, we only chose to select k = 1 and k = 2 in the following experimental study. To demonstrate the effectiveness of the UTL framework, the following algorithms (including three single-structure BNCs and an ensemble BNC) will be compared with ours, We randomly selected 40 datasets from the UCI machine learning repository [47] for our experimental study. The datasets were divided into three categories, i.e., large datasets with the number of instances >5000, medium datasets with the number of instances >1000 and <5000, and small datasets with the number of instances <1000. The above datasets are described in Table 1 in detail, including the number of instances, attributes, and classes. All the datasets are ordered in ascending order of dataset size. The number of attributes ranged widely from 4-56, convenient for evaluating the effectiveness of the UTL framework to mine dependency relationships between attributes. Meanwhile, we can examine the classification performance with various sizes from 24 instances to 5,749,132 instances. Missing values were replaced with distinct values. We used Minimum Description Length (MDL) discretization [48] to discretize the numeric attributes. To validate the effectiveness of UTL, the proposed UKDB are contrasted with three single-structure BNCs (NB, TAN, and KDB), as well as three ensemble BNCs (AODE, WATAN, TAN e ) in terms of zero-one loss, RMSE, and F 1 -score in Section 4.1. Then, we introduce the criteria, goal difference, and relative zero-one loss ratio, to measure the classification performance of UKDB while dealing with different quantities of training data and different numbers of attributes in Sections 4.2 and 4.3, respectively. In Section 4.4, we compare the time cost for training and classifying. At last, we conduct the global comparison in Section 4.5.

4.1.
Comparison of Zero-One Loss, RMSE, and F 1 -Score

Zero-One Loss
The experiments were tested by applying 10 rounds of 10-fold cross-validation. We used Win/Draw/Loss (W/D/L) to clarify the experimental results. To compare the classification accuracy, Table A1 in Appendix A reports the average zero-one loss for each algorithm on different datasets. The corresponding W/D/L records are summarized in Table 2. As shown in Table 2, for the single-structure classifier, UK 1 DB performed significantly better than NB and TAN. Most importantly, UK 1 DB achieved significant advantage over K 1 DB in terms of zero-one loss with 21 wins and only seven losses, providing convincing evidence for the validity of the proposed algorithm. For large datasets, the advantage was even stronger. Simultaneously, UK 2 DB achieved significant advantage over K 2 DB with a W/D/L of 28/8/4. That is, K 2 DB only achieved better results of zero-one loss over UK 2 DB on four datasets (contact-lenses, lung-cancer, sign, nursery); thus, UK 2 DB seldom performed worse than KDB. In contrast, UK 2 DB performed better than K 2 DB more often on many datasets, such as car, poker-hand, primary-tumor, waveform-5000. When compared with the ensemble algorithms, UK 1 DB and UK 2 DB still enjoyed an advantage over AODE, WATAN, and TAN e . Moreover, the comparison results of UK 2 DB with AODE and WATAN were almost significant (24 wins and only three losses, 24 wins and only two losses, respectively). Based on the discussion above, we argue that UTL is an effective approach to refining BNC.

RMSE
The Root Mean Squared Error (RMSE) is used to measure the deviation between the observed value and the true value [49]. Table A2 in Appendix A reports the RMSE results for each algorithm on different datasets. The corresponding W/D/L records are summarized in Table 3. The scatter plot between UK 2 DB and K 2 DB in terms of RMSE is shown in Figure 2. The X-axis shows the RMSE results of K 2 DB, and the Y-axis shows the RMSE results of UK 2 DB. We can observe that there are generous datasets under the diagonal line, such as labor-negotiations, lymphography and poker-hand, which shows that UK 2 DB has some advantages over K 2 DB. Simultaneously, except credit-a and nursery, the other datasets approach close to the diagonal line, which means UK 2 DB rarely performed worse than K 2 DB. For many datasets, UTL substantially helped reduce the classification error of K 2 DB, for example the reduction from 0.4362 to 0.3571 on dataset lymphography. As shown in Table 3, for the single-structure classifiers, UK 1 DB performed significantly better than NB and TAN. Moreover, UK 1 DB achieved significant advantages over K 1 DB with 10 wins and four losses and UK 2 DB over K 2 DB with 14 wins and the losses, which provides convincing evidence for the validity of the proposed framework. When compared with the ensemble group, UK 1 DB and UK 2 DB still had a significant advantage. UK 1 DB and UK 2 DB had obvious advantage with W/D/L of 10/24/6 and 24/13/3 when compared with AODE. UK 2 DB also achieved relatively significant advantage when coming to WATAN and TAN e (14 wins and only two losses, 15 wins and only three losses). UK 2 DB reduced RMSE more substantially. UKDB not only performed better than single-structure classifiers, but also was shown as an effective ensemble model when compared with AODE in terms of RMSE.

F 1 -Score
Generally speaking, zero-one loss can roughly measure the classification performance of BNC, but it cannot evaluate whether the BNC can work consistently while dealing with different parts of imbalanced data. In contrast, precision gives the ratio of the true classification in all test data predicted to be true, and recall gives the ratio of the true classification in all test data actually to be true [50]. Precision and recall sometimes have contradictory situations; therefore, we employed the F 1 -score, the harmonic average of the precision and recall, to measure the performance of our algorithm. In order to apply the multiclass classification problem, we employed the confusion matrix to measure the F 1 -score. Suppose that there exists a dataset to be classified with the classes {C 1 , C 2 , · · · , C m }. The confusion matrix as follows shows the classification results: Each entry N ii of the matrix presents the number of instances, whose true class is C i that are actually assigned to C i (where 1 ≤ i ≤ m). Each entry N ij presents the number of instances, whose true class is C i , but nevertheless are actually assigned to C j (where i = j and 1 ≤ i, j ≤ m). Given the confusion matrix, precision, recall, and F 1 -score are computed as follows: Table A3 in Appendix A reports the F 1 -score for each algorithm on different datasets. Table 4 summarizes the W/D/L of the F 1 -score. Several points in this table are worth discussing: As shown in Table 4, for the single-structure classifiers, UK 1 DB performed significantly better than NB and TAN. When compared with the ensembles, UK 1 DB and UK 2 DB still had a slight advantage over AODE and achieved significant advantages over WATAN and TAN e . Most importantly, UK 1 DB performed better than K 1 DB and UK 2 DB better than K 2 DB, although the advantage was not significant, which provides solid evidence for the effectiveness of UTL.

Goal Difference
To further compare the performance of UKDB with other mentioned algorithms in terms of data size, the Goal Difference (GD) [51,52] was introduced. Suppose for two classifiers A, B, we compute the value of GD as follows: GD(A; B|T ) = |win| − |loss|. (17) where T represents the datasets for comparison and |win| and |loss| are respectively the numbers of datasets on which the classification performance of A is better or worse than that of B. Figures 3 and 4 respectively show the fitting curve of GD(UK 1 DB; K 1 DB|S t ) and GD(UK 2 DB; K 2 DB|S t ) in terms of the zero-one loss. The X-axis represents the indexes of datasets described in Table 1 (referred to as t), and the Y-axis respectively represents the values of GD(UK 1 DB; K 1 DB|S t ) and GD(UK 2 DB; K 2 DB|S t ), where S t denotes the collection of datasets, i.e., S t = {D m |m ≤ t} and D m is the dataset with index m.
From Figure 3, we can see that UK 1 DB achieved significant advantage over K 1 DB, and only on a few large datasets (nursery, seer-mdl, adult), the advantage was not obvious. Similarly, from Figure 4, we can see that there was an obvious positive correlation between the values of GD(UK 2 DB; K 2 DB|S t ) and the dataset size. The advantage of UK 2 DB over K 2 DB was much more obvious than that of UK 1 DB over K 1 DB on small and medium datasets. This superior performance is owed to the ensemble learning mechanism of UTL. UTL played a very important role in discovering proper dependency relationships that exist in testing instances. Since UTL replaces redundant dependency relationships with more meaningful ones, we can infer that UKDB retains the advantages of KDB, i.e., the ability to represent an arbitrary degree of dependence and to fit training data. This demonstrates the feasibility of applying UTL to search for proper dependency relationships. When dealing with large datasets, overfitting may lead to high variance and classification bias; thus, the advantage of UKDB over KDB was not obvious when k = 1 or k = 2.   For imbalanced datasets, the number of instances with different class labels will vary greatly, and that may lead to the estimate bias of the conditional probability. In this paper, the entropy function of class variable C, i.e., H(C), is introduced to measure the extent to which the datasets are imbalanced. UTL refines the network structure of BNC T and BNC P according to the attribute values rather than the class label of testing instance U. The negative effect caused by the imbalanced distribution of C will be mitigated to a certain extent. From Figures 5 and 6, we can see that the advantage of UKDB over KDB becomes more and more significant as H(C) > 0.8. Thus, these datasets with H(C) > 0.8 are supposed to be relatively imbalanced and highlighted in Tables A1-A3. Table 5 reports the corresponding H(C) values of these 40 datasets.

Relative Zero-One Loss Ratio
The criterion relative zero-one loss ratio can measure the extent of which classifier A 1 performs relatively better or worse than A 2 on different datasets. For instance, on dataset D 1 , the zero-one losses of classifier A 1 and A 2 were respectively 55% and 50%; whereas on dataset D 2 , the zero-one losses of classifier A 1 and A 2 were respectively 0% and 5%. Although the zero-one loss difference were always 5% for both cases, A 1 performed relatively better on dataset D 2 than A 2 on dataset D 1 . Given two classifiers A, B, the relative zero-one loss ratio, referred to as R Z (·), is defined as follows: where Z A(orB) denotes the value of the zero-one loss of classifier A(or B) on a specific dataset. The higher the value of R Z (A|B), the better the performance of classifier A relative to classifier B. Figure 7 presents the comparison results of R Z (·) of UK 2 DB and K 2 DB, UK 1 DB, and K 1 DB. The X-axis represents the index of the dataset, and the Y-axis shows the value of R Z (·). As we can observe intuitively, on most datasets, the values of R Z (UK 2 DB|K 2 DB) and R Z (UK 1 DB|K 1 DB) were positive, which demonstrates that UKDB achieved significant advantages over KDB no matter k = 1 or k = 2. Generally, in many cases, the difference between R Z (UK 2 DB|K 2 DB) and R Z (UK 1 DB|K 1 DB) was not obvious; thus, the working mechanism of UTL makes it insensitive to the structure complexity. For the first 10 datasets, the effectiveness of UTL was less significant. UK 1 DB beat K 1 DB on six datasets and lost on four, and UK 2 DB performed similarly. From Table 1, among these datasets on which UTL performed poorer, contact-lenses (No. 1), echocardiogram (No. 5), and iris (No. 7) had a small number of attributes, i.e., respectively 4, 6, and 4 attributes. A small dataset may lead to low confidence estimate of the probability distribution and then low-confidence estimate ofĤ(X j |C, Π j ). A small number of attributes makes it more difficult for UTL to adjust the dependency relationships dynamically. However, as the size of datasets increased, UKDB generally achieved more significant advantages over KDB. For the last 30 datasets, UTL only performed poorer on a few datasets, e.g., hypothyroid (No. 25), and among theses datasets, UK 2 DB worked much better than UK 1 DB. From the above discussion, we can come to the conclusion that the UTL framework was effective at identifying significant conditional dependencies implicated in testing instance, whereas enough data for assuring high-confidence probability estimate was a necessary prerequisite.

Index of dataset
Relative zero-one loss ratio

Training and Classification Time
The comparison results of time for training and classifying are respectively displayed in Figures 8 and 9. Each bar shows the sum time of 40 datasets.  From Figure 8, we can observe that our proposed algorithms UK 1 DB and UK 2 DB substantially needed more training time than the rest of the classifiers considered, i.e., NB, TAN, K 1 DB, K 2 DB, AODE, WATAN, and TAN e . UK 2 DB spent slightly more training time than UK 1 DB on account of more dependency relationships existing in UK 2 DB. On the other hand, as shown in Figure 9, due to the ensemble learning strategy of UTL, NB, TAN, AODE, K 1 DB, and K 2 DB consumed less classification time than UKDB when k = 1 or k = 2. This was due to the fact that during the learning process, UTL recursively tries to find the stronger dependency relationships for each testing instance based on log likelihood. UK 1 DB and UK 2 DB had similar time cost for classifying. Although UKDB generally had more training time and classification time than other BNCs, it had higher classification accuracy. Compared to KDB, UKDB delivered markedly lower zero-one loss, also causing too much average computation overhead. The advantage of UTL for improving classification accuracy came at a cost in training time and classification time.

Global Comparison
We performed the comparison of our algorithm and other algorithms with the Nemenyi test in Figure 10 proposed by Demšar [53]. If two classifiers' average ranks are diverse by at least the Critical Difference (CD), their performance differs significantly. The value of CD can be calculated as follows: where the critical value q α for α = 0.05 and t = 9 is 3.102 [53]. Given nine algorithms and 40 datasets, the critical difference (CD) is CD = 3.102 × 9 × (9 + 1)/(6 × 40) = 1.8996. We plot the algorithms on the left line according to their average ranks, which are indicated on the parallel right line. Critical Difference (CD) is also presented in the graphs. The lower the position of algorithms, the lower the ranks will be, and hence the better the performance. The algorithms are connected by a line if their differences are not significant. As shown in Figure 10, UK 2 DB achieved the lowest mean zero-one loss rank, followed by UK 1 DB. The average rank of UK 2 DB and UK 1 DB was significantly better than NB, TAN, K 1 DB, and K 2 DB, demonstrating the effectiveness of the proposed universal target learning framework. Compared with the ensemble models AODE, WATAN, and TAN e , UK 2 DB and UK 1 DB also achieved lower ranks, but not significantly.

Conclusions and Future Work
BNCs can graphically represent the dependency relationships implicit in training data and they have been previously demonstrated to be effective and efficient. On the basis of analyzing and summarizing the state-of-the-art BNCs in terms of log likelihood, this paper proposed a novel learning framework for BNC learning, UTL. Our experiments showed its advantages from the comparison results of zero-one loss, RMSE, F 1 -score, etc. UTL can help refine the network structure by fully mining the significant conditional dependencies among attribute values in a specific instance. The application of UTL is time-consuming, and we will seek methods to make it more effective. The research work on extending TL will be very promising.
Author Contributions: All authors contributed to the study and preparation of the article. S.G. and L.W. conceived of the idea, derived the equations, and wrote the paper. Y.L., H.L., and T.F. did the analysis and finished the programming work. All authors read and approved the final manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.