Belief Entropy Tree and Random Forest: Learning from Data with Continuous Attributes and Evidential Labels

As well-known machine learning methods, decision trees are widely applied in classification and recognition areas. In this paper, with the uncertainty of labels handled by belief functions, a new decision tree method based on belief entropy is proposed and then extended to random forest. With the Gaussian mixture model, this tree method is able to deal with continuous attribute values directly, without pretreatment of discretization. Specifically, the tree method adopts belief entropy, a kind of uncertainty measurement based on the basic belief assignment, as a new attribute selection tool. To improve the classification performance, we constructed a random forest based on the basic trees and discuss different prediction combination strategies. Some numerical experiments on UCI machine learning data set were conducted, which indicate the good classification accuracy of the proposed method in different situations, especially on data with huge uncertainty.


Introduction
Decision trees have been widely used for their good learning capabilities and ease of understanding. In some real world issues, instances may be ill-known for some factors such as randomness, data incompleteness and even expert's indefinite subjective opinions; however, traditional decision trees can only handle certain samples with precise data. The incompletely observed instances are usually ignored or replaced by a precise one, despite the fact that they may contain useful information [1], which may cause a loss of accuracy.
There have been many attempts to build trees from incomplete data in the past several decades. The probability trees [2,3] were suggested based on probability theory, which is usually intuitively the first tool to modeling uncertainty in practice; however, it has been proven that probability cannot always be adequate for representing data uncertainty [4,5] (often termed epistemic uncertainty). To overcome this drawback, various approaches have been proposed, including: fuzzy decision trees [6,7], the possibilistic decision trees [8] and the uncertain decision trees [9,10]. Besides the aforementioned methods, a more general framework, called the belief function theory [11,12] (also evidential theory or Dempster-Shafer theory), has been proven to have the ability to model all kinds of knowledge. The process of embedding belief functions within decision tree techniques has already been extensively investigated [13][14][15][16][17][18][19][20][21][22][23][24][25] in recent years. Particularly, among these methods, several trees [17][18][19] estimate parameters by maximizing evidential likelihood function using the E 2 M algorithm [26,27], which is also the basis of part of the trees to be proposed in this paper.
However, the existing methods on incomplete data do not take continuous attributes into full consideration. These proposals deal with uncertain data modeled by the belief function and build trees by extending the traditional decision tree method. The imitation and transformation decides to use existing methods to handle continuous attribute values by discretization, which brings about an issue of losing the detail of the training data. For example, the information gain ratio, the attribute selecting measurement in C4.5, was transformed to adapt the evidential labels of the training set in the Belief C4.5 trees [19], in which the continuous-valued attribute is divided into four intervals of equal width before learning. This issue leads to the purpose of this paper: to learn from uncertain data with continuous attribute values without pretreatment.
To realize this purpose, we firstly, for each attribute, fit the training data to a Gaussian mixture model (GMM), which consists of normal distribution models one-by-one corresponding to class labels, by adopting the E 2 M algorithm. This step, which significantly differs from other decision trees, confirms the ability to deal with ill-known labels and original attribute values (either discrete or continuous). On the basis of these GMM models, we generate the basic belief assignment (BBA) and calculate belief entropy [28]. The attribute with minimal average entropy, which distinguishes classes from each others most, will be selected as the splitting attribute. The following decision tree induction steps are designed accordingly and logically. To our knowledge, this paper is the first to introduce GMM models and belief entropy to decision trees with evidential data.
Another part of our proposal is adopting the ensemble method for our belief entropy trees. Inspired by the idea of building bagging trees based on random sampling [29], we further choose a more efficient and popular technique-random forest [30]. Under the belief function framework, the basic trees will output either precise or mass (modeled by BBA) label predictions, while traditional random forest can only combine precise labels. Thus, a new method to summarize the basic tree predictions is proposed to combine mass labels directly, instead of voting on precise labels. This combined mass keeps the uncertain information of data as much as possible, which helps to generate a more reasonable prediction. The new combination method is discussed and compared to the traditional majority voting method later.
We note that we have proposed our early work in a shorter conference paper [31]. Compared with our initial conference paper, we have fixed the attribute selection and splitting strategy of a single tree and introduced ensemble learning to our tree method in this paper. Section 2 recalls some basic knowledge about decision trees, belief function theory, the E 2 M algorithm and belief entropy. Section 3 details the induction procedure of belief entropy methods and proposes three different instance prediction techniques. In Section 4, we introduce how to expend the single belief entropy tree to random forests and discuss the different predicting combination strategies. In Section 5, we detail experiments that were carried out on some classical UCI machine learning data sets to compare the classification accuracies of proposed trees and random forests. Finally, conclusions are summarized in Section 6.

Settings and Basic Definitions
The purpose of a classification method is to build a model that maps an attribute vector X = x 1 , . . . , x D ∈ A 1 × A 2 × · · · × A D , which contains D attributes, to an output class y ∈ C = {C 1 , . . . , C K } taking its value among K classes. Each attribute discretely has finite values or continuously takes value within an interval. The learning of classification is based on a complete training set of precise data which contains N instances, denoted as However, the imperfect knowledge about the inputs (feature vector) and the outputs (classification labels) exists widely in practical applications. Traditionally and regularly, the imperfect knowledge is modeled by probability theory, which is considered to be questionable in a variety of scenarios. Hence, we model uncertainty by belief function in this paper. Typically, we consider that attribute values are precise and can be either continuous or discrete, while only the output labels are uncertain.

Decision Trees
Decision trees [32] are regarded as one of the most effective and efficient machine learning methods and widely adopted for solving classification and regression problems in practice. The success, to a great extent, relays on the easily understandable structure, for both humans and computers. Generally, a decision tree is induced top-down from a training set T, which recursively repeats the steps below: • Select an attribute, through a designed selection method, to generate a partition of a training set; • Split the current training set to several subsets and put them into child node; • Generate a leaf node and determine the prediction label for a child node when a stop criterion is satisfied.
Differing in the attribute selection methods, several decision tree algorithms have been proposed, such as ID3 [32], C4.5 [33] and CART [34]. Among these trees, the ID3 and C4.5 choose entropy as an information measure to compute and evaluate the quality of a node split by a given attribute.
The core of ID3 is information gain. Given training data T and an attribute A with K A modalities, the information gain will be: where and where θ i is the proportion of instances in T that are of class C i , |T| and |T i | are the cardinalities of the instance sets belonging to a parent node and to the child node i. The limitation of information gain is that attributes with largest values will be most promoted [33], which leads to the GainRatio in the C4.5 algorithm. It is given as: where The attribute with the largest gain ratio will be selected for splitting. We can easily find the the Equation (2) is actually the Shannon Entropy. Yet in this paper, concerning the feature of evidential data described by the framework of belief function, the attribute selection method is newly designed based on belief entropy [28] instead of Shannon entropy.

Random Forest
To improve the classification accuracy and generalization ability of machine learning, the ensemble model method is introduced to the learning procedure. One important branch of ensemble method is called bagging, which concurrently builds multiple basic models learning from different training sets, which are generated from original data by bootstrap sampling. On the basis of bagging decision trees, random forest (RF) [30] not only chooses the training instance randomly but also introduces randomness into attributes selection. To be specific, traditional decision trees select the best splitting attribute among all D attributes; random forest generates a random attribute subset then chooses the best one within this subset to split the tree node. The size D of this subset is adjustable and generally set as D = log 2 D.
A detailed description of the mathematical formulation of RF model is found in [30]. The RF model consists of a union of multiple basic trees, where each tree learns from bootstrap samples and selects attribute from a small subset of all attributes. There some advantages of RF: (a) better prediction performance, (b) resistance to overfitting, (c) low correlation of individual trees, (d) low bias and low variance and (e) small computational overhead.
Some existing works have explored the ensemble method on belief decision trees, such as bagging [29]. In this paper, we apply the random forest technique to the proposed belief entropy trees and discuss the different prediction determining strategies.

Belief Function Theory
Let the finite set Ω denote the frame of discernment containing k possible exclusive values that a variable can take. When considering the output y, the imperfect knowledge about value of y can be modeled by mass function m y : 2 Ω → [0, 1], such that m y (∅) = 0, and which is also called a basic belief assignment (BBA). The subset A is called a focal set where m y (A) > 0, and the m y (A) can be interpreted as the support degree of the evidence towards the case that true value is in set A.
There are some typical mass functions need to be attended: • Vacuous mass: mass function such that m y (Ω) = 1, which means total ignorance; • Bayesian mass: for all focal set A, the cardinality |A| = 1. In this case, the mass degenerates to a probability distribution; • Logical(categorical) mass: m y (A) = 1 for some A. In this case, the mass is equivalent to the set A.
One-to-one related to the mass function m y , the belie f f unction and plausibility f unction are defined as: which, respectively, indicate the minimum and maximum belief degree of evidence towards set B. Typically, the function pl : Ω → [0, 1] such that pl y (ω) = Pl y ({ω}) for all ω ∈ Ω is called contour f unction associated to m y . For two mass function m 1 and m 2 induced by evidences independently, they can be combined by the Dempster s rule [12] ⊕ defined as: for all A ⊆ Ω, A = ∅, and (m 1 ⊕ m 2 )(∅) = 0, where is called the degree o f con f lict between m 1 and m 2 . Obviously, Dempster's rule is commutative and associative according to the definition. In the decision making situation, we need to determine the most reasonable hypothesis from a mass. Different decision-making strategies with belief functions [35,36] have been researched. Among these methods, in the transferable belief model (TBM), pignistic probability [37] was proposed to make decision from a BBA: where |A| is the cardinality of A.
When we model uncertain labels of evidential data with mass functions, the training set becomes

Evidential Likelihood
Consider a discrete random vector Y taking values in Ω with a probability mass function p Y (y; θ) assumed to be associated with a parameter θ ∈ Θ. After a realization y of Y has been perfectly observed, the likelihood function of complete data is defined as When the observations are uncertain, it is impossible to evaluate parameter θ from a likelihood function. In this situation, a new statistical tool [38] called evidential likelihood was proposed to perform parameter estimation. Assume that y is not precisely observed, but is known surely that y ∈ A for some A ∈ Ω. Given such imprecise data, the likelihood function will be extended to Furthermore, the observation of instance y could be not only imprecise, but also uncertain, which is modeled by mass function m y . Thus the evidential likelihood function [27] can be defined as where the pl is the contour function related to m y and the L θ; m y can be remarked as L(θ; pl). According to the statement of Denoeux [27], the value 1 − L(θ; pl) equals to the conflict between parametric model p Y (y; θ) and the uncertain observation pl, which means minimizing L(θ; pl) is actually a procedure of estimating the best parameter θ to fit the parametric model to observation as closely as much. Equation (14) also indicates that L(θ; pl) can be remarked as the expectation of pl such that Assume that Y = (y 1 , . . . , y N ) is a sample set containing n cognitively independent [12] and i.i.d. uncertain observations, in which the y i is model by m y i . In the situation the Equation (15) is written as a product of n terms:

E 2 M Algorithm
Though an extension of likelihood function, the maximum likelihood estimation of evidential likelihood can not directly be computed by the broadly applied EM algorithm [39]. The E 2 M algorithms [27] introduced by Denoeux allow us to maximize the evidential likelihood iteratively, which is composed of two steps (similar to EM algorithm): 1.
The E-step require firstly a probability mass function p Y · | pl; θ (q) = p Y ·; θ (q) ⊕ pl, in which the former part means the probability mass function of Y under the parameter θ (q) estimated from last iteration and the latter part indicates contour function pl. The expression is: Then calculate the expectation of log likelihood log L c (θ; y) = log p Y (y; θ) of complete data with respect to p Y · | pl; θ (q) , 2.
The M-step is to maximize Q θ, θ (q) with respect to θ, obtaining a new estimation The two steps repeat until L θ (q+1) − L θ (q) , where is a set threshold.

Belief Entropy
Inspired by Shannon entropy [40], which can measure uncertainty contained by a probability distribution, a type of belief entropy called Deng entropy is proposed by Deng [28] to handle situation where the traditional probability theory is limited. When the uncertain information is described by the basic belief assignment instead of the probability distribution, Shannon entropy cannot work. Deng entropy is defined on the belief function frame, which makes it able to measure uncertain information described by the BBA efficiently.
Let A be the focal set of belief function, and |A| be the cardinality of A. Deng entropy E is defined as: We can easily learn from the definition that if the mass function is Bayesian, which means |A| = 1 for all A, Deng entropy degenerates to Shannon entropy such that The greater the cardinality of the focal set is, the bigger the corresponding Deng entropy is, so that the evidence imprecisely refers to more single elements. Thus, significant Deng entropy indicates huge uncertainty. Powered by this feature, we calculate the average Deng entropy of BBAs to select the best attribute leading to the least uncertainty. The details are shown in the next section.

Design of Belief Entropy Trees
Up to now, various decision tree methods have been proposed to deal with evidential data, but many of them consider categorical attributes and transform the continuous attribute values into discrete categories. Some recent works fit the continuous attributes with same class labels into normal distributions [41] and generate BBA from the normal distributions to select the best splitting attribute by calculating belief entropy [42]; however, this method divides samples into each set of certain classes, which can only handle the precise class labels. Our goal is to develop a belief decision tree method learns from data set with continuous and precise attribute values but incomplete class labels directly and efficiently.
This section explains our method in detail, specifically focusing on the procedure of attribute selection. Corresponding splitting strategy, stopping criterion and the leaf structure are also well-designed to accomplish the whole belief entropy decision tree.

The Novel Method to Select Attribute
The learning procedure of decision trees is generally to decide the split attribute and to decide how to split on this attribute on each node; our method also proceeds in this manner. As a novel decision tree, the most characteristic and core part of our method is the attribution selection, which includes three steps: firstly, for each attribute, fit the values to normal distributions corresponding to each class label, in another words, fit attribute values into K×D normal distribution models, where K is the class number and D is the attributes number of instances; secondly, for every instance, generate D BBAs from each attribute according to the normal distribution-based models; finally, calculate belief entropy from BBAs for each attribute. The attribute with minimum belief entropy will be selected to split.

Parameter Estimation on Data with Continuous Attributes
Powered by the idea of extracting BBAs from normal distribution-modeled attribute values [41], we try to operate similarly on data with ill-known class labels. In the situation that each instance exactly belongs to one class, the d-th attribute values set x d 1 , . . . , x d N is divided into K subsets x d n | y n = C k , k = 1, . . . , K corresponding to each class. It is easy to fit each subset to the normal distribution by calculating means and standard deviations.

Example 1.
Consider the Iris data set [43], a classical machine learning data set, which contains 150 training instances of three classes: 'Setosa', 'Versicolor', 'Virginica', with four attributes: sepal length(SL), sepal width(SW), petal length(PL) and petal width(PW). For the values of attribute SL in the class of Setosa, we can directly calculate the mean value as µ = 5.0133 and standard deviation as σ = 0.3267. Similarly, we can obtain normal distribution parameters of class of Versicolor and Virginica. Figure 1 shows the normal distribution model of Iris data set for the SL attribute in three classes. However, when the labels of training set are ill-known, some instances can not be allocated to a certain class assertively. The evidential likelihood and E 2 M algorithm introduced in Section 2 make it possible to generate an estimation of model parameters.
Because the E 2 M algorithm uses only contour functions, the label of n-th instance will be represented by plausibility pl n = {pl nk }, k = 1, . . . , K instead of mass function m n .
For the purpose of comparing attributes, we split the whole training data into D attribute-label pairs and handle D parameter estimation problems. Consider the d-th . . , D}, we assume the conditional distribution of X d when given y = C k is normal with mean µ k and standard deviation σ k : Actually the assumption is to build a one-dimensional Gaussian mixture model(GMM) [44]. Similar to the application of E 2 M algorithm in linear discriminant analysis [45], the following discuss is practically to adopt E 2 M algorithm to estimate parameters in GMM. Let π k be the marginal probability when y = C k , and θ = (µ 1 , . . . , µ K , σ 1 , . . . , σ K , π 1 , . . . , π K ) the parameter vector to be estimated. The complete-data likelihood is where the φ is normal distribution probability density, and y nk is a binary indicator variable, such that y nk = 1 if y n = C k and y ik = 0 if y N = C k . when expended to evidential data, where we use contour function to describe the labels, the evidential likelihood is drew from Equation (16) that, According to the E 2 M algorithm, we compute the expectation of complete-data log likelihood with respect to the combined mass probability function To simplify the equation, we denote Finally, we obtain the to-be-maximized function in the E-step.
The formal of Q θ, θ (q) is similar to the function computed in the EM algorithm on the GMM [44]. Because of the similarity, we imitate it and learn that the optimal parameter maximizing Q θ, θ (q) can be iteratively computed by is satisfied for some , stop the iteration which is the estimation of parameters in the GMM extracted from d-th attribute. Repeat this procedure for every attributes of the training set, will be generated. The Algorithm 1 shows the procedure of parameter estimation and there is Example 2 to help understand it.

Algorithm 1 Parameter estimation of GMMs.
Input: evidential training set T pl = (x, pl y ), iteration stop threshold (28) and (29). 6: To simulate the situation that labels of training set are not completely observed, we manually introduce uncertainty to the Iris data. In this example, we set that each instance has an equivalent chance (25%) to be vacuous, imprecise, uncertain or completely observed (the detail of transformation is discussed in Section 5). Table 1 shows the attribute values and labels described by plausibility pl of some instances in evidential Iris data. Table 2 shows the mean and standard deviation pairs (µ, σ) calculated by E 2 M algorithm. Figure 2 shows curves of these models.   Choose an instance I n with attribute vector x n = x 1 n , . . . , x D n from the data set, calculate the intersection of x d n (d = 1, . . . , D) and the K normal distribution functions φ d k = φ x d ; µ k , σ k , k = 1, . . . , K, i.e., we obtain K normally distributed probability density function (PDF) values for the attribute A d and instance I n , denoted as φ d nk , k = 1, . . . , K. Due to the property that the probability of a value x sampling from a normal distribution is proportional to the PDF φ(x), we can infer, for the attribute d, the probability that instance x n belongs to each class is proportional to φ d nk = φ x d n ; µ k , σ k , k = 1, . . . , K. From this opinion of statistical analysis, the rule to assign normal PDFs to some sets was proposed to build BBAs.
Firstly, normalize the φ d nk with different class k such that Then rank f k in decreasing order f r (r = 1, . . . , K), whose corresponding class is denoted as C r (r = 1, . . . , K). Assign f r to the class set by the following rule: If f i = f i+1 = .
. . = f j , then m C 1 , . . . , C j = ∑ j p=i f p . By this rule, we obtain a nested BBA of x n under the select attribute A d , which we denote as m d n .
Example 3. Consider the first instance of the evidential Iris data set showed in Table 1, whose attributes are: For attribute SL, the intersections of x SL = 5.1 and three normal distributions are shown in Figure 3 such that The reader can see in the figure that this instance is closest to class 'Setosa', then to the 'Versicolor' and 'Virginica'. Thus, we generate BBA from intersection values, which is intuitive. The BBA is assigned as: Similarly, we build BBAs for the rest of the attributes-shown in Table 3.  Table 3. Generated BBAs of selected instance. The last step to determine splitting attribute is to calculate the average Deng entropy

Attributes BBAs
of all instances for each attribute. As mentioned in Section 2.6, Deng entropy measures the uncertain degree contained by BBA, which means the less E d , the more certainty the BBAs contain, and the more separate the division of classes is. Consequently, we choose the attribute A * that minimizes the average Deng entropy such that to be the best splitting attribute to proceed the tree building.

Example 4.
Continue the Examples 2 and 3. Calculate Deng entropy of BBAs of selected instance shown in Table 3: Similarly proceed same calculation to all instances so that average Deng entropy for attributes are calculated that According to this result, attribute PW will be chosen to generate child nodes.
Comparing the Deng entropy with the curves in Figure 2, we can intuitively learn that PW has the most distinctive curves for each class, yet curves in SW overlap each other a lot, which conforms to the size of the average Deng entropy above, where PW is the lowest and SW is the highest.
As a matter of fact, Examples 1-4 in this chapter can be orderly combined as a whole calculating example, which shows the procedure of the proposed attribute selecting method.

Splitting Strategy
The splitting strategy is redesigned according to the selected attribute A * to fit the proposed attribute selection method. Branches will be associated to each class, that is to say, each node to be edged will have K branches. For an instance I n , consider the generated BBAs, the class corresponding to the maximum mass value will be the branch that this instance shall be put into. To put it simply, when splitting the tree under attribute A * , the instance I n will be assigned into the k n -th child node, where the k n satisfies The Algorithm 2 summarizes the procedure of selecting attribute and splitting. It should be mentioned that, though the child nodes are associated to each class, this splitting strategy does not mean to determine the affiliation of instances directly and arbitrarily in this step.

Stopping Criterion and Prediction Decision
After designing the attribute selection and partitioning strategy, we split each decision node to several child nodes. This procedure repeats iteratively until one of the stop criterion is met: • No more attributes for selection; • The number of instances in the nodes falls below a set threshold; • The labels of instances are all precise and fall into the same class; When the tree building stops at a leaf node L, a class label should be determined to predict the instances that fall into this node. We design two different prediction methods such that: • The first one is to generate the prediction label from the original training labels of instances contained by this node, which is a similar treatment to traditional decision trees such as C4.5 tree method. Denoting the instances in the leaf node by I 1 , . . . I P and the corresponding evidential training labels by pl p , p = 1, . . . P, the leaf node will be labeled byĈ, wherê which means the class label with maximal plausibility summation will represent this node. This tree predicts from the original labels of training set, which is called Oringin-prediction belie f entropy tree(OBE tree) for short in this paper. • The first method described above in fact abandons the generated BBAs during the tree build procedure, which will be adopted to generating predicted instance label in the second method. Firstly, the splitting attributes list, which lead instance I to the leaf node from top to down, are denoted by A * 1 , . . . , A * Q , and the BBAs generated accordingly are denoted by m * 1 . . . , m * Q . Then combine these BBAs by Dempster rule, such thatm = m * 1 ⊕ · · · ⊕ m * Q , to predict the training instance. On this basis, we continue to combine generated BBAs of all instances in a leaf node such thatm lea f =m 1 ⊕ · · · ⊕m P , where the once again combined BBAm lea f will be the mass prediction label for the whole leaf node. To obtain a precise label for another choice, the last step is making decision on BBA by choosing the class label with maximal pignistic probability computed by Equation (11). We call this tree a Lea f -prediction belie f entropy tree(LBE tree) in this paper.

Algorithm 3 Induction of belief entropy trees (BE-tree).
Input: evidential training set T pl , classifier type TYPE Output: belief entropy tree Tree 1: construct a root node containing all instances T p l; 2: if stopping criterion is met then 3: if TYPE = OBE then 4: output precise prediction generated from original plausibility label for the whole node; 5: else if TYPE = LBE then 6: combine BBAs generates during each splittingm = m * 1 ⊕ · · · ⊕ m * Q for each instance; 7: combine BBAs of all instances in previous node generated in step 6 thatm lea f = m 1 ⊕ · · · ⊕m P ; 8: outputm lea f as a mass prediction for the whole leaf node; 9: outputĈ = Pignistic m lea f as a precise prediction for the whole leaf node; 10: end if 11: return Tree=root node; 12: else 13: apply Algorithm 2 to select splitting attribute A * ; 14: induce each subset T pl child based on A * ; 15: for all T pl child do 16: Tree child = BE-tree T pl child ;{Recursively build the tree on the new child node} 17: attach Tree child to the corresponding Tree; 18: end for 19: end if

An Alternative Method for Predicting New Instance
Two types of belief entropy trees, the OBE tree and the LBE tree, have been described in detail in the last section. Similar to traditional decision trees, a new instance will be classified in a top-down way: starting at the root node and following branches by considering its generated BBA under splitting attribute until reaching a leaf node. The prediction of leaf node will be given to this new instance.
However, differing from the idea of collecting the numerous 'opinions' of instances, another method to predict a new instance is considered after a tree has been built. In Section 3.1.2, we introduced how to generate each training instance's BBA corresponding to attributes. In the same way, we can generate m * 1 . . . , m * Q corresponding to an attributes list A * 1 , . . . , A * Q , which orderly splits and leads the new instance to a leaf node. Then, we combine these BBAs such thatm = m * 1 ⊕ · · · ⊕ m * Q to predict the new testing instance. It is easy to find that this method performs the same way as the front part of label prediction in LBE trees, yet stops when obtaining a mass prediction from the testing instance's own attribution values instead of the leaf node it belongs to, which also means testing instances in a same leaf node normally have different mass prediction under this design. For the sake of narrative, a tree predicting in this way is called Instance-prediction belie f entropy tree(IBE tree) in this paper. Figures 4 and 5 show the procedure of making prediction on leaf node, where Figure 4 is the generation of mass predictionm for each instance, in whether training set or testing set; Figure 5 details the different prediction making in the proposed three belief entropy trees.

Belief Entropy Random Forest
We have introduced the induction of belief entropy trees in the previous section, which is regarded as the basic classifier of random forest ensemble method in the following discussion.
The generalization ability of random forest draws from not only the perturbation of sampling, but also the perturbation of attributes selecting. Specific to the proposed belief entropy random forest, for each basic tree, we firstly performs bootstrap sampling on the original training set, which means randomly sampling with replacement for N times on the set T where |T| = N. Secondly, when training on this resampling set, for each to-be-split node, the best splitting attribute will be chosen from a subset A i , i = 1, . . . , D of the set of all available attributes A j , j = 1, . . . , D, where 1 < D < D. If D = D, the basic tree splits totally, the same as the belief entropy tree; while D = 1 means randomly selecting an attribute to split all the time.
Repeat the first and second steps above S times then a 'forest' containing variable basic trees will be constructed, where the repeat time S is called forest size. When making a prediction of a new instance on this forest, S primary predictions will be independently generated by S basic trees and finally summarized to one result. It should be mentioned that the OBE tree output precise label directly for testing instances while the LBE trees and IBE trees can provide mass labels described by BBAs or precise labels. This feature inspires two different strategies for making predictions in the last step: the majority voting for precise labels and belief combination of mass labels.
Algorithm 4 shows the procedure of building the complete evidential random forests based on belief entropy trees. Selecting different ensemble prediction strategies and base tree types, we build five random forest lists below:  Figure 6 shows the procedure of constructing the forests, in which the Figure 6a shows generation of basic trees in a random forest, and Figure 6b shows different procedure of combining the final prediction in five forests, which will lead to a different classification performance. We will evaluate them in the next section.

Experiments
In this section, we detail experiments to evaluate the performance of the proposed decision tree method. The experiment settings and results are detailed below.

Experiment Settings
As there are no widely accepted evidential data sets to measure the proposed method, it is necessary to generate a data set with ill-known labels from machine learning databases taken from the UCI repository [46]. We selected several data sets, including: Iris, Wine, Balance scale, Breast cancer, Sonar and Ionosphere.
Denote the true label of a instance by C i , and give its uncertain observation m y i . Due to the characters of belief function, we can simulate several situations from precise data: • a precise observation is such that pl y i (C i ) = 1, and pl y i C j = 0, ∀C j = C * i ; • a vacuous observation is such that pl y i C j = 1, ∀C j ∈ C; • an imprecise observation is such that pl y i C j = 1 if C j = C * i or C j ∈ C rm , and pl y i C j = 0 otherwise, where C rm is a set of randomly selected labels; • an uncertain observation is such that pl y i C * i = 1, and pl y i C j = r j , ∀C j = C * i , where r j are sampled independently from uniform distribution U ([0, 1]).
To observe the performance on evidential training data sets with different ill-known types and incomplete degrees, we set three variables, vacuousness level V ∈ [0, 1], imprecision level I ∈ [0, 1] and uncertainty level U ∈ [0, 1], to adjust the generation procedure, where V + I + U 1.
Example 2 shows the transformed Iris data set and listed part of instances in Table 1. In this example, labels of no.53 and no.54 instance are vacuous; labels of no.1 and no.2 instance are imprecise; labels of no.4 and no.52 instance are uncertain.
To improve the reliability and reduce the stochasticity, we performed 5-fold crossvalidation on each data set and repeat ten times to compute an average classification accuracy for all experiments. Different tree induction techniques will be compared: • traditional C4.5 tree, which only uses precise data in the training set during tree induction; • belie f entropy trees described in Section 3.3: OBE tree, LBE tree, IBE tree; • belie f entropy random f orest described in Section 4: We set the maximal size of the leaf node as |T|/20 to avoid overfitting in the belief entropy trees. In the random forests, the forest size was set as 50, and the size of the attributes subset was set as D = log 2 D.

Experiments on Vacuous Data
Assuming part of the instances in the training set are totally unobserved while others are completely observed, we performed experiments with different vacuousness levels V ∈ [0, 1] while I = U = 0. Generating the training sets and learning on them, the results are shown in Figure 7.
Firstly, we observe the figure as a whole. Obviously, whatever the tree induction method is, it is impossible to learn from data sets whose instances are all vacuous. Thus, the accuracy of all trees decreases gradually as V increases, yet drops sharply when the V approaches nearly to 1. On the contrary, almost all curves keep steady or decrease slightly before the vacuousness level reaches 80%, except for the OBE trees. Table 4 shows the accuracy results when V equals 90%.  Considering the basic belief entropy trees firstly, the LBE trees and IBE trees perform, most of the time, at least as well as the traditional C4.5 decision trees, and better than the traditional decision trees for some time, especially when encountering high vacuousness level V; however, the OBE preforms elusively on different data sets: it has the lowest classification accuracy in Iris, Wine and Ionosphere data set; however, it achieves better results in the Balance scale. It is possible that if all samples in a leaf node are vacuous, the direct combination of all the training labels stays vacuous, which led to the shortage of OBE tree.
It can be observed that the belief entropy random forests perform well overall for their improvement in classification accuracy compared to the corresponding basic tree and the slower accuracy decent rate as V increases. Among these forests, the ones based on IBE and making prediction by mass combination performs better than others in nearly all data sets except the Balance scale.

Experiments on Imprecise Data
The second situation is that some data are imprecisely observed, i.e., the observation is a set value, while the true value lies in this set (called superset labels [47] in some works). As mentioned before, imprecision level I controls the percentage of imprecise observations.
For the instance to be imprecise, we randomly generate a number z k ∈ [0, 1] for each class C k except the true one. Plausibility of labels with z k < I will be set to 1. When the I = 1, a training set becomes totally imprecise, which is, in practice, the same situation as total vacuousness; while I < 1, instances are in a middle state of transition from precise to vacuous, which indicates a piece of similarity between the vacuous training set and the imprecise training set, i.e., we can tell that the imprecise sample contains more information than the totally vacuous ones. As a result, we can see in Figure 8, that curves of accuracy with changing I are similar to those in experiments with vacuousness in Figure 7, yet more smooth and full.
According to the Table 5, the proposed methods keep pretty good classification results under high-level imprecise observations. OBE still keeps the shortage in almost all data sets while LBE and IBE achieve similar performance. M-IBE RF keeps its advantage in most situations, especially in the Iris and Breast Cancer data; the classification accuracy is almost equal to the results on the total precise training set. The balance scale is a particular case to be discussed later.

Experiments on Uncertain Data
Another type of ill-known label is the uncertain one, which is measured by a plausibility distribution, with the true label having the highest chance among all class labels. To evaluate the performance of the proposed trees and forests in a more general situation with uncertainty, we set U ∈ [0, 1] and V = I = 0. For instance, to be transformed into an uncertain one, we assign a value 1 to the plausibility of the true label and random values averagely sampled from [0, U] to other labels.
Despite the inability to handle total vacuousness and imprecision, the belief entropy trees have the ability to learn from totally uncertain training data sets. The horizontal curves in Figure 9 indicate all methods proposed in this paper keep stable performance with changing uncertainty level U. On the whole, we can learn from the figure that LBE and IBE perform equally well and better than OBE as a single tree in most data sets, except in the Balance scale.  Figure 9. Classification accuracy on UCI data sets with different uncertainty levels.
Considering the forests, for the good attribute normality of Iris, Wine and Breast cancer data, classification accuracies of the five forests on these data sets have similar performance according to Table 6, leading to a heavy overlap of curves in figure. Among these trees, the OBE trees achieve the most significant improvement by building random forest; this improvement helps OBE-RF to surpass other forests in the Ionosphere, Sonar and Balance scale data sets. Particularly, in the Balance scale, the accuracy of OBE-RF even increases slightly as the U decreases, which can be partially explained by the fact that uncertain instances are more informative then absolutely precise instances.

Summary
By carrying out experiments on training sets with different types and degrees of incomplete observation, we can conclude that the LBE trees and IBE trees, along with four types of random forests based on them, generally possess excellent learning ability on data with ill-known labels. Among the RFs, the ensemble of the IBE tree, L-IBE-RF and M-IBE RF achieve the highest classification accuracy in most situations except on samples with high uncertainty levels, especially on the Balance scale data set. We think there are two reasons: (a) compared to vacuous and imprecise samples, the learning labels of uncertain samples are more information rich, while the OBE use the learning labels to predict directly; (b) the attribute values of Ionosphere, Balance, and Sonar data sets contain less normality than others-the balance scale are totally not normal. We can conclude that the ensemble OBE RF requests less normality of the data set.
The results of experiments indicate that the application of the belief function tool to the prediction of trees and combination of forests is efficient and reasonable; yet there are also some drawbacks. Firstly, the introduction of the belief function and mass combination obviously increases the time cost of learning. The sensitivity to the normality of data makes the proposed trees and RFs unable to handle, to the greatest extent, all situations with one particular structure.

Conclusions
In this paper, a new classification tree method based on belief entropy is proposed to cope with uncertain data. This method directly models continuous attribute values of training data by E 2 M algorithm, and selects a splitting attribute via a new tool-belief entropy. Differing from the traditional decision trees, we redesign the splitting and prediction, making them fit the feature of uncertain labels described by the belief function. Finally, random forests with different combination strategies were constructed on the basis of the proposed tree method to seek higher accuracy and stronger generalization ability.
As the experimental results show, the proposed belief entropy trees are robust to different sorts of uncertainty. They perform closely to traditional decision trees on precise data and keep good results on data with ill-known labels. Meanwhile, the belief entropy random forests, which improve significantly when compared to the basic belief function trees, achieve excellent and stable performance even in the situation with high-level uncertainty. It is proved that the proposed trees and random forests have a potentially broad field of application. In future research, some further improvements will be investigated, such as more reasonable BBA combination methods for the incapacity of Dempster's rule to handle huge mass conflict, and a boosting ensemble method based on the belief entropy trees.

Funding:
The work described in this paper was supported by the National Natural Science Foundation of China (61973291).
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.