Attribute Selection Based on Constraint Gain and Depth Optimal for a Decision Tree

Uncertainty evaluation based on statistical probabilistic information entropy is a commonly used mechanism for a heuristic method construction of decision tree learning. The entropy kernel potentially links its deviation and decision tree classification performance. This paper presents a decision tree learning algorithm based on constrained gain and depth induction optimization. Firstly, the calculation and analysis of single- and multi-value event uncertainty distributions of information entropy is followed by an enhanced property of single-value event entropy kernel and multi-value event entropy peaks as well as a reciprocal relationship between peak location and the number of possible events. Secondly, this study proposed an estimated method for information entropy whose entropy kernel is replaced with a peak-shift sine function to establish a decision tree learning (CGDT) algorithm on the basis of constraint gain. Finally, by combining branch convergence and fan-out indices under an inductive depth of a decision tree, we built a constraint gained and depth inductive improved decision tree (CGDIDT) learning algorithm. Results show the benefits of the CGDT and CGDIDT algorithms.


Introduction
Decision trees are used extensively in data modelling of a system and rapid real-time prediction for real complex environments [1][2][3][4][5]. Given a dataset acquired by field sampling, a decision attribute is determined through a heuristic method [6,7] for training a decision tree. Considering that the heuristic method is the core of induction to a decision tree, many researchers have contributed substantially to studying an inductive attribute evaluation [8][9][10]. Currently, the heuristic method of attribute selection remains an interesting topic in improving learning deviation.
The attribute selections in constructing a decision tree are mostly based on the uncertainty heuristic method, which can be divided into the following categories: Information entropy method based on statistical probability [11][12][13][14], based on a rough set and its information entropy method [15][16][17], and the uncertainty approximate calculation method [18,19]. An uncertainty evaluation of Shannon information entropy [20] based on statistical probability has been used previously for uncertainty evaluation of the sample set division of decision tree training [21], such as the well-known ID3 and C4.5 heuristic method of the decision tree algorithm [22,23]; these methods are used to search for a gain-optimized splitting feature of dividing subsets for an inductive classification to achieve a rapid convergence effect. Whilst a rough set has the advantage of natural inaccuracy expression, through the dependency evaluation of condition and decision attributes, the kernel characteristics of its strong condition quality is selected as the split attribute to form the decision tree algorithm, with improved classification performance [15][16][17]24]. The uncertainty approximation calculations focus on the existence of the deviation in an evaluation function estimated by most learning methods with information theory [18]. These computations further improve the stability of the algorithm by improving the uncertainty estimation of entropy [25][26][27][28]. The deviation in entropy is not only from itself, but also from the properties of data and samples.
This study proposed an improved learning algorithm based on constraint gain and depth induction for a decision tree. To suppress the deviation of entropy itself, we firstly used a peak-shift sine factor that is embedded in the information entropy to create a constraint gain GCE heuristic in accordance with the entropy law of peak, which moves to a low probability and intensity enhancement whilst increasing the number of events possible. The uncertainty is represented moderately so that the estimation deviation could be avoided. This phenomenon realizes the uncertainty estimation considering the otherness of the data property while allowing for entropy itself. Moreover, evaluation indicators of branch inductive convergence and fan-out are used in assisting the heuristic GCE to select a minimal attribute that is affected by data samples and noises on the basis of the primary attributes. This study obtained an improved learning algorithm through an uncertainty-optimized estimation for the attributes of a decision tree. The experimental results validate the effectiveness of our proposed method.
The rest of this paper is organized as follows. Section 2 introduces some related works on the attribute selection of heuristic measures in a decision tree. Section 3 discusses the evaluation of uncertainty. Section 4 proposes a learning algorithm based on the constraint gain and optimal depth for a decision tree. Section 5 introduces the experimental setup and results. Section 6 concludes the paper.

Related Work
Decision tree learning aims to reduce the distribution uncertainty of a dataset, which is partitioned by selected split attributes, and enables a classified model of induction to be simple and reasonable. Notably, a heuristic model based on the uncertainty of entropy evaluation has become a common pattern for decision tree learning given its improved uncertainty interpretation.
The uncertainty evaluation based on information entropy was design by Quinlan in the ID3 algorithm [22] in which the uncertainty entropy of a class distribution, H(C), is reduced by the class distribution uncertainty, E(A), of the attribute domain in the dataset, namely H(C)-E(A); thus, a heuristic method of information gain with an intuitive interpretation is obtained. Given that H(C) is a constant of the corresponding dataset, the information gain is the minimum calculation of the Gain uncertainty at the core, E(A). In the attribute domain of E(A), a small distribution of the classification uncertainty is easily obtained using multi-valued attributes, thereby leading to an evident multi-valued attribute selection bias and splitting of the attribute selection instability in the Gain heuristic. The C4.5 algorithm [23] uses an H(A) entropy to normalise Gain in an attempt to suppress the bias of a split attribute selection; furthermore, this algorithm improves the classification performance of a decision tree through a pruning operation. Some authors [14] used a class constraint entropy to calculate the uncertainty of the attribute convergence and achieved an improved attribute selection bias and performance.
Nowozin [18] considered information entropy to be biased and proposed the use of discrete and differential entropy to replace the uncertainty estimation operator of a traditional information entropy. These authors found that improving the predictive performance originates from enhancing the information gain.
Wang et al. [29] introduced the embedding of interest factors in the Gain heuristic, E(A), in which the process is simplified as a division operation of a product, and the sum of the category sample accounts for each attribute to form an improved PVF-ID3 algorithm. According to Nurpratami et al. [30], space entropy denotes that the ratio of the class inner distance to the outer distance is embedded in information entropy. Then, space entropy is utilised, rather than information entropy, to constitute the information gain estimation between the target attribute and support information to achieve hot and non-hot spot heuristic predictions.
Sivakumar et al. [31] proposed to use Renyi entropy to replace the information entropy in the Gain heuristic. Moreover, the normalisation factor, V(k), was used to improve the Gain instability, thereby improving the performance of the decision tree. Wang et al. [29] suggested to replace the entropy of information gain with unified Tsallis entropy and determined the optimal heuristic method through q parameter selection. Some authors [32] proposed that the deviation of the Shannon entropy is improved by the sine function-restraining entropy peak, but the impact of the property distribution and data sampling imbalance should be ignored.
In addition, Qiu C et al. [33] introduced a randomly selected decision tree, which aims to keep the high classification accuracy while also reducing the total test cost.

Analysis of Uncertainty Measure of Entropy
The concept of entropy has been previously utilized to measure the degree of disorder in a thermodynamic system. Shannon [20] introduced this thermodynamic entropy into information theory to define the information entropy.
We assume that a random variable, X j , who has v possible occurrence can be obtained from things' space. If we aimed to measure the heterogeneity of things' status through X j , then its measured information entropy is described as follows: where X j ∈A is a random variable derived from an attribute of things, A is the attribute set, and X jk is an arbitrary value of this attribute. P(X jk ) is the occurrence probability of the event represented by this attribute value. I(X jk ) is the self-information of X jk , and H(X j ) is a physical quantity that measures the amount of information provided by the random variable, X j . We assume that the number of possible values, v, is defined as 1 -4 for a random variable, X j . Firstly, the occurrence probability of the event is generated by different parameter settings, and then the distribution of the information entropy is calculated. The results of these calculations are illustrated in Figures 1 and 2. The distribution of the information entropy of the single-value event is an asymmetric convex peak, which changes steeply in a small probability region and slowly in a large probability region. The position of the peak top is at P ≈ 0.36788, and the maximum value is H max (X j ) = 0.53073, as depicted in Figure 1a. The distribution of the information entropy of a double-value event is a symmetric convex peak with probability of P = 0.5 whose maximum value is H max (X j ) = 1, as demonstrated in Figure 1b. The distribution shape of the information entropy of a three-value event. Three-value events are convex peaks with left steep and right slow, as exhibited in Figure 2a,b. The preceding peak top is at P = 1/3, and its maximum value is H max (X j ) ≈ 1.58496. The rear peak top is at P = 0.25, and its maximum value is H max (X j ) = 2.
When the number of values for a random variable is greater than 2, the peak of the information entropy of the random variable not only moves regularly with the change in the values, but also increases its peak intensity to greater than 1. Therefore, the deviation in the uncertainty estimation is protruded by the change in the value number of the random variable, whereas the peak shift of the information entropy presents the uncertainty distribution. When the number of values for a random variable is greater than 2, the peak of the information entropy of the random variable not only moves regularly with the change in the values, but also increases its peak intensity to greater than 1. Therefore, the deviation in the uncertainty estimation is protruded by the change in the value number of the random variable, whereas the peak shift of the information entropy presents the uncertainty distribution.

Definition of Constraint Entropy Estimation Based on Peak-Shift
For a random variable, Xj ∈{X1, X2,..., Xm}, which represents the attribute to the event that occurred in things' space, if a reasonable uncertainty evaluation of the Xj variable on all attributes' set is required, then the effect of the possible number of random variables on the intensity of the information entropy is eliminated. This condition aims to create a normalized measurement of uncertainty. We define the entropy estimation to measure the uncertainty of this random variable as follows: When the number of values for a random variable is greater than 2, the peak of the information entropy of the random variable not only moves regularly with the change in the values, but also increases its peak intensity to greater than 1. Therefore, the deviation in the uncertainty estimation is protruded by the change in the value number of the random variable, whereas the peak shift of the information entropy presents the uncertainty distribution.

Definition of Constraint Entropy Estimation Based on Peak-Shift
For a random variable, Xj ∈{X1, X2,..., Xm}, which represents the attribute to the event that occurred in things' space, if a reasonable uncertainty evaluation of the Xj variable on all attributes' set is required, then the effect of the possible number of random variables on the intensity of the information entropy is eliminated. This condition aims to create a normalized measurement of uncertainty. We define the entropy estimation to measure the uncertainty of this random variable as follows: (b) Entropy of four-value event. Note: (1) If three possible probabilities of the event are P(X j1 ), P(X j2 ), and P(X j3 ), respectively, let parameter k exist to make P(X j3 ) = k P(X j2 ), then, P(X j2 ) = [1 − P(X j1 )]/(1 + k) in which k > 0. (2) If four possible probabilities of the event are P(X j1 ), P(X j2 ), P(X j3 ), and P(X j4 ), let parameter k 1 and k 2 exist to make P(X j3 ) = k 1 P(X j2 ) and P(X j4 ) = k 2 P(X j2 ), then, P(X j2 ) = [1 − P (X j1 )]/(1 + k 1 + k 2 ), in which k 1 > 0 and k 2 > 0. Sign "1,1" similar to Figure (b), the number at the front is k 1 , and the number after it is k 2 .

Definition of Constraint Entropy Estimation Based on Peak-Shift
For a random variable, X j ∈{X 1 , X 2 ,..., X m }, which represents the attribute to the event that occurred in things' space, if a reasonable uncertainty evaluation of the X j variable on all attributes' set is required, then the effect of the possible number of random variables on the intensity of the information entropy is eliminated. This condition aims to create a normalized measurement of uncertainty. We define the entropy estimation to measure the uncertainty of this random variable as follows: where H sc is the average of the uncertainty of a random variable, X j . We aim for this variable to be no more than 0.5 when it is a single-value variable. The kernel, I sin , of the entropy is the key part of deviation improvement, that is, the sine function is used to replace the entropy kernel, log 2 x −1 , partly in accordance with this requirement. The peak-shift entropy kernel definition based on the peak intensity constraint is expressed as follows: where ω(X jk , v) is a periodic parameter, and Ψ (X jk ,v) is the initial phase parameter.
If v = 1, and P(X jk ) ∈ [0,1], then I sin (X jk , v) is the first half of a single cycle sine function with the initial phase as 0 in the probability [0,1] domain. Its form is expressed in Formula (4), and the distribution of I sin (X jk ,1) is displayed in Figure 3b: If v > 1, then I sin (X jk ,v) is the entropy kernel of the peak top (summit position, SP) transferred to P = 1/v; that is, the kernel refers to the connection composition of two cycles and initial phases, namely it consists of two 1/2 radian period sine function in the probability [0,1] domain.
When P(X jk ) is the monotonically increasing distribution of the first 1/2 radian period sine. The formula is expressed in Formula (5): When P(X jk ) ∈ (1/v, 1], I sin (X jk , v) is the monotonically decreasing distribution of the second 1/2 radian period sine. The formula is defined in Formula (6): The above connection position of the two initial parameterised sinusoids, which constituted I sin (X jk , v), is also where the entropy kernel amplitude has the maximum value.
When the number of random variable values, v, increases gradually from 2, the entropy kernel, I sin , summit is transferred gradually from the probability of 0.5 to a small probability, which is presented in Figure 3a. This design aims to enhance the uncertainty expression of the random variable; that is, a resonance at the probability, 1/v, because its uncertainty is the largest when all possibilities of the random variable occur with an equal probability. When v = 1, the constraint entropy kernel, Isin, forms, the impurity is directly harvested in the minimum distribution of extreme probabilities and the maximum of the P = 0.5 probability; then, Weibull's log2x -1 kernel forms, the impurity is depicted in the probability product of a monotone decreasing distribution of the first maximum then minimum. The impurity of the constrained entropy kernel is strongly natural.
When v = 2, the distribution form is thinner in constraint entropy, Hsc (Xj), than in conventional entropy, H(Xj); that is, Hsc (Xj) < H(Xj) on the left of peak (0, 0.5) and right of peak (0.5, 1.0) with a probability 0.5 symmetry. Their details are plotted in Figure 3 between them is evident in the range of the small probability (0, 0.4) and large probability (0.6, 1.0). However, the difference is small in the peak top range (i.e., 0.4, 0.6). Therefore, the constraint entropy, Hsc(Xj), is not only the strongest during an equiprobable occurrence of possibilities for the When v = 1, the constraint entropy kernel, I sin , forms, the impurity is directly harvested in the minimum distribution of extreme probabilities and the maximum of the P = 0.5 probability; then, Weibull's log 2 x −1 kernel forms, the impurity is depicted in the probability product of a monotone decreasing distribution of the first maximum then minimum. The impurity of the constrained entropy kernel is strongly natural. When v = 2, the distribution form is thinner in constraint entropy, H sc (X j ), than in conventional entropy, H(X j ); that is, H sc (X j ) < H(X j ) on the left of peak (0, 0.5) and right of peak (0.5, 1.0) with a probability 0.5 symmetry. Their details are plotted in Figure 3b. The difference, H(X j ) − H sc (X j ), between them is evident in the range of the small probability (0, 0.4) and large probability (0.6, 1.0). However, the difference is small in the peak top range (i.e., 0.4, 0.6). Therefore, the constraint entropy, H sc (X j ), is not only the strongest during an equiprobable occurrence of possibilities for the event, but also presents an amplitude whose suppression is realized at both sides where the probability decreases and then increases, and the influence strength on the uncertainty is reduced.
When v = 3, the random variable has three kinds of possible events to occur whilst a possibility of them shows a range of probability [0,1], and other possibilities may be distributed reversely or randomly. If three kinds of possibilities change to an equal proportion from an unequal proportion, that is, parameter, k = 7, decreases gradually to 1, as depicted in Figure 4a, then the H sc (X j ) peak will move to 1/3 probability from 0.5 probability. The right side of the peak changes from steep to gentle, whereas the left side is from gentle to steep. Then, the change gradually increases. Finally, the peak has reached its maximum strength.  When v = 4, as possibilities of the event occur from the unequal proportion to equal proportion, that is, when k1 and k2 change from large to small, the peak top of Hsc (Xj) transfers gradually from near P = 0.5 to a small direction until it reaches P = 0.25. Finally, the summit value also increases until it reaches the maximum Hsc(0.25) = 1 at P = 0.25. The distribution of Hsc(Xj) is demonstrated in Figure 4(b), which is similar to Figure 5(a). The right side of the peak shows a slow decrease, whereas the left side gradually increases.
In comparison with the traditional information entropy, the constraint entropy estimation, Hsc(Xj), retains firstly the shape and extremal distribution of a peak for an equal proportion of the possible occurrence of a two-value or a multi-value event. Moreover, the constraint entropy estimation restricts the peak intensity of a multi-value event to not more than 1, meanwhile, enhancing the uncertainty of the event in the direction of an equal proportional possible occurrence and weakening the uncertainty of the event in the direction of an unequal proportional possible occurrence. Clearly, although it seems to discard the intuitive expression of information, then the constrained entropy estimation could be more sensitive to discovery uncertainty, reasonable, but not exaggerated.

Evaluation of the Attribute Selection Based on Constraint Gain and Depth Induction
Considering a node of the decision tree, its corresponding training dataset is S = {Y1, Y2, ..., Yn}, in which the attribute variable set of a dataset is {X1, X2, ..., Xm}, the class tag of each sample is Ti∈C, and C is the class set of the things. Thus, each sample, Yi, consists of the attributes and a class tag, Ti. According to the Gain principle [22], we aim to find the attribute of a strong gain in the training dataset, S, whilst the impact of attribute otherness is reduced. Therefore, we defined the entropy estimation of gain uncertainty measured on the basis of the peak-shift as follows: When v = 4, as possibilities of the event occur from the unequal proportion to equal proportion, that is, when k 1 and k 2 change from large to small, the peak top of H sc (X j ) transfers gradually from near P = 0.5 to a small direction until it reaches P = 0.25. Finally, the summit value also increases until it reaches the maximum H sc (0.25) = 1 at P = 0.25. The distribution of H sc (X j ) is demonstrated in Figure 4b, which is similar to Figure 5a. The right side of the peak shows a slow decrease, whereas the left side gradually increases.
In comparison with the traditional information entropy, the constraint entropy estimation, H sc (X j ), retains firstly the shape and extremal distribution of a peak for an equal proportion of the possible occurrence of a two-value or a multi-value event. Moreover, the constraint entropy estimation restricts the peak intensity of a multi-value event to not more than 1, meanwhile, enhancing the uncertainty of the event in the direction of an equal proportional possible occurrence and weakening the uncertainty of the event in the direction of an unequal proportional possible occurrence. Clearly, although it seems to discard the intuitive expression of information, then the constrained entropy estimation could be more sensitive to discovery uncertainty, reasonable, but not exaggerated.
training set, and Ls is the number of leaves of the decision tree that are learnt and obtained in the training set. where nf is the number of samples that tested unsuccessfully in the test set, and precision is the degree of accuracy of a test set except to testing failures.

Influence of the Entropy Peak Shift to Decision Tree Learning
In this study, we initially conducted an experiment on the influence of the peak shift of the constraint entropy on decision tree learning. The experiment used the training and test sets of Balance, Tic-Tac-Toe, and Dermatology. The experiment result is presented in Figure 5   For the Balance training set, the training of a decision tree is implemented by moving the peak of the constraint entropy as the probability ranges from low to high for the heuristic, namely, SP∈  Figure 6 illustrates the training case of the Tic-tac-toe dataset. The general form of the decision tree, Acc, exhibits an initially high and then low distribution; that is, it presents a relatively high

Evaluation of the Attribute Selection Based on Constraint Gain and Depth Induction
Considering a node of the decision tree, its corresponding training dataset is S = {Y 1 , Y 2 , ..., Y n }, in which the attribute variable set of a dataset is {X 1 , X 2 , ..., X m }, the class tag of each sample is T i ∈C, and C is the class set of the things. Thus, each sample, Y i , consists of the attributes and a class tag, T i . According to the Gain principle [22], we aim to find the attribute of a strong gain in the training dataset, S, whilst the impact of attribute otherness is reduced. Therefore, we defined the entropy estimation of gain uncertainty measured on the basis of the peak-shift as follows: where GCE is the gain constraint entropy estimation, which is measured by the key changed part from the Gain formula, in which the uncertainty measure of the category distribution in the attribute space is H sc , and H sc is optimized by the constraint entropy based on the peak-shift. Its specific calculation is expressed in Formula (8): where P(C k |X ji ) is the probability of distribution of a class, C k , in the X ji value domain of an attribute, I sin (P(C k |X ji ), v) is an entropy kernel calculated specifically in accordance with either Formula (5) or (6). Given the attributes set of the training dataset, S, the GCE measure is performed in accordance with Formula (7). From this condition, we aim to find an attribute variable of the smallest uncertainty of a class distribution in the attribute space, as defined in Formula (9): where A* is a set of candidate attributes that provide a partitioning node, in which the number of values for each attribute is greater than 1.
Whilst evaluating the selection of attributes on the basis of the gain uncertainty formed by the constraint entropy, we must consider the inductive convergence of the branches generated by a selected split attribute to reduce the effects of samples and noise. We assume that the attribute, X j , is selected as a splitting attribute of the node for the decision tree induction. Given that the attribute's, X j , value distribution is {X j1 , X j2 , ..., X jv }, the dataset, S, is divided into v subsets and downward v branches. Correspondingly, when an attribute, X k (X k = X j ), is selected as a further split attribute in the subset of the branches, the convergence branching number under the depth generated by the current tree node attribute, X j , is measured as follows: where V is a set of branch sequence numbers that can be converged as the leaf by the attribute, X j , and U is a set of branch sequence numbers that can be divided further into nodes by an attribute, X j . By contrast, X jl and X kq (q∈[0, kv], kv is the number of X k values) are the attribute values of the current tree node and the sub-branch node, correspondingly. F l is the functions that determine whether the branches of the attribute value of the current tree node is a leaf, and F q is the functions that determine whether the branches of the attribute value of the sub-branch node are a leaf. If P(C b |X jl ) = 1, and P(C b |X kq ) = 1, where b∈[1, |C|], then, F l and F q are 1; otherwise, 0. Thus, B conv (X j ) is the strength index, which measures the convergence of a branch at two inductive depths generated by the split attribute, X j , of the current tree node. Similarly, when we select an attribute, X j , as a split attribute of the current node for the decision tree, we can expect to estimate the divergence of branches produced in-depth by X j . Therefore, the split attribute of a branch node generated by the division of attribute X j is assumed to be X k (X k =X j ). Then, the number of fan-outs under the depth generated by the current tree node attribute, X j , is measured as follows: where N j is the number of branches generated by the current tree node, and N k is the number of branches generated by the subordinate node of a branch of the current tree node. Thus, B diver (X j ) is the aided index that measures the divergence of the branch at the two inductive depths produced by the splitting attribute, X j , of the current tree node.

Learning Algorithm Based on Constraint Gain and Depth Induction for a Decision Tree
According to the Hunt principle and the above-mentioned definition, this study proposed an inductive system that is the heuristic framework of an optimal measure of a category convergence in the attribute space. In this attribute space, minimal uncertainty distribution is searched based on the constraint mechanism of the strength and summit, and constitutes the decision tree learning algorithm (CGDT) of the constrained gain heuristic. Moreover, whilst GCE is used as the main measurement index, the branch convergence, B conv , and branch fan-out, B diver , are applied to be auxiliary indices among the similar attributes of GCE. We aim to select split attributes of a strong deep convergence and weak divergent, and form the constraint gained and depth inductive improved decision tree learning algorithm (CGDIDT). Therefore, the learning algorithm based on the constraint entropy for the decision tree designed is defined specifically as follows (Algorithms 1 and 2): In the algorithm presented above, Leaftype(S) is the function of a leaf class judgment, and Effective(S) is the processing function to obtain a valid attribute set of the dataset, S. The complexity of the entire CGDT(S, R) is the same as that of the ID3 algorithm. The core of the algorithm is the attribute selection heuristic algorithm based on GCE.
The pruning of the above algorithms is turned off. The branch convergence and fan-out index under the depth are introduced to optimize the learning process of the decision tree on the basis of Algorithm 1.

Algorithm 1. The learning algorithm of the constraint gained decision tree, CGDT (S, R).
Input: Training dataset, S, which has been filtered and labelled.
Output: Output decision tree classifier.
Pre-processing: For any sample in the dataset, {Y 1 , Y 2 , ..., Y n }: Y i = {X, T i }, T i ∈C to obtain the discrete training set.
Initialization: The training set, S, is used as the initial dataset of the decision tree to establish the root node, R, which corresponds to the tree.
1. If Leaftype(S) = C k , where C k ∈C and k∈[0, |C|], then label the corresponding node, R, of the sample set, S, as a leaf of the C k category, and return.
2. Return the valid attribute set of the corresponding dataset, S, of the node: X e = Effective(S). If X e is an empty set, then the maximal frequentness class is taken from the S set, and the node is marked as a leaf and is returned. If X e is only a single attribute set, then this attribute is returned directly as the split attribute of the node.
3. For any attribute, X i (i∈[0, |X e |]), in the X e set, perform calculations on GCE. The attribute of the minimum uncertainty is selected as the split attribute, A*, of the current node, R.

Algorithm 2. Constraint gained and depth inductive decision tree algorithm, CGDIDT (S, R).
Input: Training dataset, S, which has been filtered and labelled.
Output: Output decision tree classifier.
1. Judge whether the Leaftype(S) = C k (C k and S definition is the same as in Algorithm 1), the corresponding node, R, of the sample set, S, is labelled as a leaf of the C k category when it is true, and return.
2. Return the valid attribute set of the corresponding dataset, S, of the node: X e = Effective(S). If X e is an empty set, then the maximum frequency class is taken from the S set. The node is marked as a leaf and is returned. If X e is only a single attribute set, then return the attribute directly as the split attribute of the node, R.
3. Establish an empty set, H, for the candidate split attributes; firstly, obtain the attribute with the smallest constraint gain, f, from the set, S, that is, f = Min{GCE(X i ), i∈[0, |X e |]}. Secondly, determine the candidate attributes in which GCE is the same or similar to the minimum value, such as GCE≤(1 + r)f, where r∈[0, 0.5]; these candidate attributes are placed in the set, H.
4. Face the candidate attributes set, H, of the current node, and calculate the depth branch convergence number, B conv , and depth branch fan-out number, B diver , of each attribute. If the attribute with the optimal B conv is not the same as the GCE minimal attribute in the set, H, then select the attribute of the larger B conv and smaller B diver as the improved attribute. If the attribute obtained the optimal B conv , and the GCE minimal attribute is the same attribute in the set, H, then the split attribute, A*, selection is all with the GCE minimum evaluation as the preferred attribute selection criteria for the current node and even the subsequent branch node.

Experimental Setup
In this section, we use the 11 discretized and complete datasets of the UCI international machine learning database as the original sample sets to verify the performance of the CGDT and CGDIDT algorithms. The details of the datasets are provided in Table 1. Firstly, the representative Balance, Tic-Tac-Toe and Dermatology datasets were selected for the peak shift experiment to observe the effects of the peak movement of the entropy core on decision tree learning.  Table 1.
Then, the CGDT and CGDIDT algorithm experiments were performed separately on the 11 datasets using the experimental system designed in this study. However, the ID3 and C4.5 (J48) decision tree algorithms were implemented as references by the Weka system. The same training and test sets were used for the experiment on different algorithms when the same dataset experiments were performed on two different systems. Before the experiment, the dataset was sampled uniformly and unrepeatably in accordance with the determined proportion, α, in which the extracted parts constituted the training set and the remaining parts constituted the test set for learning, training, and validation. In this study, a sampled proportion of α = 70% was first used for the training set. Even for the Monks datasets, which provided the training sets, this learning experiment still used only α proportional extracted datasets from the provided training set as learning training sets to verify the adaptability of the learning algorithm, whereas all the remaining datasets were used for testing.
The classifier scale (Size) of a decision tree on the training set, the accuracy (Acc) for verifying the test set, the F-measure, and the test coverage (Cov) were the indicators used to compare and evaluate the algorithms in the experiment. The description of the specific indicators is given Equations (12) and (13): where n c is the number of samples that have been validated in the test set, n t is the total number of tested samples, Ns is the number of nodes of the decision tree that are learnt and obtained in the training set, and Ls is the number of leaves of the decision tree that are learnt and obtained in the training set.
F-measure = 2Acc · precision Acc + precision , Cov = n t − n f n t , precision = n c n t − n f (13) where n f is the number of samples that tested unsuccessfully in the test set, and precision is the degree of accuracy of a test set except to testing failures.

Influence of the Entropy Peak Shift to Decision Tree Learning
In this study, we initially conducted an experiment on the influence of the peak shift of the constraint entropy on decision tree learning. The experiment used the training and test sets of Balance, Tic-Tac-Toe, and Dermatology. The experiment result is presented in Figures 5-7.  Similarly, for the Dermatology training set, the general form of the decision tree, Acc, also displays an initially high and then low distribution; that is, it first achieves a relatively high distribution at section [0.1, 0.3], and then exhibits a low distribution and an apparent hollow bucket shape at section [0.4, 0.9], in which section [0.1, 0.3] is the stable high section of Acc, whereas section [0.45, 0.55] is the lowest section. Simultaneously, the numbers of nodes and leaves of a decision tree (Size) display a considerably reverse distribution with Acc, which is low at section [0.1, 0.3], but high at section [0.4, 0.9]. However, a stronger volatility is achieved, which is lowest at a low SP section and highest at a high SP section.
The preceding analysis implies that although the Balance and Tic-tac-toe training sets with the same number of attribute values exhibit a better stability distribution of classification performance and size than the Dermatology training set, which has a different number of attribute values, they all have the same volatility and regularity is evident. That is, the Balance set at section [0.1, 0.3], Tic- Similarly, for the Dermatology training set, the general form of the decision tree, Acc, also displays an initially high and then low distribution; that is, it first achieves a relatively high distribution at section [0.1, 0.3], and then exhibits a low distribution and an apparent hollow bucket shape at section [0.4, 0.9], in which section [0.1, 0.3] is the stable high section of Acc, whereas section [0.45, 0.55] is the lowest section. Simultaneously, the numbers of nodes and leaves of a decision tree (Size) display a considerably reverse distribution with Acc, which is low at section [0.1, 0.3], but high at section [0.4, 0.9]. However, a stronger volatility is achieved, which is lowest at a low SP section and highest at a high SP section.
The preceding analysis implies that although the Balance and Tic-tac-toe training sets with the same number of attribute values exhibit a better stability distribution of classification performance and size than the Dermatology training set, which has a different number of attribute values, they all have the same volatility and regularity is evident.   Figure 6 illustrates the training case of the Tic-tac-toe dataset. The general form of the decision tree, Acc, exhibits an initially high and then low distribution; that is, it presents a relatively high distribution at section [0.1, 0.7], in which it maintains a period of high value at section [0.25, 0.35], and then demonstrates a low distribution at section [0.8, 0.9], with a steep decline at section > 0.8. However, the numbers of nodes and leaves of the decision tree (Size) exhibit a stable low-value distribution at section [0.1, 0.7], which indicates an improved reverse distribution with Acc.
Similarly, for the Dermatology training set, the general form of the decision tree, Acc, also displays an initially high and then low distribution; that is, it first achieves a relatively high distribution at section (1) For the Tic-tac-toe set with the same number of attribute values and a similar proportion of sample categories, the accuracy rate and size (numbers of nodes and leaves) exhibit the best expression for the decision tree when the peak of the entropy core, SP, is constrained at P = 1/3.
(2) For the Balance set with the same number of attribute values and a slightly different proportion of sample categories, the corresponding numbers of nodes and leaves present the most stable expression when SP is constrained at P = 1/5, although the accuracy rate of the decision tree does not reach the maximum value.
(3) For the Dermatology set with different attribute values and a similar difference in the proportion of samples categories, the corresponding numbers of nodes and leaves are small when SP is constrained at P ≤ 1/4, whereas the accuracy rate of the decision tree is high.
The experiment on the three representative datasets shows that uncertainty is measured by the constraint entropy, which consists of the dynamic peak shift of the entropy core in accordance with SP = 1/v. Enhancing the rationality of split attribute selection is effective for decision tree induction.

Effects of Decision Tree Learning Based on Constraint Gain
In accordance with the rules obtained from the preceding experiment on the peak shift of the entropy core, the constraint entropy of improved dynamic peak localization is determined. Its contribution constitutes the GCE heuristic and it realizes the decision tree algorithm based on constraint gain heuristic learning, CGDT. In this section, 11 datasets were used to compare the Gain and Gainratio heuristics (pruning off). The experimental results are presented in Table 2. datasets are the same or close to one another. The Gainratio heuristic has two datasets with smaller decision tree classifiers than those of the Gain heuristic. The other datasets have decision tree classifiers that are considerably larger than those of the Gain heuristic.
For the larger Mushroom dataset, the learning experiment of reducing the sampling proportion to 50% was conducted. The classification accuracy, Acc, of the GCE heuristic is better than those of the Gain and Gainratio heuristics. The size of the decision tree classifier is larger than that of the Gain heuristics and smaller than that of the Gainratio heuristic.
From the overall average of all the datasets, the numbers of branch nodes (30.6364) and leaves (70) of the GCE heuristic are extremely close to those of the Gain heuristic (29.9091, 69) and considerably less than those of the Gainratio heuristic. Meanwhile, the average size of the GCE heuristic's decision tree classifier is close to that of the Gain heuristic. The average accuracy of the GCE heuristic (82.6265) is better than those of the Gain and Gainratio heuristics (80.8023 and 78.3007, respectively).
On average, the GCE heuristic based on the constraint entropy for a decision tree achieves better classification accuracy than the Gain and Gainratio heuristics.

Effect of Optimized Learning of Combining Depth Induction
In the preceding section, the classifier for a decision tree is established through the inspired learning of GCE in the CGDT algorithm. Its size characteristics show that the split attribute of tree nodes should still be optimized in inductive convergence. CGDIDT is a learning algorithm of deep induction optimization that is based on the GCE selection for a decision tree. It is compared with ID3 and J48 of the Weka system. The experimental results are presented in Table 3.
The classification results of the CGDIDT decision tree demonstrate that the accuracy of 10 datasets is greater than that of the ID3 algorithm, with differences ranging from 1.4085 to 14.6104. Meanwhile, one dataset is flat and the average difference is 4.7312. The size of nine datasets is less than that of the ID3 algorithm, with a difference of −1-−104. The F-measure has 10 datasets that are greater than the ID3 algorithm. Its coverage has eight datasets that are greater than the ID3 algorithm.
CGDIDT is also compared with the J48 algorithm. Its accuracy has five datasets that are better than the average difference (8.3968), one dataset is flat and five datasets are weak (with an average difference of −5.2556). Their average overall difference is 1.4278. Meanwhile, the size of 10 datasets is bigger than that of J48. The F-measure has six better datasets, and its coverage has seven smaller datasets and three flat datasets.
Finally, the J48 algorithm is compared with ID3. The accuracy has seven better datasets, one flat dataset and five weak datasets. The average difference is 3.2854. Moreover, size has 11 smaller datasets. The F-measure has seven better datasets, and its coverage has eight smaller datasets.
For the larger Mushroom dataset, the experimental results of reducing the sampling proportion to 50% are as follows: The classification performance (Acc and F-measure) of the CGDIDT algorithm is better than that of ID3 and the same as that of J48. Meanwhile, the size of CGDIDT is smaller than those of ID3 and J48.
In conclusion, the classification accuracy and F-measure of the CGDIDT algorithm are averagely better than those of the ID3 and J48 algorithms of the Weka system. The average classification performance is further improved compared with that of CGDT. The average size and coverage of the classifier is considerably improved compared with those of ID3. However, the classifier scale is weaker than that of the J48 algorithm, which is the reason why the built-in pruning of J48 plays an evident role in the Weka system. Note: The sign '**' denotes the sampling proportion α = 50%, and its results are not considered in the average calculation. The sign '*' is the same of Table 1.

Conclusions
This study proposed an optimal learning algorithm based on the constraint gain for a decision tree. This study firstly analyzed the uncertainty distributions of single-event and multi-event entropies in accordance with the composition of information entropy. It found an enhanced property of the peak entropy value with a number of events and the existence of a relationship between the peak position and the reciprocal number of events. Hence, by replacing the information entropy kernel with the peak shift sine to achieve the uncertainty-estimated entropy of enhanced restraining, we proposed an attribute selection heuristic based on constraint gain to obtain the learning algorithm, CGDT. Then, we built an optimal learning algorithm, CGDIDT, using the branch convergence and fan-out indices within the inductive depth of a decision tree to assist in the selection optimization of the split attribute for a decision tree. The comparison of the experimental results showed that the classification accuracy of a decision tree based on the GCE heuristic is averagely superior to those of Gain and Gainratio. The size of the GCE heuristic is close that of Gain and larger than that of Gainratio. Finally, the average Acc and F-measure of the proposed CGDIDT algorithm are superior to those of ID3 and J48, whereas its size is generally smaller than that of ID3, but larger than that of J48.
For the classifier size of the decision tree, the CGDIDT was weaker than the pruned algorithm although it was better improved than ID3. This should be a need for future research and improvement.