A Two-Parameter Fractional Tsallis Decision Tree

Decision trees are decision support data mining tools that create, as the name suggests, a tree-like model. The classical C4.5 decision tree, based on the Shannon entropy, is a simple algorithm to calculate the gain ratio and then split the attributes based on this entropy measure. Tsallis and Renyi entropies (instead of Shannon) can be employed to generate a decision tree with better results. In practice, the entropic index parameter of these entropies is tuned to outperform the classical decision trees. However, this process is carried out by testing a range of values for a given database, which is time-consuming and unfeasible for massive data. This paper introduces a decision tree based on a two-parameter fractional Tsallis entropy. We propose a constructionist approach to the representation of databases as complex networks that enable us an efficient computation of the parameters of this entropy using the box-covering algorithm and renormalization of the complex network. The experimental results support the conclusion that the two-parameter fractional Tsallis entropy is a more sensitive measure than parametric Renyi, Tsallis, and Gini index precedents for a decision tree classifier.


Introduction
Entropy is a measure of the unpredictability of the state in physical systems that would be needed to specify the degree of disorder in full micro-structure of them. Claude Elwood Shannon [1] defined a measure of entropy to measure the amount of information in a digital system in the context of theory communication that has been applied in a variety of fields such as information theory, complex networks, and data mining techniques.
The most widely used form of the Shannon entropy is given by where N is the number of possibilities p i and ∑ n k=1 p k = 1. Two celebrated generalizations of Shannon entropy are Renyi [2] and Tsallis entropies [3]. Alfred Renyi proposed a universal formula to define a family of entropy measures given by the expression [2] where q denotes the order of moments. Constantino Tsallis proposed the q-logarithm defined by to introduce a physical entropy given by [3] Tsallis entropy could be rewritten [4][5][6] as where D t q of a function f given by D t q f (t) = f (qt) − f (t) (q − 1)t , t = 0, stands for the Jackson [7] q-derivative, to reflect that it is an extension of Shannon entropy. Renyi and Tsallis entropy measures depend on the parameter q, which describes their deviations from the standard Shannon entropy. Both entropies converge to Shannon entropy in the limit q → 1. For complex network applications [8] and data mining techniques [9][10][11][12][13][14][15][16][17], the parameter q varies into a range of values. On the other hand, the computation of the entropic index q of the Tsallis entropy was implemented for physics applications in [18][19][20][21][22][23][24][25].
Shannon and Tsallis entropies can be obtained by the action of standard derivative or q-derivative, respectively, to the same generating function ∑ N i=1 p −t i with respect to variable t and then letting t → −1. This approach can be used to reveal different entropy measures based on the actions of appropriate fractional order differentiation operators [26][27][28][29][30][31][32].
The major goal of this work is to introduce a new decision tree based on a twoparameter fractional Tsallis entropy. This new kind of tree is tested on twelve databases for a classification task. The structure of the paper is as follows. Section 2 focuses attention on the notion of two-parameter fractional Tsallis entropy. In Section 3, two-parameter fractional Tsallis decision trees and a constructionist approach to the representation of the databases as a complex network are introduced. The basic facts on the box-covering algorithm of a complex network are reviewed. Finally, we compute an approximation set of parameters q, β, and α of the two-parameter fractional Tsallis entropy. Section 4 is concerned with the testing of two-parameter fractional Tsallis decision trees on twelve databases. Next, the approximations of q values are tested on Renyi and Tsallis entropies. Discussion of the findings of this study and concluding remarks are offered in Section 5.

Parametric Decision Trees
A decision tree is a supervised data mining technique that creates a tree-like structure, where the non-leaf node tests a given attribute [45]. The outcome gives us the path to reach a leaf node, where the classification label is found. For example, let (x = 3, y = 1) be a tuple to be classified by the decision tree of Figure 1. If we test x = 1, we must follow the left path to reach y = 1 and finally arrive at the leaf node with the classification label "a".
In general, the cornerstone of the construction process of decision trees is the evaluation of all attributes to find the best node and the best split condition on this node to classify the tuple with the lower error rate. This evaluation is carried out by information gain on each attribute a [45]: where I(D) is the entropy of the database after being partitioned by the condition c of a given attribute a and I c (D) is the entropy induced by c. The tree's construction needs to evaluate several partition conditions c on all attributes of the database, then chooses the pair of attribute-condition with the highest value. Once a pair is chosen, the process evaluates the partitioned database recursively using a different attribute-condition. The reader is referred to [45] for details on decision tree construction and computation of (14).

Two-Parameter Fractional Tsallis Decision Tree
Following a similar fashion, a two-parameter fractional decision tree can be induced by the information gain obtained by rewritten (14) using (13): An alternative informativeness measure for constructing decision trees is the Gini index, or Gini coefficient, which is calculated by The Gini index can be deduced from Tsallis entropy (4) using q = 2 [14]. On the other hand, the two-parameter fractional Tsallis entropy with q = 2, α = 1, β = 1 reduces to the Gini index. Hence, Gini decision trees are a particular case of both Tsallis and two-parameter fractional Tsallis trees.
The main issue with Renyi and Tsallis decision trees is the estimation of q-value to obtain a better classification than the one produced by the classical decision trees. Trial and error is the accepted approach for this purpose. It consists of testing several values in a given interval, usually [−10, 10], and comparing the classification rates. This approach becomes unfeasible in two-parameter fractional Tsallis decision trees as it is needed to tune q, α, and β. A representation of a database as a complex network is introduced to face this issue. This representation lets us compute α and β following the approach in [22], which is the basis for determining the fractional decision tree parameters.

Network's Construction
A network is a powerful tool to model the relationships among entities or parts of a system. When those relationships are complex, i.e., properties that cannot be found by examining single components, something emerges that is called a complex network. Thus, networks as a skeleton of complex systems [46] have attracted considerable attention in different areas of science [47][48][49][50][51]. Following this approach, a representation of the relationships among attributes (system entities) of a database (system) as a network is obtained.
The attribute's name will be concatenated before the value of a given row to distinguish the same value that might appear on different attributes. Consider the first record of the database shown on the top of Figure 2. The first node will be N AME.BruceDickinson, the second node will be PHONE.54 − 76 − 90, and the third node will be ZIP.08510. These nodes belong to the same record, so they must be connected; see dotted lines of the network in the middle of Figure 2. We next consider the second record; the nodes Name.MichaelKiske and PHONE.87 − 34 − 67 will be added to the network. Note that the node ZIP.08510 was added in the previous step. We may now add the links between these three nodes. This procedure is repeated for each record in the database. The outcome is a complex network that exhibits non-simple topological features [52], which cannot be predicted by analyzing single nodes as occurs in random graphs or lattices [53].

Computation of Two-Parameter Fractional Tsallis Decision Tree Parameters
By the technique introduced in [22], the parameters α and β-on the network representation of the database-of the two-parameter fractional Tsallis decision tree are defined to be where |G i | is the number of nodes in the box G i obtained by the box-covering algorithm [54], n is the number of nodes of the network, and innerdeg(G i ) is the average degree of the nodes of the G i box. Similarly, two values of β are computed as follows [22]: where l > 2 is the diameter of the box G i , ∆ is the diameter of the network, and outerdeg(G i ) is the number of links among the boxes G i . The computation of innerdeg and outerdeg will be explained later. Inspired by the right-hand term of (19) and (20) (named α ) with the fact that is a normalized measure of the number of boxes to cover the network [20], an approximation of the q-value for the two-parameter fractional decision tree is introduced: Similarly, from the right hand of (21) and (22) (named β ), a second approximation of the q-value is given by where N b (l) is the minimum number of boxes of diameter l to cover the network, n, ∆, The process to compute the minimum number of boxes N b of diameter l to cover the network G is shown in Figure 3. A dual network (G ) is created only with the nodes of the original network, Figure 3b. Then, the links in G are added following the rule: two nodes i, j, in the dual network, are connected if the distance between l ij is greater than or equal to l. In our example, l = 3, and node one is selected to start. Node one will be connected in G with nodes five and six since their distance is four and three. The procedure is repeated with the remaining nodes to obtain the dual network shown in Figure 3b. Next, the nodes will be colored as follows: two directly connected nodes in G must not have the same color. Finally, the nodes colored in G are mapped to G; see Figure 3c. The minimum number of boxes to cover the network given l equals the number of colors in G. In addition, the nodes in the same color belong to the same box. In practice, l = [1, ∆]; thus, N b (l) of the example are shown in Table 1. For details of the box-covering algorithm, the reader is referred to [54].  Now, we are ready to compute innerdeg. Two boxes were found following the previous example for for l = 3; see Figure 4a. The innerdeg(G 1 ) = 2 is the average link per node between the nodes of this box; for this reason, the link between nodes four and six is omitted in this computation. Similarly, innerdeg(G 2 ) = 1. The outerdeg is the degree of each node of the renormalized network; see the network of Figure 4b.
In our example, outerdeg(G 1 ) = outerdeg(G 2 ) = 1. The renormalization converts each box into a super node, preserving the connections between boxes. On the other hand, it is known that N b (1) = n, and N b (∆ + 1) = 1; in the first case, each box contains a node, and in the second one, there is one box to cover the network that contains all nodes. For this reason, the innerdeg and outerdeg are not defined for l = 1 and l = ∆ + 1, respectively. This force to l = [2, ∆] as was stated in (19)- (24). Additionally, note that the right hand of (19) and (20) (α ), (21) and (22) (β ) are "pseudo matrices", where each row has N b (l) values; see Table 1. Consequently, q α and q β are also "pseudo matrices". The network represents the relationships between attribute-value (nodes) of each record and the relationships between different database records. For example, the dotted lines in Figure 2 show the relationships between the first record's attribute value. Links of the node ZIP.08510 are the relationships between the three records, and the links of PHONE.54-76-90 are the relationships between the first and third one. The box-covering algorithm groups these relationships into boxes (network records). The network in the middle of Figure 2 shows that the three boxes (in orange, green, and blue) coincide with the number of records in the database. However, the attribute value of each box does not coincide entirely with records in the database since box-covering finds the minimum number of boxes with the maximum number of attributes where the boxes are mutually exclusive.
The nodes in each box (network record) are enough to differentiate the records in the database. For example, the first network record consists of name, phone, and zip values (nodes in orange). The second record in the database can be differentiated from the first by its name and phone (values of those attributes are the second network record in green). The third one can be distinguished from the two others by its name (the third network record in blue). The cost of differentiating the first network record (measured by innerdeg) is the highest; meanwhile, the lowest is for the third. Thus, α measures the local differentiation cost for the network records.
On the other hand, β measures the global differentiation cost (by outerdeg). For example, the global cost for the first network record is two, and one for the second and third; see the renormalized network (at the bottom of Figure 2). It means that the first network record needs to be differentiated from two network records, and the second and third only need to be distinguished from the first. Note that α , β for a given l relies on the topology network that captures the relationships of the records and their values. Finally, q α is the ratio between network records (normalized number of boxes δ) and the local differentiation cost; meanwhile, q β is the ratio between network records and the global differentiation cost.
The network can be obtained from a raw database or after being discretized. Since the classification-measured by the area under receiver operating characteristic curve (AU-ROC) and Matthews correlation coefficient (MCC)-was better using the approximations computed on the networks from discretized databases, these approximations are only reported. The attribute discretization of a database can be found in [56]. The discretization technique is unsupervised and uses equal-frequency binning. The discretized databases were only used to obtain the networks so that the classification task was carried out using the original databases. The networks obtained from the discretized and non-discretized databases turned out to be different; see Figure 5.  The classification task was performed by classical, Renyi, Tsallis, Gini, and the twoparameter fractional Tsallis decisions trees on each database. We used a 10-fold crossvalidation repeated ten times to calculate the AUROC and MCC. The best value of the AUROC and MCC, produced by one of the four sets of parameters-used to approximate q, α, and β-of fractional Tsallis decisions trees, was chosen and compared with the classical and Gini decision trees. In the same way, q α or q β was chosen for the q parameter of Renyi and Tsallis trees. Then, their AUROCs and MCCs were compared with those of the classical trees. It is known that decision trees could produce non-normal distributed AUROC and MCC measures [57]. Hence, the normality was verified by the Kolmogorov-Smirnov test. These measures were compared using a T or a U Mann-Whitney test, according to their normality [10,[57][58][59].

Applications
The approximations of q, α, and β parameters computed on discretized databases are shown in Table 3. Table 4 shows the AUROC and MCC of classical and two-parameter fractional Tsallis decisions trees and the result of the statistical compassion. In addition, the values of the parameters of fractional Tsallis decisions trees are reported.
The two-parameter fractional Tsallis decision tree outperforms the AUROC and MCC of the classical trees for eight databases. The statistical result of both measures disagrees with Car and Haberman. The AUROC of the two-parameter fractional Tsallis tree was equal to the classical trees for Car, Image, Vehicle, and Yeast; meanwhile, for Haberman, Image, Vehicle, and Yeast, the MCC of both trees showed no difference. Table 3. The parameters of the fractional Tsallis decision tree were obtained using the networks from discretized databases. Tsallis entropy is a non-extensive measure [60] as well as a two-parameter fractional Tsallis entropy [22]. On the contrary, Shannon entropy is extensive. The super-extensive property is given by q < 1, and sub-extensive property by q > 1. Note that the approximations of the q parameter for all the databases, see Table 3, are < 1 except for Yeast. Thus, they can be considered candidates for being named super-extensive databases. We say that a database is super-extensive if q < 1 and its value produces a better classification (AUROC, MCC, or another measure) than the classical trees (based on Shannon entropy). Similarly, a database is sub-extensive if q > 1 and its value produces a better classification. Otherwise, the database is extensive since, in this case, the Shannon entropy (the cornerstone of classical trees) is a less complex measure than the two-parameter fractional Tsallis entropy; hence Shannon entropy must be preferred. The two-parameter fractional Tsallis trees produce classifications equal to or better than the classical trees. Following those conditions, based on MCC, Breast Cancer, Car, Cmc, Glass, Hayes, Letter, Scale, and Wine are super-extensive. Meanwhile, Haberman, Image, Vehicle, and Yeast can be classified as extensive.
The AUROC and MCC of Renyi and Tsallis decision trees are compared with the baseline of the classical ones. The q α and q β were tested as the entropic index of both parametric decision trees. The parameters of Renyi (q r ) and Tsallis (q t ) that produce the better AUROCs and MCCs are reported in Table 5. The result shows that the AUROC of Renyi trees was better for Breast Cancer, Glass, Letter, and Yeast and worse for Cmc and Haberman than classical trees. The results are quite similar for MCC, where Car's classification outperforms the classical tree classification. On the contrary, the MCC of the Vehicle database was statistically less than that of the classical tree. The Tsallis AUROCs were better for Cmc, Glass, Haberman, Hayes, and Wine and worse for Yeast than those of classical trees. Additionally, the MCCs of Car, Cmc, Glass, and Scale were higher, and lower for Yeast, than the classical trees' MCCs. Based on MCC, Car, Cmc, Glass, and Scale are super-extensive, which is a subset of the classification obtained by two-parameter fractional Tsallis.  Finally, the Gini and the two-parameter fractional Tsallis decisions trees are compared using AUROC and MCC. The results are shown in Table 6. These results indicate that twoparameter fractional Tsallis trees outperform AUROC of Gini trees in six databases, and MCC in ten. It underpins that Gini trees are a particular case of two-parameter fractional Tsallis trees with q = 2. In summarizing, two-parameter fractional Tsallis trees have better classifications than classical and Gini trees. Table 6. AUROC and MCC of Gini decision trees (GT) and two-parameter fractional Tsallis decision trees (TFTT). + means that AUROC is statistically greater than AUROC or MCC of GT.

Conclusions
This paper introduces two-parameter fractional Tsallis decision trees underpinned by fractional-order entropies. The three parameters of this new decision tree need to be tuned to produce better classifications than the classical ones. The trial and error approach is the standard method to adjust the entropic index for Renyi and Tsallis decision trees. However, it is unfeasible for two-parameter fractional Tsallis trees. From a database representation as a complex network, it was possible to determine a set of values for parameters q, α, and β based on this network. The experimental results on twelve databases show that the proposed values yield better classifications (AUROC, MCC) for eight of them, and for the four remaining, the classification was equal to that produced by classical trees.
Moreover, two values (q al pha , q beta ) were tested in Renyi and Tsallis decision trees. The results show that Renyi outperforms the classical trees in four (AUROC) and five (MCC) out of twelve databases. Similarly, Tsallis decision trees produced better classification for five (AUROC) and four (MCC) databases. The classification was worse in almost three and one databases for Renyi and Tsallis, respectively. The overall results of both parametric decision trees suggest that both outperform the classical trees in seven databases. All of the above is less favorable than what happened in eight databases analyzed with the two-parameter fractional Tsallis decision trees. In addition, the databases with a better classification using Tsallis decision trees are a subset of those for which two-parameter fractional Tsallis trees produced a better classification. It supports the conjecture that two-parameter fractional Tsallis entropy is a finer measure than the parametric entropies such as Renyi and Tsallis.
The approximate technique for the tree parameters introduced here is a valuable alternative for practitioners. Furthermore, the network classification based on the nonextensive properties of Tsallis and two-parameter fractional Tsallis entropies reveals that the relationships between the records and their attribute values (modeled by a network) are complex. Such complex relationships are better measured by two-parameter fractional Tsallis entropy, the cornerstone of the proposed decision tree.
The results pave the way for using the two-parameter Tsallis fractional entropy in other data mining techniques such as K-means, generic MST, Kruskal MST, and algorithms for dimension reduction in the future. Our research has the limitation that the databases used in the experiments are not large enough to reveal the reduction in time compared with the trial-and-error approach to set the tree parameters. However, we may conjecture that our method works in large databases, which will be the scope of future research.