Next Article in Journal
Quantum Circuit Components for Cognitive Decision-Making
Previous Article in Journal
An Improved Deep Reinforcement Learning Method for Dispatch Optimization Strategy of Modern Power Systems
Previous Article in Special Issue
Aggregated Power Indices for Measuring Indirect Control in Complex Corporate Networks with Float Shareholders
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Applications of Depth Minimization of Decision Trees Containing Hypotheses for Multiple-Value Decision Tables

1
College of Computer and Information Sciences, Jouf University, Sakaka 72441, Saudi Arabia
2
Computer, Electrical and Mathematical Sciences & Engineering Division and Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia
*
Author to whom correspondence should be addressed.
Entropy 2023, 25(4), 547; https://doi.org/10.3390/e25040547
Submission received: 12 December 2022 / Revised: 28 February 2023 / Accepted: 7 March 2023 / Published: 23 March 2023
(This article belongs to the Special Issue Decision Optimization in Information Theory and Game Theory)

Abstract

:
In this research, we consider decision trees that incorporate standard queries with one feature per query as well as hypotheses consisting of all features’ values. These decision trees are used to represent knowledge and are comparable to those investigated in exact learning, in which membership queries and equivalence queries are used. As an application, we look into the issue of creating decision trees for two cases: the sorting of a sequence that contains equal elements and multiple-value decision tables which are modified from UCI Machine Learning Repository. We contrast the efficiency of several forms of optimal (considering the parameter depth) decision trees with hypotheses for the aforementioned applications. We also investigate the efficiency of decision trees built by dynamic programming and by an entropy-based greedy method. We discovered that the greedy algorithm produces very similar results compared to the results of dynamic programming algorithms. Therefore, since the dynamic programming algorithms take a long time, we may readily apply the greedy algorithms.

1. Background and Related Study

In traditional decision tables, each object (row) is assigned a single decision (or label). However, in some cases, such as semantic annotation of images [1], music categorization into emotions [2], functional genomics [3], text categorization [4], and so on, a sets of decisions is given to every row. If we look at the literature, we frequently find works on multi-label learning [5] and multi-instance learning [6] that discuss prediction problems for multiple-value decision tables. It is also important to include ambiguous learning [7], partial learning [8], multiple label learning [9], and semi-supervised learning [10]. Additionally, multiple-value decision tables (multi-label decision tables) can be found in research on decision trees when it is regarded as algorithms, including (i) the diagnosis of circuit faults; (ii) computational geometry and (iii) combinatorial optimization [11], etc.
In decision tables with empirical observations, we frequently encounter clusters of equal rows with, most likely, different decisions. We can preserve one row from the cluster in place of such a cluster. Then, the set of decisions associated with the rows in the cluster is used to label it [11].
In contrast to exact learning [12,13,14,15,16], which employs both membership and equivalence queries in its algorithms, conventional decision trees [17,18] solely use features which are similar to membership queries. In [19,20,21], we investigated decision trees that use hypotheses about the values of all features in addition to only features. Exact learning’s equivalence queries are comparable to hypothesis-based queries. We looked at five decision tree formations based on different combinations of features and hypotheses, as well as proper hypotheses (i.e., exact learning’s equivalents of proper equivalence queries).
In general, decision trees utilizing hypotheses can be more effective than those merely using features. As an illustration, consider the task of computing the conjunction x 1 x n . To solve this problem, a decision tree containing the features x 1 , , x n must have a minimum depth of n. However, the minimum depth of a decision tree utilized to answer this problem using proper hypotheses is 1. This is needed to consider the proper hypothesis x 1 = 1 , , x n = 1 . The considered conjunction is equivalent to 1 if this proper hypothesis is true. It is equivalent to 0 if not.
To find the least complexity for decision trees of five forms, we suggested dynamic programming algorithms in [19]. We also looked at the length and coverage of the decision rules obtained from the optimal decision trees. According to a number of experimental findings, decision trees with hypotheses can be less complex than conventional decision trees. Furthermore, the decision rules formed from the former trees can be shorter and more comprehensive than those formed from the latter trees. These results imply that knowledge representation could benefit from decision trees with hypotheses [19].
In this paper, we consider the problem of minimizing the depth of the decision tree. It should be noted that reducing the depth often results in reducing the length of the rules derived from the tree.
Furthermore, we look at two applications to demonstrate the use of the tools created in [19]: the task of n ( n = 3 , , 5 ) elements sorting (elements are taken from a linearly ordered collection) where there are equal elements and the task of creating decision trees for tables that are modified and changed from the UCI ML Repository [22] to form multiple-value decision tables. We investigate the depth of five forms of optimal decision trees for each of these applications. The same parameter is investigated for five different forms of decision trees created using an entropy-based greedy method.
The comparison of decision trees containing hypotheses with traditional decision trees is the main goal of this research. The collected results from the experiments demonstrate that we can almost always find decision trees with hypotheses that outperform outcomes for traditional decision trees for both applications. Furthermore, the greedy method can be applied easily since their results are comparable to the results of dynamic programming.
The remainder of the manuscript is structured as follows: basic definitions and notation are covered in Section 2. Section 3 discusses methods to minimize the depth of the decision tree, followed by Section 4—the two applications of the created methods. Section 5 contains a short conclusion and future direction.

2. Preliminary Notation

The concepts of a multiple-value decision table and a decision tree for such a table are discussed in this section. Previously, such notions are described in the context of a decision table containing one decision [23] and a decision table containing many-valued decisions [11]. Finally, we describe five different forms of decision trees and one parameter of the decision tree: depth. This parameter will be used later as a comparison tool among different forms of trees.

2.1. Multiple Value Decision Table

A multiple-value decision table is a table T filled with nonnegative integers. Columns in this table are tagged by features. The rows of the table are pairwise different, and they are tagged by nonempty finite sets of natural numbers that are indicated by sets of decisions. D ( T ) is the union of the set of decisions for all rows. Figure 1 illustrates a multiple-value decision table T 0 .
We now discuss the basic notions regarding a multiple-value decision table T. If T contains no rows, it is said to be empty. T is referred to as degenerate when it is empty or if each row of T has a common decision (the decision which belongs to each set of decisions for all rows). Let us consider a feature f i { f 1 , , f n } . Then X ( T , f i ) is the collection of f i ’s values in the table T. X ( T ) is the collection of features that have 2 or more values i.e., X ( T , f i ) 2 .
Assume T is a decision table that is not empty. A table made up of a few rows of T is a subtable of T. Assume that f i 1 , , f i m are the features in T which must be from the set { f 1 , , f n } and their values are b 1 , , b m . We consider a special subtable of the form T ( f i 1 , b 1 ) ( f i m , b m ) . It contains the rows that intersect with the columns f i 1 , , f i m produces b 1 , , b m values. These are referred to as separable subtables of T. In particular, T is a separable subtable of T. The set of separable subtables of table T is denoted by S E P ( T ) . When we consider T analogous to an optimization problem, then separable subtables of T will be the analogous to subproblems of that optimization problem. Figure 2 illustrates a separable subtable T 0 ( f 1 , 0 ) of the table T 0 .

2.2. Decision Tree

Assume T is a nonempty decision table containing f 1 , , f n features. Decision trees for two different kinds of questions are taken into account. First, the value of a feature f i is inquired, where f i must be from F ( T ) = { f 1 , , f n } . The potential responses to this question are the expressions { f i = b } , where b must be from X ( T , f i ) . Second, a hypothesis over the variable T is inquired. The hypothesis can be written as H = { f 1 = b 1 , , f n = b n } , where b 1 X ( T , f 1 ) , , b n X ( T , f n ) . The potential responses to this question are the hypothesis itself and counterexamples of the form { f 1 = c 1 } , c 1 X ( T , f 1 ) \ { b 1 } , . . . , { f n = c n } , c n X ( T , f n ) \ { b n } . When the response is hypothesis then it is correct. If ( b 1 , , b n ) is the same as the row of T, then H is denoted as a proper hypothesis for T.
A decision tree over T is a rooted tree where each terminal vertex has a number assigned to it from the set D ( T ) { 0 } . Every non-terminal vertex (referred to as a “functioning vertex” in this context) is tagged by either a hypothesis over T or a feature from the set F ( T ) . In the case of a functioning vertex tagged by a hypothesis H = { f 1 = b 1 , , f n = b n } over T, precisely one edge is tagged for each of the response C ( H ) = { H , { f 1 = c 1 } , c 1 X ( T , f 1 ) \ { b 1 } , . . . , { f n = c n } , c n X ( T , f n ) \ { b n } } . On the other hand, if it is tagged by a feature f i from F ( T ) , precisely one edge is tagged for each of the responses C ( f i ) = { { f i = b } , b X ( T , f i ) } . Note that just the labeled edge exits this vertex in both instances, with no additional edges doing so.
Assume u is a vertex on Γ , which is a decision tree over T. We now create an equation system over T that is connected to the vertex u, S ( Γ , u ) . ξ represents the path from the root of Γ to u. Whenever ξ does not have any functioning vertices, S ( Γ , u ) represents the empty system. If not, S ( Γ , u ) is the union of equation systems linked to arcs of ξ .
The above tree is said to be a decision tree for T when it satisfies the following rules. For each vertex u of Γ , the subtable T S ( Γ , u ) is degenerate if and conversely, the vertex u is terminal. In addition, the vertex u receives the label 0 if T S ( Γ , u ) is empty and it is a terminal vertex. As opposed to that, if T S ( Γ , u ) is not empty when u is a terminal vertex, a common decision for T S ( Γ , u ) is labeled on the vertex u.
An example of a decision tree for T 0 is given in Figure 3.

2.3. Different Tree Forms and Corresponding Depth

In Γ , a full path is any directed path that travels from the root to a terminal vertex. The depth of the decision tree is defined as the maximum length of the tree’s full path. The depth of any considered tree is regarded as its time complexity. h ( Γ ) is denoted as the depth of Γ .
The following syntax will be taken into account for the lowest depth of a decision tree for T that:
  • solely employs features from F ( T ) -> h ( 1 ) ( T ) .
  • solely employs hypotheses over T -> h ( 2 ) ( T ) .
  • employs features from F ( T ) as well as hypotheses over T -> h ( 3 ) ( T ) .
  • solely employs proper hypotheses over T -> h ( 4 ) ( T ) .
  • employs features from F ( T ) as well as proper hypotheses over T -> h ( 5 ) ( T ) .

3. Procedure to Minimize Depth

In this section, first we describe how to minimize the depth of the decision tree using dynamic programming, and then we describe similar things using the greedy algorithm.

3.1. Extension of Dynamic Programming

3.1.1. Creation of DAG

Assume T is a nonempty multiple-value decision table containing f 1 , , f n features. We will discuss the directed acyclic graph (DAG) Δ ( T ) creation algorithm A d a g (Algorithm 1). The separable subtables of table T make up this graph’s vertices. We deal with one vertex every iteration. We begin with the graph that has one untreated vertex, T, and we end when all of the graph’s vertices have been treated. We examine the untreated vertex (subtable) β at the end of each iteration; if it is degenerate, we stop splitting the vertex; otherwise, we split the vertex and carry on with the process. Previously similar procedures are described in [11,23].
Algorithm 1: A d a g
Input: A nonempty multiple-value decision table T that contains f 1 , , f n features.
Output: A DAG Δ ( T ) .
  • Construct a graph with only the vertex T that is not designated as treated.
  • The algorithm’s work is complete once every vertex in the graph has been treated and return it as Δ ( T ) . If not, pick a vertex (table) β that is untreated.
  • If (the vertex β is degenerate) {
       designate it as treated and proceed to step 2.
    }
    else {
       Create a bundle of edges from the vertex β for each f i in X ( β ) if it is not degenerate. Let’s say X ( β , f i ) = { b 1 , , b e } . After that, create e arcs from β and tag them with pairs ( f i , b 1 ) , , ( f i , b e ) . The vertices that these edges penetrate are β ( f i , b 1 ) , , β ( f i , b e ) , respectively. Add the vertices β ( f i , b 1 ) , , β ( f i , b e ) to the network if they are missing. Return to step 2 and mark vertex β as treated.
    }
In general, A d a g ’s time complexity is exponential based on the table’s size. Note that we discussed a few different classes of decision tables in the book [11]. A polynomial on the number of columns in the tables serves as a boundary for the number of separable subtables of decision tables for each of these classes. A d a g ’s time complexity varies with the size of the input decision tables for each of these classes.

3.1.2. Method to Minimize Depth

Assume T is a nonempty decision table containing f 1 , , f n features. The values h ( 1 ) ( T ) , , h ( 5 ) ( T ) may be computed using the DAG Δ ( T ) . Let’s say t is 1 , , 5 . We compute the value h ( t ) ( β ) for each vertex β of Δ ( T ) to determine h ( t ) ( T ) . It is easier for us to take into account not only subtables corresponding to vertices of Δ ( T ) , but also Λ (empty subtables of T), as well as T r (subtables that contain only a single row r of T but are not considered vertices of Δ ( T ) ). The work of our method starts with terminal vertices of Δ ( T ) (vertices without exiting arcs) and these unique subtables. Then we continue moving up to the table T one step at a time.
Assume β is equal to T r for a row r of T or a terminal vertex of Δ ( T ) . The decision tree for β is the one that has a single vertex tagged by a common decision for β , which means that h ( t ) ( β ) = 0 . If β = Λ , the decision tree for Λ will be only one vertex labeled with 0 and in this case, h ( t ) ( β ) = 0 .
Assume β is a nonterminal vertex of Δ ( T ) such that h ( t ) ( β ) is already known for each child β of β . With the help of this information, we can determine the decision tree’s minimum depth for β . Regarding the root of this tree, we can have the following scenarios:
  • A feature from F ( T ) is used to mark the root; the decision tree’s minimum depth is indicated by the symbol h a ( t ) ( β ) .
  • a hypothesis over T is used to mark the root; the decision tree’s minimum depth is indicated by the symbol h h ( t ) ( β ) .
  • a proper hypothesis over T is used to mark the root; the decision tree’s minimum depth is indicated by the symbol h p ( t ) ( β ) .
The set X ( β ) is not empty because β is nondegenerate. Here are three methods for calculating the values h a ( t ) ( β ) , h h ( t ) ( β ) , and h p ( t ) ( β ) , respectively.
Consider a decision tree’s root labeled with f i X ( β ) where the Γ ( f i ) is a decision tree for β . For each b X ( T , f i ) , an edge exits the root and enters a vertex, u ( b ) . On this edge, the equation system { f i = b } is labeled. The vertex u ( b ) is the root of a decision tree of form t for β { f i = b } , and its depth is h ( t ) ( β { f i = b } ) . Clearly
h ( Γ ( f i ) ) = 1 + max { h ( t ) ( β { f i = b } ) : b X ( T , f i ) } .
Because h ( t ) ( β { f i = b } ) = h ( t ) ( Λ ) = 0 for any b X ( T , f i ) \ X ( β , f i ) ,
h ( Γ ( f i ) ) = 1 + max { h ( t ) ( β { f i = b } ) : b X ( β , f i ) } .
For each b X ( β , f i ) , it is evident that the subtable β { f i = b } is β ’s child in the Δ ( T ) , therefore h ( t ) ( β { f i = b } ) ’s value is known.
With the root labeled with the feature f i and decision trees of form t used for the subtables corresponding to the root’s children, it can be shown that h ( Γ ( f i ) ) is the minimum depth of a decision tree for β .
The fact that there is a b in X ( T , f i ) for each such feature and that β { f i = b } = β implies that we cannot build an optimal decision tree for β means that we should not consider these features. Thus, the depth, in this case, will be
h a ( t ) ( β ) = min { h ( Γ ( f i ) ) : f i X ( β ) } .
The procedure of h a ( t ) ( β ) Computation. Create the collection of features X ( β ) . Calculate the value h ( Γ ( f i ) ) for each feature f i X ( β ) using (1). Determine h a ( t ) ( β ) with reference to (2).
Note that, this procedure’s time complexity is polynomial based on table T.
We consider only admissible hypothesis over T. Assume H = { f 1 = b 1 , , f n = b n } is such a hypothesis. It is admissible for β and a feature f i in F ( T ) = { f 1 , , f n } if β { f i = c } is not the same as β for any c X ( T , f i ) \ { b i } . It is not admissible for β and a feature f i F ( T ) iff | X ( β , f i ) | = 1 and b i X ( β , f i ) . Such a hypothesis is referred to as admissible for β if it is admissible for both β and any feature f i F ( T ) .
Assume a decision tree’s root is tagged by the hypothesis H = { f 1 = b 1 , , f n = b n } (admissible for β ) where the Γ ( f i ) is a decision tree for β . The collection of responses to the question associated with H is C ( H ) = { H , { f 1 = c 1 } , , { f n = c n } : c 1 X ( T , f 1 ) \ { b 1 } , . . . , c n X ( T , f n ) \ { b n } } . There must be an edge that departs from Γ ( H ) ’s root and reaches a vertex u ( S ) for each S in C ( H ) . The equation system S is written on this edge. The decision tree of type t for β S has this u ( S ) as its root, and its depth is h ( t ) ( β S ) . Clearly
h ( Γ ( H ) ) = 1 + max { h ( t ) ( β S ) : S C ( H ) } .
Note that, h ( t ) ( β H ) = 0 if β H = Λ or β H = T r for any row r of T. X ( β , f i ) \ { b i } is equal to empty set whenever f i F ( T ) \ X ( β ) because H is admissible for β . It is evident that β { f i = c } = Λ for any feature f i X ( β ) and any c X ( T , f i ) \ { b i } where c X ( β , f i ) . In this case, h ( t ) ( β { f i = c } ) = 0 . Hence
h ( Γ ( H ) ) = 1 + max { h ( t ) ( β { f i = c } ) : f i X ( β ) , c X ( β , f i ) \ { b i } } .
Obviously for any f i X ( β ) and any c X ( β , f i ) \ { b i } , β { f i = c } is a child of β in Δ ( T ) . Therefore, h ( t ) ( β { f i = c } ) is already known.
With the root labeled by the hypothesis H and decision trees of form t used for the subtables corresponding to the root’s children, it can be shown that h ( Γ ( H ) ) is the minimum depth of a decision tree for β .
Not admissible for β hypotheses should not be taken into consideration.
The procedure of h h ( t ) ( β ) Computation. Create the hypothesis: H β = { f 1 = b 1 , , f n = b n } for β . Assume that f i in F ( T ) \ X ( β ) . The sole number in the set X ( β , f i ) is then equal to b i . Let’s say f i X ( β ) . If so, h ( t ) ( β { f i = b i } ) = max { h ( t ) ( β { f i = c } ) : c X ( β , f i ) } where b i is the lowest number from X ( β , f i ) . There is no doubt that H β is admissible for β . Calculate the value of h ( Γ ( H β ) ) using (3). It is clear from a quick look at (3) that h ( Γ ( H β ) ) = h h ( t ) ( β ) .
Note that, this procedure’s time complexity is polynomial based on table T.
The procedure of h p ( t ) ( β ) Computation. Create a proper hypothesis H r = { f 1 = b 1 , , f n = b n } corresponding to every row r = ( b 1 , , b n ) and determine if this proper hypothesis is admissible for β . Calculate the value of h ( Γ ( H r ) ) using (3). h p ( t ) ( β ) is equal to the minimum of the obtained numbers.
Note that this procedure’s time complexity is polynomial based on table T.
Now, an algorithm A t h (Algorithm 2) is described below that determines h ( t ) ( T ) . It is the minimum depth of a decision tree with the provided form t (where, t = 1 , , 5 ) for the table T. As the algorithm executes, we learn h ( t ) ( β ) for each vertex β in Δ ( T ) .
Algorithm 2: A t h (computation of h ( t ) ( T ) ).
Input: The DAG Δ ( T ) for a nonempty table T.
Output: The value of the depth h ( t ) ( T ) .
  • The algorithm will end if all vertices of the DAG Δ ( T ) have numbers associated to them and return the number h ( t ) ( T ) associated to the vertex T. In the absence of such a case, we select a vertex that does not have an assigned number. This vertex can be a terminal vertex of Δ ( T ) or a nonterminal vertex of Δ ( T ) where every child contains the numbers that are assigned to it.
  • If ( β is a terminal vertex) {
       assign the value h ( t ) ( β ) = 0 to β and move on to step 1.
    }
    else {
       perform the following based on the value t:
       If ( t = 1 ) {
       calculate the value h a ( 1 ) ( β ) and set h ( 1 ) ( β ) = h a ( 1 ) ( β ) .
       }
       else if ( t = 2 ) {
       calculate the value h h ( 2 ) ( β ) and set h ( 2 ) ( β ) = h h ( 2 ) ( β ) .
       }
       else if ( t = 3 ) {
       calculate the value h a ( 3 ) ( β ) and h h ( 3 ) ( β ) and set h ( 3 ) ( β ) = min { h a ( 3 ) ( β ) , h h ( 3 ) ( β ) } .
       }
       else if ( t = 4 ) {
       calculate the value h p ( 4 ) ( β ) and set h ( 4 ) ( β ) = h p ( 4 ) ( β ) .
       }
       else if ( t = 5 ) {
       calculate the value h a ( 5 ) ( β ) and h p ( 5 ) ( β ) and set h ( 5 ) ( β ) = min { h a ( 5 ) ( β ) , h p ( 5 ) ( β ) } .
       }
    }
    Move on to step 1.

3.2. Greedy Algorithms

In this section, we provide greedy techniques for building decision trees with hypotheses for multiple-value decision tables based on entropy uncertainty measure.
Assume T is a nonempty multiple-value decision table with n features f 1 , , f n and β be a subtable of the table T. Entropy of β (denoted e n t M L ( β ) ) is defined as:
  • If β is empty or degenerate, then e n t M L ( β ) = 0 .
  • Assume β is nonempty and nondegenerate. N s ( β ) is the number of rows in β where the set of decisions contains the decision s (s is in D ( β ) ). Furthermore, p s is the value N s ( β ) N ( β ) . Here, N ( β ) is the number of rows in β . Hence, e n t M L ( β ) = s D ( T ) p s log 2 p s + ( 1 p s ) log 2 ( 1 p s ) [24].
We now define the impurity of a query for the table β and uncertainty measure e n t M L . The impurity of the query based on a feature f i F ( T ) is equal to I e n t M L ( f i , β ) = max { e n t M L ( β S ) : S C ( f i ) } . The impurity of the query based on a hypothesis H is equal to I e n t M L ( H , β ) = max { e n t M L ( β S ) : S C ( H ) } .
A greedy algorithm, A e n t M L (Algorithm 3), is described below. It generates a decision tree, Γ e n t M L ( t ) of form t (where, 1 , , 5 ) for a nonempty multiple-value decision table T.
Algorithm 3: A e n t M L (construction of the tree Γ e n t M L ( t ) ).
Input: A nonempty multiple-value decision table T and a number t { 1 , , 5 } .
Output: A decision tree of form t for T.
  • Create a tree G with a sole vertex that is tagged by T.
  • The algorithm terminates and returns the tree G if none of the vertices of the tree G are tagged by a table T.
  • Pick a vertex in the tree G that is tagged by the subtable β of table T. This vertex is named as u.
  • If ( β is degenerate) {
       u is tagged by 0 if it is empty and with a common decision of β if it is not.
    }
    else {
       we select an M query (either a feature or a hypothesis) admissible for β based on t according to the below rules:
       If ( t = 1 ) {
          a feature M F ( T ) is chosen that is admissible for β and has the lowest impurity, I e n t M L ( M , β ) .
       }
       else if ( t = 2 ) {
          a hypothesis M over T is chosen that is admissible for β and has the lowest impurity, I e n t M L ( M , β ) .
       }
       else if ( t = 3 ) {
          a feature Y F ( T ) is chosen that is admissible for β and has the lowest impurity I e n t M L ( Y , β ) as well as discover a hypothesis Z over T that is admissible for β and has the lowest impurity, I e n t M L ( Z , β ) . We select the query M out of Y and Z with the minimum impurity I e n t M L ( M , β ) .
       }
       else if ( t = 4 ) {
          a proper hypothesis M over T is chosen that is admissible for β and has the lowest impurity, I e n t M L ( M , β ) .
       }
       else if ( t = 5 ) {
          a feature Y F ( T ) is chosen that is admissible for β and has the lowest impurity I e n t M L ( Y , β ) as well as discover a proper hypothesis Z over T that is admissible for β and has the lowest impurity, I e n t M L ( Z , β ) . We select the query M out of Y and Z with the minimum impurity I e n t M L ( M , β ) .
       }
    }
  • Rather than using β , the query M is used to label the vertex u. We add a vertex u ( S ) as well as an arc e ( S ) connecting u and u ( S ) to the tree G for each solution S C ( M ) . The answer S is used to label the arc e ( S ) and the subtable β S is used to label the vertex u ( S ) . Move on to step 2.
The algorithm A e n t M L generates a decision tree Γ e n t M L ( t ) of form t (where, t { 1 , , 5 } ) for the table T. Denote h e n t M L ( t ) ( T ) = h ( Γ e n t M L ( t ) ) .

4. Application

In this section, we look at two applications for the aforementioned algorithms. The sorting problem is the first application, and the experimental analysis of changed decision tables from the UCI Machine Learning Repository [22] is the second.

4.1. Sorting

The problem of sorting is usually defined in theoretical studies (see [25,26]) as sorting a series of n distinct elements y 1 , , y n from a set that is linearly ordered. The collection { 1 , , n } has only one permutation ( s 1 , , s n ) where y s 1 < < y s n . There are only two results for any comparison of two components y i : y j : y i > y j and y i < y j . The goal is to find a permutation ( s 1 , , s n ) for a given sequence y 1 , , y n such that y s 1 < < y s n .
Now consider the case where there are equal elements in the sequence y 1 , , y n . The set { 1 , , n } can have more than one permutation ( s 1 , , s n ) such that y s 1 y s n . There are three results for any comparison of two components y i : y j : y i < y j , y i = y j , and y i > y j . The goal is to find a permutation ( s 1 , , s n ) for a given sequence y 1 , , y n such that y s 1 y s n .
We can construct the multiple-value decision table T s o r t 3 ( n ) for a given n. In this table, columns are labeled as features y i : y j , 1 i < j n . Furthermore, for a series of elements y 1 , , y n that can contain equal elements, rows represent all feasible tuples of values according to these elements ( T s o r t 3 ( 3 ) table is illustrated in Table 1). The collection of corresponding permutations serves as the decision set for each row. Three-valued features are taken into account, as shown by the number 3 in the syntax T s o r t 3 ( n ) .
Table 2 contains the lowest depth of a decision tree for T s o r t 3 ( n ) when n = 3 , , 5 using the dynamic programming algorithm. When compared to the form 1 decision tree, the forms 2, 3, 4, and 5 decision trees produce a smaller minimum depth of decision tree.
Table 3 contains the minimum depth of a decision tree for the decision table T s o r t 3 ( n ) for n = 3 , , 5 using the greedy algorithm. It is evident that forms 2, 3, 4, and 5 produce a smaller minimum depth (highlighted in bold format) of the decision tree compared to form 1.
Now, if we compare the results obtained by the greedy algorithms and the dynamic programming, we can see that the results are pretty close. For n = 3 , the results are the same. Even though, the results are better for dynamic programming when we increase n, it takes an exponential amount of time for the dynamic programming algorithms. Therefore, to reduce the time, we can easily apply greedy algorithms.

4.2. Modified Tables from UCI ML Repository

We extracted decision tables from the UCI ML Repository [22]. Then one or more features are removed. Therefore, some tables have multiple rows with the same feature values but various decisions, that are subsequently combined into a sole row identified by the collection of decisions from those rows. Before the experiment begins, some preparation steps are completed. If a feature has different values for every row, it is eliminated. The most frequent value for a feature is used to fill in any gaps if a value is missing.
In Table 4, the column ‘Name of table’ indicates the name of the new table (that we obtain after deleting features from the UCI ML Repository’s data set), the column ‘The initial table removed columns’ indicates the name of the initial table along with the indexes of features deleted from it, the column ‘Rows’ indicates the quantity of rows, the column ‘Attr’ indicates the quantity of features, the column ‘Number of decisions’ indicates the quantity of decisions in the new table, and the column ‘Distribution’ indicates a sequence of numbers @ 1 , , @ i , , where @ i denotes the quantity of rows in T labeled with sets of decisions having i decisions.
The lowest depth of a decision tree for the above-mentioned decision tables using the dynamic programming algorithm is available in Table 5. The table’s name is in the first column, and the remaining columns have values h ( 1 ) ( T ) , , h ( 5 ) ( T ) (lowest one is in bold format for each table). It is clear that forms 2, 3, 4, and 5 produce smaller optimum depth of a decision tree compared to the form 1 decision tree. Decision trees with the smallest depth work best for 8 decision tables when using features (form 1), 4 tables when using hypotheses (form 2), all tables when using features and hypotheses (form 3), 2 tables when using proper hypotheses (form 4), and 10 tables when using features and proper hypotheses (form 5). At the bottom, the mean and standard deviation values are calculated for each form.
The lowest depth of a decision tree for the above-mentioned decision tables using the greedy algorithm is available in Table 6. The table’s name is in the first column, and the remaining columns have values h ( 1 ) ( T ) , , h ( 5 ) ( T ) (lowest one is in bold format for each table). It is clear that forms 2, 3, 4, and 5 produce the smaller depth of a decision tree compared to the form 1 decision tree. Decision trees with the smallest depth work best for four decision tables when using features (form 1), four tables when using hypotheses (form 2), eleven tables when using features and hypotheses (form 3), two tables when using proper hypotheses (form 4), and all tables when using features and proper hypotheses (form 5). At the bottom, the mean and standard deviation values are calculated for each form.

4.3. Discussion

We can compare the results in two ways. First, we can compare five different forms. It is evident from both applications that, the decision tree’s minimum depth can be optimal when forms 3 and 5 are used. Second, we can compare the results obtained by greedy algorithms and those obtained by dynamic programming. Clearly, we can see that the results are pretty close for the tables “CARS1”, “NURSERY1”, “ZOO1”, “ZOO2”, and “ZOO3”. For other tables, the results are better with dynamic programming. However, it takes an exponential amount of time for the dynamic programming algorithms; therefore, to reduce the time, we can easily apply the greedy algorithms.

5. Conclusions

Modified decision trees are examined in this article that employs both queries based on a hypothesis about the values of all features as well as queries based on one feature per query. To reduce the depth of these decision trees, we created dynamic programming techniques and greedy algorithms, and we considered two applications. The first application is concerned with the sorting problem, whereas the second problem is concerned with the modified decision tables in the form of multiple-value decision tables from the UCI Machine Learning repository. We can see from the experimental results that the minimum depth using features and hypotheses is optimal in both applications. Furthermore, dynamic programming produces the optimal output, which is not far from the results of the greedy algorithms using entropy uncertainty measure.
In the future, we are planning to deploy more greedy methods to investigate the suitability of the greedy uncertainty measures, which can be closest to the optimal dynamic programming results. Nevertheless, the minimum number of vertices will be studied in the future using the modified decision trees by both dynamic programming and greedy algorithms for these multiple-value decision tables.

Author Contributions

Project administration, M.A.; writing—original draft preparation, M.A.; software, M.A.; investigation, M.A.; methodology, M.A. and M.M.; validation, M.A.; data curation, M.A.; Conceptualization, M.A. and M.M.; visualization, M.A.; supervision, M.M.; formal analysis, M.A. and M.M.; writing—review and editing, M.A. and M.M.; resources, M.A. and M.M.; funding acquisition, M.M. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by King Abdullah University of Science & Technology.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data is publicly accessible via http://archive.ics.uci.edu/ml accessed on 6 March 2023.

Acknowledgments

The authors acknowledge the help and support from King Abdullah University of Science and Technology (KAUST) and Jouf University (JU) for this research.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Boutell, M.; Luo, J.; Shen, X.; Brown, C. Learning Multi-label Scene Classification. Pattern Recognit. 2004, 37, 1757–1771. [Google Scholar] [CrossRef] [Green Version]
  2. Wieczorkowska, A.; Synak, P.; Lewis, R.; Ras, Z. Extracting Emotions from Music Data. In Lecture Notes in Computer Science, Proceedings of the 15th International Conference on Foundations of Intelligent Systems, ISMIS 2005, Saratoga Springs, NY, USA, 25–28 May 2005; Hacid, M., Murray, N., Ras, Z., Tsumoto, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; Volume 3488, pp. 456–465. [Google Scholar]
  3. Vens, C.; Struyf, J.; Schietgat, L.; Džeroski, S.; Blockeel, H. Decision Trees for Hierarchical Multi-label Classification. Mach. Learn. 2008, 73, 185–214. [Google Scholar] [CrossRef] [Green Version]
  4. Zhou, Z.H.; Jiang, K.; Li, M. Multi-Instance Learning Based Web Mining. Appl. Intell. 2005, 22, 135–147. [Google Scholar] [CrossRef]
  5. Tsoumakas, G.; Katakis, I. Multi-Label Classification: An Overview. IJDWM 2007, 3, 1–13. [Google Scholar] [CrossRef] [Green Version]
  6. Zhou, Z.H.; Zhang, M.L.; Huang, S.J.; Li, Y.F. Multi-instance Multi-label Learning. Artif. Intell. 2012, 176, 2291–2320. [Google Scholar] [CrossRef] [Green Version]
  7. Hüllermeier, E.; Beringer, J. Learning from Ambiguously Labeled Examples. In Lecture Notes in Computer Science, Proceedings of the Advances in Intelligent Data Analysis VI, 6th International Symposium on Intelligent Data Analysis, IDA 2005, Madrid, Spain, 8–10 September 2005; Famili, A.F., Kok, J.N., Sánchez, J.M.P., Siebes, A., Feelders, A.J., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; Volume 3646, pp. 168–179. [Google Scholar]
  8. Cour, T.; Sapp, B.; Jordan, C.; Taskar, B. Learning from Ambiguously Labeled Images. In Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2009, Miami, FL, USA, 20–25 June 2009; IEEE Computer Society: Washington, DC, USA, 2009; pp. 919–926. [Google Scholar]
  9. Jin, R.; Ghahramani, Z. Learning with Multiple Labels. In Proceedings of the Advances in Neural Information Processing Systems 15, NIPS 2002, Vancouver, BC, Canada, 9–14 December 2002; Becker, S., Thrun, S., Obermayer, K., Eds.; MIT Press: Cambridge, MA, USA, 2002; pp. 897–904. [Google Scholar]
  10. Zhu, X.; Goldberg, A.B. Introduction to Semi-Supervised Learning; Synthesis Lectures on Artificial Intelligence and Machine Learning; Morgan & Claypool Publishers: San Rafael, CA, USA, 2009. [Google Scholar]
  11. Alsolami, F.; Azad, M.; Chikalov, I.; Moshkov, M. Decision and Inhibitory Trees and Rules for Decision Tables with Many-Valued Decisions; Intelligent Systems Reference Library; Springer: Berlin/Heidelberg, Germany, 2020; Volume 156. [Google Scholar]
  12. Angluin, D. Queries and Concept Learning. Mach. Learn. 1988, 2, 319–342. [Google Scholar] [CrossRef] [Green Version]
  13. Angluin, D. Queries revisited. Theor. Comput. Sci. 2004, 313, 175–194. [Google Scholar] [CrossRef] [Green Version]
  14. Angluin, D. Negative results for equivalence queries. Mach. Learn. 1990, 5, 121–150. [Google Scholar] [CrossRef] [Green Version]
  15. Balcázar, J.L.; Castro, J.; Guijarro, D. A general dimension for exact learning. In Proceedings of the Computational Learning Theory: 14th Annual Conference on Computational Learning Theory, COLT 2001 and 5th European Conference on Computational Learning Theory, EuroCOLT 2001, Amsterdam, The Netherlands, 16–19 July 2001; Springer: Berlin/Heidelberg, Germany, 2001; Volume 14, pp. 354–367. [Google Scholar]
  16. Balcázar, J.L.; Castro, J.; Guijarro, D. A New Abstract Combinatorial Dimension for Exact Learning via Queries. J. Comput. Syst. Sci. 2002, 64, 2–21. [Google Scholar] [CrossRef] [Green Version]
  17. Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Chapman and Hall/CRC: Boca Raton, FL, USA, 1984. [Google Scholar]
  18. Rokach, L.; Maimon, O. Data Mining with Decision Trees—Theory and Applications; Series in Machine Perception and Artificial Intelligence; World Scientific: Singapore, 2007; Volume 69. [Google Scholar]
  19. Azad, M.; Chikalov, I.; Hussain, S.; Moshkov, A.; Zielosko, B. Decision Trees with Hypotheses; Synthesis Lectures on Intelligent Technologies Series; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
  20. Azad, M.; Chikalov, I.; Hussain, S.; Moshkov, M. Entropy-based Greedy Algorithm for Decision Trees Using Hypotheses. Entropy 2021, 23, 808. [Google Scholar] [CrossRef] [PubMed]
  21. Azad, M.; Chikalov, I.; Hussain, S.; Moshkov, M. Optimization of Decision Trees with Hypotheses for Knowledge Representation. Electronics 2021, 10, 1580. [Google Scholar] [CrossRef]
  22. Dua, D.; Graff, C. UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. 2017. Available online: http://archive.ics.uci.edu/ml (accessed on 6 March 2023).
  23. AbouEisha, H.; Amin, T.; Chikalov, I.; Hussain, S.; Moshkov, M. Extensions of Dynamic Programming for Combinatorial Optimization and Data Mining; Intelligent Systems Reference Library; Springer: Berlin/Heidelberg, Germany, 2019; Volume 146. [Google Scholar]
  24. Clare, A.; King, R.D. Knowledge Discovery in Multi-label Phenotype Data. In Lecture Notes in Computer Science, Proceedings of the Principles of Data Mining and Knowledge Discovery, 5th European Conference, PKDD 2001, Freiburg, Germany, 3–5 September 2001; Raedt, L.D., Siebes, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2001; Volume 2168, pp. 42–53. [Google Scholar]
  25. Kollár, L. Optimal Sorting of Seven Element Sets. In Lecture Notes in Computer Science, Proceedings of the Mathematical Foundations of Computer Science 1986, Bratislava, Czechoslovakia, 25–29 August 1986; Gruska, J., Rovan, B., Wiedermann, J., Eds.; Springer: Berlin/Heidelberg, Germany, 1986; Volume 233, pp. 449–457. [Google Scholar]
  26. Peczarski, M. New Results in Minimum-Comparison Sorting. Algorithmica 2004, 40, 133–145. [Google Scholar] [CrossRef]
Figure 1. A multiple-value decision table T 0 .
Figure 1. A multiple-value decision table T 0 .
Entropy 25 00547 g001
Figure 2. A separable subtable T 0 ( f 1 , 0 ) of T 0 .
Figure 2. A separable subtable T 0 ( f 1 , 0 ) of T 0 .
Entropy 25 00547 g002
Figure 3. A decision tree for a multiple-value decision table T 0 .
Figure 3. A decision tree for a multiple-value decision table T 0 .
Entropy 25 00547 g003
Table 1. Multiple value decision table, T s o r t 3 ( 3 ) .
Table 1. Multiple value decision table, T s o r t 3 ( 3 ) .
y 1 : y 2 y 1 : y 3 y 2 : y 3
<<<{123}
<<={123, 132}
<<>{132}
<=>{132, 312}
<>>{312}
=<<{123, 213}
==={123, 132, 213, 231, 312, 321}
=>>{312, 321}
><<{213}
>=<{213, 231}
>><{231}
>>={231, 321}
>>>{321}
Table 2. Lowest depth of decision trees for decision table T s o r t 3 ( n ) , n = 3 , , 5 using DP.
Table 2. Lowest depth of decision trees for decision table T s o r t 3 ( n ) , n = 3 , , 5 using DP.
Minimum Depth
n h ( 1 ) ( T sort 3 ( n ) ) h ( 2 ) ( T sort 3 ( n ) ) h ( 3 ) ( T sort 3 ( n ) ) h ( 4 ) ( T sort 3 ( n ) ) h ( 5 ) ( T sort 3 ( n ) )
332222
454444
576666
Table 3. Lowest depth of decision trees for decision table T s o r t 3 ( n ) , n = 3 , , 5 using greedy method.
Table 3. Lowest depth of decision trees for decision table T s o r t 3 ( n ) , n = 3 , , 5 using greedy method.
Minimum Depth
n h ( 1 ) ( T sort 3 ( n ) ) h ( 2 ) ( T sort 3 ( n ) ) h ( 3 ) ( T sort 3 ( n ) ) h ( 4 ) ( T sort 3 ( n ) ) h ( 5 ) ( T sort 3 ( n ) )
332222
465555
5109999
Table 4. Characteristics of multiple-value decision tables.
Table 4. Characteristics of multiple-value decision tables.
Name ofThe Initial TableRowsAttrNumber ofDistribution
TableRemoved Columns Decisions @ 1 , @ 2 , @ 3 ,…
cars1cars43254258, 161, 13
1
flags2flags176226168, 8
1, 2, 3, 19
flags1flags177216166, 10, 1
1, 2, 3, 5, 15
flags3flags184236178, 6
1, 2, 3
flags4flags190256188, 2
1
lymph1lymphography122134113, 9
1, 13, 14, 15, 18
lymph2lymphography136144132, 4
13, 14, 15, 18
nursery1nursery2404597, 96, 47
1, 5, 6, 7
nursery2nursery4320752858, 1460, 2
1
zoo1zoo-data4311740, 1, 2
2, 6, 8, 9, 13
zoo2zoo-data4412740, 4
2, 9, 13, 14
zoo3zoo-data4614744, 2
6, 13
Table 5. Minimum depth of decision tree using DP.
Table 5. Minimum depth of decision tree using DP.
Minimum Depth
T h ( 1 ) ( T ) h ( 2 ) ( T ) h ( 3 ) ( T ) h ( 4 ) ( T ) h ( 5 ) ( T )
cars125252
flags266596
flags156585
flags356595
flags456495
lymph155454
lymph255565
nursery113131
nursery247474
zoo144444
zoo254444
zoo344454
mean4.255.083.926.174.08
std1.361.111.192.071.32
Table 6. Minimum depth of decision tree using greedy method.
Table 6. Minimum depth of decision tree using greedy method.
Minimum Depth
T h ( 1 ) ( T ) h ( 2 ) ( T ) h ( 3 ) ( T ) h ( 4 ) ( T ) h ( 5 ) ( T )
cars155555
flags21215121711
flags11212101510
flags310149189
flags413147187
lymph112108108
lymph21111101110
nursery124242
nursery277777
zoo154464
zoo285565
zoo346464
mean8.428.926.9210.256.83
std3.594.032.875.172.73
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Azad, M.; Moshkov, M. Applications of Depth Minimization of Decision Trees Containing Hypotheses for Multiple-Value Decision Tables. Entropy 2023, 25, 547. https://doi.org/10.3390/e25040547

AMA Style

Azad M, Moshkov M. Applications of Depth Minimization of Decision Trees Containing Hypotheses for Multiple-Value Decision Tables. Entropy. 2023; 25(4):547. https://doi.org/10.3390/e25040547

Chicago/Turabian Style

Azad, Mohammad, and Mikhail Moshkov. 2023. "Applications of Depth Minimization of Decision Trees Containing Hypotheses for Multiple-Value Decision Tables" Entropy 25, no. 4: 547. https://doi.org/10.3390/e25040547

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop