Applications of Depth Minimization of Decision Trees Containing Hypotheses for Multiple-Value Decision Tables

In this research, we consider decision trees that incorporate standard queries with one feature per query as well as hypotheses consisting of all features’ values. These decision trees are used to represent knowledge and are comparable to those investigated in exact learning, in which membership queries and equivalence queries are used. As an application, we look into the issue of creating decision trees for two cases: the sorting of a sequence that contains equal elements and multiple-value decision tables which are modified from UCI Machine Learning Repository. We contrast the efficiency of several forms of optimal (considering the parameter depth) decision trees with hypotheses for the aforementioned applications. We also investigate the efficiency of decision trees built by dynamic programming and by an entropy-based greedy method. We discovered that the greedy algorithm produces very similar results compared to the results of dynamic programming algorithms. Therefore, since the dynamic programming algorithms take a long time, we may readily apply the greedy algorithms.


Background and Related Study
In traditional decision tables, each object (row) is assigned a single decision (or label). However, in some cases, such as semantic annotation of images [1], music categorization into emotions [2], functional genomics [3], text categorization [4], and so on, a sets of decisions is given to every row. If we look at the literature, we frequently find works on multi-label learning [5] and multi-instance learning [6] that discuss prediction problems for multiple-value decision tables. It is also important to include ambiguous learning [7], partial learning [8], multiple label learning [9], and semi-supervised learning [10]. Additionally, multiple-value decision tables (multi-label decision tables) can be found in research on decision trees when it is regarded as algorithms, including (i) the diagnosis of circuit faults; (ii) computational geometry and (iii) combinatorial optimization [11], etc.
In decision tables with empirical observations, we frequently encounter clusters of equal rows with, most likely, different decisions. We can preserve one row from the cluster in place of such a cluster. Then, the set of decisions associated with the rows in the cluster is used to label it [11].
In contrast to exact learning [12][13][14][15][16], which employs both membership and equivalence queries in its algorithms, conventional decision trees [17,18] solely use features which are similar to membership queries. In [19][20][21], we investigated decision trees that use hypotheses about the values of all features in addition to only features. Exact learning's equivalence queries are comparable to hypothesis-based queries. We looked at five decision tree formations based on different combinations of features and hypotheses, as well as proper hypotheses (i.e., exact learning's equivalents of proper equivalence queries).
In general, decision trees utilizing hypotheses can be more effective than those merely using features. As an illustration, consider the task of computing the conjunction x 1 ∧ · · · ∧ x n . To solve this problem, a decision tree containing the features x 1 , . . . , x n must have a minimum depth of n. However, the minimum depth of a decision tree utilized to answer this problem using proper hypotheses is 1. This is needed to consider the proper hypothesis x 1 = 1, . . . , x n = 1. The considered conjunction is equivalent to 1 if this proper hypothesis is true. It is equivalent to 0 if not.
To find the least complexity for decision trees of five forms, we suggested dynamic programming algorithms in [19]. We also looked at the length and coverage of the decision rules obtained from the optimal decision trees. According to a number of experimental findings, decision trees with hypotheses can be less complex than conventional decision trees. Furthermore, the decision rules formed from the former trees can be shorter and more comprehensive than those formed from the latter trees. These results imply that knowledge representation could benefit from decision trees with hypotheses [19].
In this paper, we consider the problem of minimizing the depth of the decision tree. It should be noted that reducing the depth often results in reducing the length of the rules derived from the tree. Furthermore, we look at two applications to demonstrate the use of the tools created in [19]: the task of n (n = 3, . . . , 5) elements sorting (elements are taken from a linearly ordered collection) where there are equal elements and the task of creating decision trees for tables that are modified and changed from the UCI ML Repository [22] to form multiplevalue decision tables. We investigate the depth of five forms of optimal decision trees for each of these applications. The same parameter is investigated for five different forms of decision trees created using an entropy-based greedy method.
The comparison of decision trees containing hypotheses with traditional decision trees is the main goal of this research. The collected results from the experiments demonstrate that we can almost always find decision trees with hypotheses that outperform outcomes for traditional decision trees for both applications. Furthermore, the greedy method can be applied easily since their results are comparable to the results of dynamic programming.
The remainder of the manuscript is structured as follows: basic definitions and notation are covered in Section 2. Section 3 discusses methods to minimize the depth of the decision tree, followed by Section 4-the two applications of the created methods. Section 5 contains a short conclusion and future direction.

Preliminary Notation
The concepts of a multiple-value decision table and a decision tree for such a table are discussed in this section. Previously, such notions are described in the context of a decision table containing one decision [23] and a decision table containing many-valued decisions [11]. Finally, we describe five different forms of decision trees and one parameter of the decision tree: depth. This parameter will be used later as a comparison tool among different forms of trees. Table   A multiple-value decision table is a table T filled with nonnegative integers. Columns  in this table are tagged by features. The rows of the table are pairwise different, and they are tagged by nonempty finite sets of natural numbers that are indicated by sets of decisions. D(T) is the union of the set of decisions for all rows. Figure 1 illustrates a multiple-value decision table T 0 .

Multiple Value Decision
We now discuss the basic notions regarding a multiple-value decision table T. If T contains no rows, it is said to be empty. T is referred to as degenerate when it is empty or if each row of T has a common decision (the decision which belongs to each set of decisions for all rows). Let us consider a feature f i ∈ { f 1 , . . . , f n }. Then X(T, f i ) is the collection of f i 's values in the table T. X(T) is the collection of features that have 2 or more values i.e., |X(T, f i )| ≥ 2.
Assume T is a decision table that is not empty. A table made up of a few rows of T is a subtable of T. Assume that f i 1 , . . . , f i m are the features in T which must be from the set { f 1 , . . . , f n } and their values are b 1 , . . . , b m . We consider a special subtable of the form T( f i 1 , b 1 ) . . . ( f i m , b m ). It contains the rows that intersect with the columns f i 1 , . . . , f i m produces b 1 , . . . , b m values. These are referred to as separable subtables of T. In particular, T is a separable subtable of T. The set of separable subtables of table T is denoted by SEP(T). When we consider T analogous to an optimization problem, then separable subtables of T will be the analogous to subproblems of that optimization problem. Figure 2 illustrates a separable subtable T 0 ( f 1 , 0) of the table T 0 .

Decision Tree
Assume T is a nonempty decision table containing f 1 , . . . , f n features. Decision trees for two different kinds of questions are taken into account. First, the value of a feature f i is inquired, where f i must be from F(T) = { f 1 , . . . , f n }. The potential responses to this question are the expressions { f i = b}, where b must be from X(T, f i ). Second, a hypothesis over the variable T is inquired. The hypothesis can be written as H = { f 1 = b 1 , . . . , f n = b n }, where b 1 ∈ X(T, f 1 ), . . . , b n ∈ X(T, f n ). The potential responses to this question are the hypothesis itself and counterexamples of the form { f 1 = c 1 }, c 1 ∈ X(T, f 1 ) \ {b 1 }, ..., { f n = c n }, c n ∈ X(T, f n ) \ {b n }. When the response is hypothesis then it is correct. If (b 1 , . . . , b n ) is the same as the row of T, then H is denoted as a proper hypothesis for T.
A decision tree over T is a rooted tree where each terminal vertex has a number assigned to it from the set D(T) ∪ {0}. Every non-terminal vertex (referred to as a "functioning vertex" in this context) is tagged by either a hypothesis over T or a feature from the set F(T). In the case of a functioning vertex tagged by a hypothesis H = { f 1 = b 1 , . . . , f n = b n } over T, precisely one edge is tagged for each of the response On the other hand, if it is tagged by a feature f i from F(T), precisely one edge is tagged for each of the responses C( f i ) = {{ f i = b}, b ∈ X(T, f i )}. Note that just the labeled edge exits this vertex in both instances, with no additional edges doing so.
Assume u is a vertex on Γ, which is a decision tree over T. We now create an equation system over T that is connected to the vertex u, S(Γ, u). ξ represents the path from the root of Γ to u. Whenever ξ does not have any functioning vertices, S(Γ, u) represents the empty system. If not, S(Γ, u) is the union of equation systems linked to arcs of ξ.
The above tree is said to be a decision tree for T when it satisfies the following rules. For each vertex u of Γ, the subtable TS(Γ, u) is degenerate if and conversely, the vertex u is terminal. In addition, the vertex u receives the label 0 if TS(Γ, u) is empty and it is a terminal vertex. As opposed to that, if TS(Γ, u) is not empty when u is a terminal vertex, a common decision for TS(Γ, u) is labeled on the vertex u.
An example of a decision tree for T 0 is given in Figure 3.

Different Tree Forms and Corresponding Depth
In Γ, a full path is any directed path that travels from the root to a terminal vertex. The depth of the decision tree is defined as the maximum length of the tree's full path. The depth of any considered tree is regarded as its time complexity. h(Γ) is denoted as the depth of Γ.
The following syntax will be taken into account for the lowest depth of a decision tree for T that: employs features from F(T) as well as proper hypotheses over T -> h (5) (T).

Procedure to Minimize Depth
In this section, first we describe how to minimize the depth of the decision tree using dynamic programming, and then we describe similar things using the greedy algorithm.

Creation of DAG
Assume T is a nonempty multiple-value decision table containing f 1 , . . . , f n features. We will discuss the directed acyclic graph (DAG) ∆(T) creation algorithm A dag (Algorithm 1). The separable subtables of table T make up this graph's vertices. We deal with one vertex every iteration. We begin with the graph that has one untreated vertex, T, and we end when all of the graph's vertices have been treated. We examine the untreated vertex (subtable) β at the end of each iteration; if it is degenerate, we stop splitting the vertex; otherwise, we split the vertex and carry on with the process. Previously similar procedures are described in [11,23].

Algorithm 1: A dag
Input: A nonempty multiple-value decision table T that contains f 1 , . . . , f n features. Output: A DAG ∆(T).

1.
Construct a graph with only the vertex T that is not designated as treated.

2.
The algorithm's work is complete once every vertex in the graph has been treated and return it as ∆(T). If not, pick a vertex (table) β that is untreated.

3.
If (the vertex β is degenerate) { designate it as treated and proceed to step 2. } else { Create a bundle of edges from the vertex β for each f i in X(β) if it is not degenerate. Let's say X(β, f i ) = {b 1 , . . . , b e }. After that, create e arcs from β and tag them with pairs ( f i , b 1 ), . . . , ( f i , b e ). The vertices that these edges penetrate are β( to the network if they are missing. Return to step 2 and mark vertex β as treated. } In general, A dag 's time complexity is exponential based on the table's size. Note that we discussed a few different classes of decision tables in the book [11]. A polynomial on the number of columns in the tables serves as a boundary for the number of separable subtables of decision tables for each of these classes. A dag 's time complexity varies with the size of the input decision tables for each of these classes.

Method to Minimize Depth
Assume T is a nonempty decision table containing f 1 , . . . , f n features. The values h (1) (T) , . . . , h (5) (T) may be computed using the DAG ∆(T). Let's say t is 1, . . . , 5. We compute the value h (t) (β) for each vertex β of ∆(T) to determine h (t) (T). It is easier for us to take into account not only subtables corresponding to vertices of ∆(T), but also Λ (empty subtables of T), as well as T r (subtables that contain only a single row r of T but are not considered vertices of ∆(T)). The work of our method starts with terminal vertices of ∆(T) (vertices without exiting arcs) and these unique subtables. Then we continue moving up to the table T one step at a time.
Assume β is equal to T r for a row r of T or a terminal vertex of ∆(T). The decision tree for β is the one that has a single vertex tagged by a common decision for β, which means that h (t) (β) = 0. If β = Λ, the decision tree for Λ will be only one vertex labeled with 0 and in this case, With the help of this information, we can determine the decision tree's minimum depth for β. Regarding the root of this tree, we can have the following scenarios: Consider a decision tree's root labeled with f i ∈ X(β) where the Γ( f i ) is a decision tree for β. For each b ∈ X(T, f i ), an edge exits the root and enters a vertex, u(b). On this edge, the equation system { f i = b} is labeled. The vertex u(b) is the root of a decision tree of form t for β{ f i = b}, and its depth is With the root labeled with the feature f i and decision trees of form t used for the subtables corresponding to the root's children, it can be shown that h(Γ( f i )) is the minimum depth of a decision tree for β.
The fact that there is a b in X(T, f i ) for each such feature and that β{ f i = b} = β implies that we cannot build an optimal decision tree for β means that we should not consider these features. Thus, the depth, in this case, will be The procedure of h (t) a (β) Computation. Create the collection of features X(β). Calculate the value h(Γ( f i )) for each feature f i ∈ X(β) using (1). Determine h (t) a (β) with reference to (2).
Note that, this procedure's time complexity is polynomial based on table T.
We consider only admissible hypothesis over T.
. . , f n = b n } is such a hypothesis. It is admissible for β and a feature f i in F(T) = { f 1 , . . . , f n } if β{ f i = c} is not the same as β for any c ∈ X(T, f i ) \ {b i }. It is not admissible for β and a feature f i ∈ F(T) iff |X(β, f i )| = 1 and b i / ∈ X(β, f i ). Such a hypothesis is referred to as admissible for β if it is admissible for both β and any feature f i ∈ F(T).
Assume a decision tree's root is tagged by the hypothesis There must be an edge that departs from Γ(H)'s root and reaches a vertex u(S) for each S in C(H). The equation system S is written on this edge. The decision tree of type t for βS has this u(S) as its root, and its depth is h (t) (βS). Clearly Obviously for any f i ∈ X(β) and any c ∈ X(β, f i ) \ {b i }, β{ f i = c} is a child of β in ∆(T). Therefore, h (t) (β{ f i = c}) is already known.
With the root labeled by the hypothesis H and decision trees of form t used for the subtables corresponding to the root's children, it can be shown that h(Γ(H)) is the minimum depth of a decision tree for β.
Not admissible for β hypotheses should not be taken into consideration. . . , f n = b n } for β. Assume that f i in F(T) \ X(β). The sole number in the set X(β, f i ) is then equal to b i . Let's say f i ∈ X(β). If so, h (t) (β{ f i = b i }) = max{h (t) (β{ f i = c}) : c ∈ X(β, f i )} where b i is the lowest number from X(β, f i ). There is no doubt that H β is admissible for β. Calculate the value of h(Γ(H β )) using (3). It is clear from a quick look Note that, this procedure's time complexity is polynomial based on table T.   Note that this procedure's time complexity is polynomial based on table T.

Now, an algorithm A h t (Algorithm 2) is described below that determines h (t) (T).
It is the minimum depth of a decision tree with the provided form t (where, t = 1, . . . , 5) for the table T. As the algorithm executes, we learn h (t) (β) for each vertex β in ∆(T).

Input: The DAG ∆(T) for a nonempty table T.
Output: The value of the depth h (t) (T).

1.
The algorithm will end if all vertices of the DAG ∆(T) have numbers associated to them and return the number h (t) (T) associated to the vertex T. In the absence of such a case, we select a vertex that does not have an assigned number. This vertex can be a terminal vertex of ∆(T) or a nonterminal vertex of ∆(T) where every child contains the numbers that are assigned to it.

2.
If (β is a terminal vertex) { assign the value h (t) (β) = 0 to β and move on to step 1. } else { perform the following based on the value t:

Greedy Algorithms
In this section, we provide greedy techniques for building decision trees with hypotheses for multiple-value decision tables based on entropy uncertainty measure.
Assume T is a nonempty multiple-value decision table with n features f 1 , . . . , f n and β be a subtable of the table T. Entropy of β (denoted entML(β)) is defined as:

2.
Assume β is nonempty and nondegenerate. N s (β) is the number of rows in β where the set of decisions contains the decision s (s is in D(β) ). Furthermore, p s is the value N s (β) N(β) . Here, N(β) is the number of rows in β. Hence, entML(β) = − ∑ s∈D(T) p s log 2 p s + (1 − p s ) log 2 (1 − p s ) [24].
We now define the impurity of a query for the table β and uncertainty measure entML. The impurity of the query based on a feature f i ∈ F(T) is equal to I entML ( f i , β) = max{entML(βS) : S ∈ C( f i )}. The impurity of the query based on a hypothesis H is equal to I entML (H, β) = max{entML(βS) : S ∈ C(H)}.
A greedy algorithm, A entML (Algorithm 3), is described below. It generates a decision tree, Γ Input: A nonempty multiple-value decision table T and a number t ∈ {1, . . . , 5}. Output: A decision tree of form t for T.

1.
Create a tree G with a sole vertex that is tagged by T.

2.
The algorithm terminates and returns the tree G if none of the vertices of the tree G are tagged by a table T.

3.
Pick a vertex in the tree G that is tagged by the subtable β of table T. This vertex is named as u.

4.
If (β is degenerate) { u is tagged by 0 if it is empty and with a common decision of β if it is not. } else { we select an M query (either a feature or a hypothesis) admissible for β based on t according to the below rules: is chosen that is admissible for β and has the lowest impurity, I entML (M, β).
} else if (t = 2) { a hypothesis M over T is chosen that is admissible for β and has the lowest impurity, is chosen that is admissible for β and has the lowest impurity I entML (Y, β) as well as discover a hypothesis Z over T that is admissible for β and has the lowest impurity, I entML (Z, β). We select the query M out of Y and Z with the minimum impurity I entML (M, β).
} else if (t = 4) { a proper hypothesis M over T is chosen that is admissible for β and has the lowest impurity, I entML (M, β). } else if (t = 5) { a feature Y ∈ F(T) is chosen that is admissible for β and has the lowest impurity I entML (Y, β) as well as discover a proper hypothesis Z over T that is admissible for β and has the lowest impurity, I entML (Z, β). We select the query M out of Y and Z with the minimum impurity I entML (M, β). } } 5.
Rather than using β, the query M is used to label the vertex u. We add a vertex u(S) as well as an arc e(S) connecting u and u(S) to the tree G for each solution S ∈ C(M).
The answer S is used to label the arc e(S) and the subtable βS is used to label the vertex u(S). Move on to step 2.
The algorithm A entML generates a decision tree Γ

Application
In this section, we look at two applications for the aforementioned algorithms. The sorting problem is the first application, and the experimental analysis of changed decision tables from the UCI Machine Learning Repository [22] is the second.

Sorting
The problem of sorting is usually defined in theoretical studies (see [25,26]) as sorting a series of n distinct elements y 1 , . . . , y n from a set that is linearly ordered. The collection {1, . . . , n} has only one permutation (s 1 , . . . , s n ) where y s 1 < . . . < y s n . There are only two results for any comparison of two components y i : y j : y i > y j and y i < y j . The goal is to find a permutation (s 1 , . . . , s n ) for a given sequence y 1 , . . . , y n such that y s 1 < . . . < y s n . Now consider the case where there are equal elements in the sequence y 1 , . . . , y n . The set {1, . . . , n} can have more than one permutation (s 1 , . . . , s n ) such that y s 1 ≤ . . . ≤ y s n . There are three results for any comparison of two components y i : y j : y i < y j , y i = y j , and y i > y j . The goal is to find a permutation (s 1 , . . . , s n ) for a given sequence y 1 , . . . , y n such that y s 1 ≤ . . . ≤ y s n .
We can construct the multiple-value decision table T 3 sort (n) for a given n. In this table, columns are labeled as features y i : y j , 1 ≤ i < j ≤ n. Furthermore, for a series of elements y 1 , . . . , y n that can contain equal elements, rows represent all feasible tuples of values according to these elements (T 3 sort (3) table is illustrated in Table 1). The collection of corresponding permutations serves as the decision set for each row. Three-valued features are taken into account, as shown by the number 3 in the syntax T 3 sort (n).  Table 2 contains the lowest depth of a decision tree for T 3 sort (n) when n = 3, . . . , 5 using the dynamic programming algorithm. When compared to the form 1 decision tree, the forms 2, 3, 4, and 5 decision trees produce a smaller minimum depth of decision tree.  T 3 sort (n), n = 3, . . . , 5 using DP.  Table 3 contains the minimum depth of a decision tree for the decision table T 3 sort (n) for n = 3, . . . , 5 using the greedy algorithm. It is evident that forms 2, 3, 4, and 5 produce a smaller minimum depth (highlighted in bold format) of the decision tree compared to form 1. Table 3. Lowest depth of decision trees for decision table T 3 sort (n), n = 3, . . . , 5 using greedy method. Now, if we compare the results obtained by the greedy algorithms and the dynamic programming, we can see that the results are pretty close. For n = 3, the results are the same. Even though, the results are better for dynamic programming when we increase n, it takes an exponential amount of time for the dynamic programming algorithms. Therefore, to reduce the time, we can easily apply greedy algorithms.

Modified Tables from UCI ML Repository
We extracted decision tables from the UCI ML Repository [22]. Then one or more features are removed. Therefore, some tables have multiple rows with the same feature values but various decisions, that are subsequently combined into a sole row identified by the collection of decisions from those rows. Before the experiment begins, some preparation steps are completed. If a feature has different values for every row, it is eliminated. The most frequent value for a feature is used to fill in any gaps if a value is missing.
In Table 4, the column 'Name of table' indicates the name of the new table (that we obtain after deleting features from the UCI ML Repository's data set), the column 'The initial table removed columns' indicates the name of the initial table along with the indexes of features deleted from it, the column 'Rows' indicates the quantity of rows, the column 'Attr' indicates the quantity of features, the column 'Number of decisions' indicates the quantity of decisions in the new table, and the column 'Distribution' indicates a sequence of numbers @1, . . . , @i, . . ., where @i denotes the quantity of rows in T labeled with sets of decisions having i decisions.  The lowest depth of a decision tree for the above-mentioned decision tables using the dynamic programming algorithm is available in Table 5. The table's name is in the first column, and the remaining columns have values h (1) (T), . . . , h (5) (T) (lowest one is in bold format for each table). It is clear that forms 2, 3, 4, and 5 produce smaller optimum depth of a decision tree compared to the form 1 decision tree. Decision trees with the smallest depth work best for 8 decision tables when using features (form 1), 4 tables when using hypotheses (form 2), all tables when using features and hypotheses (form 3), 2 tables when using proper hypotheses (form 4), and 10 tables when using features and proper hypotheses (form 5). At the bottom, the mean and standard deviation values are calculated for each form. Table 5. Minimum depth of decision tree using DP. The lowest depth of a decision tree for the above-mentioned decision tables using the greedy algorithm is available in Table 6. The table's name is in the first column, and the remaining columns have values h (1) (T), . . . , h (5) (T) (lowest one is in bold format for each table). It is clear that forms 2, 3, 4, and 5 produce the smaller depth of a decision tree compared to the form 1 decision tree. Decision trees with the smallest depth work best for four decision tables when using features (form 1), four tables when using hypotheses (form 2), eleven tables when using features and hypotheses (form 3), two tables when using proper hypotheses (form 4), and all tables when using features and proper hypotheses (form 5). At the bottom, the mean and standard deviation values are calculated for each form. Table 6. Minimum depth of decision tree using greedy method.

Discussion
We can compare the results in two ways. First, we can compare five different forms. It is evident from both applications that, the decision tree's minimum depth can be optimal when forms 3 and 5 are used. Second, we can compare the results obtained by greedy algorithms and those obtained by dynamic programming. Clearly, we can see that the results are pretty close for the tables "CARS1", "NURSERY1", "ZOO1", "ZOO2", and "ZOO3". For other tables, the results are better with dynamic programming. However, it takes an exponential amount of time for the dynamic programming algorithms; therefore, to reduce the time, we can easily apply the greedy algorithms.

Conclusions
Modified decision trees are examined in this article that employs both queries based on a hypothesis about the values of all features as well as queries based on one feature per query. To reduce the depth of these decision trees, we created dynamic programming techniques and greedy algorithms, and we considered two applications. The first application is concerned with the sorting problem, whereas the second problem is concerned with the modified decision tables in the form of multiple-value decision tables from the UCI Machine Learning repository. We can see from the experimental results that the minimum depth using features and hypotheses is optimal in both applications. Furthermore, dynamic programming produces the optimal output, which is not far from the results of the greedy algorithms using entropy uncertainty measure.
In the future, we are planning to deploy more greedy methods to investigate the suitability of the greedy uncertainty measures, which can be closest to the optimal dynamic programming results. Nevertheless, the minimum number of vertices will be studied in the future using the modified decision trees by both dynamic programming and greedy algorithms for these multiple-value decision tables.