Fast Identification of High Utility Itemsets from Candidates

High utility itemsets (HUIs) are sets of items with high utility, like profit, in a database. Efficient mining of high utility itemsets is an important problem in the data mining area. Many mining algorithms adopt a two-phase framework. They first generate a set of candidate itemsets by roughly overestimating the utilities of all itemsets in a database, and subsequently compute the exact utility of each candidate to identify HUIs. Therefore, the major costs in these algorithms come from candidate generation and utility computation. Previous works mainly focus on how to reduce the number of candidates, without dedicating much attention to utility computation, to the best of our knowledge. However, we find that, for a mining task, the time of utility computation in two-phase algorithms dominates the whole running time of these algorithms. Therefore, it is important to optimize utility computation. In this paper, we first give a basic algorithm for HUI identification, the core of which is a utility computation procedure. Subsequently, a novel candidate tree structure is proposed for storing candidate itemsets, and a candidate tree-based algorithm is developed for fast HUI identification, in which there is an efficient utility computation procedure. Extensive experimental results show that the candidate tree-based algorithm outperforms the basic algorithm and the performance of two-phase algorithms, integrating the candidate tree algorithm as their second step, can be significantly improved.


Introduction
In recent years, high utility itemset (HUI) mining [1] has became one of the most significant problems in the area of data mining.The problem derives from the frequent itemset mining problem [2], but the former considers the values of itemsets like profits, and is different from the latter that only takes the frequencies of itemsets into account.Efficient mining of high utility itemsets usually plays an important role in many real-life applications such as market analysis [3][4][5][6][7].
Many algorithms for high utility itemset mining adopt a two-phase frame [8][9][10][11], as shown in Figure 1.These algorithms first generate candidate itemsets, from which they subsequently identify high utility itemsets.Previous works pay much attention to reducing the number of candidate itemsets, which can result in performance improvement.However, these works neglect the identification process.A elaborate identification process plays an important role in performance improvement.This work focuses on the fast identification of high utility itemsets from candidates.

Problem Definition
Let I = {i 1 , i 2 , i 3 , . . ., i n } be a set of items and DB be a transaction database.DB is composed of two tables: a utility table and a transaction table.Each item in I has a utility value in the utility table.Each transaction labeled with a Tid in the transaction table is a subset of I, in which each item is associated with a count value.Tables 1 and 2 show a sample database.Definition 1.The external utility of item i, denoted as eu(i), is the utility value of i in the utility table.
Definition 2. The internal utility of item i in transaction T, denoted as iu(i, T), is the count value of i in T in the transaction table.
Definition 3. The utility of item i in transaction T, denoted as u(i, T), is the product of eu(i) and iu(i, T), namely u(i, T) = eu(i) × iu(i, T).
For example, for the database in Table 1, eu(a) = 2, iu(a, T4) = 3, and u(a, T4) = eu(a) × iu(a, T4) = 2 × 3 = 6.An itemset is a subset of I and is called a k-itemset if it contains k items.Definition 4. The utility of itemset X in transaction T containing X, denoted as u(X, T), is the sum of the utilities of all items in X in T, namely u(X, T) = ∑ i∈X∧X⊆T u(i, T).Definition 5.The utility of itemset X, denoted as u(X), is the sum of the utilities of X in all transactions containing X in DB, namely u(X) = ∑ T∈DB∧X⊆T u(X, T).Definition 6.An itemset is called a high utility itemset if its utility exceeds a user-specified minimum utility threshold denoted as "minutil".

Previous Solutions
After the formal introduction of the problem in [1] as above, a number of algorithms for high utility itemset mining have been proposed, such as TP [8], FSH [12], DCG [13], FUM [14], DCG+ [14], IHUPTWU [9], UP-Growth [10], and UP-Growth+ [11].These algorithms employ a uniform two-phase framework as follows.Firstly, they generate a set of candidate itemsets from a mined database by roughly overestimating the utilities of all itemsets, and the set is a superset of the set of all high utility itemsets.Secondly, the exact utilities of all candidate itemsets are computed by a database scan, and thereby high utility itemsets are identified.
The two major costs in these algorithms are candidate generation and utility computation.It is obvious that the fewer candidate itemsets an algorithm generates, the lower the candidate generation and utility computation costs in the algorithm.Therefore, previous works put much effort into how to reduce the number of candidate itemsets.Recent two-phase algorithms such as UP-Growth+ have been able of efficiently reducing candidates.Table 3 shows the numbers of candidate itemsets generated by TP, FUM, UP-Growth, and UP-Growth+, given database chain and minutil 0.06%.The numbers of candidates in TP and FUM were taken from [14].We implemented the last two algorithms and obtained the numbers of candidates in them.The database chain will be introduced in Section 5.The running time of these algorithms mainly consists of the time for candidate generation in phase I and that for exact utility computation in phase II.Although the candidate generation time can be reduced significantly due to the decrease in the number of generated candidate itemsets, the exact utility computation time is still very large for a mining task.For example, when the minutil is 0.004%, 0.005%, and 0.006%, respectively, the running times of the two phases of the UP-Growth+ algorithm for database chain are depicted in Figure 2. It is very clear that the utility computation time dominates the whole running time of the algorithm.However, there is little effort to improve the performance of utility computation in previous works.To the best of our knowledge, a formal algorithm for exact utility computation is not even given in previous literature, although the algorithm should be simple.

Contributions
In this study, we focus on the fast identification of high utility itemsets from candidates, the core of which is the efficient utility computation for candidates.The main contributions of the paper are as follows.

•
A basic algorithm for high utility itemset identification is formally presented.

•
A novel structure called the candidate tree is proposed for storing candidate itemsets.

•
A candidate tree-based algorithm is developed for the fast identification of high utility itemsets.

•
Extensive experimental results that show the performance difference between the basic algorithm and the candidate tree-based algorithm are reported.
As shown in Figure 2, the running time of phase II dominates the total running time of a two-phase algorithm.The proposed structure and algorithms are devoted to the decrease in the time of phase II and thereby can result in performance improvement.The proposed structure and algorithms are all-purpose and can be integrated into any previous two-phase algorithm as its second step.
The rest of this paper is organized as follows.After the basic algorithm is introduced in Section 2, the candidate tree and related algorithm are proposed in Section 3 and analyzed in Section 4. Experimental results are reported in Section 5, and the paper ends with the conclusion of Section 6.

Basic Identification Algorithm
In this section, we show a basic identification algorithm (BIA) and discuss its core procedure.

Pseudo-Code of the BIA
Algorithm 1 shows the pseudo-code of the BIA.

Algorithm 1: Basic Identification Algorithm
Input: C is a set of candidate itemsets; DB is a transaction database; minutil is a minimum utility threshold.Output: all high utility itemsets Firstly, a vector utility indexed by the names of candidates is initialized, and utility[c] stores the utility of candidate c.Subsequently, for each transaction, the algorithm accumulates the utility of each candidate in the vector.At last, the algorithm outputs those candidates, the utilities of which exceed the minimum utility threshold.
When both a set of candidates C and a database DB can be stored in memory, or when C can be stored in memory but DB cannot, the BIA works well.If DB can be stored in memory but C cannot, it is better to exchange the two loops in line 4 and line 5 for reducing the I/O cost.

Utility Computation
In the BIA, the core procedure is the computation of the utility of itemset c in transaction t in line 6, which is listed in Procedure 2.

Procedure 2: u(c, t)
Input: c is a candidate itemset; t is a transaction.Output: the utility of c in t In the procedure, length(c) and length(t) are the numbers of items in c and t, c[i] denotes the ith item in c and t[j] denotes the jth item in t.
the utility of the item in t, namely u(t[j], t), is added to variable util storing the accumulated utility for c.If the condition in line 16 is met, which means that t contains c, util is returned.The procedure is actually a two-way comparison procedure, in which the atom operations are item comparisons (lines 5 and 8) and utility accumulations (line 11).
The procedure is based on the assumption that items in both c and t are ordered.The items in a candidate itemset can be sorted before it is stored.The items in a transaction are generally ordered, and otherwise they can also be sorted after the transaction is loaded in memory.For u(t[j], t) in line 11 in the procedure, we can compute it once and employ many.For example, the sample database can be transformed into the view in Table 4.

Candidate-Tree
To speed utility computation up, repeated comparisons and accumulations should be avoided.First of all, we can store all candidate itemsets in a candidate tree.A candidate tree is a modified prefix-tree [15], in which itemsets containing the same prefix share a common path.For example, the candidate tree in Figure 3 can represent itemsets {ab}, {abc}, {abd}, and {abcd}.Besides the pointers for maintaining the tree structure, each node in a candidate tree contains an item and an util.A node represents an itemset composed of the items in the path from the node to the root.The util of a node is used to store the utility of the itemset represented by the node.For example, the node numbered 5 in Figure 3 represents itemset {abd}.In a candidate tree, not all nodes represent candidate itemsets.Definition 7. In a candidate tree, a node is called a count node if it represents a candidate itemset.
For the candidate tree in Figure 3, the nodes numbered 2, 3, 4, and 5 are count nodes, and the node numbered 1 is not.
The method of constructing a candidate tree is similar to the method of constructing a prefix-tree [15].In the implementation of a candidate tree, the util of a count node is initialized with 0, and that of a node that is not a count node is initialized with −1.In this way, all count nodes of a candidate tree are marked during the candidate tree construction.

Fast HUI Identification
After a candidate tree is constructed, a fast identification algorithm (FIA) can efficiently compute the utilities of all candidates stored in the tree and subsequently identify high utility itemsets.The FIA is shown in Algorithm 3, the core procedure of which is shown in Procedure 4.

Algorithm 3: Fast Identification Algorithm
Input: root is the root node of a candidate tree; DB is a transaction database; minutil is a minimum utility threshold.Output: all high utility itemsets Like Procedure 2, items in transactions and itemsets are considered as ordered in Procedure 4. In the procedure, n.item and n.util denote the item and util contained in n.Suppose node n represents itemset X, and then parameter utility stores u(X-n.item,t).Firstly, the procedure searches t[k] (k ≤ length[t]) for n.item.If t does not contain n.item, the subtree rooted at n is no longer checked (line 5).Otherwise, utility is updated and is added to n.util if n is a count node (line 9), and subsequently all child nodes of n are recursively processed.Parameter k keeps track of the position of an item in t before which each item in t is contained in an ancestor node of n and is no longer compared with n.item.After the subtree rooted at n is recursively processed, the utils of the count nodes in the subtree representing the itemsets contained in t are updated.To facilitate the understanding of Procedure 4, Figure 4 demonstrates the procedure when T2 in Table 4 and the candidate tree in Figure 3 are processed.
After all transactions in DB are processed, all high utility itemsets can be identified by a candidate tree traversal as shown in Procedure 5.

Complexity Analysis
The main operations in the BIA and FIA are item comparisons and utility accumulations.Since items in a k-itemset X and a transaction T containing m items are ordered, for computing u(X, T), the comparison number denoted as CN holds in Properties 1 and 2, and the accumulation number denoted as AN holds in Properties 3 and 4. Suppose there are a transaction with m items and n candidates that contain s 1 , s 2 , s 3 ,. . ., s n items, respectively.The candidates have the same prefix itemset with s items (s ≤ s i , 1 ≤ i ≤ n).To compute the utilities of the candidates in the transaction, the numbers of comparisons and accumulations performed in the BIA and FIA, on condition that all the candidates are or are not contained in the transaction, are listed in Table 5.

Comparison Number
Least Most For example, when the utility of the candidate with s i items is computed, for the BIA, the number of comparisons is s i at least or m at most according Property 1, if the transaction contains the candidate.Then, the total number of comparisons for all the candidates is (s 1 + s 2 + s 3 +. . .+s n ) at least or (m + m+. . .+m = m × n) at most, if the transaction contains these candidates.When these candidates are stored in a candidate tree, the n candidates can be considered as (n + 1) candidates that contain s, (s 1 − s), (s 2 − s), . . ., (s n − s) items respectively.Therefore, for the FIA, the total number of comparisons for the (n + 1) candidates is s + (s In the worst case, the complexities of the BIA and FIA are all O(m × n) with respect to comparisons, but compared with the BIA the number of comparisons in FIA factually decreases by n × s, which is a large factor, especially for a large s.It is also observed that the number of accumulations in the FIA decreases by about n × s in the worst case, compared with that in the BIA.

Experiments
In this section, the BIA is compared with FIA.We first implemented a famous algorithm UP-Growth+ [11] in C++.UP-Growth+ is a standard two-phase algorithm, and it first generates a set of candidate itemsets and subsequently computes the exact utilities of candidates to identify high utility itemsets.However, the utility computation of UP-Growth+ is not discussed in detail in [11].Therefore, we integrated the BIA and FIA into UP-Growth+ as its second step, respectively.In the following, BIA-UP-Growth+ denotes the combination of UP-Growth+ with the BIA, and FIA-UP-Growth+ denotes the combination of UP-Growth+ with the FIA.
Eight databases were used in our experiments.The database chain was downloaded from NU-MineBench 2.0 [16], and the other databases were downloaded from the FIMI Repository [17].Databases accidents, chess, kosarak, mushroom, and retail derived from the real world, and synthetic databases T10I4D100K and T40I10D100K were generated by the IBM Quest Synthetic Data Generation Code.Except for chain, the other databases do not provide the external utility and internal utility for each item, and thus we generated the utility and count values of each item as the settings in previous works [9][10][11].The statistical information about these databases is shown in Table 6, including the size on disk, the number of transactions, the number of distinct items, the average number of items in a transaction, and the maximal number of items in the longest transaction(s).The experiments were performed on a machine with a 2.8 GHz Intel Core i5 CPU, 4 GB of physical memory, and a 32-bit Linux operation system.

Running Time for Phase II
For each experimental database, it was transformed into a physical view in memory as in Table 4, and thereby u(t[j], t) in both Procedures 2 and 4 was directly available.After a set of candidate itemsets or a candidate tree was generated in memory, the utility computation time of the two algorithms was recorded, as depicted in Figure 5.We varied the minimum utility in the experiments.The lower the minimum utility is, the more high utility itemsets an algorithm generates, and thus the greater the running time is.It can be observed that FIA-UP-Growth+ always outperforms BIA-UP-Growth+.For the databases accidents, chain, and chess, in Figure 5a-c, FIA-UP-Growth+ is several times faster than BIA-UP-Growth+.For the databases kosarak, mushroom, and retail, in Figure 5d-f, FIA-UP-Growth+ is about an order of magnitude faster than BIA-UP-Growth+.For databases T10I4D100K and T40I10D100K, in Figure 5g,h, FIA-UP-Growth+ is two orders of magnitude faster than BIA-UP-Growth+.

Running Time for Phase I
In phase I, the difference between BIA-UP-Growth+ and FIA-UP-Growth+ is that the former directly stores each generated candidate itemset in a memory pool, while the latter inserts each candidate itemset into a candidate tree immediately after generating it.Therefore, in theory, BIA-UP-Growth+ is faster than FIA-UP-Growth+ in the phase.The third column in Table 7 lists the first phase time of the two algorithms running on the eight databases for the lowest minutils in our experiments, and in this case, the algorithms generate the largest numbers of candidate itemsets and high utility itemsets.Even though there is a very large number of candidates, the time for constructing a candidate tree is small.For example, when the minutil is 18% for database chess, the first phase runtime of FIA-UP-Growth+ is 18.21 seconds and that of BIA-UP-Growth+ is 13.32 seconds, and then the time of constructing the candidate tree can be considered as 4.89 (=18.21− 13.32) s.It is interesting that the first phase runtime of FIA-UP-Growth+ is even shorter than that of BIA-UP-Growth+ for database T40I10D100K, when the minutil is 0.1%.We believe the reason is that, for the mining task, the time of constructing the candidate tree is relatively short, while FIA-UP-Growth+ holds better data locality than BIA-UP-Growth+ due to the smaller memory consumption.

Memory Consumption
FIA-UP-Growth+ generates candidate itemsets as BIA-UP-Growth+ does [11], and thus they consume the same amount of memory for candidate generation.On the other hand, there is no considerable memory consumption in their second phases, namely in the FIA and BIA.Therefore, we paid attention to the memory consumption of the two algorithms for storing candidate itemsets, as shown in the fifth column in Table 7.
Since a candidate tree is a compact data structure [15], the size of a candidate tree-storing candidate itemsets is smaller than the size of a memory pool storing them, if the number of the candidate itemsets is large enough and thus there are many shared paths.For example, for databases chess, mushroom, and T40I10D100K, in Table 7, FIA-UP-Growth+ only consumes half the amount of memory BIA-UP-Growth+ does.However, a candidate tree also stores the tree structure information, namely pointers for linking nodes, and thus FIA-UP-Growth+ consumes more memory than BIA-UP-Growth+ if there is a small number of candidate itemsets.

Discussion
FIA-UP-Growth+ significantly outperforms BIA-UP-Growth+ in our experiments.The reasons are as follows.
Firstly, a high utility itemset mining algorithm generally generates a very large number of candidate itemsets, as shown in the sixth column in Table 7, and therefore there are numerous comparisons and accumulations when computing their utilities.The numbers of comparisons and accumulations can be reduced efficiently if utility computation is performed on a candidate tree.
Secondly, using a candidate tree, the utility computation for the candidates sharing the same prefix but not contained in a transaction can be terminated once and for all.For example, for the candidate tree in Figure 3, when T1 in Table 4 is processed, the utility computation for the four candidates can be terminated immediately after two comparisons according to the FIA.If these candidates are stored in a memory pool, there are eight comparisons according to the BIA.Actually, for many mining tasks, the number of high utility itemsets is far less than the number of candidate itemsets, as shown in the last column in Table 7.Therefore, for a transaction, there should be a considerable number of candidates that are not contained in it.
Thirdly, if the number of candidate itemsets is so large that there are many shared paths, a candidate tree storing them occupies less memory than a memory pool storing them, and thereby the FIA can gain better data locality than the BIA.
Fourthly, although the first phase runtime of the algorithm integrating FIA is increased due to the candidate tree construction, the increase in the first phase runtime of the algorithm can be balanced by the decrease in the second phase runtime of the algorithm.

Conclusions
In this paper, we addressed the problem of identifying high utility itemsets from candidates.The high utility itemset identification is an indispensable part of most mining algorithms, but it is not discussed in these algorithms in detail.As a supplement to previous works, we first gave a basic identification algorithm, i.e., the BIA.Subsequently, we proposed a novel data structure called candidate tree for storing candidate itemsets and developed a candidate tree-based algorithm, i.e., the FIA, for the fast identification of high utility itemsets.The main operations in the BIA and FIA are comparisons and accumulations.For an identification task, the FIA performs fewer comparisons and has less accumulations than the BIA.Extensive experimental results show that (1) the time for high utility itemset identification dominates the whole running time for a mining algorithm; and (2) the FIA significantly outperforms the BIA in various databases.
It should be noted that FIA works well if a candidate tree can be completely in memory.However, this study does not consider the case that a tree is too large to be completely stored in memory.We plan to study the fast identification of high utility itemsets from candidates in disk in a future study.

Figure 1 .
Figure 1.A two-phase frame for high utility itemset mining.
Database Chain (sec.)Minimum utility (%) Phase I of UP−Growth+ Phase II of UP−Growth+

Figure 2 .
Figure 2. Running times of the two phases of UP-Growth+.

Figure 3 .
Figure 3. Candidate itemsets and a candidate tree.

Property 2 .Property 3 .Property 4 .
If T does not contain X, then 1 ≤ CN ≤ max(k, m), and max(k, m) denotes the larger between k and m.If T contains X, then AN = k.If T does not contain X, then 0 ≤ AN ≤ (k − 1).

Table 1 .
The utility table of a sample database.

Table 2 .
The transaction table of a sample database.

Table 4 .
A database view.

14 end 15 return k+1; 16 end
ComputeUtility(t, k, n, utility) Input: t is a transaction; k indicates the position of an item in t; n is a node in the candidate tree; utility stores the sum of the utilities of the items contained in all n's ancestor nodes in t.

Table 5 .
Numbers of comparisons and accumulations.FIA: fast identification algorithm; BIA: basic identification algorithm.

Table 6 .
Statistical information about databases.