Improving Clope's Profit Value and Stability with an Optimized Agglomerative Approach

CLOPE (Clustering with sLOPE) is a simple and fast histogram-based clustering algorithm for categorical data. However, given the same data set with the same input parameter, the clustering results by this algorithm would possibly be different if the transactions are input in a different sequence. In this paper, a hierarchical clustering framework is proposed as an extension of CLOPE to generate stable and satisfactory clustering results based on an optimized agglomerative merge process. The new clustering profit is defined as the merge criteria and the cluster graph structure is proposed to optimize the merge iteration process. The experiments conducted on two datasets both demonstrate that the agglomerative approach achieves stable clustering results with a better profit value, but costs much more time due to the worse complexity.


Introduction
Clustering is an important data mining technology that groups data into certain sets with maximizing the intra-cluster similarity and minimizing the inter-cluster similarity [1].Most clustering algorithms focus on numerical data.Distances between two data points are calculated as the clustering OPEN ACCESS criterion.However, a lot of databases handle transactions with categorical attributes whose clustering process seems to be different and more complicated than those of numerical ones.With the repaid growth of categorical data volume, the research of clustering methods for categorical data becomes increasingly important.
Among previous research works on clustering categorical data [2][3][4][5][6][7], CLOPE [6] is the representative one, which achieves relatively better clustering results with high efficiency based on cluster histogram calculation.Moreover, it has the following advantages.Firstly, the algorithm is simple and requires only one input parameter, which is easy to extend.In the second place, neither fuzzy theory nor probability calculation is used, thus the clustering result is accurate and direct.Finally, the algorithm automatically stops until no further iteration could be done, and the output number of clusters is absolutely determined by the input data and the only parameter instead of being specified by user such as k-means and ROCK [3].Some improved methods have been proposed based on CLOPE algorithm.For example, Ong proposed SCLOPE [8] and σ-SCLOPE [9] algorithm for clustering categorical data streams [10] based on a FP-Growth tree structure [11].Li proposed fuzzy-CLOPE algorithm [12].However, there is one crucial problem which has been ignored among CLOPE and its extensions.That is the clustering results are unstable and greatly influenced by the transaction order in the dataset.Given the exactly same dataset and input parameter, if each transaction is read in another order, the clustering result would seem to be different, thus the profit value might not be optimal.
We deeply analyze the clustering process of CLOPE algorithm and find that since CLOPE algorithm only moves one transaction at a time during each round of iteration, it is difficult to find the best combinations of transactions to form an "optimal" clustering result and achieve stable satisfactory clustering results.To deal with this problem, an optimized agglomerative approach is proposed for categorical data.In the proposed method, each transaction in the dataset is treated as a single cluster in initial.Then the new clustering profit is defined as the criteria to merge the initial clusters as an extension from CLOPE.Such process is iterated until no clusters are merged and then the clustering process is automatically stopped that also meets the feature of CLOPE.Finally, the optimized method based on the cluster graph is proposed to reduce the merge iteration process.On this basis, this approach can effectively handle the unstable problem from CLOPE.The experiments conducted on the mushroom dataset and the splice-junction gene sequences dataset both demonstrate that the agglomerative approach achieves stable clustering results with a better profit value, but costs much more time due to the worse complexity.

The CLOPE Algorithm
CLOPE is a histogram-based clustering algorithm for categorical data.It uses the conception of "profit" as criterion function that tries to increase the intra-cluster overlapping categorical attributes according to a height-to-width ratio of each cluster.If the sum of all the height-to-weight ratio values is maximum, the clustering result is considered to be optimal.The following example describes a simple case of categorical transactions using CLOPE.

Terminologies
denotes as a database containing n transactions, and to be all the attributes in D. On this basis, we have the following definitions.
In addition, we have the following properties for histogram H : (1) The size is the total number of attributes in cluster c, defined as: (2) The width is the total number of different attributes in cluster c, defined as: (3) The height is the ratio between the size and the width, defined as: (4) The number of transactions in cluster c, defined as | | c .Definition 4 (Clustering).A clustering C is a set of all the clusters, i.e., ) ) According to the CLOPE paper, the theoretical purpose of CLOPE algorithm is to find a clustering C that maximizes Profit(C) to produce the best histogram, given a database D and the repulsion factor r. In addition, the author pointed out that the profit value of clustering C is only affected by the distribution of transactions in each clusters, which is the numerator part of Equation ( 4).
The CLOPE algorithm contains two phases.In the initialization phase, each transaction is sequentially read and placed into the best cluster that would maximize the profit value of current clustering.The iteration phase includes a loop that reads transactions from head to tail and tries to move each of them into the best cluster that maximizes the profit value of clustering.The loop does not end until no transactions are moved into a new cluster.Unfortunately, the proposed CLOPE algorithm is unable to find a clustering C that maximizes Profit(C) as it claimed.Details of this core problem are discussed in the next subsection.

Problem Discovery
The hidden problem of CLOPE can be exposed by a simple example.Example 1.Given r = 2.0 and a small database with two different input sequence of transactions: ① {ab, ac, bc, abd, acd, bcd}, ② {bcd, acd, abd, bc, ac, ab}, CLOPE produces different clustering results as C1:{{ab, abd}, {ac, acd}, {bc, bcd}} and C2:{bcd, acd, abd, bc, ac, ab} respectively.This example reveals two major deficiencies of CLOPE.For the first, it is intuitive that CLOPE is unstable, as different input sequence might produce different clustering results.In the second place, as It is sure that no more movements of a single transaction would occur in C1:{{ab, abd}, {bc, bcd}, {ac,acd}}, but merging cluster {ab, abd} with either {bc, bcd} or {ac, acd} could both further enhance the profit value.As a result, a merge operation is required on all the proper clusters not only to eliminate the effect by input sequence but also to achieve a much better profit value at the end of each round of iteration.

The Optimized Agglomerative Approach
We propose an optimized agglomerative clustering algorithm, Agg-CLOPE, to deal with the problem caused by movement of only one transaction illustrated in the previous section.Firstly, the terminologies from original CLOPE algorithm are extended to support cluster merge operation.Then a cluster graph structure is applied to optimize the traditional bottom-to-up clustering approach to achieve a stable clustering result with much better profit value.

Extension from CLOPE
As the CLOPE algorithm moves only one transaction from one cluster to another, it is required to extend the definitions and the corresponding data structures to support cluster merge operation.
Definition 6 (Extension of Cluster).A cluster c is a quadruple , , , id T H L < > where id is the unique identification of c, L is a list of ids from other clusters related to c, and the rest are the same as those in Definition 1.
Definition 7 (Extension of Profit).The profit value of a cluster c is defined as follows: Then the profit of a clustering C would be: Definition 8 (Delta): Suppose there are two clusters ci and cj, and ci/j is the merge of these two clusters.Then the Delta value of ci and cj is defined as follows: We use symbol Di,j short for Delta(ci,cj).Obviously the Delta function is symmetric that we have Di,j = Dj,i.We also define Di,i = 0 for the relationship of the same cluster.To maximize the profit value in each merge operation, the Delta value for a certain cluster should be maximal.Suppose there are totally p clusters, the maximum Delta value for cluster i is denoted as: If cj could maximize the Delta value of ci, then cj is the merge candidate of ci denoted as i j c c  .A cluster might have more than one merge candidates, and all the unique ids of those merge candidates are stored in L according to Definition 6 for further process.

Optimization on Traditional Agglomerative Approach
Traditional agglomerative approach starts with each transaction staying in its own cluster, and then merges pairs of clusters to move up the hierarchy [13].However, the cluster merge operation could not be undone, namely transactions could not be taken away from one cluster.It is required to perform merge operations on as optimal clusters as possible.For a single cluster, the best candidates should be picked out to be merged with.The following example expresses a situation of unsuitable candidates.In Example 2, cj is the merge candidate of ci, and ch is the merge candidate of cj.Although ci and cj comes before ch, cj should be merged with ch rather than ci, as the former would produce larger profit value for the current clustering.In other case, if M(ci) and M(cj) are both maximum among all the delta values, it is obvious that ci, cj and ch could be merge together to achieve optimal profit value.To deal with this phenomenon, we define a Global Maximum (GM) Delta value for all the Delta values.
If M( ) M( ) , it is sure that merging ci with cj is best, denoting as " i j c c ⇔ ", as well as merging cj with ch.On this basis, we could make improvement on the traditional agglomerative approach.
Traditional agglomerative approach could only perform merge operation on two clusters at a time, which would not stop until all the transactions are in one cluster or the number of result clusters is specified.However, in this optimized approach, all the clusters linked with the GM value are able to merge at once, which saves a lot of time.Agg-CLOPE would automatically terminate until all Delta values are no more than zero and no further merge operations could be performed.We define the cluster graph to support this feature.
Definition 9 (Cluster Graph): A cluster graph is an undirected graph ( ) where V is a set of vertices each of which representing a cluster and E contains edges that connect two vertices (clusters) with Delta value equals to GM. Figure 2 illustrates the clustering process on the transactions {ab, ac, bc, abd, acd, bcd} using Agg-CLOPE with r = 2.0.At the very beginning, each transaction is a cluster symbolized as a vertex in graph G (see Figure 2a).By calculating the Delta values of each two clusters, those meeting with the GM value are linked and to be merged (see Figure 2b).So at the end of the first round and the beginning of the second round, there are three new clusters remaining (see Figure 2c).The second round actually does the same as those in the first round.As all Delta values equal to GM, the clusters are linked with each other for the further merge operation (see Figure 2d).After that, there are no more clusters to be merged, and the algorithm ends with a final cluster including all the transactions (see Figure 2e).The implementation of Agg-CLOPE is listed in Algorithm 1. Similar to CLOPE, this algorithm also contains two phases.In the initialization phase, each transaction is read and placed into its own cluster to form vertices in G (Lines 1-4).In the iteration phase, the delta values are calculated between each two clusters.Each calculation performs a comparison to the current GM value (Line 9 and Line 13).The content in E and current cluster's L is cleared and renewed as the GM value is assigned to the current maximum delta value (Lines 10-12), otherwise the current edge is appended in E and the current vertices are added to the opponent's L (Line 14).If E is not empty, all the connective sub-graphs of G would be merged (Lines 18-23).Note that the main part of merge operation is done by LinkMerge, a recursive function shown in Algorithm 2. Agg-CLOPE would automatically terminate until no clusters to be merge, i.e., E is empty.
Suppose the total number of transactions is N, the current number of clusters is K and the average length of a transactions is A. Obviously the initialization phase takes O(N) time.In the iteration phase, the calculation of delta values requires O(

Experimental Results
In this section, we evaluate the experimental results from two datasets.We apply three algorithms-CLOPE, SCLOPE and Agg-CLOPE on both datasets and make comparison on five metrics of the clustering results by each algorithm.The experiments are executed on a single Lenovo machine (Lenovo, Shanghai, China) with Intel QuadCore CPU, 8GB RAM, and CentOS 6.4.

The Mushroom Dataset
The mushroom dataset is retrieved from the UCI machine learning repository (http://archive.ics.uci.edu/ml/datasets/Mushroom,accessed on 24 March 2015), which has been applied by various research works.It contains 8124 transactions with two classes, 4208 edible mushrooms and 3916 poisonous mushrooms.Each transaction has 22 categorical attributes with 116 different values in total.Firstly, we compare the profit values of clustering produced by the three algorithms from r = 1.0 to r = 4.0, as shown in Table 1.According to Equation ( 4), the profit value becomes smaller as r grows up.With the same repulsion, the clustering profit value produced by Agg-CLOPE comes to be the best, and CLOPE the worst in most cases.It is proved that the methodology of Agg-CLOPE for grouping transactions into clusters is capable of finding a more proper clustering with much better profit value.
For the next, we scramble the mushrooms in ten random sequences, and fix r = 2 to run the three algorithms.The result shown in Figure 3 proves that the profit values produced by Agg-CLOPE are just the same regardless of the input sequence, while the clustering result of CLOPE and SCLOPE are both unstable.
Then we make comparison on the executing time of the three algorithms from r = 1.0 to r = 4.0.The result in Figure 4 shows that CLOPE is fastest with no more than 5 s, while Agg-CLOPE and SCLOPE takes much longer time to produce the clustering results.SCLOPE is the worst as the attributes in each transaction should be sorted during the creation of its FP-Tree structure [8].The reason that Agg-CLOPE is slow is the time complexity analyzed in Section 3.2.Finally, we evaluate the quality of clustering results from r = 1.0 to r = 4.0.It includes two metrics-purity and cluster number [6].The purity metric is calculated by summing up the larger one of the number of edible mushrooms and number of poisonous mushrooms.It has a maximum of 8124, the total number of transactions.The number of clusters should be as many as possible, as a clustering with each cluster contains only one transaction would surely achieve the maximum purity.
As shown in Figure 5 and Figure 6, both CLOPE and Agg-CLOPE reaches the maximum purity of 8124 and maximum cluster number of 23 at r = 2.8, while SCLOPE achieves this goal at r = 3.1.Besides, Agg-CLOPE is slightly better than CLOPE on both of the two metrics, while SCLOPE appears to be worse.

The Splice-junction Gene Sequences Dataset
We pick up one of the splice-junction gene sequences datasets from the GenBank website (http://www.ncbi.nlm.nih.gov/genbank/,accessed on 04 April 2015).The selected dataset contains 3190 DNA transactions with three classes of boundaries: exon/intron (abbreviated as EI), intron/exon (abbreviated as IE) and neither (abbreviated as N).Each transaction has a sequence with 60 fields filled by one of {A, G, T, C} for each mostly.Combined with the position information, there are totally 287 different categorical attributes.The experiments on this dataset are similar to those in the previous subsection.Table 2 shows that Agg-CLOPE has the best profit value and Figure 7 proves Agg-CLOPE to be stable compared to CLOPE and SCLOPE by fixing r = 3.0.However, although the number of transactions in this dataset is smaller than that of mushroom, Agg-CLOPE takes much more time while SCLOPE becomes fastest when r ≥ 2.0, as illustrated in Figure 8.The reason can be explained in Figure 9, with more than 2700 clusters produced under bigger repulsion values by all three algorithms.Observing the calculation of time complexity in the last paragraph of Section 3.2, the number of clusters K in each iterative round is a key factor.Let Σ symbolizes the final number of clusters by the corresponding algorithm, and then K is close to Σ in CLOPE and is always bigger than Σ in Agg-CLOPE.On the other hand, the creation time of FP-Tree in SCLOPE is only affected by N and A, resulting in faster than CLOPE, Agg-CLOPE and that on the mushroom dataset.Moreover, there are no obvious differences on the purity metric among the three algorithms.As shown in Figure 10, the maximum value of 3190 can be easily achieved on condition of r > 2.0, thus the quality of clustering results is almost the same on this dataset.

Conclusion
In this paper, we propose the Agg-CLOPE algorithm as an extension of the original CLOPE algorithm using an optimized agglomerative approach.It uses cluster merge operations instead of moving a single transaction in each iterative round to find the optimal combination of transactions.Experiments on two datasets both demonstrate that Agg-CLOPE can achieve better profit value and stable clustering results compared with CLOPE and SCLOPE.However, the slowness of execution speed becomes an obstacle for larger and more complicated datasets.To deal with this problem, we would make tradeoff among the time, profit and stability to reduce the running time with the following approaches in the future:

Figure 1 .
Figure 1.Histograms of the two clustering.

,
CLOPE algorithm might not find a proper clustering that has the best histogram in most time.

Figure 2 .
Figure 2. (a) Initial state of G with each transaction in a cluster; (b) clusters meeting with Global Maximum (GM) value are linked; (c) Linked clusters are merged; (d) The remaining clusters are linked; (e) The final result contains only one cluster.

2 K×
K A × × ) time, and the recursive function is O(K) that is the same as depth first search on graph G.As K = N in the first round, it requires at least O( the K value gradually shrinks in the next few rounds.The worst case would take N − 1 rounds, in each of which only two clusters are merged, resulting in O( ) approximately.The best case takes only two rounds, where all the clusters are merged in the first round and no further operations are performed in the second round.However, compared with the time complexity of CLOPE-O(N × K × A), Agg-CLOPE is generally much slower to be its only and fatal defect.Algorithm 2 LinkMerge Parameter: c is a cluster(vertex) in cluster graph G 1: If(c is not merged) 2: for each k in c.L

Figure 3 .
Figure 3.Comparison of stability with different input sequence produced by three algorithms on the mushroom dataset.

Figure 4 .
Figure 4. Comparison of execution time with different repulsion produced by three algorithms on the mushroom dataset.

Figure 5 .
Figure 5.Comparison of purity with different repulsion produced by three algorithms on the mushroom dataset.

Figure 6 .
Figure 6.Comparison of cluster number with different repulsion produced by three algorithms on the mushroom dataset.

Figure 7 .
Figure 7.Comparison of stability with different input sequence produced by three algorithms on the splice-junction gene sequence dataset.

Figure 8 .Figure 9 .Figure 10 .
Figure 8.Comparison of execution time with different repulsion produced by three algorithms on the splice-junction gene sequence dataset.

Table 1 .
Comparison of clustering profit values with different repulsion produced by three algorithms on the mushroom dataset.

Table 2 .
Comparison of clustering profit values with different repulsion produced by three algorithms on the splice-junction gene sequence dataset.