Hierarchical Clustering Using One-Class Support Vector Machines

This paper presents a novel hierarchical clustering method using support vector machines. A common approach for hierarchical clustering is to use distance for the task. However, different choices for computing inter-cluster distances often lead to fairly distinct clustering outcomes, causing interpretation difficulties in practice. In this paper, we propose to use a one-class support vector machine (OC-SVM) to directly find high-density regions of data. Our algorithm generates nested set estimates using the OC-SVM and exploits the hierarchical structure of the estimated sets. We demonstrate the proposed algorithm on synthetic datasets. The cluster hierarchy is visualized with dendrograms and spanning trees.


Introduction
The goal of cluster analysis is to assign data points into groups called clusters, so that the points in the same cluster are more closely related to each other than the points in different clusters [1].In many applications, however, clusters often have subclusters, which, in turn, have sub-subclusters.Hierarchical clustering aims to find this hierarchical arrangement of the clusters.Requiring to specify neither the number of clusters nor the initial cluster assignment, hierarchical clustering is widely used for exploratory data analysis.
Typical procedures of the hierarchical clustering initially assign every data point to its own singleton cluster and successively merge the closest clusters until there is only one cluster left containing all of the data points [2].In hierarchical clustering, cluster dissimilarities are commonly computed from the pairwise data point distances.However, the choices of computing inter-cluster distances can lead to fairly distinct cluster outcomes.For example, the result of linking the closest pair (single linkage) and the result of linking the furthest pair (complete linkage) are often different enough to cause difficulties in interpreting the cluster analysis in practice.
In this paper, we present a nonparametric hierarchical clustering algorithm based on the support vector method [3][4][5][6][7][8][9].While typical hierarchical clustering algorithms use the pairwise data point distances, we propose to use a one-class support vector machine (OC-SVM) [10,11].This is motivated by the recent work by Vert, R. and Vert, J [12] that the OC-SVM with the Gaussian kernel produces a consistent estimate of a density level set or a high-density region, where the free parameter λ affects the level of the estimated set.
Hence, we use the OC-SVM decision sets to directly estimate the high-density regions.In the data space, the OC-SVM decision boundary forms a set of contours enclosing the data points.We interpret the separated regions enclosed by the contours as clusters [13] and merge two clusters based on their connectivity in high-density regions.Since density level sets are hierarchical [14], we argue that a collection of the level sets from the OC-SVM naturally induces hierarchical clustering.Our algorithm finds the hierarchy of clusters from the family of OC-SVM level set estimates, which we obtain by varying the OC-SVM parameter λ in a continuum from zero to infinity.
Recently, de Morsier et al. [15] developed a hierarchical extension of support vector clustering.Rather than using the whole family of OC-SVM solutions, they select a level λ and consider the clusters in the corresponding OC-SVM solution.Then, they proceed to merge the clusters to find the cluster hierarchy.Two clusters are merged based on the average minimal distance between the cluster outliers.Thus, their cluster dissimilarities are also derived from the point distances, similarly to the conventional hierarchical clustering algorithms.
With the proposed hierarchical clustering method, we envision the following applications.Anomaly detection: Anomaly detection is to identify deviations from the nominal data when combined observations of nominal and anomalous data are given.By detecting observations incoherent to existing clusters, one can develop early alarm systems of anomalous activities [16].
Image segmentation: Image segmentation is essential in computer vision for partitioning a digital image into a set of segments, such that pixels in the same segments are more similar to each other.A hierarchical structure found on the image segments can facilitate computer vision tasks [17,18].
We briefly review the OC-SVM in Section 2. Our hierarchical clustering algorithm based on the OC-SVM is presented in Section 3 and evaluated in Section 4. Conclusion remarks follow in Section 5.

One-Class Support Vector Machines
Suppose a random sample {x 1 , x 2 , . . ., x n }, x i ∈ R d is given.One-class support vector machines (OC-SVM) are proposed in [10,11] to estimate a set encompassing most of the data points in the space.The OC-SVM first maps each x i to a high (possibly infinite) dimensional space H via a function Φ : R d → H.A kernel function k(x, x ) = Φ(x), Φ(x ) corresponds to an inner product in H.The OC-SVM finds a hyperplane in H that maximally separates the data points from the origin.The distance between the hyperplane and the origin is called the margin.To maximize the margin, the OC-SVM allows some data points inside the margin by introducing non-negative penalties ξ i .More formally, the OC-SVM solves the following quadratic program: where w ∈ H is the normal vector of the hyperplane and λ is the control parameter of the margin violations.In practice, the primal optimization problem is solved via its dual: ) is the kernel matrix and α i is a Lagrange multiplier associated with x i .Once the optimal solution α is found, the decision function: defines the boundary {x : f (x) = 1}, which forms a set of contours enclosing the data points in R d .A more detailed discussion on the SVMs is available in [3].

Hierarchical Clustering Based on OC-SVM
In this section, we present our hierarchical clustering algorithm using a family of OC-SVM decision sets.

Nested OC-SVM Decision Sets
As discussed above, the OC-SVM decision set can be interpreted as a density level set estimator [12]: where λ is related to the level of the estimated set.Thus, we can generate a family of level set estimates by varying λ from infinity to zero.However, the set estimates are not necessarily nested, as illustrated in Section 4, while they should be.We enforce the decision sets to be nested by: Then, these sets are clearly nested, so that an estimate at a higher value of λ is a subset of an estimate at a lower value of λ.We use these nested level set estimates L λ for hierarchical clustering.We note that training the OC-SVM over the entire range of λ can be facilitated by the solution path algorithm by Lee and Scott [19].The path algorithm finds the entire set of solutions as λ decreases from a large value toward zero.For sufficiently large λ, every data point falls between the hyperplane and the origin, so that f (x) < 1.As λ decreases, the margin width decreases, and data points cross the hyperplane (f (x) = 1) to move outside the margin (f (x) > 1).Throughout this process, α i changes piecewise linearly in λ.We provide the derivation of the OC-SVM path algorithm in Appendix.

Hierarchical Clustering Using OC-SVM Decision Sets
Our approach is agglomerative and proceeds as λ decreases from infinity to zero.For sufficiently large λ, none of the data points are in the OC-SVM decision boundary.As λ decreases, data points cross the boundary and move into the decision set L λ until every data point is inside or on the boundary.A decision set may consist of one or more separated regions or clusters, and our algorithm finds the hierarchical arrangement of clusters during the process.
Algorithm 1 describes our hierarchical clustering approach.From the OC-SVM solution path algorithm, we obtain a set of breakpoints (λ k , α k ).Each pair (λ k , α k ) yields a decision set L λ k .The cluster collection D keeps track of the clusters in L λ k .
Algorithm 1 Hierarchical clustering based on one-class support vector machine (OC-SVM).
for each x j ∈ L new λ k do end for 14: end for At step k, we first locate the data points newly included in the set L λ k (Line 5).Each of these points is tested in Line 7 if it is connected to any clusters in D. Our approach to test connectivity is geometric.A newly included data point x j is determined to be connected to a cluster C if a path exists between x j and any data point in C. If this is not the case, the path crosses the boundary and contains a segment of points y, such that f (y) < 1.To check the line segment, a number of sampled points can be used.We sample 20 points in our implementation.
Then, three cases arise from the connectivity test: 1. if x j is connected to none of the clusters, then the singleton cluster {x j } is added to the cluster collection D; 2. if x j is connected to exactly one cluster, then the singleton cluster {x j } is merged into the cluster; 3. if x j is connected to more than one cluster, then all of these clusters and the singleton cluster {x j } are merged.
The cartoons in Figure 1 illustrate the three cases.Because of Case 3, a node in a dendrogram may have more than two child nodes.The merging process continues to the last breakpoint (λ k , α k ).If D finishes with more than one cluster, then all of the remaining clusters are merged to form a single cluster containing all of the data points.Note that a spanning tree can also be derived from our algorithm by creating a tree edge connecting x j with one of the cluster elements in Cases 2 and 3.

Experiments
We evaluate the proposed hierarchical clustering algorithm on two different datasets.The first dataset multi is from a three-component Gaussian mixture, and the second dataset banana is a benchmark dataset [20].Our Matlab implementation of the OC-SVM path algorithm and the OC-SVM-based hierarchical clustering algorithm is available from the author's website [21].
In our experiments, Gaussian kernel k(x, x ) = exp(− x − x 2 /2σ 2 ) is used.Since this kernel maps all of the data points on a hypersphere in the same orthant, the OC-SVM principle, separating data points from the origin, is justified.

Gaussian Mixture Data
The multi is a set of 200 data points randomly drawn from a three-component Gaussian mixture distribution with uneven weights.Figure 2 shows the OC-SVM decision sets at two different values of λ.Each small circle represents a data point.As can be seen in Figure 2a, however, the decision set at the higher value of λ indicated by the shaded region is not completely contained inside the solid contour delineating the decision set at the lower value of λ.Thus, we modify the set estimates at a level λ to be the union of sets at higher levels, as explained in Section 3.1.Then, the boundaries do not cross each other, and the sets become properly nested (Figure 2b).
The five nested OC-SVM decision sets on multi data are illustrated in Figure 3a.The decision set at a certain level λ consists of one or more separated regions or contours.These contours are interpreted as cluster boundaries [13].Then, the hierarchical structure of these clusters is clearly visible.The inner contours at higher levels are contained within the outer contours at lower levels.Thus, any pair of data points in the same cluster at a certain level remains together at lower levels.Note that the figure shows the three components of multi data.Our hierarchical clustering based on the OC-SVM is visualized with a dendrogram in Figure 4. Below each dendrogram, the ground truth cluster memberships are shown.The data points from the same Gaussian component are indicated by the dots of the same color and height.To compare with the conventional hierarchical clustering, the figure also displays the dendrograms from the single linkage, the complete linkage and the group average.While there are three true clusters in multi, it is not apparent in the classical agglomerative clustering results.
However, three components are clearly identifiable in Figure 4d obtained from the proposed OC-SVM hierarchical clustering.In the dendrogram, each of the three clusters grows until the larger one merges with the smaller one, and finally, all of the three clusters combine to form a single large cluster.Then, the cluster steadily grows until it encloses the rest of the outlying data points in the low-density regions.This observation is consistent with the nested decision boundaries in Figure 3.

Benchmark Data
We repeat similar experiments on the banana benchmark data.The banana data were originally made for classification.In the experiments, only negative-class examples are used.As illustrated in Figure 3b, banana data have two distinct components: one is elongated banana-shaped, and the other is elliptical.The figure also displays the nested contours illustrating the nested OC-SVM decision sets.
The hierarchical clustering results are shown in Figure 5.Among the three classical agglomerative algorithms, only the single linkage seems to be able to identify the two data components.This result is expected, because the single linkage is advantageous to locate elongated clusters, while the complete linkage, which tends to produce compact spherical clusters, is disadvantageous.On the other hand, our method is more flexible and has successfully found the two data components, as can be seen Figure 5d.The valley in Figure 5d also indicates the possible subdivision of the banana-shaped cluster into two subclusters.
We further compare the single linkage and our approach using the spanning trees.Figure 6 shows the minimum spanning tree from the single linkage hierarchical clustering [22] and the spanning tree from the proposed OC-SVM hierarchical clustering.Both spanning trees look similar, but have differences.A major difference is in the longest paths in the data components.The thicker line segments in the figure indicate the longest paths.In the OC-SVM spanning tree, the nodes are located relatively in the "center", and the path connecting the nodes shows the ridge of the data distribution.

Computational Costs
The SVM solution path algorithm has O(n) breakpoints and complexity O(m 2 n + n 2 m), where m is the maximum number of points on the margin along the path [23].At each breakpoint on the OC-SVM solution path, Algorithm 1 checks the connectivity of the newly included data points to the points in the existing clusters.If we assume that the newly included data points are on the margin, then the computational time of the proposed hierarchical clustering algorithm is O(n 2 m).The computational costs on the multi and banana datasets are presented in Table 1.
Table 1.The computational costs for multi and banana datasets on an Intel i7-4790 3.60 GHz.

Conclusions
In this paper, we have presented a hierarchical clustering algorithm employing the OC-SVM.Rather than using distance to indirectly find the high-density regions, we use the OC-SVM to estimate level sets and to directly locate the high-density regions.Our algorithm builds a family of nested OC-SVM decision sets over the entire range of control parameter λ.As each decision set induces a set of clusters, our algorithm finds the hierarchical structure of clusters from the family of decision sets.Dendrograms and spanning trees are used to visualize the results.Compared to the classical agglomerative methods, the proposed algorithm successfully identified the cluster components in the data sets.Future work may include comparing to other set estimators that yield nested decision sets as in [24] or kernel density estimation followed by thresholding.that the Lagrange multipliers are piecewise linear in λ and developed an algorithm that finds the solution path. Lee and Scott [19] extended the method to the OC-SVM to derive the optimal solution path algorithm for all values of λ.
The path algorithm finds the whole set of solutions by decreasing λ from a large value toward zero.For sufficiently large λ, all the data points fall between the hyperplane and the origin so that f (x) < 1.As λ decreases, the margin width decreases, and data points cross the hyperplane (f (x) = 1) to move outside the margin (f (x) > 1).Throughout this process, the OC-SVM solution path algorithm monitors the changes of the following subsets: We first establish the initial state of the sets defined above.For sufficiently large λ, every data point x i falls inside the margin; that is, f (x i ) < 1.Then the KKT condition implies α i = 1 for ∀i, and we obtain λ ≥ j K ij for ∀i from Equation (1).Thus, if denotes the maximum row sum of the kernel matrix, then for any λ ≥ λ 0 , the optimal solution of OC-SVM becomes α i = 1 for ∀i.Therefore, the path algorithm sets the initial value of λ to λ 0 .Then all the data points are in the subset L and the corresponding Lagrange multipliers are α i = 1.

A.2. Tracing the Path
As λ decreases, either of the following events can occur: 1.A point enters E from L or R.
2. A point leaves E to enters L or R.
Let α l j and λ l denote the parameter values right after the l-th event and f l (x) the decision function at this point.Define E l similarly and suppose |E l | = m.Recall that Then, for λ l > λ > λ l+1 , we can write The second equality holds because for this range of λ only points in E l change their α j .On the contrary, all other points in R l or L l have their α j fixed to 0 or 1, respectively.Since f (x i ) = 1 for all i ∈ E l , we have

Figure 1 .
Figure 1.A new data point (red dot) is included in the decision set (solid contours).Based on the connectivity of the point to the existing (dark shaded) clusters, three cases arise: (a) the point is connected to none of the clusters, and a new cluster is formed; (b) the point is connected to exactly one cluster, and the cluster grows; or (c) the point is connected to more than one cluster, and the clusters are merged.

Figure 2 .Figure 3 .
Figure 2. The OC-SVM decision sets on multi data at λ = 22.60 (shaded) and 1.26 (contour).Small circles represent data points.(a) Two sets from the original OC-SVM are not nested.(b) With the modification in Section 3.1, the shaded region is completely contained inside the solid contour; thus, the sets are nested.

Figure 4 .
Figure 4. Dendrograms from the classical hierarchical clustering of multi data: (a) single linkage, (b) complete linkage and (c) group average linkage.(d) Dendrogram from our hierarchical clustering algorithm based on the OC-SVM.Three clusters are clearly identifiable.

Figure 5 .Figure 6 .
Figure 5. Dendrograms from hierarchical clustering of banana data: (a) single linkage, (b) complete linkage, (c) group average linkage, and (d) OC-SVM.The two clusters are found from the single linkage and the OC-SVM hierarchical clustering.