Next Article in Journal
Quantum Cosmologies under Geometrical Unification of Gravity and Dark Energy
Previous Article in Journal
Feature Selection with Conditional Mutual Information Considering Feature Interaction
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Accelerating Density Peak Clustering Algorithm

1
Department of Information Management, Yuan Ze University, Taoyuan 32003, Taiwan
2
Innovation Center for Big Data and Digital Convergence, Yuan Ze University, Taoyuan 32003, Taiwan
Symmetry 2019, 11(7), 859; https://doi.org/10.3390/sym11070859
Submission received: 28 May 2019 / Revised: 23 June 2019 / Accepted: 26 June 2019 / Published: 2 July 2019

Abstract

:
The Density Peak Clustering (DPC) algorithm is a new density-based clustering method. It spends most of its execution time on calculating the local density and the separation distance for each data point in a dataset. The purpose of this study is to accelerate its computation. On average, the DPC algorithm scans half of the dataset to calculate the separation distance of each data point. We propose an approach to calculate the separation distance of a data point by scanning only the neighbors of the data point. Additionally, the purpose of the separation distance is to assist in choosing the density peaks, which are the data points with both high local density and high separation distance. We propose an approach to identify non-peak data points at an early stage to avoid calculating their separation distances. Our experimental results show that most of the data points in a dataset can benefit from the proposed approaches to accelerate the DPC algorithm.

1. Introduction

Clustering is the process of categorizing objects into groups (called clusters) of similar objects and is a widely-used data mining technique both in academic and applied research [1,2]. Many clustering methods appear in the literature, but they differ in the notion of similarity. For example, the k-means algorithm [3] represents each cluster by a centroid, and those objects near the same centroid are deemed similar; the DBSCAN algorithm [4] defines the notion of density and deems the objects in a continuous region with a density exceeding a specified threshold as similar; some studies measure the similarity using the concept of symmetry.
The k-means algorithm is an example of the partitioning-based clustering methods, and most of the partitioning-based clustering methods can find only spherical shaped clusters [5]. In contrast, the DBSCAN algorithm is an example of the density-based clustering methods, which can not only find clusters of arbitrary shapes but also detect outliers [5]. Although a density-based clustering method usually requires more execution time than a partitioning-based clustering method does, it can often discover meaningful clustering results that a partitioning-based clustering method cannot reveal. Several applications of clustering to real-world problems use both of these approaches to extract different clustering results of the same dataset, to highlight different aspects of the data.
The Density Peak Clustering (DPC) algorithm, proposed by Rodriguez and Laio [6], is a new density-based clustering method that has received much attention for the past few years [7,8,9,10,11,12,13,14,15,16,17]. It accelerates the clustering process by first searching for the density peaks in a dataset, and then constructing clusters from the density peaks. To search density peaks, DPC must calculate two quantities for each data point: local density and separation distance (see Section 2 for details) [9]. Then, data points with relatively high local density and separation distance are selected as the density peaks. Many works refer to the density peak of a cluster as the “center” of the cluster. Since density-based clustering methods yield clusters of arbitrary shapes, the notion of “center” is somewhat misleading. This work uses “density peak” instead of “center” to avoid confusion.
The contribution of this work is to propose two methods (called ADPC1 and ADPC2) that accelerate the DPC algorithm. The first method ADPC1 accelerates the calculation of separation distances and yields the same clustering results as that of the DPC algorithm. The second method ADPC2 accelerates the DPC algorithm by identifying a significant portion of the non-peak data points and avoiding calculating their separation distances. Since calculating the separation distances for all data points is a time-consuming step with O ( N 2 ) time complexity where N is the number of data points, our proposed methods can significantly speed up the DPC algorithm.
The rest of this work is organized as follows: Section 2 reviews related work, with a focus on the DPC algorithm. Section 3 and Section 4 propose our methods. Section 5 presents the experimental results. Finally, Section 6 concludes this study.

2. Related Works

2.1. Clustering Methods

In the literature, clustering methods have been classified into several categories [18]: partitioning-based methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods. Partitioning-based methods (e.g., k-means and possibilistic c-means) focus on discovering compact and hyperellipsoidally shaped clusters. With k-means, the clustering results are sensitive to outliers. The possibilistic c-means (PCM) method is resilient to outliers, but it requires additional parameters γ, one for each cluster. Adaptive PCM algorithm [19] allows the parameters γ to change as the algorithm evolves.
Hierarchical methods work by iteratively (or recursively) dividing a large cluster into small clusters (or by combining small clusters as a large cluster). As a result, their clustering results can be represented by a dendrogram. Bianchi, et. [20] proposed a clustering method that forms clusters by iterative partitioning of an undirected graph.
Density-based methods discover clusters that are continuous regions with a high local density within the regions. Unlike the partitioning-based methods, density-based methods yield clusters of arbitrary shapes. Grid-based methods use a grid data structure to quantize the data space into a finite number of cells and perform the clustering operations directly on the cells. Model-based methods try to fit the data to some mathematical model.
Some clustering methods do not fit nicely into the above categorization. For example, subspace clustering [21] methods identify clusters based on their association with subspaces in high-dimensional spaces.

2.2. Density Peak Clustering Algorithm

As described in Section 1, the DPC algorithm [6] must calculate the local density and the separation distance for each data point. Given a dataset X , the local density ρ ( x i ) of a data point x i X is the number of data points in the neighborhood of x i . That is:
ρ ( x i ) = | B ( x i ) |
where B ( x i ) denotes the neighborhood of x i and is defined as the set of data points in X whose distance to x i is less than a user-specified parameter d c . That is:
B ( x i ) = { x j X | d ( x i , x j ) < d c }
where d ( x i , x j ) represents the distance between x i and x j . Notably, Equations (1) and (2) use the parameter d c as a hard threshold to derive the neighborhood and the local density of a data point, respectively.
The value of d c can be chosen so that the average number of neighbors of a data point is around p% of the number of the data points in X , and the suggested value [6] for p is between 1 and 2. For small datasets, Rodriguez and Laio [6] suggested using an exponential kernel to calculate the local density, as shown in Equation (3):
ρ ( x i ) = x j X e x p ( d ( x i , x j ) 2 d c 2 )
The separation distance δ ( x i ) of x i is the minimum distance from x i to any other data point with a local density > ρ ( x i ) , or the maximum distance from x i to any other data point in X if there exists no data point with a local density > ρ ( x i ) , as shown in Equation (4):
δ ( x i ) = { min j : ρ ( x j ) > ρ ( x i ) d ( x i , x j ) ,   if   ρ ( x i ) < max x j X   ρ ( x j ) max x j X   d ( x i , x j ) ,   otherwise .
For ease of exposition, we use σ ( x i ) to denote the index j of the data point x j that is the nearest to x i and ρ ( x j ) > ρ ( x i ) , and if no such data point exists, σ ( x i ) is set to i, as shown in Equation (5):
σ ( x i ) = { argmin j : ρ ( x j ) > ρ ( x i ) d ( x i , x j ) ,   if   ρ ( x i ) < max x j X   ρ ( x j ) i ,   otherwise .
Notably, there may be more than one data point that is the nearest to x i and has a local density > ρ ( x i ) . According to Laio’s Matlab implementation of the DPC algorithm [22], if this situation happens, then σ ( x i ) is randomly chosen from the indexes of those data points with the highest local density among all the data points that are the nearest to x i and have a local density > ρ ( x i ) .
Once ρ ( x i ) and δ ( x i ) of each data point have been determined, the DPC algorithm uses the following assumption to select density peaks: if a data point x i X is a density peak, then x i must be surrounded by many data points (i.e., ρ ( x i ) is large) and must be at a relatively high distance from other data points with a local density greater than ρ ( x i ) (i.e., δ ( x i ) is large). To assist choosing the density peaks, the DPC algorithm plots each data point in a decision graph, which is a two-dimensional graph with the local density and the separation distance as the horizontal and vertical axes, respectively. Data points with both high local density and high separation distance are manually selected as the density peaks. Alternatively, one can set a threshold on γ ( x i ) = ρ ( x i ) δ ( x i ) and select data points with γ ( x i ) greater than the threshold as density peaks [6].
After all density peaks have been determined, each density peak acts as the starting point of a cluster, and thus the number of density peaks equals the number of clusters. Each non-peak data point is assigned to the same cluster as its nearest data point of higher density, i.e., data points x i is assigned to the cluster that contains x σ ( x i ) . Let y i denote the cluster label of data point x i , then y i = y σ ( x i ) .
Algorithm 1 shows the DPC algorithm. Notably, it is important to sort the data points by their local density descendingly in Step 2 so that calculating δ ( x i ) and σ ( x i ) in Step 3 and the cluster assignment in Step 6 can be done efficiently. Without Step 2, for each data point x i , Step 3 would require scanning all data points in X to find the data points with a local density > ρ ( x i ) . With Step 2, Step 3 only needs to scan the data points located before x i in X , and, thus, reduces the running time of Step 3 by half on average. Additionally, with Step 2, data points with higher local density are processed earlier in Step 6. Since ρ ( x σ ( x i ) ) > ρ ( x i ) , y σ ( x i ) will be determined before y i in Step 6, and, thus, Step 6 can complete cluster assignment in O ( N ) time.
Algorithm 1. DPC algorithm.
Input: the set of data points X N × M and the parameters d c for defining the neighborhood, and d r for selecting density peaks
Output: the label vector of cluster index y N × 1
Algorithm:
  • Calculate ρ ( x i ) for each x i X using either (1) or (3).
  • Sort all data points in X by their local densities descendingly.
  • Calculate δ ( x i ) and σ ( x i ) for each x i X using (4) and (5), respectively.
  • Select data points with ρ ( x i ) δ ( x i ) > d r as density peaks.
  • For each density peak x i , set y i = i . // starting point of each cluster.
  • For each non-peak data point x i , set y i = y σ ( x i ) . // cluster assignment.
  • Return y .
Appendix A describes Laio’s implementation details for Step 3 of the DPC algorithm. Specifically, we discuss how it handles two ambiguous situations when calculating the separation distance using Equation (4).

3. Accelerating APC by Scanning Neighbors Only

As described earlier, for each data point x i X , Step 3 of the DPC algorithm in Algorithm 1 requires to scan half of X on average to find x σ ( x i ) , i.e., the data point nearest to x i and with a local density > ρ ( x i ) . Observation 1 shows that we can find x σ ( x i ) by scanning only the neighbors of x i , if the local density of x i is less than the maximal local density of its neighbors. Most data points satisfy this condition, and the size of a data point’s neighborhood is much smaller than the size of X , so the time complexity of Step 3 can be reduced from O ( N 2 ) to O ( N b ) where N denotes the number of data points in X , and b denotes the average neighborhood size.
Observation 1.
If ρ ( x i ) < max x j B ( x i ) ρ ( x j ) for some data point x i X , then the data point nearest to x i and with a local density > ρ ( x i ) is in B ( x i ) , i.e., x σ ( x i )   B ( x i ) .
Based on Observation 1, we rewrite Equations (4) and (5) to Equations (6) and (7) below. In Algorithm 2, we propose an accelerated version of DPC (called ADPC1), which produces the same clustering results as DPC does, but in less time:
δ ( x i ) = { min j : x j B ( x i )     ρ ( x j ) > ρ ( x i ) d ( x i , x j ) , if   B ( x i ) ϕ   and   ρ ( x i ) < max x j B ( x i ) ρ ( x j ) min j : x j X     ρ ( x j ) > ρ ( x i ) d ( x i , x j ) , if   B ( x i ) = ϕ   or   ( ρ ( x i ) = max x j B ( x i ) ρ ( x j )   and   ρ ( x i ) max x j X   ρ ( x j ) ) max x j X   d ( x i , x j ) , if   ρ ( x i ) = max x j X   ρ ( x j ) .
σ ( x i ) = { argmin j : x j B ( x i )     ρ ( x j ) > ρ ( x i ) d ( x i , x j ) , if   B ( x i ) ϕ   and   ρ ( x i ) < max x j B ( x i ) ρ ( x j ) argmin j : x j X     ρ ( x j ) > ρ ( x i ) d ( x i , x j ) , if   B ( x i ) = ϕ   or   ( ρ ( x i ) = max x j B ( x i ) ρ ( x j )   and   ρ ( x i ) max x j X   ρ ( x j ) ) i , if   ρ ( x i ) = max x j X   ρ ( x j ) .
Algorithm 2. ADPC1 algorithm.
Input: the set of data points X N × M and the parameters d c for defining the neighborhood, and d r for selecting density peaks.
Output: the label vector of cluster index y N × 1
Algorithm:
  • Calculate ρ ( x i ) and B ( x i ) for each x i X using either (1) and (2) or (3) and (2).
  • Sort all data points in X by their local density descendingly.
  • Calculate δ ( x i ) and σ ( x i ) for each x i X using (6) and (7).
  • Select data points with ρ ( x i ) δ ( x i ) > d r as density peaks.
  • For each density peak x i , set y i = i . // starting point of each cluster.
  • For each non-peak data point x i , set y i = y σ ( x i ) . // cluster assignment.
  • Return y .
The parts different from the DPC in Algorithm 1 are highlighted in red.
Notably, in Step 1 of Algorithm 1, the DPC algorithm uses B ( x i ) to calculate local density ρ ( x i ) , but, afterwards, B ( x i ) is no longer needed. However, in Algorithm 2, the ADPC1 algorithm needs to keep B ( x i ) for calculating δ ( x i ) and σ ( x i ) in Step 3. If ρ ( x i ) < max x j B ( x i ) ρ ( x j ) , then we only need to scan B ( x i ) to calculate σ ( x i ) and δ ( x i ) . If ρ ( x i ) < max x j B ( x i ) ρ ( x j ) does not hold, then σ ( x i ) and δ ( x i ) are calculated the same way as in the DPC algorithm, i.e., scanning half of the dataset X on average. Since it is often that the local density of a data point is less than the maximal local density of its neighbors, ADPC1 can greatly reduce the execution time. Appendix B describes the implementation details for Step 3 of the ADPC1 algorithm.

4. Accelerating APC by Skipping Non-Peaks

Both DPC and ADPC1 need to calculate the separation distance δ ( x i ) for each data point x i . Recall that the purpose of calculating δ ( x i ) is to assist determining whether x i is a density peak. Therefore, if we can determine x i as a non-peak data point at an early stage, then there is no need to calculate δ ( x i ) . Observation 2 shows the necessary condition of a density peak, which can be applied to detect most non-peak data points in a dataset.
Observation 2.
If ρ ( x i ) < m a x x j B ( x i ) ρ ( x j ) for some data point x i X , then x i cannot be a density peak.
If x i is not a density peak, then we can omit to calculate δ ( x i ) by simply assigning δ ( x i ) to a small value, say 0. However, without calculating δ ( x i ) , we do not know σ ( x i ) , i.e., the index of the data point nearest to x i and with a local density > ρ ( x i ) . Notably, σ ( x i ) is needed for cluster assignment in Step 6 of the DPC and ADPC1 algorithms. To resolve this problem, we use the index of the data point with the highest local density in the neighborhood of x i as a surrogate for σ ( x i ) and redefine Equations (6) and (7) as Equations (8) and (9) below:
δ ( x i ) = { 0 , if   B ( x i ) ϕ   and   ρ ( x i ) < max x j B ( x i ) ρ ( x j ) min j : x j   X   ρ ( x j ) > ρ ( x i ) d ( x i , x j ) , if   B ( x i ) = ϕ   or   ( ρ ( x i ) max x j B ( x i ) ρ ( x j )   and   ρ ( x i ) max x j X ρ ( x j ) ) max x j X   d ( x i , x j ) ,   if   ρ ( x i ) = max x j X   ρ ( x j ) .
σ ( x i ) = { argmax j : x j B ( x i )   ρ ( x j ) , if   B ( x i ) ϕ   and   ρ ( x i ) < max x j B ( x i ) ρ ( x j )   argmin j : x j X     ρ ( x j ) > ρ ( x i ) d ( x i , x j ) , if   B ( x i ) = ϕ   or   ( ρ ( x i ) max x j B ( x i ) ρ ( x j )   and   ρ ( x i ) max x j X ρ ( x j ) ) i , if   ρ ( x i ) = max x j X   ρ ( x j ) .
Notably, Equations (8) and (9) only modify the first case of Equations (6) and (7), i.e., when the local density of x i is less than the maximal local density of its neighbors. Based on Equations (8) and (9), we propose another accelerated version of DPC (called ADPC2), which is the same as ADPC1 in Algorithm 2 except that Step 3 of ADPC2 uses Equations (8) and (9) instead of Equations (6) and (7) to calculate δ ( x i ) and σ ( x i ) , as shown in Algorithm 3. Notably, because ADPC1 and ADPC2 calculate σ ( x i ) differently, their clustering results can be slightly different from each other. Appendix C describes the implementation details for Step 3 of the ADPC2 algorithm.
Algorithm 3. ADPC2 algorithm.
Input: the set of data points X N × M and the parameters d c for defining the neighborhood, and d r for selecting density peaks
Output: the label vector of cluster index y N × 1
Algorithm:
  • Calculate ρ ( x i ) and B(xi) for each x i X using either (1) and (2) or (3) and (2).
  • Sort all data points in X by their local density descendingly.
  • Calculate δ ( x i ) and σ ( x i ) for each x i X using (8) and (9).
  • Select data points with ρ ( x i ) δ ( x i ) > d r as density peaks.
  • For each density peak x i , set y i = i . // starting point of each cluster.
  • For each non-peak data point x i , set y i = y σ ( x i ) . // cluster assignment.
  • Return y .
The parts different from the DPC in Algorithm 1 are highlighted in red.

5. Performance Study

5.1. Test Datasets

In this study, we use 12 well-known two-dimensional synthetic datasets to demonstrate the performance of the proposed algorithms. Dataset Spiral [23] consists of three spiral-shaped clusters. Dataset Flame [24] consists of two non-Gaussian clusters of points, where both clusters are of different sizes and shapes. Dataset Aggregation [25] consists of seven perceptually distinct (non-Gaussian) clusters of points. Dataset R15 [26] consists of 15 similar Gaussian clusters that are positioned on concentric circles. Dataset D31 [26] consists of 31 similar Gaussian clusters that are positioned along random curves. Datasets A1, A2, and A3 [27] contain 20, 35, and 50 circular clusters, respectively, where each cluster has 150 points. Datasets S1, S2, S3, and S4 [28] each contain 15 Gaussian clusters, where the degree of cluster overlapping is S1 < S2 < S3 < S4. Appendix D gives a detailed characterization of these datasets.

5.2. Experiment Setup

The experiment was divided into two tests. Test 1 used a hard threshold to calculate the local density, as defined in Equations (1) and (2); Test 2 used an exponential kernel to calculate the local density, as defined in Equation (3). In both tests, the value of d c for defining the neighborhood is determined by the parameter p, as suggested in [6] and described in Section 2. We varied the value of p from 0.5 to 4 with a step size of 0.5. A large p implied a large d c and consequently a large neighborhood.
In this experimental study, we compared the performance of the proposed ADPC1 and ADPC2 against DPC. Recall that both ADPC1 and ADPC2 accelerated the way to derive the separation distances of those data points with a local density less than the maximal local density of their neighbors. Thus, we calculated the proportion (denoted by R ˘ ) of such data points in a dataset for various p values, i.e., R ˘ = N ˘ N where N ˘ is the number of such data points in the dataset, and N is the total number of data points in the dataset. Usually, both N ˘ and R ˘ grow with a large neighborhood (i.e., a large d c or p). Thus, the proposed ADPC1 and ADPC2 should perform better with a larger p.
Since these three algorithms only differ on how to calculate the separation distance, we collected and compared their execution time for calculating the local density and the separation distance, i.e., from Step 1 to Step 3 of these algorithms in Algorithms 1 and 2. Then, for ease of comparison, we calculated the percentage of execution time improvement of ADPC1 (or ADPC2) over DPC by the difference of the execution times of DPC and ADPC1 (or ADPC2), divided by the execution time of DPC.

5.3. Experiment Results

5.3.1. Test 1: Use a Fixed Threshold for Local Density

In Test 1, a fixed threshold is used to determine the neighborhood for calculating the local density of each data point. Table 1 shows the value of R ˘ , i.e., the proportion of data points with a local density less than the maximal local density of their neighbors. According to Table 1, except for some small datasets (e.g., Spiral, Flame, Aggregation, and R15 datasets) and small p combinations, the value of R ˘ is usually greater than 80% in most cases, indicating that a large proportion of the data points in a dataset can be benefited from ADPC1 and ADPC2 to accelerate the calculation of their separation distances. Please refer to Table A2 in Appendix E for the value of N ˘ , i.e., the number of data points with a local density less than the maximal local density of their neighbors.
A larger p implies a larger d c , and thus a larger neighborhood range and probably more neighbors in the neighborhood. Intuitively, for a data point with a larger number of neighbors, it becomes less likely that the local density of the data point is greater than the maximal local density of its neighbors. Therefore, as the value of p increases, the value of R ˘ tend to increase (with some exceptions).
Table 2 shows the percentage of execution time improvement of ADPC1 and ADPC2 over DPC. Except for the two small datasets Spiral and Flame at p = 0.5, both ADPC1 and ADPC2 substantially reduced the execution time of DPC. ADPC2 took less time than ADPC1 did for most dataset and p value combinations. For the execution time of the three algorithms, please see Table A4 in Appendix E.
For most cases in Table 1, the values of R ˘ were large and did not change much as the value of p increased. As a result, the impact of p‘s value on the execution time improvement was not obvious in Table 2. To show that the impact of R ˘ on the percentage of execution time improvement, consider the case of dataset D31 at p = 3 and 3.5. In Table 1, the value of R ˘ dropped from 87.94% at p = 3 to 65.61% at p = 3.5. The corresponding case in Table 2 showed that at p = 3, ADPC1 (or ADPC2) incurred the execution time improvement over DPC by 77.86% (or 80.44%). However, at p = 3.5, ADPC1 (or ADPC2) incurred the execution time improvement over DPC by only 47.06% (or 48.53%). This example shows that a large R ˘ helps ADPC1 and ADPC2 to reduce the percentage of execution time improvement. However, if a small p is applied on a small dataset, then the resulting R ˘ value is too small, causing ADPC2 to perform slower than DPC does (e.g., datasets Flame and Spiral at p = 0.5).

5.3.2. Test 2: Use an Exponential Kernel for Local Density

In Test 2, an exponential kernel (see Equation (3)) is used to calculate the local density of each data point. Table 3 shows the value of R ˘ for various dataset and p combinations. Please refer to Table A3 in Appendix E for the value of N ˘ . Similar to Table 1 in Test 1, a large proportion of the data points can be benefited from ADPC1 and ADPC2. Furthermore, each value of R ˘ in Table 3 is greater than its corresponding value in Table 1. That is, for the same dataset and the same p value, an even larger proportion of data points can be benefited from ADPC1 and ADPC2 using the exponential kernel than using a fixed threshold to calculate the local density. In Test 2, a larger p value always incurs a larger R ˘ values in Table 3. The results are consistent with that of Test 1.
Table 4 shows the percentage of execution time improvement of ADPC1 and ADPC2 over DPC. ADPC1 always took less time than DPC did, except at p = 0.5 for Spiral dataset; ADPC2 always took less time than DPC did, except at p = 0.5 for Flame dataset. In general, both ADPC1 and ADPC2 required substantially less execution time than DPC did. ADPC2 usually achieved higher improvement than ADPC1 did; however, the difference is small. For the execution time of the three algorithms, please see Table A5 in Appendix E.
Comparing Table 2 and Table 4 show that the execution time improvement is greater in Test 1 than in Test 2. In Test 1, calculating the local density of a data point requires simply counting the number of data points in its neighborhood (see Equations (1) and (2)). However, in Test 2, calculating the local density of a data point is much time consuming because it requires calculating an exponential function N 1 times, where N is the number of data points in the dataset (see Equation (3)). The execution time collected in this study is the execution time for calculating the local density and the separation distance. All three algorithms use the same method to calculate the local density, and they are only differed on how to calculate the separation distance. That is, the execution time improvement of ADPC1 and ADPC2 over DPC is due to the improvement on how to calculate the separation distance. Since much more time was spent on calculating the local density in Test 2 than in Test 1, the percentage of execution time improvement is smaller in Test 2 than in Test 1.

6. Conclusions

As discussed in Section 3, if the local density of a data point x i is less than the largest local density of its neighbors, then ADPC1 and ADPC2 can reduce the time complexity for calculating the separation distance of x i from O ( N ) to O ( | B ( x i ) | ) where N denotes the number of data points in the dataset, and | B ( x i ) | denotes the number of neighbors of x i . Thus, the effectiveness of both ADPC1 and ADPC2 depends on the proportion of the data points satisfying this condition. The experimental results in Table 1 and Table 3 show that most data points in a dataset satisfy this condition, except for some small datasets using a small neighborhood setting. Consequently, both ADPC1 and ADPC2 improve the execution time of DPC, as shown in Table 2 and Table 4. Furthermore, in most cases, ADPC2 requires less execution time than ADPC1 does.
Consider the case that all data points in a continuous region have the same local density. Then, there exists no data point in the region with a local density less than the largest local density of its neighbors, and consequently, both ADPC1 and ADPC2 cannot accelerate the computation of the separation distance for the data points in this region. If the entire dataset contains many such regions, then the advantage of ADPC1 and ADPC2 diminishes. However, according to Table 1 and Table 3, except for small datasets with a small neighborhood range (i.e., small d c ), both ADPC1 and ADPC2 are advantageous.
The proposed methods focus on accelerating the calculation of the separation distance. However, it is also possible to improve the DPC algorithm by accelerating the calculation of the local density [9]. Besides, the DPC algorithm has several shortcomings that have received much attention in the literature. First, choosing proper values for DPC’s parameters is not straightforward, but it can highly affect the quality of the clustering results. To resolve this problem, Ref. [7] applied the concept of heat diffusion and [8] employed the potential entropy of the data field to determine the value of d c . Additionally, Ref. [12] proposed a comparative technique to choose the density peaks. Thus, how to make the DPC algorithm more adaptive to the datasets with less human intervention is worthy of further investigation.
The local density of a data point x i can be defined from two different perspectives. One is to specify a fixed distance and count the number of data points within the fixed distance from x i . The DPC algorithm adopted this perspective. Another perspective is to specify a fixed number of neighbors and measure the distances of these neighbors to x i . Refs. [13,14] adopted this perspective and defined new methods to calculate the local density based on the k-nearest neighbors of x i . Since the definition of the local density significantly affects the clustering results, how to choose a proper method to define the local density is an important issue worthy of further investigation for density-based clustering algorithms.
Our future work intends to extend the DPC algorithm as a hierarchical clustering algorithm. Conceptually, the DPC algorithm builds a directed acyclic graph of all data points with an out-degree ≤ 1. Then, it selects several data points from the graph as the density peaks. Finally, it removes the outgoing links of the density peaks and breaks the graph into several subgraphs, each of which represents a cluster. By adding an ordering on the density peaks and incrementally removing the outgoing links of the density peaks according to this ordering, it is possible to yield the clustering results as a dendrogram. Furthermore, integrating the notion of central symmetry [29] or point symmetry [30] with the DPC algorithm for the detection of symmetry objects is also worthy of further investigation.

Funding

This research is supported by the Ministry of Science and Technology, Taiwan, under grant MOST 106-2221-E-155-038.

Acknowledgments

The author acknowledges the Innovation Center for Big Data and Digital Convergence at Yuan Ze University for supporting this study.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Implementation Details for Calculating Separation Distance in DPC

Consider the case of more than one data points with a local density = the maximal local density in X . According to Equation (4), the separation distance of any data point x i with the maximal local density will be set to the maximal distance from x i to any point in X , i.e., max x j X   d ( x i , x j ) . Consequently, all data points with the maximal local density have high separation distances and, thus, will be chosen as the density peaks to form individual clusters, regardless that some of these data points may be near to each other. Notably, many data points with an equal local density are less likely to occur when Equation (3) is used for calculating local density because the Gaussian kernel in Equation (3) yields a floating-point value. However, the local density calculated using Equation (1) is an integer, and data points with an equal local density become common.
Laio’s Matlab implementation of the DPC algorithm [22] resolved the above problem as follows. Recall that in Step 2 of the DPC algorithm in Algorithm 1, all data points in X are sorted by their local densities descendingly, i.e., ρ ( x i ) ρ ( x j ) for i < j . After Step 2, Laio used the ordering of the data points’ positions in X instead of the ordering on local density for calculating separation distances. Specifically, Laio used Equations (A1) and (A2) instead of Equations (4) and (5) to calculate separation distances. Notably, in this work, we use x i to denote the ith data point in X , and whenever the ordering of the data points in X is rearranged, the data point referred as x i also changes. Notably, it is possible that more than one data point has the same local density, but each position in X can only be taken by one data point:
δ ( x i )   = { max x j X   d ( x i , x j ) , if   i = 1 min j < i   d ( x i , x j ) , if   1 < i
σ ( x i )   = { 1 if   i = 1 argmin j < i   d ( x i , x j ) , if   1 < i
According to Equation (A1), only the separation distance of the first data point x 1 in X is set to the maximal distance, and for each data point x i 1 , we only scan those data points located before x i in X . Notably, with Equations (A1) and (A2), it is possible that σ ( x i 1 ) = j but ρ ( x j ) = ρ ( x i ) because the ordering on local density is non-monotonically decreasing after Step 2 of the DPC algorithm in Algorithm 1. However, with Equations (4) and (5), if σ ( x i ) = j and i j , then ρ ( x j ) > ρ ( x i ) must hold. Thus, the definition of separation distance according to Equation (4) has been slightly modified in Equation (A1), and the difference is illustrated in Figure A1.
Figure A1. Difference between using Equation (4) and using Equation (A1) to calculate separation distance.
Figure A1. Difference between using Equation (4) and using Equation (A1) to calculate separation distance.
Symmetry 11 00859 g0a1
Figure A2 shows Laio’s implementation for Step 3 of the DPC algorithm based on Equations (A1) and (A2). It is obvious that only the first data point is handled differently from the rest of the data points. Figure A3 shows the implementation of Step 3 of the DPC algorithm based on Equations (4) and (5). Data points with the maximal local density are handled in the same manner in Figure A2 and Figure A3. However, for data points with a local density less than the maximal local density, Figure A3 faithfully implements Equations (4) and (5) to ensure that no data point with the same local density as x i is scanned when calculating δ ( x i ) , as illustrated in Figure A1.
Figure A2. Implementation details of Step 3 of DPC algorithm (in Algorithm 1) based on Laio’s implementation [22].
Figure A2. Implementation details of Step 3 of DPC algorithm (in Algorithm 1) based on Laio’s implementation [22].
Symmetry 11 00859 g0a2
Figure A3. Implementation details of Step 3 of DPC algorithm (in Algorithm 1) based on Equations (4) and (5).
Figure A3. Implementation details of Step 3 of DPC algorithm (in Algorithm 1) based on Equations (4) and (5).
Symmetry 11 00859 g0a3

Appendix B. Implementation Details for Calculating Separation Distance in ADPC1

Figure A4 gives a detailed description of Step 3 of the ADPC1 algorithm in Algorithm 2. Notably, before this step, all data points in X have been sorted by their local densities descendingly. To resolve the problem of multiple data points with the maximal local density, we adopt the same approach described in Appendix A, as shown in the first for loop of Figure A4. That is, only the separation distance of the first data point with the maximal local density in X is set to the maximal distance. For each data point x i 1 with the maximal local density, the separation distance δ ( x i ) is set to the minimal distance from x i to other data points located before x i in X . Notably, the data points with the maximal local density are handled in the same manner in Figure A2, Figure A3, and Figure A4. The second for loop in Figure A4 applies Equations (6) and (7) to process the data points with local density < the maximal local density.
Figure A4. Implementation details of Step 3 of ADPC1 algorithm (in Algorithm 2) based on Equations (6) and (7).
Figure A4. Implementation details of Step 3 of ADPC1 algorithm (in Algorithm 2) based on Equations (6) and (7).
Symmetry 11 00859 g0a4

Appendix C. Implementation Details for Calculating Separation Distance in ADPC2

Figure A5 gives a detailed description of Step 3 of the ADPC2 algorithm in Algorithm 3. To resolve the problem of multiple data points with the maximal local density, we adopt the same approach described in Appendix A, as shown in the first for loop of Figure A5. The second for loop in Figure A5 bases on Equations (8) and (9) to process the data points with local density < the maximal local density.
Figure A5. Implementation details of Step 3 of the ADPC2 algorithm (in Algorithm 3) based on Equations (8) and (9).
Figure A5. Implementation details of Step 3 of the ADPC2 algorithm (in Algorithm 3) based on Equations (8) and (9).
Symmetry 11 00859 g0a5

Appendix D. Datasets

Figure A6 and Figure A7 show the data distribution of the 12 two-dimensional synthetic datasets used in Section 5. Table A1 describes the number of clusters and the number of points in these datasets.
Figure A6. Data distribution of the 12 datasets (part 1).
Figure A6. Data distribution of the 12 datasets (part 1).
Symmetry 11 00859 g0a6
Figure A7. Data distribution of the 12 datasets (part 2).
Figure A7. Data distribution of the 12 datasets (part 2).
Symmetry 11 00859 g0a7
Table A1. Number of points and number of clusters in the 12 datasets.
Table A1. Number of points and number of clusters in the 12 datasets.
DatasetNumber of ClustersNumber of Points
Spiral3312
Flame2240
Aggregation7788
R1515600
D31313100
A1203000
A2355250
A3507500
S1155000
S2155000
S3155000
S4155000

Appendix E. More Experimental Results

Table A2 and Table A3 show the number N ˘ of the data points with a local density < the maximal local density of their neighbors in Tests 1 and 2, respectively. Table A4 and Table A5 show the execution time of the three algorithms (DPC, ADPC1 and ADPC2) in Tests 1 and 2, respectively.
Table A2. The value of N ˘ for various dataset and p combination (Test 1 uses a fixed threshold).
Table A2. The value of N ˘ for various dataset and p combination (Test 1 uses a fixed threshold).
Dataset (N)p = 0.5p = 1p = 1.5p = 2p = 2.5p = 3p = 3.5p = 4
Spiral (N = 312)6290110139161187208232
Flame (N = 240)75143177208213221226231
Aggregation (N = 788)609704744760763761761768
R15 (N = 600)348490537558563570564567
D31 (N = 3100)29523028303430373013272620342482
A1 (N = 3000)28542955296729682971297329642962
A2 (N = 5250)51405201519251855202521452365237
A3 (N = 7500)73987418741374567477748374867490
S1 (N = 5000)43054806491249364945495649744973
S2 (N = 5000)43514834493449534950495449684961
S3 (N = 5000)44744882494249594977497049774982
S4 (N = 5000)43554818493049554964496849724976
Table A3. The value of N ˘ for various dataset and p value combination (Test 2 uses an exponential kernel).
Table A3. The value of N ˘ for various dataset and p value combination (Test 2 uses an exponential kernel).
Dataset (N)p = 0.5p = 1p = 1.5p = 2p = 2.5p = 3p = 3.5p = 4
Spiral (N = 312)148251286301308309309309
Flame (N = 240)118188219231232234234235
Aggregation (N = 788)725769775778779780781781
R15 (N = 600)439527556572576581584584
D31 (N = 3100)29833044306130673069307030753085
A1 (N = 3000)29142967297729782980298029812983
A2 (N = 5250)51755212521452165227523652395243
A3 (N = 7500)74247447745374757483749174937493
S1 (N = 5000)45924880494449634978498349834984
S2 (N = 5000)46054910496849804982498449854985
S3 (N = 5000)47274932496649794983498549874988
S4 (N = 5000)46524910495349714974498049854987
Table A4. The execution time in seconds (Test 1 uses a fixed threshold).
Table A4. The execution time in seconds (Test 1 uses a fixed threshold).
DatasetAlgorithmp = 0.5p = 1p = 1.5p = 2p = 2.5p = 3p = 3.5p = 4
SpiralDPC0.0381010.0468790.0792110.0521390.046880.0468790.0468790.046879
ADPC10.0350940.0411080.0360960.0320860.0468790.0312520.0312520.046879
ADPC20.045120.0330880.045120.0468780.0312510.0468450.046860.015646
FlameDPC0.0312530.0312550.0312730.0312520.0312530.0312520.0312520.031253
ADPC10.0156260.0312530.01562600.0155990.01562800.015628
ADPC20.0312550.0156250.0156050.0156270.01562700.01560.015606
AggregationDPC0.2526730.2656850.2812930.2969380.2969380.2969380.3125640.281312
ADPC10.0812150.0781320.0625060.0625070.0625060.0781330.0625060.078133
ADPC20.07810.0625070.046880.046880.0469010.0625070.0781330.062505
R15DPC0.1562650.1718920.1718920.1718920.1718920.1875190.1718910.171891
ADPC10.1250120.0781310.0781310.046880.0468790.046880.0468730.046878
ADPC20.0937590.078130.0468790.0468790.0468780.0312530.0312530.031253
D31DPC4.2817044.3442134.3754954.3754654.3754644.2348233.1878373.625386
ADPC10.8835430.76570.7427460.7813330.8125880.9376311.6876811.672092
ADPC20.8751240.714180.6875730.70320.7344890.8282131.6407981.578292
A1DPC4.0160514.0785574.1254374.1410654.1455344.1604854.2348134.187945
ADPC10.812590.6875720.6876060.7344550.7813320.796960.8282120.89072
ADPC20.7969610.6563190.6407260.671950.6875730.6875730.7188260.762816
A2DPC12.4856712.485712.5950912.5950812.5950912.6888512.7669812.7982
ADPC12.0783761.9845862.0938392.2033572.3283722.4846322.5642242.672159
ADPC22.0005021.9220792.2502052.0002132.0939692.1877322.2346092.297118
A3DPC25.4245425.5339625.6902325.7285426.0183926.1121526.4058726.33089
ADPC13.9264124.1899124.479094.5473574.7692925.1940295.4860065.506579
ADPC23.837743.8027743.9535444.14114.2817034.4206014.5614124.742328
S1DPC11.4230911.8137211.6574711.7146311.7512511.7981611.7981311.82938
ADPC13.9222922.4221342.1408522.1252252.1408512.2346122.3440022.437762
ADPC23.85842.3283721.9845881.937741.9221111.9533292.0350822.125227
S2DPC11.2199411.4543411.5208311.548111.6887411.6418611.6574811.67311
ADPC13.6878912.3127422.0314962.0627182.156482.2189862.3442.469012
ADPC23.6410112.2033571.9064521.8908261.9220791.9845842.0470922.105681
S3DPC11.2980811.4855911.5637311.5637311.6262311.5949811.6574811.67311
ADPC13.219092.140852.0314682.0314672.1095672.2346122.3439992.469044
ADPC23.1878382.0158361.8908571.862471.9064521.972082.0314662.109599
S4DPC11.2511811.5168511.5793611.6262311.5637111.6418311.7199911.70437
ADPC13.5159992.3440022.0783462.0939712.2033592.2658652.3596252.437758
ADPC23.4847132.2502421.9377061.9064511.9689921.9845862.0627512.099317
Table A5. The execution time in seconds (Test 2 uses an exponential kernel).
Table A5. The execution time in seconds (Test 2 uses an exponential kernel).
DatasetAlgorithmp = 0.5p = 1p = 1.5p = 2p = 2.5p = 3p = 3.5p = 4
SpiralDPC0.2176070.2276060.240640.2306420.2187880.2343990.2187730.218806
ADPC10.2436480.2105880.1995590.1774490.171890.2187420.1718920.187521
ADPC20.2015140.1794770.1774710.2156080.1718910.187520.1718690.171893
FlameDPC0.125010.1250120.1250140.140640.1250130.140640.1250140.14064
ADPC10.1093860.1093860.1093860.1250140.0937590.1094060.1093870.109387
ADPC20.125010.1093850.1250130.1093860.109380.1093860.1093840.109385
AggregationDPC1.2994881.2501661.2501311.2813621.2814071.2501691.2501641.250162
ADPC11.0788681.0626441.0626121.0626141.0626441.0626451.0782691.078274
ADPC21.110831.0626131.0469831.0470221.0470181.0470191.0470181.047018
R15DPC0.7796320.7501110.7500570.7501130.7500750.7657380.7344830.73445
ADPC10.7032320.6719570.6563510.6407270.6563510.6407280.6250970.625066
ADPC20.7032320.6719780.6563510.6250980.6407240.6406910.6250970.60947
D31DPC20.3458819.814619.9044219.5332919.6583419.517719.5489819.64271
ADPC117.1268216.7452516.6111716.3923716.5642316.3923316.458116.59551
ADPC217.1111916.6424216.3611116.2517316.2867616.3530116.5329716.37674
A1DPC18.5801318.0800417.9550317.9862517.876917.898817.9550317.8769
ADPC115.5797815.2672515.1891115.1578315.1265715.1567915.1656515.19659
ADPC215.4235115.1578915.0484715.0172214.9070315.0015914.9234314.87658
A2DPC56.3184655.5371555.0526855.6777855.2595855.4433955.568455.64653
ADPC147.6457146.6247446.1298946.3174147.286346.4580546.2705446.87997
ADPC247.364446.1275345.9736346.3799246.1298945.8485845.9101246.16624
A3DPC115.6921112.1025112.9651115.0125112.3618112.9651112.1213112.8714
ADPC195.7306993.5061495.9167295.7311595.5101694.6827395.7934896.21016
ADPC296.2133694.8622694.7985493.9659494.7386493.6975494.5228594.63775
S1DPC53.1150151.6304850.9233250.5834650.4428550.7866150.1928351.5836
ADPC145.3641943.8015242.7085942.832742.3326542.051342.227942.50451
ADPC245.7704843.61442.988941.8950442.3373541.8637941.816941.92629
S2DPC51.8804750.9741650.8960650.0053150.3647250.0834150.7397650.56783
ADPC144.7532442.7193742.0044642.0044242.2232344.3953342.6607842.61389
ADPC244.5828542.4889141.8559841.8638241.8641841.754441.910741.75443
S3DPC51.8961350.1459550.1702649.8959250.0365550.6303750.1302950.03656
ADPC143.4577741.8638241.7544341.7856841.6450441.8638242.2232341.88364
ADPC244.2390741.8481942.2076141.7700642.2701141.6502941.8794441.89504
S4DPC51.4529650.177250.4272350.567950.3646950.0521550.0052750.23967
ADPC144.0046742.1349341.8481942.5201342.0513442.0044242.27442.28573
ADPC244.4734741.8829142.0825642.0356842.1763541.6606741.8638241.84976

References

  1. Aggarwal, C.C.; Reddy, C.K. Data Clustering: Algorithms and Applications; Chapman and Hall/CRC: Boca Raton, FL, USA, 2014. [Google Scholar]
  2. Pham, G.; Lee, S.-H.; Kwon, O.-H.; Kwon, K.-R. A watermarking method for 3d printing based on menger curvature and k-mean clustering. Symmetry 2018, 10, 97. [Google Scholar] [CrossRef]
  3. MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 18–21 June 1965; pp. 281–297. [Google Scholar]
  4. Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. KDD 1996, 96, 226–231. [Google Scholar]
  5. Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2011; p. 696. [Google Scholar]
  6. Rodriguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science 2014, 344, 1492. [Google Scholar] [CrossRef] [PubMed]
  7. Mehmood, R.; Zhang, G.; Bie, R.; Dawood, H.; Ahmad, H. Clustering by fast search and find of density peaks via heat diffusion. Neurocomputing 2016, 208, 210–217. [Google Scholar] [CrossRef]
  8. Wang, S.; Wang, D.; Li, C.; Li, Y.; Ding, G. Clustering by fast search and find of density peaks with data field. Chin. J. Electron. 2016, 25, 397–402. [Google Scholar] [CrossRef]
  9. Bai, L.; Cheng, X.; Liang, J.; Shen, H.; Guo, Y. Fast density clustering strategies based on the k-means algorithm. Pattern Recognit. 2017, 71, 375–386. [Google Scholar] [CrossRef]
  10. Mehmood, R.; El-Ashram, S.; Bie, R.; Dawood, H.; Kos, A. Clustering by fast search and merge of local density peaks for gene expression microarray data. Sci. Rep. 2017, 7, 45602. [Google Scholar] [CrossRef]
  11. Liu, S.; Zhou, B.; Huang, D.; Shen, L. Clustering mixed data by fast search and find of density peaks. Math. Probl. Eng. 2017, 2017, 7. [Google Scholar] [CrossRef]
  12. Li, Z.; Tang, Y. Comparative density peaks clustering. Expert Syst. Appl. 2018, 95, 236–247. [Google Scholar] [CrossRef]
  13. Du, M.; Ding, S.; Jia, H. Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowl.-Based Syst. 2016, 99, 135–145. [Google Scholar] [CrossRef]
  14. Yaohui, L.; Zhengming, M.; Fang, Y. Adaptive density peak clustering based on k-nearest neighbors with aggregating strategy. Knowl.-Based Syst. 2017, 133, 208–220. [Google Scholar] [CrossRef]
  15. Ding, S.; Du, M.; Sun, T.; Xu, X.; Xue, Y. An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood. Knowl.-Based Syst. 2017, 133, 294–313. [Google Scholar] [CrossRef]
  16. Yang, X.-H.; Zhu, Q.-P.; Huang, Y.-J.; Xiao, J.; Wang, L.; Tong, F.-C. Parameter-free laplacian centrality peaks clustering. Pattern Recognit. Lett. 2017, 100, 167–173. [Google Scholar] [CrossRef]
  17. Cheng, S.; Duan, Y.; Fan, X.; Zhang, D.; Cheng, H. Review of Fast Density-Peaks Clustering and Its Application to Pediatric White Matter Tracts. In Annual Conference on Medical Image Understanding and Analysis; Springer International Publishing: Cham, Switzerland, 2017; pp. 436–447. [Google Scholar]
  18. Han, J.; Kamber, M.; Pei, J. 10-cluster analysis: Basic concepts and methods. In Data Mining, 3rd ed.; Han, J., Kamber, M., Pei, J., Eds.; Morgan Kaufmann: Boston, MA, USA, 2012; pp. 443–495. [Google Scholar]
  19. Xenaki, S.D.; Koutroumbas, K.D.; Rontogiannis, A.A. A novel adaptive possibilistic clustering algorithm. IEEE Trans. Fuzzy Syst. 2016, 24, 791–810. [Google Scholar] [CrossRef]
  20. Bianchi, G.; Bruni, R.; Reale, A.; Sforzi, F. A min-cut approach to functional regionalization, with a case study of the italian local labour market areas. Optim. Lett. 2016, 10, 955–973. [Google Scholar] [CrossRef]
  21. Deng, Z.; Choi, K.-S.; Jiang, Y.; Wang, J.; Wang, S. A survey on soft subspace clustering. Inf. Sci. 2016, 348, 84–106. [Google Scholar] [CrossRef] [Green Version]
  22. Laio, A. Matlab Implementation of the Density Peak Algorithm. Available online: http://people.sissa.it/~laio/Research/Clustering_source_code/cluster_dp.tgz (accessed on 27 May 2019).
  23. Chang, H.; Yeung, D.-Y. Robust path-based spectral clustering. Pattern Recognit. 2008, 41, 191–203. [Google Scholar] [CrossRef]
  24. Fu, L.; Medico, E. Flame, a novel fuzzy clustering method for the analysis of DNA microarray data. BMC Bioinf. 2007, 8, 3. [Google Scholar] [CrossRef]
  25. Gionis, A.; Mannila, H.; Tsaparas, P. Clustering aggregation. ACM Trans. Knowl. Discov. Data 2007, 1, 4. [Google Scholar] [CrossRef]
  26. Veenman, C.J.; Reinders, M.J.T.; Backer, E. A maximum variance cluster algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 1273–1280. [Google Scholar] [CrossRef]
  27. Kärkkäinen, I.; Fränti, P. Dynamic Local Search Algorithm for the Clustering Problem; University of Joensuu: Joensuu, Finland, 2002. [Google Scholar]
  28. Fränti, P.; Virmajoki, O. Iterative shrinking method for clustering problems. Pattern Recognit. 2006, 39, 761–775. [Google Scholar] [CrossRef]
  29. Lin, J.; Peng, H.; Xie, J.; Zheng, Q. Novel clustering algorithm based on central symmetry. In Proceedings of the Internation Conference on Machine Learning and Cybernetics, Shanghai, China, 26–29 August 2004; pp. 1329–1334. [Google Scholar]
  30. Bandyopadhyay, S.; Saha, S. A Point Symmetry-Based Clustering Technique for Automatic Evolution of Clusters. IEEE Trans. Knowl. Data Eng. 2008, 20, 1441–1457. [Google Scholar] [CrossRef]
Table 1. The value of R ˘ for various dataset and p combinations. (Using a fixed threshold).
Table 1. The value of R ˘ for various dataset and p combinations. (Using a fixed threshold).
Datasetp = 0.5p = 1p = 1.5p = 2p = 2.5p = 3p = 3.5p = 4
Spiral19.87%28.85%35.26%44.55%51.60%59.94%66.67%74.36%
Flame31.25%59.58%73.75%86.67%88.75%92.08%94.17%96.25%
Aggregation77.28%89.34%94.42%96.45%96.83%96.57%96.57%97.46%
R1558.00%81.67%89.50%93.00%93.83%95.00%94.00%94.50%
D3195.23%97.68%97.87%97.97%97.19%87.94%65.61%80.06%
A195.13%98.50%98.90%98.93%99.03%99.10%98.80%98.73%
A297.90%99.07%98.90%98.76%99.09%99.31%99.73%99.75%
A398.64%98.91%98.84%99.41%99.69%99.77%99.81%99.87%
S186.10%96.12%98.24%98.72%98.90%99.12%99.48%99.46%
S287.02%96.68%98.68%99.06%99.00%99.08%99.36%99.22%
S389.48%97.64%98.84%99.18%99.54%99.40%99.54%99.64%
S487.10%96.36%98.60%99.10%99.28%99.36%99.44%99.52%
Table 2. Percentage of execution time improvements over DPC (using a fixed threshold).
Table 2. Percentage of execution time improvements over DPC (using a fixed threshold).
DatasetAlgorithmp = 0.5p = 1p = 1.5p = 2p = 2.5p = 3p = 3.5p = 4
SpiralADPC17.89%12.31%54.43%38.46%0.00%33.33%33.33%0.00%
ADPC2–18.42%29.42%43.04%10.09%33.34%0.07%0.04%66.62%
FlameADPC150.00%0.01%50.03%100.00%50.09%49.99%100.00%50.00%
ADPC2−0.01%50.01%50.10%50.00%50.00%100.00%50.08%50.07%
AggregationADPC167.86%70.59%77.78%78.95%78.95%73.69%80.00%72.23%
ADPC269.09%76.47%83.33%84.21%84.21%78.95%75.00%77.78%
R15ADPC120.00%54.55%54.55%72.73%72.73%75.00%72.73%72.73%
ADPC240.00%54.55%72.73%72.73%72.73%83.33%81.82%81.82%
D31ADPC179.36%82.37%83.02%82.14%81.43%77.86%47.06%53.88%
ADPC279.56%83.56%84.29%83.93%83.21%80.44%48.53%56.47%
A1ADPC179.77%83.14%83.33%82.26%81.15%80.84%80.44%78.73%
ADPC280.16%83.91%84.47%83.77%83.41%83.47%83.03%81.79%
A2ADPC183.35%84.11%83.38%82.51%81.51%80.42%79.92%79.12%
ADPC283.98%84.61%82.13%84.12%83.37%82.76%82.50%82.05%
A3ADPC184.56%83.59%82.57%82.33%81.67%80.11%79.22%79.09%
ADPC284.91%85.11%84.61%83.90%83.54%83.07%82.73%81.99%
S1ADPC165.66%79.50%81.64%81.86%81.78%81.06%80.13%79.39%
ADPC266.22%80.29%82.98%83.46%83.64%83.44%82.75%82.03%
S2ADPC167.13%79.81%82.37%82.14%81.55%80.94%79.89%78.85%
ADPC267.55%80.76%83.45%83.63%83.56%82.95%82.44%81.96%
S3ADPC171.51%81.36%82.43%82.43%81.86%80.73%79.89%78.85%
ADPC271.78%82.45%83.65%83.89%83.60%82.99%82.57%81.93%
S4ADPC168.75%79.65%82.05%81.99%80.95%80.54%79.87%79.17%
ADPC269.03%80.46%83.27%83.60%82.97%82.95%82.40%82.06%
Table 3. The value of R ˘ for various dataset and p combinations. (Using an exponential kernel).
Table 3. The value of R ˘ for various dataset and p combinations. (Using an exponential kernel).
Datasetp = 0.5p = 1p = 1.5p = 2p = 2.5p = 3p = 3.5p = 4
Spiral47.44%80.45%91.67%96.47%98.72%99.04%99.04%99.04%
Flame49.17%78.33%91.25%96.25%96.67%97.50%97.50%97.92%
Aggregation92.01%97.59%98.35%98.73%98.86%98.98%99.11%99.11%
R1573.17%87.83%92.67%95.33%96.00%96.83%97.33%97.33%
D3196.23%98.19%98.74%98.94%99.00%99.03%99.19%99.52%
A197.13%98.90%99.23%99.27%99.33%99.33%99.37%99.43%
A298.57%99.28%99.31%99.35%99.56%99.73%99.79%99.87%
A398.99%99.29%99.37%99.67%99.77%99.88%99.91%99.91%
S191.84%97.60%98.88%99.26%99.56%99.66%99.66%99.68%
S292.10%98.20%99.36%99.60%99.64%99.68%99.70%99.70%
S394.54%98.64%99.32%99.58%99.66%99.70%99.74%99.76%
S493.04%98.20%99.06%99.42%99.48%99.60%99.70%99.74%
Table 4. Percentage of execution time improvements over DPC (using an exponential kernel).
Table 4. Percentage of execution time improvements over DPC (using an exponential kernel).
DatasetAlgorithmp = 0.5p = 1p = 1.5p = 2p = 2.5p = 3p = 3.5p = 4
SpiralADPC1–11.97%7.48%17.07%23.06%21.44%6.68%21.43%14.30%
ADPC27.40%21.15%26.25%6.52%21.43%20.00%21.44%21.44%
FlameADPC112.50%12.50%12.50%11.11%25.00%22.21%12.50%22.22%
ADPC20.00%12.50%0.001%22.22%12.51%22.22%12.50%22.22%
AggregationADPC116.98%15.00%15.00%17.07%17.07%15.00%13.75%13.75%
ADPC214.52%15.00%16.25%18.29%18.29%16.25%16.25%16.25%
R15ADPC19.80%10.42%12.49%14.58%12.50%16.33%14.89%14.89%
ADPC29.80%10.42%12.49%16.67%14.58%16.33%14.89%17.02%
D31ADPC115.82%15.49%16.55%16.08%15.74%16.01%15.81%15.51%
ADPC215.90%16.01%17.80%16.80%17.15%16.21%15.43%16.63%
A1ADPC116.15%15.56%15.40%15.73%15.38%15.32%15.54%14.99%
ADPC216.99%16.16%16.19%16.51%16.61%16.19%16.88%16.78%
A2ADPC115.40%16.05%16.21%16.81%14.43%16.21%16.73%15.75%
ADPC215.90%16.94%16.49%16.70%16.52%17.31%17.38%17.04%
A3ADPC117.25%16.59%15.09%16.76%15.00%16.18%14.56%14.76%
ADPC216.84%15.38%16.08%18.30%15.68%17.06%15.70%16.15%
S1ADPC114.59%15.16%16.13%15.32%16.08%17.20%15.87%17.60%
ADPC213.83%15.53%15.58%17.18%16.07%17.57%16.69%18.72%
S2ADPC113.74%16.19%17.47%16.00%16.17%11.36%15.92%15.73%
ADPC214.07%16.65%17.76%16.28%16.88%16.63%17.40%17.43%
S3ADPC116.26%16.52%16.77%16.25%16.77%17.31%15.77%16.29%
ADPC214.75%16.55%15.87%16.29%15.52%17.74%16.46%16.27%
S4ADPC114.48%16.03%17.01%15.91%16.51%16.08%15.46%15.83%
ADPC213.56%16.53%16.55%16.87%16.26%16.77%16.28%16.70%

Share and Cite

MDPI and ACS Style

Lin, J.-L. Accelerating Density Peak Clustering Algorithm. Symmetry 2019, 11, 859. https://doi.org/10.3390/sym11070859

AMA Style

Lin J-L. Accelerating Density Peak Clustering Algorithm. Symmetry. 2019; 11(7):859. https://doi.org/10.3390/sym11070859

Chicago/Turabian Style

Lin, Jun-Lin. 2019. "Accelerating Density Peak Clustering Algorithm" Symmetry 11, no. 7: 859. https://doi.org/10.3390/sym11070859

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop