Next Article in Journal
Technological Development of an InP-Based Mach–Zehnder Modulator
Previous Article in Journal
Modular Edge-Gracefulness of Graphs without Stars
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Fast Searching Density Peak Clustering Algorithm Based on Shared Nearest Neighbor and Adaptive Clustering Center

Key Laboratory of Advanced Control and Optimization for Chemical Processes, Ministry of Education, East China University of Science and Technology, Shanghai 200237, China
*
Author to whom correspondence should be addressed.
Symmetry 2020, 12(12), 2014; https://doi.org/10.3390/sym12122014
Submission received: 16 November 2020 / Revised: 30 November 2020 / Accepted: 3 December 2020 / Published: 6 December 2020

Abstract

:
The clustering analysis algorithm is used to reveal the internal relationships among the data without prior knowledge and to further gather some data with common attributes into a group. In order to solve the problem that the existing algorithms always need prior knowledge, we proposed a fast searching density peak clustering algorithm based on the shared nearest neighbor and adaptive clustering center (DPC-SNNACC) algorithm. It can automatically ascertain the number of knee points in the decision graph according to the characteristics of different datasets, and further determine the number of clustering centers without human intervention. First, an improved calculation method of local density based on the symmetric distance matrix was proposed. Then, the position of knee point was obtained by calculating the change in the difference between decision values. Finally, the experimental and comparative evaluation of several datasets from diverse domains established the viability of the DPC-SNNACC algorithm.

1. Introduction

Clustering is a process of “clustering by nature”. As an important research content in data mining and artificial intelligence, clustering is an unsupervised pattern recognition method without the guidance of prior information [1]. It aims at finding potential similarities and grouping, so that the distance between two points in the same cluster is as small as possible and the distance between data points in different clusters is opposite. A large number of studies have been devoted to solving clustering problems. Generally speaking, clustering algorithms can be divided into multiple categories such as partition-based, model-based, hierarchical-based, grid-based, density-based, and their combinations. Various clustering algorithms have been successfully applied in many fields [2].
With the emergence of big data, there is an increasing demand for clustering algorithms that can automatically understand and summarize data. Traditional clustering algorithms cannot handle the data with hundreds of dimensions, which leads to low efficiency and poor results. It is urgent to develop the existing clustering algorithms or propose a new clustering algorithm to improve the stability of the algorithm and ensure the accuracy of clustering [3].
In June 2014, Rodriguez et al. published clustering by fast search and find of density peaks (referred to as CFSFDP) in Science [4]. This is a new clustering algorithm on the basis of density and distance. The performance of the CFSFDP algorithm is superior to other traditional clustering algorithms in many aspects. First, the CFSFDP algorithm is efficient and straightforward, which can significantly reduce the calculation time without the iteration of the objective function. Second, it is convenient to find cluster centers intuitively with the help of a decision graph in the CFSFDP algorithm. Third, the CFSFDP algorithm can be used to recognize the groups regardless of their shape and spatial dimensions. Therefore, shortly after the algorithm was proposed, it has been broadly used in computer vision [5], image recognition [6], and other fields.
Although the distinct advantages of the CFSFDP algorithm are over other clustering algorithms, the CFSFDP algorithm also has some disadvantages, which are as follows:
  • The measurement method for calculating local density and distance still needs to be improved. It does not consider the processing of complex datasets, which makes it difficult to achieve the expected clustering results when dealing with datasets with various densities, multi-scales, or other complex characteristics.
  • The allocation strategy of the remaining points after finding the cluster centers may lead to a “domino effect”, whereby once one point is assigned wrongly, there may be many more points subsequently mis-assigned.
As the cut-off distance d c corresponding to different datasets may be different, it is usually challenging to determine d c . Additionally, the clustering results are susceptible to d c .
In 2018, Rui Liu et al. proposed a shared-nearest-neighbor-based clustering by fast search and find of density peaks (SNN-DPC) algorithm to solve some fundamental problems of the CFSFDP algorithm [7]. The SNN-DPC algorithm is illustrated by the following innovations.
  • A novel metric to indicate local density is put forward. It makes full use of neighbor information to show the characteristics of data points. This standard not only works for simple datasets, but also applies to complex datasets with multiple scales, cross twining, different densities, or high dimensions.
  • By improving the distance calculation approach, an adaptive measurement method of distance from the nearest point with large density is proposed. The new approach considers both distance factors and neighborhood information. Thus, it can compensate for the points in the low-density cluster, which means that the possibilities of selecting the correct center point are raised.
  • For the non-center points, a two-step point allocation is carried out on the SNN-DPC algorithm. The unassigned points are divided into “determined subordinate points” and “uncertain subordinate points”. In the process of calculation, the points are filtered continuously, which improves the possibility of correctly allocating non-center points and avoids the “domino effect”.
Although the SNN-DPC algorithm can be used to make up some deficiencies of the CFSFDP algorithm, it still has some unsolved problems. Most obviously, the number of clusters in the dataset needs to be manually input. However, if we have no idea about the number of clusters, we can select cluster centers through the decision graph. It is worth noting that both methods require some prior knowledge. Therefore, the clustering results are less reliable because of the high effect of human orientation. In order to solve the problem of adaptively determining the cluster center, a fast searching density peak clustering algorithm based on shared nearest neighbor and adaptive clustering center algorithm, called DPC-SNNACC, was proposed in this paper. The main innovations of the algorithm are as follows:
  • The original local density measurement method is updated, and a new density measurement approach is proposed. This approach can enlarge the gap between decision values γ to prepare for automatically determining the cluster center. The improvement can not only be applied to simple datasets, but also to complex datasets with multi-scale and cross winding.
  • A novel and fast method is proposed that can select the number of clustering centers automatically according to the decision values. The “knee point” of decision values can be adjusted according to information of the dataset, then the clustering centers are determined. This method can find the cluster centers of all clusters quickly and accurately.
The rest of this paper is organized as follows. In Section 2, we introduce some research achievements related to the CFSFDP algorithm. In Section 3, we present the basic definitions and processes of the CFSFDP algorithm and the SNN-DPC algorithm. In Section 4, we propose the improvements of SNN-DPC algorithm and introduce the processes of the DPC-SNNACC algorithm. Simultaneously, the complexity of the algorithm is analyzed according to processes. In Section 5, we first introduce some datasets used in this paper, and then discuss some arguments about the parameters used in the DPC-SNNACC algorithm. Further experiments are conducted in Section 6. In this part, the DPC-SNNACC algorithm is compared with other classical clustering algorithms. Finally, the advantages and disadvantages of the DPC-SNNACC algorithm are summarized, and we also point out some future research directions.

2. Related Works

In this section, we briefly review several kinds of widely used clustering methods and elaborate on the improvement of the CFSFDP algorithm.
K-means is one of the most widely used partition-based clustering algorithms because it is easy to implement, is efficient, and has been successfully applied to many practical case studies. The core idea of K-means is to update the cluster center represented by the centroid of the data point through iterative calculation, and the iterative process will continue until the convergence criterion is met [8]. Although K-means is simple and has a high computing efficiency in general, there still exist some drawbacks. For example, the clustering result is sensitive to K; moreover, it is not suitable for finding clusters with a nonconvex shape. PAM [9] (Partitioning Around Medoids), CLARA [10] (Clustering Large Applications), CLARANS [11] (Clustering Large Applications based upon RANdomized Search), and AP [12] (Affinity Propagation) are also typical partition-based clustering algorithms. However, they also fail to find non-convex shape clusters.
The hierarchical-based clustering method merges the nearest pair of clusters from bottom to top. Typical algorithms include BIRCH [13] (Balanced Iterative Reducing and Clustering using Hierarchies) and ROCK [14] (Bayesian Optimization with Cylindrical Kernels).
The grid-based clustering method divides the original data space into a grid structure with a certain size for clustering. STING [15] (A Statistical Information Grid Approach to Spatial Data Mining) and CLIQUE [16] (clustering in QUEst) are typical algorithms of this type, and their complexity relative to the data size is very low. However, it is difficult to scale these methods to higher-dimensional spaces.
The density-based clustering method assumes that an area with a high density of points in the data space is regarded as a cluster. DBSCAN [17] (Density-Based Spatial Clustering of Applications with Noise) is the most famous density-based clustering algorithm. It uses two parameters to determine whether the neighborhood of points is dense: the radius e p s of the neighborhood and the minimum number of points M i n P t s in the neighborhood.
Based on the CFSFDP algorithm, many algorithms have been put forward to make some improvements. Research on the CFSFDP algorithm mainly involves the following three aspects:
1. The density measurements of the CFSFDP algorithm.
Xu proposed a method to adaptively choose cut-off distance [18]. Using the characteristics of Improved Sheather–Jones (ISJ), the method can be used to accurately estimate d c .
The Delta-Density based clustering with a Divide-and-Conquer strategy clustering algorithm (3DC) has also been proposed [19]. It is based on the Divide-and-Conquer strategy and the density-reachable concept in Density-Based Spatial Clustering of Applications with Noise (referred to as DBSCAN).
Xie proposed a density peak searching and point assigning algorithm based on the fuzzy weighted K-nearest neighbor (FKNN-DPC) technique to solve the problem of the non-uniformity of point density measurements in the CFSFDP algorithm [20]. This approach uses K-nearest neighbor information to define the local density of points and to search and discover cluster centers.
Du proposed density peak clustering based on K-nearest neighbors (DPC-KNN), which introduces the concept of K-nearest neighbors (KNN) to CFSFDP and provides another option for computing the local density [21].
Qi introduced a new metric for density that eliminates the effect of d c on clustering results [22]. This method uses a cluster diffusion algorithm to distribute remaining points.
Liu suggested calculating two kinds of densities, one based on k nearest neighbors and one based on local spatial position deviation, to handle datasets with mixed density clusters [23].
2. Automatically determine the group numbers.
Bie proposed the Fuzzy-CFSFDP algorithm [24]. This algorithm uses fuzzy rules to select centers for different density peaks, and then the number of final clustering centers is determined by judging whether there are similar internal patterns between density peaks and merging density peaks.
Li put forward the concept of potential cluster center based on the CFSFDP algorithm [25], and considered that if the shortest distance between potential cluster center and known cluster center is less than the cut-off distance d c , then the potential cluster center is redundant. Otherwise, it will be considered as the center of another group.
Lin proposed an algorithm [26] that used the radius of neighborhood to automatically select a group of possible density peaks, then used potential density peaks as density peaks, and used CFSFDP to generate preliminary clustering results. Finally, single link clustering was used to reduce the number of clusters. The algorithm can avoid the clustering allocation problem in CFSFDP.
3. Application of the CFSFDP algorithm.
Zhong and Huang applied the improved density and distance-based clustering method to the actual evaluation process to evaluate the performance of enterprise asset management (EAM) [27]. This method greatly reduces the resource investment in manual data analysis and performance sorting.
Shi et al. used the CFSFDP algorithm for scene image clustering [5]. Chen et al. applied it to obtain a possible age estimate according to a face image [6]. Additionally, Li et al. applied the CFSFDP algorithm and entropy information to detect and remove the noise data field from datasets [25].

3. Clustering by Fast Search and Find of Density Peaks (CFSFDP) Algorithm and Shared-Nearest-Neighbor-Based Clustering by Fast Search and Find of Density Peaks (SNN-DPC) Algorithm

3.1. Clustering by Fast Search and Find of Density Peaks (CFSFDP) Algorithm

It is not necessary to consider the probability distribution or multi-dimensional density in the CFSFDP algorithm as the performance is not affected by the space dimension, which is why it can handle high-dimensional data. Furthermore, it requires neither an iterative process nor more parameters. This approach is robust concerning the choice of d c   as the only parameter.
This algorithm is based on a critical assumption that the points with a higher local density and a relatively large distance than the others are more likely to be the cluster centers. Therefore, for each data point x i , only two variables need to be focused, that is, its local density ρ i and distance δ i .
The local density ρ i of data point x i is defined as Equation (1).
ρ i = x i x j χ ( d i j d c ) ,   χ ( a ) = { 1 , a < 0 0 , a 0
where d i j is the Euclidean distance between point x i and x j ; d c is the cut-off distance; and d c represents the neighborhood radius of a point, while it is a hyper-parameter that needs to be specified by users.
Equation (1) means that the local density is equal to the number of data points with a distance less than the cut-off distance.
δ i is computed by the minimum distance between point x i and another point x j with a higher density than x i , defined as Equation (2).
δ i = m i n j : ρ j > ρ i ( d i j ) ,
It should be noted that for the point with the highest density, its δ i m a x is conventionally obtained by Equation (3).
δ i m a x = max j ( d i j ) ,
Then, the points with high ρ i and high δ i are simultaneously decided as cluster centers, and the CFSFDP algorithm uses the decision value γ i for each data point x i to express the possibility of becoming the center of the cluster. The calculation method is shown by Equation (4).
γ i = ρ i δ i ,
From Equation (4), the higher ρ i and δ i are, the larger the decision value is. In other words, point x i with higher ρ i and δ i is more likely to be chosen as a cluster center. The algorithm introduces a representation called a decision graph to help users to select centers. The decision graph is the plot of δ i as a function of ρ i for each point.
There is no need to specify the number of clusters in advance as the algorithm can find the density peaks and identify them as cluster centers, but it needs users to select the number of clusters by identifying outliers in the decision graph.
After finding the centers, we can assign the groups of remaining points according to the cluster where the nearest neighbor peak belongs. The information used is obtained by calculating δ i .

3.2. Shared-Nearest-Neighbor-Based Clustering by Fast Search and Find of Density Peaks (SNN-DPC) Algorithm

The SNN-DPC algorithm introduces an indirect distance and density measurement method, taking the influence of neighbors of each point into consideration, and using the concept of shared neighbors to describe the local density of points and the distance between them.
Generally speaking, the larger the sum of neighbors shared by the two points, the higher the similarity of the two points.
For each point x i and x j in dataset X , the intersection of the K-nearest neighbor set of the i -th point and the K-nearest neighbor set of the x j point is defined as the number of shared nearest neighbors, referred to as S N N ( x i , x j ) . Equation (5) expresses the definition of S N N ( x i , x j ) .
S N N ( x i , x j ) = Γ ( x i ) Γ ( x j ) ,
where Γ ( x i ) represents the set of K-nearest neighbors of point x i and Γ ( x j ) represents the set of K-nearest neighbors of point x j .
For each point x i and x j in dataset X , the SNN similarity can be defined as Equation (6)
S i m ( x i , x j ) =   { | S N N ( x i , x j ) | 2 p S N N ( x i , x j ) ( d i p + d j p ) ,     i f   x i , x j S N N ( x i , x j ) 0 ,                                                     o t h e r w i s e ,
In other words, SNN similarity is calculated only if point x i and point x j exist in each other’s K-neighbor set. Otherwise, the SNN similarity is zero.
Then, the local density is calculated by Equation (7)
ρ i =   x j L ( x i ) S i m ( x i , x j ) ,
where L ( x i ) is the set of k points with the highest similarities. The local density is the result of adding the similarity of k points with the highest similarity to point x i .
From the definition of local density, it can be seen that the calculation of local density ρ i not only uses the distance information, but also obtains the information about the clustering structure by SNN similarity, which fully reveals the internal relevance between points.
Regarding δ , the distance from the nearest larger density point is introduced in the SNN-DPC algorithm and uses the proximity distance to add a compensation mechanism, so the δ value of the point in the low-density cluster may also be high. The definition of distance is shown in Equation (8)
δ i = min x j : ρ j > ρ i [ d i j ( p Γ ( x i ) d i p + q Γ ( x j ) d j q ) ] ,
The δ value of the highest density point is the largest δ i among all the remaining points in the dataset X , which can be given by Equation (9)
δ i m a x = max x j : X x i ( δ j ) ,
The distance from the nearest larger density point δ i not only considers the distance factor, but also considers the neighbor information of each point, thereby compensating the points in the low-density cluster and improving the feasibilities of the δ . That is to say, this method can be adapted to different density datasets.
The definition of the decision value is the same as the CFSFDP algorithm. The formula is displayed in Equation (10)
γ i = ρ i δ i ,
According to the decision graph, the cluster center-point can also be determined.
In the SNN-DPC algorithm, the unallocated points are divided into two categories: inevitable subordinate point and possible subordinate point.
Points x i and x j are two different points in dataset X . Only when at least half of the k neighborhoods of x i and x j are shared can they become the same cluster points. The condition is expressed by Equation (11).
| {   p |   p   Γ ( x i   )     p   Γ ( x j   )   } |   k /   2 ,
If an unassigned point does not meet the criteria for the inevitable subordinate point, it is defined as a possible subordinate point. The condition is reflected by Equation (12).
0 < | {   p |   p   Γ ( x i   )     p   Γ ( x j   )   } | <   k /   2 ,
According to Equation (11), inevitable subordinate points can be allocated first. For the points that do not meet the equation conditions, the neighborhood information is used to further determine the belonging cluster. During the whole process of the algorithm, the information will be updated continuously to achieve better results.

3.3. Analysis of the SNN-DPC Algorithm

Compared with the CFSFDP algorithm, the SNN-DPC algorithm has a significant improvement. Considering the information of the shared neighbors, the SNN-DPC algorithm is applicable to datasets with different conditions. For the non-center points, a two-step allocation method is adopted. With this approach, the algorithm can cluster variable density and non-convex datasets. However, the method of determining cluster center points without prior knowledge has not been explored in the SNN-DPC algorithm, in other words, manual intervention is still needed. Moreover, this approach is more effective for datasets with one unique density peak or with a small number of clusters because in such cases, it is easier to judge the data points with relatively large ρ and δ from the decision graph. However, for the following cases, the decision graph showed obvious limitations:
  • For the datasets with unknown cluster numbers, choosing the cluster number is greatly affected by human subjectivity.
  • In addition to the apparent density peak points in the decision graph, some data points with relatively large ρ and small δ , or relatively small ρ   and large δ may be cluster centers. These points are easy to be ignored artificially, so fewer cluster centers can be selected. Finally, some data points of different clusters will be mistakenly merged into the same group.
  • If there are multiple density peaks in the same group, these points can be wrongly selected as redundant cluster centers, resulting in the same cluster being improperly divided into sub-clusters.
  • When dealing with datasets with many clustering centers, it is also easier to choose the wrong clustering centers.
The SNN-DPC algorithm is more sensitive to the selection of clustering centers and it is likely to select fewer or more clustering centers artificially. This kind of defect is more prominent when dealing with some particular datasets.

4. Fast Searching Density Peak Clustering Algorithm Based on Shared Nearest Neighbor and Adaptive Clustering Center (DPC-SNNACC) Algorithm

Through the analysis in Section 3.3, it can be seen that the SNN-DPC algorithm needs to know the number of clusters in advance. A method that can adaptively find the number of clusters on the basis of the SNN-DPC algorithm was proposed to solve this problem. The improvements in the DPC-SNNACC algorithm are mainly reflected in two aspects. On one hand, an improved calculation method of local density is proposed. On the other hand, the position of the knee point is obtained by calculating the change of the difference between decision values. Then, we can get the center-points and a suitable number of clusters. In this part, we will elaborate on the improvements.

4.1. Local Density Calculation Method

The local density of the SNN-DPC algorithm is defined under the similarity and the number of shared neighbors between two points. For some datasets with unbalanced sample numbers and different densities between certain clusters, when choosing the cluster centers by the decision graph, the distinction between center and non-center points is vague. In order to better identify the clustering centers than previously, we used the squared value of the original local density. The enhanced local density ρ i is defined as Equation (13).
ρ i = ( x j L ( x i ) S i m ( x i , x j ) ) 2 ,
where L ( x i ) is the set of k points with the highest similarities, and S i m ( x i , x j ) stands for SNN similarity, which is calculated based on a symmetric distance matrix.
Through experimental analysis, we can draw a conclusion that changing the density has a greater impact than δ on the γ value. Thus, in order to limit the complexity of the algorithm, we only dealt with the ρ and kept the δ unchanged. In addition, the γ values will follow the change of ρ according to Equation (10). If the difference in ρ between the points increases, so does the difference in γ . We used γ to represent the results of ranking γ in ascending order, subsequently, we could plot a novel decision graph by using the number of points as the x-axis and γ as the y-axis. The subsequent analysis of the decision values will be based on the new decision graph. The specific analysis of γ is illustrated in Section 5.

4.2. Adaptive Selection of Cluster Centers

Through observation, it can be easily inferred that the γ . values of the cluster center points are relatively large, and they are far away from the decision values of the non-central points. Moreover, the γ values of non-center points are usually small and remain basically the same. According to this characteristic, we proposed a method to find the knee point, which is described as Equations (14)–(17).
The search range of clustering centers can be locked on several points with larger γ values. On one hand, reducing the search range of the decision value γ of the cluster center means reducing the time of calculation as we do not need to search for complete data. On the other hand, deleting elements that have little impact on the results also has a positive impact on the accuracy of the clustering results. The square root is a quick and straightforward method to reduce the size of value in mathematical calculations. If the number of data points in dataset X is defined as n , D N is an integer closest to the result after rooting to n , then we lock the processing range of the points on [ D N , n ] .
k p = max { i | | μ i + 1 μ i | θ , i = n D N + 1 , n D N + 2 , , n 2 } ,
Specifically,
D N = [ n   ] ,
θ = 1 D N 2 i = n D N + 1 n 2 | μ i + 1 μ i | ,
μ i = γ i + 1 γ i , i = n D N + 1 , n D N + 2 , , n 1 ,
where [ n ] is the rounding symbol, whose value is the integer closest to the value in middle brackets and μ i is the difference value between adjacent γ value. The k p is to find a point with maximum sort value, which satisfies that the change in the difference between its corresponding γ values difference is beyond a certain threshold value. We used the average value of the difference among the D N γ values to represent the threshold θ . The specific reasons for the selection of the number of D N are explained in Section 5.4.
After obtaining the position of knee point k p , the number of clusters is the number of γ (from γ k p to γ m a x ). The corresponding point of the γ value is the cluster center of each cluster.
We took the Aggregation dataset (the details of Aggregation dataset are described in Section 5.1) as an example. It includes 788 data points (n = 788) in total, so DN = [ n   ] = 28 . The subsequent calculation only needs to focus on the maximum of 28 γ values. First, calculate 27 difference values μ i from the biggest 28 γ values according to Equation (17), then calculate the average value θ from the 26 increments of μ i by Equation (16). Next, the maximum of γ , whose increment is greater than θ according to Equation (14), is found to determine k p . For this dataset, k p = max { i | | μ i + 1 μ i | θ , i = 761 ,   762 , , 786 } = 782 , so the number of centers selected was 782; subsequently, the cluster centers were seven points corresponding to γ 782 to γ 788 .
As shown in Figure 1, the red horizontal line in the figure distinguishes the selected group center points from the non-center points. The color points above the red line are the clustering center points chosen by the algorithm, and the blue points below the red line are the non-center points.

4.3. Processes

The whole process of the DPC-SNNACC algorithm still includes three major steps: the calculation of ρ and δ , the selection of cluster centers, and the distribution of non-center points.
Suppose dataset X = ( x 1 , , x n ) and number of neighbors k are given. Clustering aims to divide the dataset X into N C classes. N C is the number of clusters. The set of cluster centers is C e n t e r = { c 1 , c 2 , , c N C   } . The result of clustering is Φ = { C 1 , C 2 , . C N C } . In the following steps, D n n is the Euclidean distance matrix, local density is ρ , distance from the nearest larger density point is δ , the decision value is γ , and the ascending sorted decision value is γ , Φ = { C 1 , C 2 , . C N C , x p , x p + 1 , } represents the initial clustering number where x p , x p + 1 , are the noncentral points that are unassigned, M is an ergodic matrix whose rows correspond to unallocated points, and columns correspond to clusters. Various symbols used in the paper are explained in the attached Table A1.
Step 1:
Initialize dataset X , and standardize all data points so that their values are in the range of [0, 1]. Then, calculate distance matrix D n n = { d i j } n n .
Step 2:
Calculate the similarity matrix of SNN according to Equation (5).
Step 3:
According to Equation (13), calculate local density ρ .
Step 4:
Calculate the distance from the nearest larger density point δ according to Equations (8) and (9).
Step 5:
Calculate the decision value γ according to Equation (10), and arrange it in ascending order to get the sorted γ .
Step 6:
According to the knee point condition defined in Equation (14), determine the number of clusters N C ( N C = k p ), and then the set of cluster centers C e n t e r = { c 1 , c 2 , , c N C   } is determined.
Step 7:
Initialize queue Q , push all center points into Q .
Step 8:
Take the header of Q as x a , find the number of neighbors k of x a , and the set of it is K a .
Step 9:
Take the unallocated point x K a .   If x meets the conditions defined as Equation (11), then classify the data point x into the cluster where x a   is located, and add x to the end of the queue Q . Otherwise, continue to choose the next point in K a and determine the distribution. After all the sample points in K a are judged, return to step 8 to proceed to the next determined subordinate point in queue Q .
Step 10:
When Q = , the initial clustering result is Φ = { C 1 , C 2 , . C N C , x p , x p + 1 , } .
Step 11:
Find all unallocated points x p , x p + 1 , , and re-number them. Then, define an ergodic matrix M for distribution of possible subordinate points, the row in M indicates the order number of unassigned points, and the column represents the cluster.
Step 12:
Find the maximum value M imax in each row of M , and use the cluster where the M imax is located to represent the cluster of unallocated points in this row. Update matrix M until all points are assigned.
Step 13:
Output the final clustering results Φ = { C 1 , C 2 , , C N C } .
We used a simple dataset as an example. This simple dataset contained 16 data points, X = ( x 1 , , x 16 ) , which can be divided into two categories according to their relative positions. Suppose the number of neighbors k is 5. As shown in Figure 2, there are two clusters in X , the red points from numbers 1 to 8 belong to one cluster, the blue points from numbers 9 to 16 is another cluster. Clustering aims to divide the dataset X into two classes.
According to steps 1 to 6 above, we can calculate the ρ and δ of every point in dataset X . Then, the γ of each point can be calculated and arranged in ascending order to get γ , and the distribution of γ is shown in Figure 3. Through step 7, we can easily obtain the number of clusters. Furthermore, the cluster center set is C e n t e r = { c 1 , c 2 } , corresponding to the two clusters as points x 4 and x 10 , respectively.
To determine inevitable subordinate points, we pushed all points of C e n t e r into Q , Q = { x 4 , x 10 } . First, we can take x 4 as x a , and simultaneously pop x 4 out of queue Q , so the queue becomes Q = { x 10 } . Then, we can find the five neighbors of x 4 as K 4 = { x 4 ,   x 2 ,   x 6 ,   x 7 ,   x 5 } . Point x 2 meets the conditions defined as Equation (11), so add x 2 to the cluster of x 4 , push point x 2 into Q = { x 10 ,   x 2 } . Next, check the points in K 4 in order, and repeat the same steps until Q is empty. The initial clustering result is Φ = { C 1 , C 2 , x 8 , x 9 } .
The unallocated points are x 8 , x 9 . The five neighbors of x 8 are { x 8 ,   x 9 ,   x 1 ,   x 3 ,   x 2 } and the neighbors of x 9 are { x 9 ,   x 8 ,   x 14 ,   x 15 ,   x 3 } . We can define the ergodic matrix M as shown in Table 1. In matrix M , the rows represent unallocated points x 8 ,   x 9 , and the columns stand for Cluster 1 centered on x 4 and Cluster 2 centered on x 10 .
Table 1 shows that most of the neighbors of point x 8 belong to cluster 1, so x 8 belongs to cluster 1 with x 4 as the cluster center. Similarly, point x 9 belongs to cluster 2 with x 10 as the cluster center. Finally, all points in dataset X can be solved, and we can get clustering results.

4.4. Analysis of Complexity

In this section, we analyze the time complexity of the algorithm according to the algorithm steps in Section 4.3. The time complexity corresponding to each step is analyzed as follows.
Step 1:
Normalize points into the range of [ 0 , 1 ] , so the time complexity is about O ( n ) . Calculate the distance between each of the two points, and the time complexity is O ( n 2 ) .
Step 2:
Calculate the number of nearest neighbors between each two points, and the time complexity is O ( n 2 ) . Calculate the intersection within O ( k ) using a hash table. Generally speaking, the time complexity of step 3 is O ( k n 2 ) .
Step 3:
Calculate the initial local density according to the number of shared nearest neighbors k , and square it, so the time complexity of step 4 is O ( k n ) .
Step 4:
Calculate the distance from the nearest larger density point, so the time complexity is O ( n 2 ) .
Step 5:
Calculate the decision values for each point in ascending order, so the complexity is O ( n l o g n ) .
Step 6:
Calculate the change in the difference between two different points in the total n points, so the time complexity is about O ( n ) .
Steps 7–10:
The total complexity is O ( N C n 2 ) .
The total time complexity of Steps 8–11 is the basic loop O ( N C ) times the highest complexity in this loop, which is O ( n 2 ) . As the number of nearest neighbors k can be obtained by step 3, the time complexity is recorded as O ( 1 ) . Therefore, the total time complexity is O ( N C n 2 ) .
Steps 11–13:
The total complexity is O ( ( N C + k ) n 2 )
The total time complexity of Steps 11–13 is the basic loop O ( n ) times the highest complexity in the loop, which is O ( k n ) or O ( N C n ) ; therefore, the total time complexity is O ( k n 2 ) or O ( N C n 2 ) , so we can combine them into O ( ( N C + k ) n 2 ) .
In summary, the time complexity of the entire DPC-SNNACC algorithm is O ( ( N C + k ) n 2 ) .

5. Discussion

Before the experiment, some parameters should be discussed. The purpose was to find the best value of each parameter related to the DPC-SNNACC algorithm. First, some datasets and metrics are introduced. Second, we discuss the performance of the DPC-SNNACC algorithm from several aspects including k, γ , and D N . The optimal parameter values corresponding to the optimal metrics were found through comparative experiments.

5.1. Introduction to Datasets and Metrics

The performance of the clustering algorithm is usually verified with some datasets. In this paper, we applied 14 commonly used datasets [28,29,30,31,32,33,34,35,36] containing eight synthetic datasets in Table 2 and four UCI(University of California, Irvine) real-datasets in Table 3. The tables list the basic information including the number of data records, the number of clusters, and the data dimensions. The datasets in Table 2 are two-dimensional for the convenience of graphic display. Compared with synthetic datasets, the dimensions of the real datasets in Table 3 are usually bigger than 2.
The evaluation metrics of the clustering algorithm usually include internal and external metrics. Generally speaking, internal metrics are suitable for the situation of unknown data labels, while external metrics have a good reflection on the data with known data labels. As the datasets used in this experiment had already been labeled, several external evaluation metrics were used to judge the accuracy of clustering results including normalized mutual information (NMI) [37], adjusted mutual information (AMI) [38], adjusted Rand index (ARI) [38], F-measure [39], accuracy [40], and Fowlers-Mallows index (FMI) [41]. The maximum values of these metrics are 1, and the larger the values of the metrics, the higher the accuracy.

5.2. The Influence of k on the Metrics

The only parameter that needs to be determined in the DPC-SNNACC algorithm is the number of nearest neighbors k . In order to analyze the impact of k on the algorithm, the Aggregation dataset was used as shown in Figure 4.
To select the optimal number of neighbors, we increased the number of neighbors k from five to 100. For the lower boundary, if the number of neighbors is low and the density is sparse, it means that there is no similarity. Furthermore, errors may be caused by small k for some datasets. Thus, the lower limit was determined to be 5. For the upper limit, if the value of k is much too high, on one hand, the algorithm will be complex and run for a long time, on the other hand, a high k value will affect the results of the algorithm. The analysis on k shows that the exorbitant k has no impact on the results of the algorithm, so it is of little significance for further tests. We set 100 as the upper limit.
When k ranges from five to 100, the corresponding metrics oscillate obviously, and the trend of the selected metrics of “AMI”, “ARI”, and “FMI” are more or less the same. Therefore, we can replace multiple metric changes with one metric change. For example, when the AMI metric reaches the optimal value, other external metrics will float the optimal value nearby, and the corresponding k is the best number of neighbors. Additionally, it can be seen from the change in metrics that the value of each metric tends to be stable with the increase in k . However, an exorbitant k value will lead to the decrease of metrics. Therefore, if a certain k value is defined in advance without experiment, the optimal clustering result cannot be obtained; furthermore, the significance of the similarity measurement is lost. In the case of the Aggregation dataset, when k = 35, each metric value was higher, so 35 could be selected as the best neighbor number of the Aggregation dataset.

5.3. The Comparison of γ Values between SNN-DPC and DPC-SNNACC

As we change the calculation method of local density, the difference between contiguous γ will increase. Two different datasets were used to illustrate the comparison between two algorithms in Figure 5 and Figure 6. The k used were the best parameters in the respective algorithms.
After comparing and analyzing the above figures, whether using the SNN-DPC algorithm or the DPC-SNNACC algorithm, the correct selection of the number of cluster centers could be achieved on the Jain dataset and Spiral dataset, and the distinction was the differences between γ . In other words, the improved method showed a bigger difference between the cluster center and non-center points, thus indicating that we can use the DPC-SNNACC algorithm to identify the cluster centers and reduce unnecessary errors.

5.4. The Influence of Different D N on Metrics

As above-mentioned, D N represents the search range of clustering centers and is an integer closest to the result of n treatment. In order to reduce the search range of the decision values γ of the cluster centers, we used D N closest to n as the search scale of clustering centers. In this part, the effects of using several other values instead of n as D N were subjected to further analysis. This section applies the Aggregation and Compound datasets to illustrate this problem.
As can be seen from Figure 7, Figure 8 and Figure 9, D N = n achieves better cluster quantity than other values of D N in terms of the distributions of γ and the clustering results. Furthermore, when D N = n / 2 , the algorithm loses the exact number of clusters.
Table 4 uses a number of metrics to explain the problem objectively, which explain the clustering situations when D N chooses different values. The bold type indicates the best clustering situation.   n obtained higher metrics than n and n / 2 , indicating that n is the best value to determine D N .
It can be seen from Figure 10, Figure 11 and Figure 12 that different D N obtained the same number of clusters, but the corresponding metrics were quite different. Comparing Figure 10b and Figure 11b, it can be clearly seen that the case of n correctly separated each cluster, while the case of n divided a complete cluster into three parts, and merged two upper left clusters that should have been separated.
The conclusions of Table 5 are similar to those in Table 4; when n and n / 2 were selected as D N , the performances were not as good as n . This means that choosing n as the search range of the decision value is reasonable to determine the cluster centers.

6. Experiment

6.1. Preprocessing and Parameter Selection

Before starting the experiment, we had to eliminate the effects of missing values and dimension differences on the datasets. For missing values, we could replace them with the average of all valid values of the same dimension. For data preprocessing, we used the “min-max normalization method”, as shown in Equation (18), to make all data linearly map into [0, 1]. In this way, we could reduce the influence of different measures on the experimental results, eliminate dimension differences, and improve the calculation efficiency [42].
x i j = x i j min ( x j ) max ( x j ) min ( x j ) ,
where x i j is the standard data of i -th sample and j -th dimension; x i j is the original data of i -th sample and j -th dimension; and x j is the original data of the entire j -th dimension.
First, to reflect the actual results of the algorithms more objectively, we adjusted the optimal parameters for each dataset to ensure the best performance of the algorithm for each kind of parameter. Specifically, the exact cluster numbers of K-means, SNN-DPC, and CFSFDP are given. However, the DPC-SNNACC algorithm does not determine the number of clusters in advance, which is automatically calculated according to the algorithm itself. When processing the SNN-DPC and the DPC-SNNACC algorithms, we increased the number of neighbors k from five to 100. Next, we conducted experiments to find the best clustering neighbor number k . The author of the traditional CFSFDP algorithm provides a rule of thumb, which makes the number of neighbors between 1 2 % of the total number of points. We can approach the best result by adjusting the d c values through previous experiments.

6.2. Results on Synthetic Datasets

In this part, we tested the DPC-SNNACC algorithm with eight synthetic datasets. Figure 13 and Figure 14 intuitively show the visual comparisons among the four algorithms by using two-dimensional images. The color points represent the points assigned to different groups, and the pentagram points are clustering centers.
The comprehensive clustering results of the Aggregation dataset in Figure 13 showed that better clustering results could be achieved by both the SNN-DPC and the DPC-SNNACC algorithm, however, a complete cluster will be split by the K-means and CFSFDP algorithm. To be specific, the SNN-DPC and the DPC-SNNACC algorithm found the same clustering centers, which means that the improved algorithm achieved gratifying results in determining the centers. Nevertheless, the K-means algorithm in Figure 13a divided a complete cluster at the bottom of the two-dimensional image from the middle into two independent clusters, and merged the left part with two similar groups into one cluster, while the CFSFDP algorithm in Figure 13b divided it into upper and lower clusters because there were several data points from the upper groups connected to an adjacent small group, so they were combined into a single cluster.
The clustering results shown in Figure 14 demonstrate the SNN-DPC and the DPC-SNNACC algorithms identified the cluster correctly, but K-means and CFSFDP failed to do so. For the K-means algorithm, a small part of the lower cluster was allocated to the upper cluster, so the whole dataset looked like two types of diagonal cut, therefore it was obvious that the clustering results were not desirable. For the CFSFDP algorithm, all sides of the lower cluster were allocated to the upper cluster, leading to the upper and lower parts, which was still not what we expected.
Table 6 shows the results in terms of the NMI, AMI, ARI, F-Measure, Accuracy, FMI, N C , and Time evaluation metrics on the synthetic datasets, where the symbol “–” indicates the meaningless parameters, and the best result metrics are in bold.
As reported in Table 6, the improved algorithm competes favorably with the other three clustering algorithms. The N C column shows the number of clusters. The N C of the first three algorithms are given in advance. However, the number of clusters of the DPC-SNNACC algorithm was adaptively determined by the improved method. Through the analysis of the table, the numerical value of the N C column in each dataset was the same, indicating that the DPC-SNNACC algorithm could correctly identify the number of clusters in the selected dataset. Compared with K-means, the other three algorithms could obtain fairly good performances in clustering. The Spiral dataset usually verifies the ability of algorithm to solve cross-winding datasets. It is clear that the K-means algorithm failed to perform well, while the other three clustering algorithms reached the optimal condition. This may result from the fact that the other three algorithms used both ρ and δ to describe each point, which can make the characteristics more obvious. The K-means algorithm only uses distance to classify points, so it cannot generally represent the characteristics of points. In other words, the K-means algorithm cannot deal with the cross-winding datasets. Sometimes, the CFSFDP algorithm cannot recognize the clusters effectively, even on a few more points. This may be due to the simplistic way of density estimation as well as the approach of allocating the remaining points. The DPC-SNNACC and the SNN-DPC algorithms outperformed the other clustering algorithms on nearly all of these synthetic datasets, except for CFSFDP on Jain. Furthermore, each algorithm could obtain good clustering results in the R15 dataset. Similarly, in the D31 dataset, the result of K-means and CFSFDP were almost as good as the SNN-DPC and the DPC-SNNACC algorithms, which means that these algorithms can be applied to multi groups and larger datasets. Although the DPC-SNNACC algorithm runs a little longer, it is still within the acceptable range.

6.3. Results on UCI (University of California, Irvine) Datasets

In this part, the improved algorithm was subjected to further tests with four UCI real datasets.
The Wine dataset consists of three clusters with few intersections among them. From the results shown in Figure 15, the four algorithms could all divide into three clusters, but there were some differences in the classification of the critical part. For instance, the CFSFDP algorithm had cross parts between the second and the third cluster from top to bottom, while the other algorithms were three independent individuals roughly. In addition, the less-than-perfect results are also reflected in the center distance of the second and third groups in CFSFDP as it was much closer than the others. The other three algorithms could basically achieve correct clustering results.
Figure 16 shows the performance of the four algorithms on the WDBC dataset. As shown in the figure, the other three algorithms found the cluster centers of the breast cancer database (WDBC), except for CFSFDP. To be specific, as the distribution density of the dataset was uneven and the overall density was similar, the CFSFDP algorithm classified most of the points into one group, resulting in poor clustering effect. It appears that the K-means, the DPC-SNNACC, and the SNN-DPC algorithms could find the cluster centers correctly and allocate the points reasonably, but a further inspection revealed that the DPC-SNNACC and the SNN-DPC could distinguish two kinds of clusters well, while the K-means algorithm had some cross points in the middle of two clusters.
Again, the performances of the four algorithms were benchmarked in terms of NMI, AMI, ARI, F-Measure, Accuracy, and FMI. Table 7 displays the performance of the four algorithms on various datasets. The symbol “–” in the table means that the entries had no actual values and the best results are shown in bold. Like the synthetic datasets, the DPC-SNNACC algorithm could correctly identify the number of clusters, which means that we can use the improved algorithm to adaptively determine the cluster centers. Furthermore, the SNN-DPC and DPC-SNNACC were superior to the other algorithms in terms of metrics. Basically, every metric of the DPC-SNNACC algorithm was on par with the SNN-DPC, indicating evidence of great potential in clustering. It was obvious that although the metric values of the DPC-SNNACC in the WDBC dataset were slightly lower than the SNN-DPC algorithm, these were much higher than the K-means. Furthermore, the CFSFDP algorithm had trouble with the WDBC dataset, which indicates that the DPC-SNNACC algorithm is more robust than CFSFDP. In terms of time, the DPC-SNNACC was similar to the others, with little differentiation in 0.7 s.
Through the above-detailed analysis of the performance of the DPC-SNNACC algorithm with other clustering algorithms on the synthetic and real-world datasets from the UCI, we can conclude that the DPC-SNNACC algorithm had better performance than the other common algorithms in clustering, which substantiates its potentiality in clustering. Most importantly, it could find the cluster centers adaptively, which is not a common characteristic of the other algorithms.

6.4. Running Time

The execution efficiency of the algorithm is usually an important metric to evaluate the performance, and we often use time to represent the execution efficiency. This section compares the DPC-SNNACC algorithm with the SNN-DPC algorithm, the CFSFDP algorithm, and K-means algorithm in terms of time complexity. At the same time, the clustering consumption time of the synthetic dataset and the real dataset will be compared by Section 4.3 and Section 4.4 to judge the advantages and disadvantages of the DPC-SNNACC algorithm.
Table 8 shows the time complexity comparison of the four clustering algorithms. It can be seen from the table that the time complexity of the K-means algorithm was the lowest, the time complexity of the CFSFDP algorithm ranked second, and the time complexity of the SNN-DPC and the DPC-SNNACC algorithm was the highest. However, k and N C are usually much smaller than n , so it has little effect on the time complexity of the algorithms.
The last columns of Table 6 and Table 7 show the time needed by each algorithm to cluster different datasets. Table analysis showed that for most datasets, the time of the K-means algorithm was the shortest, followed by CFSFDP. The time of the SNN-DPC and the DPC-SNNACC algorithms were longer, which is consistent with the time complexity, but the difference in the running time of each dataset algorithm was less than 1 s, which can be acceptable. Therefore, even if the time of the DPC-SNNACC algorithm is not optimal, it is still desirable.

7. Conclusions

In this paper, in order to solve the problem that the SNN-DPC algorithm needs to select cluster centers through a decision graph or needs to input the cluster number manually, we proposed an improved method called DPC-SNNACC. By optimizing the calculation method of local density, the difference in the local density among different points becomes larger as does the difference in the decision values. Then, the knee point is obtained by calculating the change in decision values. The points with a high decision value are selected as clustering centers, and the number of clustering centers is adaptively obtained. In this way, the DPC-SNNACC algorithm can solve the problem of clustering for unknown or unfamiliar datasets.
The experimental and comparative evaluation of several datasets from diverse domains established the viability of the DPC-SNNACC algorithm. It could correctly obtain the clustering centers, and almost every metric met the standard of the SNN-DPC algorithm, which was superior to the traditional CFSFDP and K-means algorithms. Moreover, the DPC-SNNACC algorithm has high applicability to datasets of different dimensions and sizes. Within the acceptable range, it is generally feasible, although it has some shortcomings such as long running time. In general, the DPC-SNNACC algorithm not only retains the advantages of the SNN-DPC algorithm, but also solves the problem of self-adaptive determination of the cluster number. Furthermore, the DPC-SNNACC algorithm can be applicable to any dimension and size of datasets, and it is robust to noise and cluster density differences.
In future work, first, we can further explore the clustering algorithm based on shared neighbors, find a more accurate method to automatically determine k , and simplify the process of determining the algorithm parameters. Second, the DPC-SNNACC algorithm can be combined with other algorithms to give full play to the advantages of other algorithms and make up for the shortcomings of the DPC-SNNACC algorithm. Third, the algorithm can be applied to some practical problems to increase its applicability.

Author Contributions

Conceptualization, Y.L.; Formal analysis, Y.L.; Supervision, M.L.; Validation, M.L.; Writing—original draft, Y.L.; Writing—review & editing, M.L. and Y.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Fundamental Research Funds for the Central Universities, grant number 222201917006

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Summary of the meaning of symbols used in this paper.
Table A1. Summary of the meaning of symbols used in this paper.
SymbolMeaning
d c The cut-off distance, the neighborhood radius of a point
X = ( x 1 , , x n ) The dataset with x i as its i -th data point
n The number of records in the dataset X
d i j The Euclidean distance between point x i and x j
D = { d i j } The distances of the pairs of data points in X
ρ = ( ρ 1 , , ρ n ) The local density
δ = ( δ , , δ n ) The distance from larger density point
γ = ( γ 1 , , γ n ) The decision value, the element-wise product of ρ   and δ
γ The ascending sorted γ
k The number of K-nearest neighbors considered
Γ ( x i ) = ( x 1 , , x k ) The set of K-nearest neighbors of point x i
S N N ( x i ) = ( x 1 , ) The set of shared nearest neighbors of point x i , maybe an empty set
L(xi) = (x1, …, xk) The set of k points with the highest similarity to point x i
S i m { x i , x j } The SNN similarity between points x i and x j
N C The number of clusters in the experiment
D N An integer closest to result after rooting to n
k p The knee point
μ i The difference value between adjacent γ values
θ The threshold for judging knee point
C e n t e r = { c 1 , c 2 , , c N C   } The set of cluster centers in the experiment
Φ = { C 1 , C 2 , . C N C } The representation of final clustering result
x p , x p + 1 The noncentral points which are unassigned, possible subordinate points
Φ = { C 1 , C 2 , , C N C , x p , x p + 1 , } The representation of initial clustering result
M Ergodic matrix for distribution of possible subordinate points
Q The queue that has points waiting to be processed
K a The set of K-nearest neighbors of point x a

References

  1. Xu, R.; Wunsch, D. Survey of Clustering Algorithms. IEEE Trans. Neural Netw. 2005, 16, 645–678. [Google Scholar] [CrossRef] [Green Version]
  2. Omran, M.G.H.; Engelbrecht, A.P.; Salman, A.A. An overview of clustering methods. Intell. Data Anal. 2007, 11, 583–605. [Google Scholar] [CrossRef]
  3. Feldman, D.; Schmidt, M.; Sohler, C. Turning Big Data into Tiny Data: Constant-Size Coresets for kk-Means, PCA, and Projective Clustering. SIAM J. Comput. 2020, 49, 601–657. [Google Scholar] [CrossRef]
  4. Alex, R. Machine learning. Clustering by fast search and find of density peaks. J. Sci. 2014, 344, 6191. [Google Scholar]
  5. Shi, Y.-M.; Zhu, F.-Z.; Wei, X.; Chen, B.-Y. Study of transpedicular screw fixation on spine development in a piglet model. J. Orthop. Surg. Res. 2016, 11, 8. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  6. Chen, Y.; Lai, D.-H.; Qi, H.; Wang, J.-L.; Du, J. A new method to estimate ages of facial image for large database. Multimed. Tools Appl. 2015, 75, 2877–2895. [Google Scholar] [CrossRef]
  7. Liu, R.; Wang, H.; Yu, X. Shared-nearest-neighbor-based clustering by fast search and find of density peaks. Inf. Sci. 2018, 450, 200–226. [Google Scholar] [CrossRef]
  8. Wang, J.; Zhu, C.; Zhou, Y.; Zhu, X.; Wang, Y.; Zhang, W. From Partition-Based Clustering to Density-Based Clustering: Fast Find Clusters With Diverse Shapes and Densities in Spatial Databases. IEEE Access 2017, 6, 1718–1729. [Google Scholar] [CrossRef]
  9. Kaufman, L.; Rousseeuw, P.J. Partitioning Around Medoids (Program PAM); Wiley: Hoboken, NJ, USA, 2008; pp. 68–125. [Google Scholar]
  10. Kaufman, L.; Rousseeuw, P.J. Finding Groups in Data: An Introduction to Cluster Analysis; Wiley: Hoboken, NJ, USA, 2005. [Google Scholar]
  11. Ng, R.; Han, J. CLARANS: A method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng. 2002, 14, 1003–1016. [Google Scholar] [CrossRef] [Green Version]
  12. Frey, B.J.; Dueck, D. Clustering by Passing Messages between Data Points. Science 2007, 315, 972–976. [Google Scholar] [CrossRef] [Green Version]
  13. Zhang, T.; Ramakrishnan, R.; Livny, M. BIRCH: An Efficient Data Clustering Method for Very Large Databases. ACM SIGMOD Rec. 1996, 25. [Google Scholar] [CrossRef]
  14. Guha, S.; Rastogi, R.; Shim, K. A Clustering Algorithm for Categorical Attributes. Inf. Syst. J. 1999, 25, 345–366. [Google Scholar] [CrossRef]
  15. Wang, W.; Yang, J.; Muntz, R. In STING: A Statistical Information Grid Approach to Spatial Data Mining. In Proceedings of the VLDB’97—23rd International Conference on Very Large Data Bases, Athens, Greece, 25–29 August 1997. [Google Scholar]
  16. Agrawal, R.; Gehrke, J.E.; Gunopulos, D.; Raghavan, P. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, WA, USA, 1–4 June 1998; Volume 27, pp. 94–105. [Google Scholar]
  17. Ester, M. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise; AAAI Press: Palo Alto, CA, USA, 1996. [Google Scholar]
  18. Xu, D.; Tian, Y. A Comprehensive Survey of Clustering Algorithms. Ann. Data Sci. 2015, 2, 165–193. [Google Scholar] [CrossRef] [Green Version]
  19. Liang, Z.; Chen, P. Delta-density based clustering with a divide-and-conquer strategy: 3DC clustering. Pattern Recognit. Lett. 2016, 73, 52–59. [Google Scholar] [CrossRef]
  20. Xie, J.; Gao, H.; Xie, W.; Liu, X.; Grant, P.W. Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors. Inf. Sci. 2016, 354, 19–40. [Google Scholar] [CrossRef]
  21. Du, M.; Ding, S.; Jia, H. Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowl. Based Syst. 2016, 99, 135–145. [Google Scholar] [CrossRef]
  22. Qi, J.; Xiao, B.; Chen, Y. I-CFSFDP: A Robust and High Accuracy Clustering Method Based on CFSFDP. In Proceedings of the 37th Chinese Control Conference (CCC), Wuhan, China, 25–27 July 2018. [Google Scholar]
  23. Liu, Y.; Liu, D.; Yu, F.; Ma, Z. A Double-Density Clustering Method Based on Nearest to First in Strategy. Symmetry 2020, 12, 747. [Google Scholar] [CrossRef]
  24. Bie, R.; Mehmood, R.; Ruan, S.; Sun, Y.; Dawood, H. Adaptive fuzzy clustering by fast search and find of density peaks. Pers. Ubiquitous Comput. 2016, 20, 785–793. [Google Scholar] [CrossRef]
  25. Tao, L.I.; Hongwei, G.E.; Shuzhi, S. Density Peaks Clustering by Automatic Determination of Cluster Centers. J. Front. Comput. Sci. Technol. 2016, 10, 1614–1622. [Google Scholar]
  26. Lin, J.-L.; Kuo, J.-C.; Chuang, H.-W. Improving Density Peak Clustering by Automatic Peak Selection and Single Linkage Clustering. Symmetry 2020, 12, 1168. [Google Scholar] [CrossRef]
  27. Zhong, X.B.; Huang, X.X. An Efficient Distance and Density Based Outlier Detection Approach. Appl. Mech. Mater. 2012, 155, 342–347. [Google Scholar] [CrossRef]
  28. Gionis, A.; Mannila, H.; Tsaparas, P. Clustering aggregation. ACM Trans. Knowl. Discov. Data 2007, 1, 4. [Google Scholar] [CrossRef] [Green Version]
  29. Chang, H.; Yeung, D.-Y. Robust path-based spectral clustering. Pattern Recognit. 2008, 41, 191–203. [Google Scholar] [CrossRef]
  30. Jain, A.K.; Law, M. Data Clustering: A User’s Dilemma. In Proceedings of the International Conference on Pattern Recognition and Machine Intelligence, Kolkata, India, 20–22 December 2005. [Google Scholar]
  31. Veenman, C.; Reinders, M.; Backer, E. A maximum variance cluster algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 1273–1280. [Google Scholar] [CrossRef] [Green Version]
  32. Fu, L.; Qu, J.; Chen, H. Mechanical drilling of printed circuit boards: The state-of-the-art. Circuit World 2007, 33, 3–8. [Google Scholar] [CrossRef]
  33. Xia, Y.; Wang, G.; Gao, S. An Efficient Clustering Algorithm for 2D Multi-density Dataset in Large Database. In Proceedings of the 2007 International Conference on Multimedia and Ubiquitous Engineering (MUE’07), Seoul, Korea, 26–28 April 2007; pp. 78–82. [Google Scholar]
  34. Bache, K.; Lichman, M. UCI Machine Learning Repository; University of California: Irvine, CA, USA, 2013. [Google Scholar]
  35. Charytanowicz, M.; Niewczas, J.; Kulczycki, P.; Kowalski, P.A.; Łukasik, S.; Żak, S. Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images; Springer: Berlin/Heidelberg, Germany, 2010; pp. 15–24. [Google Scholar]
  36. Street, W.N.; Wolberg, W.H.; Mangasarian, O. Nuclear Feature Extraction for Breast Tumor Diagnosis. In Biomedical Image Processing and Biomedical Visualization; International Society for Optics and Photonics: Bellingham, WA, USA, 1993; Volume 1905, pp. 861–870. [Google Scholar]
  37. Lancichinetti, A.; Fortunato, S.; Kertész, J. Detecting the overlapping and hierarchical community structure in complex networks. New J. Phys. 2009, 11, 033015. [Google Scholar] [CrossRef]
  38. Vinh, N.X.; Epps, J.; Bailey, J. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance. J. Mach. Learn. Res. 2010, 11, 2837–2854. [Google Scholar]
  39. Yan, Y.; Fan, J.; Mohamed, K. Survey of clustering validity evaluation. Appl. Res. Comput. 2008. [Google Scholar] [CrossRef]
  40. Ding, S.F.; Jia, H.J.; Shi, Z. Spectral Clustering Algorithm Based on Adaptive Nystrom Sampling for Big Data Analysis. J. Softw. 2014, 25, 2037–2049. [Google Scholar]
  41. Fowlkes, E.B.; Mallows, E. A Method for Comparing Two Hierarchical Clusterings. J. Am. Stat. Assoc. 1983, 78, 553–569. [Google Scholar] [CrossRef]
  42. Jiawei, H.; Micheline, K. Data Mining: Concepts and Techniques; Elsevier: Amsterdam, The Netherlands, 2006. [Google Scholar]
Figure 1. Distribution of γ on the Aggregation dataset.
Figure 1. Distribution of γ on the Aggregation dataset.
Symmetry 12 02014 g001
Figure 2. A simple dataset as the example.
Figure 2. A simple dataset as the example.
Symmetry 12 02014 g002
Figure 3. Distribution of γ in the simple dataset.
Figure 3. Distribution of γ in the simple dataset.
Symmetry 12 02014 g003
Figure 4. Changes in the various metrics of the Aggregation dataset.
Figure 4. Changes in the various metrics of the Aggregation dataset.
Symmetry 12 02014 g004
Figure 5. Comparison of γ values between shared-nearest-neighbor-based clustering by fast search and find of density peaks (SNN-DPC) and fast searching density peak clustering algorithm based on shared nearest neighbor and adaptive clustering center (DPC-SNNACC) in the Jain dataset. (a) γ values of SNN-DPC; (b) γ values of DPC-SNNACC.
Figure 5. Comparison of γ values between shared-nearest-neighbor-based clustering by fast search and find of density peaks (SNN-DPC) and fast searching density peak clustering algorithm based on shared nearest neighbor and adaptive clustering center (DPC-SNNACC) in the Jain dataset. (a) γ values of SNN-DPC; (b) γ values of DPC-SNNACC.
Symmetry 12 02014 g005
Figure 6. Comparison of γ values shared-nearest-neighbor-based clustering by fast search and find of density peaks (SNN-DPC) and fast searching density peak clustering algorithm based on shared nearest neighbor and adaptive clustering center (DPC-SNNACC) in the Spiral dataset. (a) γ values of SNN-DPC; (b) γ values of DPC-SNNACC.
Figure 6. Comparison of γ values shared-nearest-neighbor-based clustering by fast search and find of density peaks (SNN-DPC) and fast searching density peak clustering algorithm based on shared nearest neighbor and adaptive clustering center (DPC-SNNACC) in the Spiral dataset. (a) γ values of SNN-DPC; (b) γ values of DPC-SNNACC.
Symmetry 12 02014 g006
Figure 7. Clustering results when D N = n in the Aggregation dataset. (a) Distribution of γ in the Aggregation dataset; (b) Clustering result of the Aggregation dataset.
Figure 7. Clustering results when D N = n in the Aggregation dataset. (a) Distribution of γ in the Aggregation dataset; (b) Clustering result of the Aggregation dataset.
Symmetry 12 02014 g007
Figure 8. Clustering results when D N = n in the Aggregation dataset. (a) Distribution of γ in the Aggregation dataset; (b) Clustering result of the Aggregation dataset.
Figure 8. Clustering results when D N = n in the Aggregation dataset. (a) Distribution of γ in the Aggregation dataset; (b) Clustering result of the Aggregation dataset.
Symmetry 12 02014 g008
Figure 9. Clustering results when D N = n / 2 in the Aggregation dataset. (a) Distribution of γ in the Aggregation dataset; (b) Clustering result of the Aggregation dataset.
Figure 9. Clustering results when D N = n / 2 in the Aggregation dataset. (a) Distribution of γ in the Aggregation dataset; (b) Clustering result of the Aggregation dataset.
Symmetry 12 02014 g009
Figure 10. Clustering results when D N = n in the Compound dataset. (a) Distribution of γ in the Compound dataset; (b) Clustering result of the Compound dataset.
Figure 10. Clustering results when D N = n in the Compound dataset. (a) Distribution of γ in the Compound dataset; (b) Clustering result of the Compound dataset.
Symmetry 12 02014 g010
Figure 11. Clustering results when D N = n in the Compound dataset. (a) Distribution of γ in the Compound dataset; (b) Clustering result of the Compound dataset.
Figure 11. Clustering results when D N = n in the Compound dataset. (a) Distribution of γ in the Compound dataset; (b) Clustering result of the Compound dataset.
Symmetry 12 02014 g011
Figure 12. Clustering results when D N = n / 2 in the Compound dataset. (a) Distribution of γ in the Compound dataset; (b) Clustering result of the Compound dataset.
Figure 12. Clustering results when D N = n / 2 in the Compound dataset. (a) Distribution of γ in the Compound dataset; (b) Clustering result of the Compound dataset.
Symmetry 12 02014 g012
Figure 13. The clustering results on Aggregation by the four algorithms. (a) K-means on the Aggregation; (b) CFSFDP on the Aggregation; (c) SNN-DPC on the Aggregation; (d) DPC-SNNACC on Aggregation.
Figure 13. The clustering results on Aggregation by the four algorithms. (a) K-means on the Aggregation; (b) CFSFDP on the Aggregation; (c) SNN-DPC on the Aggregation; (d) DPC-SNNACC on Aggregation.
Symmetry 12 02014 g013
Figure 14. The clustering results on Flame by the four algorithms. (a) K-means on Flame; (b) CFSFDP on Flame; (c) SNN-DPC on Flame; (d) DPC-SNNACC on Flame.
Figure 14. The clustering results on Flame by the four algorithms. (a) K-means on Flame; (b) CFSFDP on Flame; (c) SNN-DPC on Flame; (d) DPC-SNNACC on Flame.
Symmetry 12 02014 g014
Figure 15. The clustering results on Wine by four algorithms. (a) K-means on Wine; (b) CFSFDP on Wine; (c) SNN-DPC on Wine; (d) DPC–SNNACC on Wine.
Figure 15. The clustering results on Wine by four algorithms. (a) K-means on Wine; (b) CFSFDP on Wine; (c) SNN-DPC on Wine; (d) DPC–SNNACC on Wine.
Symmetry 12 02014 g015
Figure 16. The clustering results on the WDBC(Wisconsin Diagnosis Breast Cancer Database) by the four algorithms. (a) K-means on WDBC; (b) CFSFDP on WDBC; (c) SNN-DPC on WDBC; (d) DPC-SNNACC on WDBC.
Figure 16. The clustering results on the WDBC(Wisconsin Diagnosis Breast Cancer Database) by the four algorithms. (a) K-means on WDBC; (b) CFSFDP on WDBC; (c) SNN-DPC on WDBC; (d) DPC-SNNACC on WDBC.
Symmetry 12 02014 g016
Table 1. Ergodic matrix M of the example dataset.
Table 1. Ergodic matrix M of the example dataset.
Cluster 1Cluster 2
x 8 30
x 9 12
Table 2. Synthetic datasets.
Table 2. Synthetic datasets.
NameNo. of RecordsNo. of ClustersNo. of Dimension
Aggregation78872
Spiral31232
Jain37322
R15600152
Flame24022
Compound39962
Path-based30032
D313100312
Table 3. UCI(University of California, Irvine) real-datasets.
Table 3. UCI(University of California, Irvine) real-datasets.
NameNo. of RecordsNo. of ClustersNo. of Dimension
Wine178313
Iris15034
Seeds21037
WDBC569230
Table 4. Metrics corresponding to different D N in the Aggregation dataset.
Table 4. Metrics corresponding to different D N in the Aggregation dataset.
D N N C NMIAMIARIFMIF-Measure
n 70.95550.949980.959380.968140.97843
n 120.803630.665230.513770.630360.69423
n 2 120.803630.665230.513770.630360.69423
Table 5. Metrics corresponding to different D N in the Compound dataset.
Table 5. Metrics corresponding to different D N in the Compound dataset.
D N N C NMIAMIARIFMIF-Measure
n 60.83610.808640.833690.874430.86421
n 60.691730.681540.514010.629950.66917
n 2 60.691730.681540.514010.629950.66917
Table 6. Performances of different clustering algorithm on different synthetic datasets.
Table 6. Performances of different clustering algorithm on different synthetic datasets.
MetricsNMIAMIARIFMeasureAccuracyFMINCTime/s
Aggregation dataset
K-means0.84990.80080.70990.81690.90860.773170.8363
CFSFDP0.89410.85960.75480.84770.95430.807170.6420
SNN-DPC0.9550.94990.95930.97870.97840.968171.1680
DPC-SNNACC0.9550.94990.95930.97870.97840.968171.1938
Spiral dataset
K-means0.003000.34920.34620.327630.2830
CFSFDP11111130.4221
SNN-DPC11111130.4981
DPC-SNNACC11111130.5000
Jain dataset
K-means0.52870.49150.57660.88760.88200.819920.2292
CFSFDP0.64560.61030.70550.92510.92220.877920.4116
SNN-DPC0.56040.52110.59350.89270.8870.827220.5303
DPC-SNNACC0.50680.46670.51460.86800.86050.790420.6070
R15 dataset
K-means0.99420.99370.99270.99660.99660.9932150.4304
CFSFDP0.99420.99370.99270.99660.99660.9932150.7821
SNN-DPC0.99420.99370.99270.99660.99660.9932150.9916
DPC-SNNACC0.99420.99370.99270.99660.99660.9932151.0347
Flame dataset
K-means0.48430.46920.51170.86110.85830.764320.2755
CFSFDP0.41310.40300.32690.79030.78730.678620.3983
SNN-DPC0.89930.89740.95010.98750.98750.976820.4230
DPC-SNNACC0.89930.89740.95010.98750.98750.976820.4750
Compound dataset
K-means0.68340.64660.49880.62100.78940.611460.6626
CFSFDP0.73730.69680.54600.71830.83200.649160.7726
SNN-DPC0.85260.84510.83480.86030.87460.875260.6762
DPC-SNNACC0.83610.82860.83360.86420.86960.874460.6807
Path-based dataset
K-means0.54690.50980.46130.69590.74330.661630.2523
CFSFDP0.55290.51650.46780.70150.74660.665330.4284
SNN-DPC0.90130.90.92930.97670.97660.952930.7557
DPC-SNNACC0.90130.90.92930.97670.97660.952930.7920
D31 dataset
K-means0.95760.95260.91620.94620.94800.9192311.9972
CFSFDP0.95770.95560.93630.96830.96830.9384312.2940
SNN-DPC0.96580.96420.95090.97580.97570.95253111.6866
DPC-SNNACC0.96580.96420.95090.97580.97570.95253111.0848
Table 7. Performances of the different clustering algorithms on different UCI datasets.
Table 7. Performances of the different clustering algorithms on different UCI datasets.
MetricsNMIAMIARIFMeasureAccuracyFMINCTime/s
Wine dataset
K-means0.83460.83010.84710.94880.94940.898431.4173
CFSFDP0.71040.70640.67230.87750.88210.783431.7788
SNN-DPC0.87810.87350.89910.96610.96620.932931.2852
DPC-SNNACC0.87810.87350.89910.96610.96620.932931.9896
Iris dataset
K-means0.74190.73310.71630.88520.88660.811230.9315
CFSFDP0.65860.57410.45300.70950.66670.685531.3219
SNN-DPC0.91430.91230.92220.97320.97330.947831.3140
DPC-SNNACC0.91430.91230.92220.97320.97330.947831.3814
Seeds dataset
K-means0.66540.66140.69340.88580.88570.794931.2054
CFSFDP0.72380.71720.73410.89770.90.823031.4785
SNN-DPC0.75130.74860.79620.92910.92850.863631.3827
DPC-SNNACC0.75130.74860.79620.92910.92850.863631.4527
WDBC dataset
K-means0.62320.61090.73010.9270.92790.876923.2554
CFSFDP0.68490.62740.725623.9900
SNN-DPC0.75680.75210.85030.96110.96130.930423.7370
DPC-SNNACC0.69260.68950.7990.94740.94780.905424.2873
Table 8. Comparison of the time complexity of different algorithms.
Table 8. Comparison of the time complexity of different algorithms.
AlgorithmTime Complexity
K-means O ( N C n )
CFSFDP O ( n 2 )
SNN-DPC O ( ( N C + k ) n 2 )
DPC-SNNACC O ( ( N C + k ) n 2 )
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Lv, Y.; Liu, M.; Xiang, Y. Fast Searching Density Peak Clustering Algorithm Based on Shared Nearest Neighbor and Adaptive Clustering Center. Symmetry 2020, 12, 2014. https://doi.org/10.3390/sym12122014

AMA Style

Lv Y, Liu M, Xiang Y. Fast Searching Density Peak Clustering Algorithm Based on Shared Nearest Neighbor and Adaptive Clustering Center. Symmetry. 2020; 12(12):2014. https://doi.org/10.3390/sym12122014

Chicago/Turabian Style

Lv, Yi, Mandan Liu, and Yue Xiang. 2020. "Fast Searching Density Peak Clustering Algorithm Based on Shared Nearest Neighbor and Adaptive Clustering Center" Symmetry 12, no. 12: 2014. https://doi.org/10.3390/sym12122014

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop