Density Peak Clustering Algorithm Considering Topological Features

: The clustering algorithm plays an important role in data mining and image processing. The breakthrough of algorithm precision and method directly affects the direction and progress of the following research. At present, types of clustering algorithms are mainly divided into hierarchical, density-based, grid-based and model-based ones. This paper mainly studies the Clustering by Fast Search and Find of Density Peaks (CFSFDP) algorithm, which is a new clustering method based on density. The algorithm has the characteristics of no iterative process, few parameters and high precision. However, we found that the clustering algorithm did not consider the original topological characteristics of the data. We also found that the clustering data is similar to the social network nodes mentioned in DeepWalk, which satisfied power-law distribution. In this study, we tried to consider the topological characteristics of the graph in the clustering algorithm. Based on previous studies, we propose a clustering algorithm that adds the topological characteristics of original data on the basis of the CFSFDP algorithm. Our experimental results show that the clustering algorithm with topological features significantly improves the clustering effect and proves that the addition of topological features is effective and feasible. in the coordinate system. The horizontal and vertical coordinates represent the eigenvalues of DeepWalk with an output of two dimensions.


Introduction
With the advent of the era of big data, information grows rapidly [1,2]. The influx of massive data makes the statistics and screening of important information more difficult. Cluster analysis is an important statistical analysis method, which is mainly used to solve classification problems. It is a technique for integrating similar information into meaningful subclasses of data and trying to find patterns embedded in the underlying structure of massive data [3][4][5]. Cluster analysis has been widely used in computer vision, database knowledge discovery, image processing and other fields [6][7][8][9]. Due to its wide applicability, many clustering algorithms have been invented, including Kmeans, the Affinity Propagation (AP) algorithm, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Ordering Points to Identify the Clustering Structure (OPTICS) and the Clustering by Fast Search and Find of Density Peaks (CFSFDP) algorithm [9,10]. Although these methods solve many data clustering problems, they also have their own limitations and disadvantages. Therefore, with the increase in data complexity and the study of clustering methods, the improvement and expansion of classical methods have become the mainstream research direction. Below, we will introduce the advantages and disadvantages of the above clustering algorithms.
The K-means algorithm is an algorithm that adopts the alternative minimization method to solve non-convex optimization problems [11,12] and it is a representative of the prototype-based clustering method of objective functions. It divides a given data set into K clusters designated by users and has a high execution efficiency. However, the K-means algorithm needs to specify the number of clustering centers in advance. Different K values have a great impact on the experimental results and, in the actual operation, multiple attempts are required to obtain the optimal K value. Firstly, the objectivity of the K value makes the clustering results mostly different. Secondly, this algorithm is not friendly to non-convex data sets and it is difficult to obtain satisfactory clustering results.
The AP clustering algorithm [13,14] is an unsupervised clustering algorithm. The algorithm realizes the clustering process through the interaction between data points. The advantages of the algorithm are that there is no need to define the number of classes in advance, the clustering center can be obtained adaptively through the iterative process, and the location and number of class centers can be identified automatically from the data points, so as to maximize the similarity sum of all similar points. At the same time, the algorithm has disadvantages. Since the AP algorithm needs to calculate the similarity between each pair of data objects in advance, when there are many data objects, the memory footprint will be high. Moreover, its high time complexity O(N3) is one of its shortcomings.
The DBSCAN algorithm [15,16] is a clustering algorithm based on data point density. It gets rid of the constraint of data set shape requirement and can obtain class clusters of arbitrary shapes. The DBSCAN algorithm needs to establish an appropriate density threshold, and points with a density greater than this threshold become "core points"; it then aggregates all points within the threshold into the same class. The advantage of this algorithm is that the number of clustering centers does not need to be determined in advance, and noise points and outliers can be eliminated well. However, DBSCAN also has deficiencies. When the density of sample sets is not uniform and the clustering spacing is greatly different, the clustering quality is poor, which is mainly caused by the limitations of distance threshold and density threshold. Generally, the determination of these two parameters depends on the researcher's experience, which also leads to the limitation of the algorithm.
The OPTICS algorithm [17] is an improved version of DBSCAN, so the OPTICS algorithm is also a density-based clustering algorithm. In DBCSAN, algorithms need to input two parameters: the ϵ (distance threshold) and MinPts (density threshold). Choosing different parameters will lead to the final clustering result difference being very big; therefore, DBCSAN is too sensitive for the input parameters. OPTICS is proposed to help the DBSCAN algorithm select appropriate parameters and reduce the sensitivity of input parameters. This mainly aims at the improvement of too sensitive input parameters, e.g., the input parameters of DBSCAN (ϵ and MinPts). Although OPTICS also needs two inputs, the algorithm is not sensitive to ϵ input (generally, ϵ is fixed as infinity); at the same time, the algorithm does not explicitly generate data clustering, it just sorts the objects in the collection of data, gets an ordered list of objects, through which an ordered list can be compiled in order to get a decision diagram and, via this decision diagram, the algorithm can cluster data sets to different ϵ parameters. First, these are obtained by the fixed MinPts and infinite ϵ ordered list, then a decision diagram is created, by which the clustering of data is achieved when ϵ takes specific value.
CFSFDP is a new clustering algorithm based on density. The algorithm has a lot of bright spots; for example, it can discover clusters of arbitrary shape, it needs less parameters and samples, no iterative classification, non-equilibrium data sets (where the number of points in different clusters varies greatly) can be well processed, and its performance is better in large-scale data sets. However, there are some shortcomings. The algorithm has two important parameters, local density and cut-off distance. The choice of cut-off distance directly affects the accuracy of the clustering results, and the algorithm ignores the topological relationship between the data.
Since the emergence of Word2vec, many fields can use its ideas for embedding. The Item2vec algorithm based on item sequence is adopted. Graph embedding technology is based on graph structure. The DeepWalk method in graph embedding takes the node sequence obtained by random walk as a sentence, obtains part of the network information from the truncated random walk sequence, and then learns the potential representation of nodes through part of the information-that is, the topology characteristics. In order to solve the problem that the CFSFDP algorithm does not consider the topological characteristics of data, we innovatively proposed a new idea, using DeepWalk algorithm to represent the potential information of data, and applying this representation as topological characteristics to data clustering to improve the accuracy of data clustering. The general process is shown in Figure 1. On the basis of the CFSFDP algorithm, we chose to combine the topological features obtained by the DeepWalk algorithm to invent a new clustering algorithm. As far as we know, there are few studies on the combination of topological features and clustering algorithms, mainly because topological features are not easy to obtain. We have deftly applied the DeepWalk approach to social characteristics to solve this problem. Our motivation is twofold: 1) solve the problem of difficult topology feature acquisition; 2) improve the clustering precision by combining topology features with the CFSFDP clustering algorithm.

DPCTF Algorithm
The algorithm is based on the CFSFDP algorithm and DeepWalk [7]. This method is based on previous theories and topological features. This section briefly reviews the CFSFDP, DeepWalk and topological characteristics.

Clustering by Fast Search and Find of Density Peaks
CFSFDP is a clustering algorithm based on density peak. The core idea is to calculate the relative density and the nearest distance through the cut-off distance, draw the decision graph by using the absolute density and the nearest distance, and subjectively select the clustering center point on the decision graph to achieve the data clustering [18,19].
When performing clustering, the CFSFDP algorithm first needs to determine the center point of the class. It assumes that the local density of the clustering center of the cluster is higher than that of the surrounding data points, and the distance between the clustering center and those data points with a higher local density is larger. For a given data set, the CFSFDP algorithm needs to calculate two quantization values for each data point: the value of the local density of the data point and the distance from the data point whose local density is higher than its .
The local density of data point has two calculation methods: The cutoff distance kernel method and a method based on the Gaussian kernel computing method. The procedure of calculating the local density formula by cutoff distance kernel is to use the absolute distance between data points to subtract the cutoff distance. If the result is less than zero; that is, if the distance between data points is less than the preset cutoff distance, the function takes one, otherwise the result is zero. In the formula for , is the distance between data, is the cutoff distance. The local density of the data point obtained with this calculation method usually equals the number of data points with a distance less than around the data point. In fact, the greater the number of points which are closer to than , the greater the density. The method based on the Gaussian kernel computing method often uses high-dimensional cluster data. The main idea of the kernel clustering method is to map the data points in the input space to the high-dimensional characteristic space through nonlinear mapping, and select an appropriate Mercer kernel function to replace the inner product of the nonlinear mapping to carry out clustering in the characteristic space. This increases the probability of linearly separable data points through nonlinear mapping; that is, it can better distinguish, extract and amplify useful features, so as to achieve more accurate clustering, and the algorithm has a faster convergence speed. Gaussian kernel clustering is the mapping of data points in data space to a high-dimensional feature space. Then, a sphere with the smallest radius can be found in the feature space to cover all data points, and the sphere is mapped back to the data space to obtain the set of contour lines containing all data points. These contour lines are the boundaries of the cluster. Each point surrounded by a closed contour line belongs to the same cluster. The difference between calculating local density with the cutoff kernel and Gaussian kernel is that the former results in a discrete value and the latter in a continuous value.
Distance between data points is obtained by calculating the distance between data point and other data points with higher local densities than this data point. The distance between data points can be calculated according to Formula (1).
is measured by computing the minimum distance between the point i and any other point with higher density and for the point with highest density, use = . The two-dimensional decision graph is established by using and , as shown in Figure 2. After calculating the and of each point, the two attributes are used as the coordinate axes to generate the decision graph as shown in the Figure 2. In the decision graph, the points with larger and are manually selected as the center points of the class cluster. The literature also suggests a formula, = , to better select the clustering center. The closer the point is to the upper right corner of the decision graph, the greater the value after calculation by the formula. After all the class cluster centers are selected, the CFSFDP algorithm allocates the remaining points to the cluster where the local density is greater than it and the point closest to it is located.
The flow of the algorithm is shown in the following steps: Step 1: preset the distance threshold and the local density of each point is calculated; Step 2: Sort density points from high to low; Step 3: Let = , calculate according to formula (1), and store the corresponding label; Step 4: Make a decision graph based on the parameters and , and select the center point of the cluster; Step 5: According to the cluster center point, data object label and density boundary threshold, the remaining points are divided into each cluster or boundary region.

DeepWalk
DeepWalk is the earliest node vecturization model based on Word2vec [20], which maps every node in a network to a low-dimensional vector. To put it simply, DeepWalk uses a vector to represent each node in the network and expects these vectors to express the relationships among the nodes in the network; that is, the more closely related the nodes in the original network are, the closer the corresponding vectors are in their space. The main idea is to imitate the Word2vec text generation process by constructing the random walk path of nodes on the network. According to the sequence of nodes generated by the random walk [21][22][23], the Skipgram and Hierarchical Softmax models [24] are used to model the probability of nodes in the random walk sequence, maximize the probability of node occurrence in the random walk sequence and, finally, use the process of random gradient descent to learn the parameters.
DeepWalk takes the graph as input to generate a potential representation as output. Applying this method to the Karate Club network, the result is shown in Figure 3. Figure 3a shows the layout of the Karate Club data. Figure 3b shows the distribution of the two potential dimension outputs by DeepWalk in the coordinate system. We observed that the distribution of data points in Figure 3b was similar to that of the original data points, but there were also linearly separable parts. This finding can help with data clustering. The following steps describe this algorithm procedure: Input: network/graph; Import data into random walk to obtain node representation mapping; Import the node sequence into Skipgram update model, map the node to its current representation vector and use the stochastic gradient descent algorithm to maximize the probability of all words appearing in the context; Output: representation.

Random Walk Generator
The so-called random walk method is used to select random walk paths repeatedly in the network and eventually form a path through the network. Starting at a particular endpoint, each step of the walk randomly selects one of the edges connected to the current node, then moves along the selected edge to the next vertex, and so on.
Random walk obtains the local information of the network from the truncated random walk sequence and uses it to learn the vector representation of nodes, which makes it is easy to parallelize. When the network structure has a slight change, a new random walk can be generated for the changing part to update the learning model and improve the learning efficiency. If the nodes of a network follow the power-law distribution, the number of occurrences of the nodes in the random walk sequence should also follow the power-law distribution.

Skipgram
Skipgram is an implementation method for the vector representation of text in Word2vec. It is a kind of neural probability language model. The main idea of Skipgram is to use words to predict sentences. The context of a sentence consists of the words in the left and right Windows of a given word, and the window size is artificially set. The Skipgram language model requires the maximization of the probability of words appearing in the context. In other words, Skipgram achieves the representation of the characteristics of words. In DeepWalk, the author applies the vector representation of text in Word2vec to the social network relations, and indicates that, as long as the data types satisfy the power-law distribution, the vector representation of nodes can be realized by this method. According to this idea, we also carry out vector representation of clustering data nodes.

Topological Structure
Topological relation [25][26][27][28] is a geometric concept, which includes the relation of adjacency, association, inclusion and connection among points, lines and planes. This concept is extended to image processing. Topological structure is one of the basic properties of image, which often contains more accurate image information than simple feature points, such as corner points and inflection points. Topology is the property of a graph that remains unchanged after topological transformation. For example, as shown in the figure below, Figure 4 is a rectangular plane with large mass. There is a point X in the plane. After two-dimensional topological transformation (stretching and compression) of the rectangle, the length, width and height of the rectangle all change, and the ratio also changes. Thus, this built-in property that point X is inside the rectangle, is called a topological property or a topological feature [29][30][31][32]. Topological structure plays an important role in image segmentation and classification, but it is not widely used in data clustering, which makes it a research blank for combining topological structure information with clustering algorithm. In this paper, topological characteristics and a clustering algorithm are combined to achieve the purpose of improving clustering results.

Density Peak Clustering Algorithm Considering Topological Features
The CFSFDP algorithm still has some flaws. When the CFSFDP algorithm is used to process the data set, it is found that the clustering effect is not ideal in practical applications, and the main reason for this problem may be the selection of the data set and cutoff distance [33][34][35]. By summarizing and thinking about the classical clustering algorithm, we find that the classical clustering algorithm does not consider the topological characteristics of the data set. Therefore, in order to verify the effect of topology on clustering effect, the following experimental scheme was proposed. The specific operation process is shown in Figure 5. As an important feature of graphs, topological features play an important role in image processing; for example, topologies are often introduced when drawing satellite maps [36][37][38][39][40][41][42][43]. However, in the field of clustering research, topological features are rarely applied. Therefore, we take the lead in proposing a clustering algorithm based on topological characteristics, We call it Density Peak Clustering Algorithm Considering Topological Features (DPCTF).
First, we need to analyze and preprocess the original data. As can be seen from Figure 6, the original data is the commonly used form of point setting in the clustering algorithm, and the topological structure of the data is not considered. We transform the original data to graph form with the topological structure by finding the connection between data points. Through analysis of the original data, it is found that the data contains three features. The first feature is the initial data point, the second feature is all data points except the initial feature point, and the third feature is the distance between feature one and feature two. We processed the original data according to feature three and obtained the topology structure of the original data points.
Through the analysis of characteristics, we found that there is a certain rule for the distance between points. By summarizing the rule, we selected data with the distance less than to connect ( is a distance threshold value; throughout the test, we used = 0.35 to achieve the best experimental effect), so as to form a topological relationship. The resulting data contains the following information. The information in each row represents the topological relationship between this data point and all other connectable data points.
All connectable data points are points that meet the preset conditions. After we preprocessed the original data, we obtained the topology of the original data. Through this topology, we obtained the input form required by DeepWalk, imported the graph obtained in the previous step into DeepWalk, and obtained the topological characteristics of the graph through the random walk and Skipgram in DeepWalk. It is important to note here that DeepWalk's output feature dimensions are adjustable and theoretically unconstrained, but the original suggests choosing multiples of four, where we tried one and four dimensions (to prevent the results from being too high). From the results, we found that for the data we processed, the one-dimensional feature was more effective.
After obtaining the output of DeepWalk, we combined the topological features obtained with the original data one by one. One can obtain new data in a higher dimension. We called this data the cluster data set, considering its topological characteristics. We put this new data set it into the CFSFDP algorithm to verify the clustering effect.
First, we used the Density Peak Clustering Algorithm to perform rough clustering of the data; in order to obtain the number of class clusters ; , are the class labels, where represents the th sample and represents the th cluster. Topological features are used as weight calculation sample to calculate , where represents the degree of membership of the sample to thetype cluster, is represented by the topological feature; is a sample with d-dimensional features; is the topological feature of the center point of the cluster, and also has d dimension, where we use one dimension. Our goal is to minimize the objective function , which is a process of iteratively calculating the membership and the cluster center until they reach the optimal. It is worth noting that for a single sample, , the sum of its memberships for each cluster is 1.
where is the number of iterative steps and is the error threshold. After several iterations, the degree of membership tends to be stable. That is, it is considered to have reached a better state. This process converges to the minimum value of the target, .

Experiments
In this section, we will verify the clustering effect of DPCTF algorithm through comparative experiments. The following will be introduced from the aspects of experimental environment, data type and result analysis.

Environmental Design
We experimented on a computer with an i7-6700@3.40ghz CPU, 12Gb of RAM and NVIDIA GeForce GTX 1050 Ti(4096MB) GPU. The running software used was MATLAB R2016b version and PyCharm2018. We calculated the clustering effect obtained by using CFSFDP algorithm without considering the addition of topological features and calculated the corresponding clustering effect after adding topological features. We ran the comparison of the clustering effect between the two after adding no topological features and then adding topological features.
In comparison with traditional CFSFDP, the performance of several synthetic data sets were evaluated. In table 1 and Figure 7 we show the details of the data used. As different data sets have different domains, we normalized the characteristics of all data to ensure the consistency of data sets and to test the algorithm under the same conditions.  Jain  373  3  2  Spiral  312  3  3  Flame  240  3  2  Aggregation 788 3 7 Figure 7. (a) Jain data, which has two classes, one of which is relatively dense and the other is relatively sparse, which is challenging to clustering problems to some extent; (b) spiral data, which is composed of three curve classes. K-means, DBSCAN and other clustering algorithms are not effective enough, and it will be challenging to verify our ideas. (c) Flame data set with evenly distributed data points to test the performance of clustering algorithm for non-discontinuous data sets; (d) aggregation data set, composed of seven classes, which tests the ability of the algorithm to deal with multicategories. The above data sets basically cover the challenges of common clustering algorithms and verify our algorithm by their actual clustering effect.

Results
Through comparative experiments, we achieved the following results.

Discussion
Through experimental comparison, we found that the DPCTF algorithm is better than CFSFDP algorithm in processing data with a long distance. When comparing the experimental 'jain' data set, it is not difficult to see that 'jain' data points present two distribution conditions: the data points in the upper part are relatively spaced, and the data points in the lower part are relatively densely spaced. If the setting is too large, the original two clusters will be grouped into one cluster, while if the setting is too small, the situation of dividing into three clusters will appear in Figure 8a. The DPCTF algorithm with topological features is used to express each data point by a vector through DeepWalk, which improves the connection between the data. According to the clustering results in Figure 8b, adding topological features improves the accuracy of clustering. The comparison experiment of flame data also confirms our guess. The two data points on the lower right are classified as outliers in the CFSFDP algorithm, while in DPCTF they are classified as clusters with similar topological characteristics. It is concluded that the DPCTF clustering algorithm can classify data points far away from the cluster center, but enough to generate new clusters into similar clusters. In order to verify the relationship between the clustering effect of DPCTF and the topological feature dimension, we conducted a comparative experiment on the 'jain' dataset with the obvious differences in clustering results. The experimental method is as follows: we obtained the topological features of the graph of the 'jain' dataset through DeepWalk and obtained the one-dimensional, twodimensional and four-dimensional topological features of each data point by setting the parameters of the output dimension. By combining different dimensional features with the original data, we finally obtained the following results shown in Figure 9. It can be clearly seen from the clustering distribution map that the clustering results are better when the topological features are selected in one dimension.  In order to verify the robustness of the algorithm, we chose the University of California Irvine (UCI) dataset commonly used for clustering verification. We took the 'iris' dataset as an example. The 'iris' data is composed of the measurements of three separate flowers of the iris plant. The number of pattern categories is three and the feature dimension is four. Each category has 50 pattern samples. There are 150 samples. Through the DPCTF algorithm, we obtained accurate classification results, and the accuracy rate was higher than the CFSFDP algorithm. The classification results are shown in Figure 10. After unifying the data dimensions, we selected the first two columns of data for two-dimensional display. In general, the DPCTF algorithm can obtain better clustering results than the CFSFDP algorithm with the same data. At the same time, researchers need to have a better understanding of the data, so as to choose the optimal graph acquisition strategy. The DPCTF algorithm solved the problem of difficult-to-obtain topology features by the DeepWalk method. By selecting the acquisition strategies of the feature dimensions and graph, the precision of the clustering is improved, which lays a foundation for subsequent automatic clustering and image segmentation.

Conclusions
The results show that the data points of the topology feature have a better clustering performance. Most clustering algorithms only consider the distance characteristics between the data points, and in the course of the study, we found that the topological characteristics are similar to the clustering results in the social network, which are that, when the topological characteristics are represented, the points of the topology relationship form a cluster distribution. Thus, we combined the current clustering algorithm to ignore the topology feature, and combined the topology characteristics of the DeepWalk social network. By increasing the topology characteristics, more precise clustering is realized.
In this paper, the DPCFS algorithm is proposed to improve the CFSFDP algorithm by reviewing and summarizing the classical clustering algorithm. Firstly, the topological characteristics of the original data are obtained through the preprocessing of the original data, and the clustering accuracy is improved through the combination of the CFSFDP algorithm and topological characteristics. Experimental results show that our DPCFS method outperforms the traditional CFSFDP method in many data sets.
In general, the proposed DPCTF has the following advantages: this method can effectively link the deep features of the clustering data and improve the clustering accuracy. It is also robust for different data types. At the same time, to get graphs of different clustering data manually, by testing different data, we found that for access to graph data differences, such as the clustering image segmentation algorithm for images after super pixel processing (achieved by defining the color characteristics of the image block), the graph method needed to find the intrinsic relationship between each image block. As a result, we propose that the method needs to go through more testing and validation to find the unified standard when it comes to how a graph is generated, which lays the foundations for the next study.
In future research, we will continue our study in this direction, using clustering algorithms that add topological characteristics combined with image segmentation techniques and will, eventually, achieve the segmentation of medical images and natural images.