Clustering is one of the most important research topics in machine learning and data mining communities. It aims at grouping samples into several clusters, samples in the same cluster are similar, while samples in different clusters are dis-similar. From practical perspective, clustering has been applied in many areas, such as genomic data analysis [1
], image segmentation [2
], social network [3
], market analysis [4
], anomaly detection [5
] and so on. However, with the development of information technology, traditional clustering algorithms faced with two challenges: the structure of samples becomes much more complex than before and the amount of samples increases sharply. Therefore, it is necessary to develop advanced clustering algorithms to combat with these challenges.
To mitigate the first challenge, Ester et al. [6
] utilized the density of points to explore the underlying clusters of complex data. Particularly, the density of a sample is defined by counting the number of samples in a region with specified radius around the sample, samples with density above a specified threshold form clusters. Another similar algorithm is clustering by fast search and finding of density peaks [7
]. This algorithm is based on two assumptions, the cluster center corresponds to a sample with high local density and it has large distance from other cluster centers. Under these assumptions, the algorithm constructs a coordinate graph, the vertical axis reflects the local density of each sample, and the horizontal axis reflects the distance between one sample and another closest sample with higher local density than itself. Based on the graph, the algorithm manually selects samples, which have high local density and long distance from other samples, as cluster centers. Then, it assigns non-center samples to their respective nearest cluster centers. However, it is difficult to correctly determine the density of samples, and the performance of these algorithms on large amount of samples is not satisfactory.
To combat the second challenge, Frey et al. [8
] proposed a clustering technique called affinity propagation clustering (APC), which propagates affinity message between samples to search a high-quality set of clusters [9
]. APC has been shown its usefulness in image segmentation [10
], gene expressions [12
] and text summarization [13
]. Xiao et al. [14
] proposed a semi-supervised APC to utilize pairwise constraints (must-link constraint that two samples are known in the same cluster, and cannot-link that two samples belong to different clusters [15
]) to adjust the similarity between samples. It is a non-trivial task for APC to choose a suitable p
(predefined preference of choosing a sample as the cluster center). To mitigate this issue, Wang et al. [16
] suggested to adaptively scan preference values in the search space and seek the optimal clustering results. Xia et al. [17
] put forward two variants of APC based on global and local techniques to speed up APC. The local technique speedups APC by decreasing the number of iterations and decomposing the original similarity matrix between samples into sub-matrices. The global approach randomly samples a subset of instances from original dataset and takes these sampled instances as landmark instances (others as non-landmark instances). Next, it applies APC on these landmark instances to optimize centers and then assigns non-landmark instances to the closest center. It repeats the above steps until predefined iterations and chooses the best final results. Recently, Serdah et al. [18
] proposed another efficient algorithm which not only improves the accuracy but also accelerates APC on large scale datasets. This algorithm employs random fragmentation or k
-means to divide the samples into several subsets. Then, it uses k
-affinity propagation (KAP) [19
] to group each subset into k
clusters and gets local cluster exemplars
(representative point of a cluster) in each subset. Next, inverse weighted clustering algorithm [20
] is utilized on all local cluster exemplars to select well-suited global exemplars of all the samples. Finally, a sample is assigned to the respective cluster based on its similarity with respect to global exemplars.
APC and these aforementioned variants of APC, however, are only suitable for datasets with simple structure (i.e., non-overlap or non-intersecting clusters), since they mainly depend on the Euclidean distance between samples. When dealing with datasets with complex structures, the performance of these approaches downgrades sharply. Walter et al. [21
] suggest a path-based similarity for APC (pAPC) to explore the complex structure of samples. Particularly, for pairwise samples, pAPC finds out the longest segment for each path directly and indirectly connecting these two samples. Then it chooses the shortest segment from all the found segments and takes the length of that segment as the path-based similarity between the pairwise samples. Next, pAPC inputs this similarity into APC to cluster samples. Zhang et al. [22
] proposed APC based on geodesic distances (APCGD). APCGD uses the negative geodesic distance between samples as the similarity between samples, instead of the negative Euclidean distance. These approaches improve the performance of APC on datasets with complex structures. However, their performance is found to be unstable, APCGD involves tuning more than two scale parameters, and both of them are very time consuming. Guo et al. [23
] applied k
-path edge centrality [24
] on social networks to efficiently compute the similarity between network nodes and then employed APC based on that similarity to discovery communities.
To improve the performance of APC on samples distributed with complex structures (as the toy examples in the Section of Experiment shown), we propose an APC based on path-based similarity (APC-PS for short). Different from pPAC, we apply APC based on negative Euclidean distance to find exemplars (or cluster centers) at first. Next, we introduce the path-based similarity to explore the underlining structure between samples and exemplars, and then assign non-exemplar samples to its closest cluster using the path-based similarity. Our study on toy examples shows that, the proposed path-based similarity can better capture the underlying structure of samples than the widely used negative Euclidean distance based similarity. The empirical study on publicly available datasets demonstrate that APC-PS can achieve better results than original APC, pAPC, APCGD and other representative clustering algorithms, and it is more robust to input values of p than APC.
The rest of this paper is organized as follows. APC and the proposed APC-PS is introduced in Section 2
. Section 3
provides the experimental results on synthetic datasets and UCI datasets [25
]. Conclusions and future work are provided in Section 4