Abstract
K-Means is a well-known algorithm for unsupervised clustering, very often used due to its simplicity and efficiency. Its long-time widespread use has stimulated researchers to investigate its properties further. A critical property concerns K-Means’s strong dependence on the seeding method adopted to initialize centroids. Poor initialization causes K-Means to get stuck in a local sub-optimal solution. This paper proposes DPCCs—Density Peaks of Candidate Centroids—a novel seeding method for K-Means. DPCC rests on genetic concepts and density peaks to define an initialization solution close to the optimal one. First, a population of J elitist candidate solutions, that is, solutions capable of yielding a reduced clustering cost, is built. Although none of these particular solutions can be near the optimal one, candidate centroids, as experimentally confirmed, tend to thicken around ground truth centroids. Therefore, subsequent generations of the population are created by repeating the k-nearest neighbors (kNNs) procedure for different values of the k parameter, and estimating density through the reverse nearest neighbors (RNNs) relationship of each centroid. Centroid density peaks are then exploited to rearrange the population solutions toward extracting a candidate solution, which is finally optimized by K-Means. The paper describes the design and operation of DPCC, which is currently implemented in parallel Java. The clustering effectiveness of DPCC is demonstrated by applications to both benchmark and real-world datasets. Results are compared with those of other competing algorithms.