ParDP: A Parallel Density Peaks-Based Clustering Algorithm

Nigro, Libero; Cicirelli, Franco

doi:10.3390/math13081285

Open AccessArticle

ParDP: A Parallel Density Peaks-Based Clustering Algorithm

by

Libero Nigro

^1,*

and

Franco Cicirelli

²

¹

DIMES—Engineering Department of Informatics Modeling Electronics and Systems Science, University of Calabria, 87036 Rende, Italy

²

CNR—National Research Council of Italy, Institute for High Performance Computing and Networking (ICAR), 87036 Rende, Italy

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(8), 1285; https://doi.org/10.3390/math13081285

Submission received: 5 March 2025 / Revised: 12 April 2025 / Accepted: 13 April 2025 / Published: 14 April 2025

(This article belongs to the Special Issue Exploring Statistical Learning: Inference, Optimization, and Real-World Applications)

Download

Browse Figures

Versions Notes

Abstract

This paper proposes ParDP, an algorithm and concrete tool for unsupervised clustering, which belongs to the class of density peaks-based clustering methods. Such methods rely on the observation that cluster representative points (centroids) are points of higher local density surrounded by points of lesser density. Candidate centroids, though, are to be far from each other. A key factor of ParDP is adopting a k-Nearest Neighbors (kNN) technique for estimating the density of points. Complete clustering depends on densities and distances among points. ParDP uses principal component analysis to cope with high-dimensional data points. The current implementation relies on Java parallel streams and the built-in lock-free fork/join mechanism, enabling the exploitation of the computing power of commodity multi/many-core machines. This paper demonstrates ParDP’s clustering capabilities by applying it to several benchmark and real-world datasets. ParDP’s operation can either be directed to observe the number of clusters in a dataset or to finalize clustering with an assigned number of clusters. Different internal and external measures can be used to assess the accuracy of a resultant clustering solution.

Keywords:

unsupervised clustering; density peaks-based clustering; k-nearest neighbors; principal component analysis; clustering accuracy measures; parallel programming; Java; benchmark and real-world datasets

MSC:

68U01; 68Q10; 68Q25; 62H30; 68T09; 68T10; 91C15

1. Introduction

Unsupervised clustering [1] is a common technique for mining information from the sample data points of a dataset. Application domains include image segmentation, machine learning, artificial intelligence, medicine, biology, etc. The goal is to partition the points of a dataset,

X = {\{x_{i}\}}_{i = 1}^{N}

, into groups (said clusters) in such a way that the points of the same cluster are similar, but dissimilar to the points assigned to other clusters. Each cluster is characterized by a representative point called its centroid

μ_{j}

. The similarity function often coincides with the Euclidean function when the dataset points are vectors with

D

numerical coordinates (dimensions):

x_{i} \in R^{D}

.

An often-used centroids-based clustering algorithm is K-Means [2,3], where the number

K

of clusters/centroids is an input parameter. K-Means’s recurrent use owes to its simplicity and efficiency [4]. K-Means, though, is more suited to datasets with spherical clusters. In addition, it strongly depends on the initialization of centroids (seeding) [5,6]. A bad initialization quickly causes K-Means to become stuck in a local sub-optimal solution. K-Means acts as a local refiner of centroids and cannot search for centroids globally in the dataset. To control these limitations, careful seeding methods can be used [7,8,9] together with the concept of repeating it many times (Repeated K-Means), each time with a different initialization, toward the obtainment of an acceptable solution, as mirrored by some clustering accuracy measures. Random Swap [10,11] integrates K-Means in its behavior and bases its operation on a global search for centroids. At each iteration, a swap is accomplished, which replaces a randomly chosen centroid with a randomly selected dataset point. Provided a suited number of swap iterations are executed, Random Swap has been proven to be able, in many cases, to find a solution close to the optimal one.

A different yet popular clustering algorithm, designed for non-spherical clusters, is based on the concept of density peaks (DP) [12]. DP relies on the fact that, naturally, centroids are points of higher local density, whose neighborhood is made up of points with lesser density. In addition, a candidate centroid is a point of higher density relatively far from points with a higher density. Two data point quantities regulate the application of DP: density (rho) and minimal distance to the nearest point with a higher density (delta). A fundamental problem is the estimation of the point density. In the original proposal [12], the concept of a cutoff distance

d_{c}

(distance threshold) is introduced. Such distance is the radius of a hyperball in the D-dimensional space. The density is established by (logically) counting points that fall in the ball. The authors in [12] adopted the “thumb rule” of choosing a value for

d_{c}

, which can ensure that from 1% to 2% of the dataset, the number of points falls in the hyperball. DP clustering is finalized by considering both the rho and delta values of points. Centroids are chosen as points with high values of both rho and delta. The remaining points are assigned to clusters/centroids according to the delta attribute. However, it has been pointed out that the DP basic approach cannot recognize clusters with an arbitrary geometry.

In the last few years, several proposals appeared in the literature to improve specifically the estimation of density and thus handling clusters of an arbitrary shape. A notable class of works is based on the k-Nearest Neighbors (kNN) technique, with

k

as an input parameter. In [13], from the

k

-th Nearest Neighbor distance of a data point, p, indicated as

{N N}_{k} (p)

, the kNN neighborhood of a data point, p (

{k N N}_{k} (p)

), is first defined:

{k N N}_{k} (p) = \{q \in X| d (p, q) \leq {N N}_{k} (p)\}

, where

d (p, q)

is the distance between p and q. Then, the local density of p is inferred through a Gaussian exponential kernel. The proposal in [13] adds to the clustering method the principal component analysis (PCA) to deal with the “curse of dimensionality”, that is, the problem of a high number of point coordinates (dimensions) that can cause the Euclidean distance to lose its meaning. A (hopefully) reduced number of coordinates is preliminarily detected to guarantee data variations and the consistency of the Euclidean distance.

The kNN approach was further specialized with the concepts of reverse and natural Nearest Neighbors [14,15]. From the k-th nearest point of a point, p,

{N N}_{k} (p)

, the Reverse Nearest Neighbors (RNN) of p,

R N N (p)

, is defined as the set of points, q, which have p within the distance

{N N}_{k} (q)

, that is:

R N N (p) = \{q \in X| d (p, q) \leq {N N}_{k} (q)}

. Two nearest neighbor points, p and q, are said natural neighbors when, as in the friendship relationship of human societies, each one belongs to the RNN of the partner:

p \in R N N (q) \land q \in R N N (p)

. A further concept concerns the definition of the Mutual Nearest Neighbors (NMM) set of a point, p, defined as:

M N N (p) = {k N N}_{k} (p) \cup R N N (p)

. A benefit of the RNN-based approaches in [14,15] is the possibility of inferring a suitable value of

k

in the algorithm, by automatically checking different radius values, starting from 2, until a stable search state is reached, where the friendship relation among points stabilizes.

A common problem of density estimation by kNN-based approaches is the high computational cost,

O (N^{2})

, due to the need to compute all the pairwise distances of points. Such a cost can limit the applicability of kNN to large datasets. Fränti et al. [16,17] developed a fast-density peak algorithm (FastDP) based on the approximate construction of the

k N N

graph through an efficient divisive technique. The divisive technique randomly selects two points,

a

and

b

, in the dataset, and then the points of the dataset are partitioned into two groups according to the nearer distance from

a

or

b

. The two sub-sets are recursively split until their size appears under a given threshold, in which case a corresponding kNN subgraph is directly built (brute force). The divisive technique is repeated many times in the algorithm’s first part until the edge changes in the last iteration becomes smaller than a given tolerance. In the second part, the divisive technique is continued by combining sub-graphs by neighborhood propagation (NN-Descent method [16,18]), that is by evaluating neighbors of neighbors. Instead of creating the graph from scratch, the previous graph is updated by reviewing the k-Nearest Neighbors of each point by considering the information that emerged in the latest solution. The approximate kNN graph is finally used to estimate each point’s density and delta values. Density is evaluated as the inverse of the mean distance to the k-Nearest Neighbors. As demonstrated in [16], FastDP can handle different data types, provided a distance function is available.

This paper proposes ParDP, a density peaks-based algorithm and tool that depends on kNN to estimate density. ParDP does not build an approximate KNN graph as in [16]. Anyway, the

O (N^{2})

cost is smoothed by exploiting Java parallelism [4,9,11,19].

A preliminary version of ParDP, where a suitable global value,

d_{c}

, was initially tentatively tuned, was presented as a conference paper [20]. Concerning the conference paper, the following new contributions are added.

Using the kNN approach for inferring points’ density and delta values. In particular, ParDP can exploit kNN for inferring a global value of $d_{c}$ from which a Gaussian kernel evaluates densities. Alternatively, densities can directly be estimated from the kNN distances.
Supporting the principal component analysis (PCA) for (hopefully) reducing the high number of dimensions of some datasets while preserving the meaning of the Euclidean distance.
Optimizing the use of Java parallel streams [19] in all the basic steps of ParDP, ranging from evaluating the density and delta values to final clustering to implementing various clustering measures.
Applying ParDP to several benchmark and real-world datasets.
Adapting the Java implementation of ParDP for an achievement, PCA included, of the Du et al. [13] density peaks algorithm (here referred to as ParDPC-KNN) as well as a density peaks algorithm based on the Mutual Nearest Neighbors (MNN) approach [19] (here referred to as ParDPC-MNN).
Verifying ParDP correctness and performance by comparing it to ParDPC-KNN and ParDPC-MNN.

This paper is structured as follows. Section 2 reviews the fundamental concepts of the basic density peaks clustering algorithm (DP). Section 3 discusses the design and implementation aspects of the proposed ParDP algorithm. Section 4 provides a summary of the clustering accuracy indexes used in the practical experiments. Section 5 is devoted to the experimental work. ParDP is applied to 23 challenging synthetic and real-world datasets, split into three groups. The clustering results are compared with those achieved by competing tools. Finally, Section 6 draws some conclusions about the developed work and indicates directions for future work.

2. Density Peaks-Basic Algorithm

The density peaks basic algorithm (DP) [12] rests on the intuition that centroids are points of higher local density, surrounded by points of lesser density. As a consequence, a centroid is relatively distant from other points of a higher density. To formalize this intuition, two quantities are introduced for each point,

x_{i}

: its density (

ρ

) and its nearest distance (

δ

) to a point with a higher density. Such a nearest point is called, in [16,17], the big brother of

x_{i}

. It is evident that for a point to be a candidate centroid, it should have a high value of both

ρ

and

δ

. Equivalently, a candidate centroid should have a high value of a third quantity, (gamma)

γ = ρ * δ

. In this paper, as in [16], the gamma strategy is adopted for choosing the centroids.

Unlike, e.g., K-Means [2,3], the number

K

of centroids is not necessarily an input parameter for DP. The value of

K

can, in many cases, be deducted by using the so-called decision graph, which depicts

δ

vs.

ρ

or, better,

γ

vs.

ρ

. Candidate centroids are expected to naturally emerge as points with the highest values of

γ

in the decision graph. In the case the dataset is (logically) sorted by decreasing

γ

values, as it is presented in this work, candidate centroids are the “first” points after which a sudden decrease (knee point) occurs for the

γ

vs. dataset point indexes. However, the reading of centroids from the decision graph shows it is not always clear. In addition, some datasets come with a desired value of

K

, which suggests picking up the first

K

points in the sorted dataset by decreasing

γ

.

A fundamental problem in DP is the definition of points’ density. In [12], the cutoff distance,

d_{c}

, is assumed, that is the common radius of a hyperball in the D-dimensional space, centered on any point,

x_{i}

. All the points that fall within

d_{c}

establish the density

ρ (x_{i})

:

ρ (x_{i}) = \sum_{j = 1, j! = i}^{N} χ (d (x_{i}, x_{j}) - d_{c})

(1)

where

d (x_{i}, x_{j})

is the distance between

x_{i}

and

x_{j}

, and

χ (x) = 1

if

x < 0

; otherwise,

χ (x) = 0

. In reality, in the implementation of DP, the authors of [12] used a Gaussian kernel instead for estimating the density:

ρ (x_{i}) = \sum_{j = 1, j! = i}^{N} e^{- (\frac{{d (x_{i}, x_{j})}^{2}}{{d_{c}}^{2}})}

(2)

Such a choice [21] can be preferable because it avoids points having the same density as it may occur by counting with the cutoff distance. The following defines the delta feature of a point,

x_{i}

:

δ (x_{i}) = \{\begin{matrix} \max \{d (x_{i}, x_{j}), 1 \leq j \leq N, j \neq i\} if (ρ (x_{i}) \geq ρ (x_{j})) \forall j \\ \min \{d (x_{i}, x_{j}), 1 \leq j \leq N, j \neq i\} \forall j : ρ (x_{i}) < ρ (x_{j}) \end{matrix}

(3)

This definition ensures that the point

x_{i}

with the maximum density has its

δ

set to the maximum distance between

x_{i}

and any other point in the dataset. Otherwise,

δ (x_{i})

is set to the distance of the nearest point with a higher density.

Following the evaluation of

ρ (x_{i})

,

δ (x_{i})

, and then of

γ (x_{i}) = ρ (x_{i}) * δ (x_{i})

, DP finalizes the clustering in a standard way. First, centroids are selected, according to the

γ

values, and then they are assigned a label. After that, the labels of centroids are propagated to all the remaining points through the rule that a point acquires the label of its big brother.

DP can identify outliers by introducing a threshold,

ρ_{b}

, so that points that have a density less than

ρ_{b}

are potential outliers or noise points.

ρ_{b}

qualifies as the maximum density of points in the halo region of a cluster, which has points assigned to this cluster, but that are within the

d_{c}

radius of points belonging to other clusters.

3. Development of the ParDP Algorithm

The design of ParDP was influenced by the basic DP work [12] and by work on kNN clustering [13,14,15,16,22]. The concepts of DP are inherited as ρ, δ, γ features of dataset points and the standard recursive clustering algorithm. Differences from DP are introduced in the way points’ density is evaluated.

Two kNN-based approaches are supported. As a preliminary task, the first

k

distinct distances to the nearest neighbors are determined for each data point. Then the average of these

k

distances is computed. In the first approach, similar to the work described in [16], the point density (mass divided by volume) is immediately estimated as the inverse of the average distance. In the second approach, the average distance is maintained in the data point as a local

d_{c}

. When all the dataset points have their local

d_{c}

in place, a global

d_{c}

is defined by sorting in ascending order all the local

d_{c}

values and by extracting the median value as the common radius [12,22] for scanning the points’ density through a Gaussian kernel.

Both approaches naturally tend to limit the influence of outliers on the clustering process. For an outlier, the average of the first k nearest distances tends to be higher than the mean distance of non-outlier points. As a consequence, in the first approach, the density of the outlier is minimal, and the point has a lower probability of being chosen as a centroid. In the second approach, the choice of the median value of local

d_{c}

(s), paired with the use of a Gaussian kernel, again ensures that the density measure of the outlier point is very small.

The validity and efficacy of each supported kNN approach depend ultimately on the cluster shapes of the handled dataset, as confirmed experimentally (see later in this paper).

Algorithm 1 provides a pseudo-code of ParDP’s behavior. For some datasets, it can be helpful to preliminarily scale the dataset point coordinates, e.g., by a

m i n - m a x

normalization, or dividing each coordinate by the maximum value of all features, and so forth. Steps such as pca(), kNN(), delta(), gamma, and clustering(), directly correspond to Java methods in the ParDP implementation.

Algorithm 1. Pseudo-code of basic ParDP operations.

Input: dataset X and (if there are any) ground truth centroids or partitions
Scaling data if required
pca()
kNN()
delta()
gamma()
if(decision graph is required)
     Output: persist rho, delta and gamma values of the data points
else
     clustering()
     Output: some accuracy measures of the achieved clustering solution

3.1. Step pca()

The following steps of principal component analysis [13,23] detect the most important coordinates of data points.

Make each coordinate dimension (feature) have a 0 mean.
Define the $D \times D$ covariance matrix $Σ = X^{t} * X$ of the $N \times D$ matrix, $X$ , associated with the dataset.
Compute the eigenvalues $λ_{j}, 1 \leq j \leq D,$ that is the diagonal elements, $Σ_{j, j}$ , of matrix $Σ$ by the internal products:

$λ_{j} = \sum_{i = 1}^{N} X_{j, i}^{t} * X_{i . j}$

(4)
Extract the vector of $D$ eigenvalues $[λ_{1}, λ_{2}, \dots, λ_{D}]$ from the matrix, $Σ$ .
Sort by decreasing values the eigenvalues vector.
Select the first $h, 1 \leq h \leq D$ sorted eigenvalues that would ensure, e.g.,

$\frac{\sum_{j = 1}^{h} λ_{j}}{\sum_{j = 1}^{D} λ_{j}} \geq 0.99$

(5)

Practically, only the indexes of the most important coordinates (those having higher eigenvalues) are retained during the sorting of the eigenvalues. Such indexes are then used when computing the Euclidean distance between two data points.

3.2. Step kNN()

This step realizes the two approaches to the density evaluation of ParDP. A global boolean

C U T O F F

parameter permits kNN() to adapt its behavior to one of the two approaches. The value of the parameter

k

is important. Algorithm 1 is supposed to be repeated for different values of

k

, whose “best” value must be tuned to the dataset. For example, several datasets used in the experiments reported in this paper were investigated by using the range [5…200] for

k

. The effects of a particular value of

k

on the clustering can be checked on the decision graph or by observing some clustering measures (see later in this paper). The execution of the kNN() step (see also Section 2) assigns to each point

x_{i}

its estimated density but also the maximal distance of

x_{i}

from any other point in the dataset. Such information is used by the subsequent delta() step. For each value of

k

, the kNN() step costs

O (N^{2})

to compute all the pairwise distances of points. To this cost, the heap sorting cost,

O (N l o g N)

, of the dataset by ascending local cutoff distances of data points is added if the cutoff approach of ParDP is chosen. The dominant

O (N^{2})

cost is smoothed by using Java parallel streams (see Section 3.6).

3.3. Step delta()

This step is responsible for assigning the information about big brother into each data point. Toward this goal, first, the dataset is sorted by decreasing values of the points’ density. Then, for each point,

x_{i}

, the points

x_{j}

located at its left in the ordered sequence of data points are considered, and the distance,

d (x_{i}, x_{j})

, is evaluated. Finally, the identity (index) and distance of point

x_{j}

nearest to

x_{i}

are stored in

x_{i}

as its big brother. In the case of point

x_{1}

that has the maximal density, its big brother is set, by convention, to the point that has the maximal distance from

x_{1}

, as determined in the kNN() step.

In this step, dataset points are sorted by descending values of the density feature. In the next gamma() step, the dataset has to be sorted by decreasing values of the gamma feature of the dataset points. To cope with these conflicting issues and also to maintain the integrity of some fundamental information, like the index of the big brother of a point, ParDP behaves as follows. In no case are points moved within the dataset during a sorting operation. Rather, an

i n d e x

data structure is used that indirectly references the points of the dataset. Therefore, the effects of sorting are reflected by the references kept in the

i n d e x

. At the time of fixing the identity of the big brother of a point,

x_{i}

, the absolute, never-changing physical index (that each point stores in itself) of the big brother (iDelta) and its distance (delta) are stored in

x_{i}

. The technique makes it possible to sort, at different times and according to different ordering criteria, the dataset points.

The delta() step mainly costs

O (N l o g N)

for heap sorting the dataset by the descending density of the points, which includes the

O (N)

cost for searching the big brother of each point.

3.4. Step gamma()

This step sorts the dataset by decreasing

g a m m a

(γ)

values, that is the product of ρ (density) per

δ

(delta or distance to big brother) of data points. Such ordering is a key to inferring the number

K

of clusters suited for a dataset or to assessing the validity of the a priori known value of

K

. All these activities can be carried out on the decision graph. Three equivalent types of decision graphs can be exploited:

δ

vs. ρ,

γ

vs. ρ, or

γ

vs. (sorted dataset point) index.

The gamma() step has the

O (N l o g N)

cost for heap sorting the dataset by the descending gamma values of points.

3.5. Step clustering()

This step selects the first

K

points in the dataset, sorted by decreasing gamma values, as centroids (see Section 3.4); assigns to each centroid a label identical to its index; and then proceeds with the final clustering as in the basic DP algorithm (see Section 2) [12]. In particular, each point is assigned the label (cluster) of its big brother. In the case the big brother is still unassigned, the procedure continues, recursively, by moving to its big brother and so forth until a centroid or an already assigned big brother is found from which its label is propagated back to all the points in the recursion return path.

The clustering step() costs about

O (N) .

Overall, the ParDP execution costs

O (N^{2})

.

3.6. Java Implementation Issues

The power and flexibility of ParDP derive from its implementation, which is based on Java parallel streams and lambda expressions [19]. The same parallel and functional programming mechanisms were previously successfully applied, for example, to improve the execution of K-Means [4] and Random Swap [9,11]. A stream is a view of a data collection (e.g., an array like the dataset managed by ParDP) making it possible to express element operations functionally by lambda expressions. A parallel stream allows for the underlying fork/join mechanism to be exploited, which splits the data into multiple segments and processes segments by separate threads. Finally, it combines the partial results generated by the worker threads. A key benefit of the parallel stream concerns taking advantage of the high-performance computing environment provided by a multi/many-core machine, where lock-free parallelism is used, requiring lambda expressions of not modifying shared data.

The concise and elegant programming style ensured by parallel streams and lambda expressions is exemplified in Algorithm 2, where the non-cutoff-based approach of ParDP for predicting the points’ density is shown. First, a stream is opened on the dataset to process points in parallel. Computation is embedded in the map() operation, which works on a stream, transforms it, and returns a new stream. map() is an intermediate operation. The execution of the various map() operations is triggered by a terminal operation, like forEach(). Each map() receives a UnaryOperator<DataPoint> object whose apply() method is formulated by a lambda expression, which receives a point, p; transforms it (p is purposely mapped); and, finally, returns p. A key point of the realization in Algorithm 2 is that each point, p, only modifies itself (the rho and dMax fields of p are modified). Absolutely no other changes are made to the shared data. The overall programming style only specifies what to do. The parallel organization and execution of the code is delegated to the built-in fork/join control structure [19].

Algorithm 2. An excerpt of the kNN() method for the non-cutoff approach to density

…
Stream<DataPoint> pStream = Stream.of(dataset);
if(PARALLEL) pStream = pStream.parallel();
pStream
    .map(p->{
       //detect distances to the first distinct k nearest neighbors of p
       double[] dkNN = new double[k];
       for(int j = 0; j < k; ++j) dkNN[j] = Double.MAX_VALUE;
       double dMax = Double.MIN_VALUE; //maximal distance of p to any other point
       for(int i = 0; i < N; ++i) {
                if(i! = p.getID()) {
                           double d = p.distance(dataset[i]);
                           if(d > dMax) dMax = d;
                           int j = 0;
                           while(j < k) {
                                 if(d < dkNN[j]) {
                                            //right shift from j to k − 1
                                            for(int h = k − 2; h >= j; --h) dkNN[h + 1] = dkNN[h];
                                            dkNN[j] = d; break;
                                 }
                                 else if(d == dkNN[j]) break;
                                j++;
                           }
                }
      }
      double ad = 0;
      for(int j = 0; j < k; ++j) ad = ad + dkNN[j];
      p.setRho(k/ad); //inverse of average kNN distances
      p.setDMax(dMax);
      return p;
    })
   .forEach(p->{}); //trigger of map computations

Parallel streams are systematically adopted in ParDP, anywhere parallelism can be exploited. For example, in the delta() step of Algorithm 1, first the dataset is sorted by decreasing density values, then a parallel stream is opened and used to determine the big brother of each point, p. Parallel streams are also used for computing internal or external clustering accuracy measures.

4. Clustering Accuracy Measures

The validity or quality of a clustering solution generated by a given algorithm can be measured by some internal or external indexes [9,24]. Internal indexes are related to the internal composition of points of clusters. Two well-known indexes in this class are the Sum-of-Squared Errors (

S S E

) and the Silhouette coefficient (

S I

) [11,25]. The

S S E

reveals how close the points assigned to the various clusters

[C_{1}, C_{2}, \dots, C_{K}]

are to their relevant centroids

[μ_{1}, μ_{2}, \dots, μ_{K}]

:

S S E = \sum_{j = 1}^{K} \sum_{x_{i} \in C_{j}} {d (x_{i}, μ_{j})}^{2}

(6)

The Silhouette coefficient,

S I

, is a joint measure of the internal compactness (cohesion) of clusters and their separation, which, in turn, is related to the overlapping degree of the clusters. The

S I

of a point,

x_{i}

, is computed as:

{S I}_{x_{i}} = \frac{{b_{x_{i}} - a}_{x_{i}}}{\max (b_{x_{i}}, a_{x_{i}})}

(7)

where

a_{x_{i}}

is the average distance of

x_{i}

from the remaining points of the belonging cluster, and

b_{x_{i}}

is the minimal average distance of

x_{i}

to all the points of other clusters. The overall

S I

is then defined as:

S I = \frac{1}{N} \sum_{i = 1}^{N} {S I}_{x_{i}}

(8)

S I

ranges in

[- 1, 1]

. A value close to 1 mirrors the well-separation of clusters in the achieved solution. A value close to 0 expresses the high overlapping of the clusters. A value that tends to −1 indicates wrong clustering.

It is worth noting that

S S E

or the

S I

can directly be used as the optimization function in K-Means, Random Swap, and similar algorithms. DP, though, is not driven directly by the optimization of such internal indexes. DP, instead, is better evaluated by the use of external measures of the similarity (or dissimilarity) degree of two solutions:

S^{1}

and

S^{2}

. Each solution is composed of a vector of

K

centroids and a vector of

K

partitions, which are sets of points (indexes) assigned to the same cluster. Labels are the integers in [1..

K

]. The number

K

can be different in the two solutions, although it is usually the same value (assumed in the following). In some cases, one of the two solutions can be a ground truth (

G T

) correct (“optimal”) solution.

G T

can be generated by the designer of a synthetic or benchmark dataset to allow the assessment of the accuracy of a particular solution (

C

) predicted by a clustering algorithm. In other cases, the ground truth can be a “golden solution” generated by an assumed correct algorithm. A correct solution can be available as a collection of ground truth centroids (

G T C

) or ground truth partition (

G T P

) labels assigned to all the points, e.g., of a real-world dataset.

G T P

mirrors the best partition of the points into clusters.

External quality indexes include the Cluster Index (

C I

) [26], the Generalized Cluster Index (

G C I

) [27], the Adjusted Rand Index (

A R I

) [24], the accuracy index (

A C C

) [28], and the Normalized Mutual Information (NMI) index [24]. External indexes can be pair-matching or set-matching measures. In pair-matching measures, the centroids of the two solutions can be mapped onto each other and, possibly, in the two directions:

S^{1} \to S^{2}

and

S^{2} \to S^{1}

. Alternatively, pairs of points can be matched among the clusters (partitions) of the two solutions. In set-matching measures, partitions are mapped onto each other between the two solutions.

The Cluster Index (

C I

) maps each centroid of

S^{1}

to the nearest one in

S^{2}

, according to the Euclidean distance. The number of orphans in

S^{2}

, that is the number of centroids upon which no centroid of

S^{1}

is mapped, is then counted.

C I

continues by also mapping centroids of

S^{2}

onto

S^{1}

and by counting the resultant orphans in

S^{1}

. The maximal number of orphans in the two directions of mapping defines the value of the

C I

.

C I

ranges from 0 to

K

. The case where

C I = 0

indicates

S^{1}

is structurally correct to

S^{2}

. A

C I > 0

mirrors the number of wrongly detected clusters/centroids.

C I

can be practically applied to measure the similarity between the centroids of a predicted clustering solution and the ground truth centroids (

G T C

) available for the dataset. The set-matching-based Generalized Cluster Index (

G C I

) extends

C I

to the cases where the partitions of a particular solution are mapped to the ground truth partitions (

G T P

) available for a dataset. In this case, the Jaccard distance [11,27] between two partitions (sets of points) can be exploited for establishing the nearness:

J d = 1 - \frac{|P_{1} \cap P_{2}|}{|P_{1} \cup P_{2}|}

(9)

The pair-matching-based

A R I

index [24,29] maps pairs of points from

S^{1} \to S^{2}

, practically, from detected to ground truth clusters. It is defined as:

A R I = \frac{2 (a d - b c)}{(a + b) (b + d) + (a + c) (c + d)}

(10)

where

a

counts the number of pairs of points that belong to the same cluster in

S^{1}

and the same cluster in

S^{2}

;

b

counts the number of pairs that are in the same cluster in

S^{1}

but in different clusters in

S^{2}

;

c

counts the number of pairs that are in distinct clusters in

S^{1}

but in the same cluster in

S^{2}

; finally,

d

counts the number of pairs that belong to different clusters in

S^{1}

and different clusters in

S^{2}

. The four quantities can be computed from the contingency matrix

S^{1} \times S^{2}

[24]. The

A R I

index ranges in [0..1]. Good clustering is mirrored by an

A R I

value close to 1.

The accuracy index is defined as:

A C C = \frac{1}{N} \sum_{i = 1}^{N} δ (S_{x_{i}}^{2}, m a p (S_{x_{i}}^{1}))

(11)

where

δ

is the Kronechker operator which returns 1 if the two arguments are equals, 0 otherwise. The notation

S_{x_{i}}^{j}

indicates the cluster (partition) of the solution

S^{j}

to which the data point

x_{i}

is assigned. The function

m a p (S_{x_{i}}^{1})

maps the partition in

S^{1}

containing

x_{i}

to a partition of

S^{2}

. In this work, the set-matching-based bi-directional definition of the

A C C

proposed in [28] is adopted. The measure exploits the contingency matrix separately built in the two directions of mapping.

A C C

ranges in [0..1], where values close to 1 mirror good clustering.

The NMI index is an information theoretic measure that can be computed using the contingency matrix [24] again. It is defined in terms of the Mutual Information (MI) and the Entropy (H) measures of the partitions of the two solutions,

S^{1}

and

S^{2}

, to be compared.

M I = \sum_{i = 1}^{K} \sum_{j = 1}^{K} \frac{|S_{i}^{1} \cap S_{j}^{2}|}{N} l o g \frac{N * |S_{i}^{1} \cap S_{j}^{2}|}{|S_{i}^{1}| * |S_{j}^{2}|}

(12)

where 0*log0 is assumed to be 0. The quantity

|S_{i}^{1} \cap S_{j}^{2}|

is the element <i,j> of the contingency matrix. The cardinality of a partition like

|S_{i}^{1}|

(

|S_{j}^{2}|

) is also the sum of the elements of the row

i

(column

j

) of the contingency matrix. The entropy of a solution,

S

, is defined as:

H (S) = - \sum_{i = 1}^{K} \frac{|S_{i}|}{N} l o g \frac{|S_{i}|}{N}

(13)

where the ratio between the cardinality of the partition,

S_{i}

, and the dataset size,

N

, expresses the probability of the cluster

S_{i}

. Finally, the NMI can be defined, in a case (see [24]), as:

N M I = \frac{M I (S^{1}, S^{2})}{(H (S^{1}) + H (S^{2})) / 2} .

(14)

Also, the NMI ranges in [0..1] with values near to 1 denoting good clustering. As noted in [28], although it is not normalized, the ACC measure is very similar to the NMI measure.

All the above accuracy indexes are implemented in Java and can be used to assess the quality of the clustering solutions computed by ParDP and similar tools. For simplicity, though, the ACC measure is preferred to the NMI index in the experiments reported in this paper.

5. Experimental Work

The following reports the application of ParDP to 23 benchmark and real-world datasets. The goal was to document and compare the ParDP clustering capabilities to similar tools using challenging datasets. In particular, the two tools ParDPC-KNN and ParDPC-MNN were considered, which were built by adapting the density evaluation code block of ParDP. ParDPC-KNN rests on the density estimation technique proposed in [13], whereas ParDPC-MNN depends on the concept of Mutual Nearest Neighbors for the density evaluation advocated in [14,15]. From ParDP, the two tools inherit the exploitation of parallelism, the use of principal component analysis for dealing with multi-dimensional data points, and the final clustering strategy. The experiments were organized into groups to address specific clustering issues separately. All the reported experimental work was performed using Java 23 on a Win11 Pro desktop platform, Dell XPS 8940, Intel i7-10700 (8 physical + 8 virtual cores), CPU@2.90 GHz, 32 GB RAM.

5.1. First Group of Datasets

Three benchmark datasets, namely S4, Skewed, and Birch1, downloaded from [30], were used to demonstrate how ParDP clustering can be practically applied and checked. The selected datasets were 2-

D

imensional and have Gaussian clusters with variations in cluster unbalance, symmetry, and overlapping degree. Although the Birch1 dataset has spherical clusters in regular positions (they occupy the cells of a 10 × 10 grid), it has a high number,

N

, of data points and clusters,

K

(see Table 1). For these reasons, Birch1 is suited for checking the time execution performance of ParDP.

All three synthetic datasets of in Table 1 are provided of ground truth centroids. The Centroid Index (

C I

) can then measure the clustering accuracy. The datasets were correctly solved (CI = 0) by ParDP with any of the two supported approaches to density estimation. What differs from the use of one approach or the other is the particular range of values of the

k

parameter that regulates the KNN strategy, and can ensure high-quality clustering. The S4 dataset, preliminarily scaled by dividing each coordinate by the maximal value of the features, was checked by varying

k

in the range [5..100] using the non-cutoff approach. It emerged that each checked

k

value guaranteed CI = 0. Figure 1 portrays the decision graph δ vs. ρ for

k

= 20, which confirms the expected number of clusters,

K

= 15. Figure 2 shows the partitions detected by ParDP, together with the ground truth centroids (in red) and the predicted centroids (in black). As one can see, the predicted centroids agree very well with the ground truth ones.

The Skewed dataset was studied using the cutoff approach of ParDP and no scaling, with the range [5..100] for the parameter

k

. All these possible values of

k

ensured CI = 0. Figure 3 depicts the decision graph γ vs. ρ when

k

= 100. The expected number of clusters/centroids was

K

= 6. Figure 4 shows the achieved partitions, the ground truth centroids (in red), and the predicted centroids (in black).

The Birch1 dataset, preliminarily scaled by the maximum value of the coordinates, was investigated by varying

k

in [5..500] and the non-cutoff approach. Also, in this case, each value of

k

ensured correct clustering (CI = 0). For demonstration purposes, Figure 5 illustrates the decision graph drawn by γ vs. the sorted dataset point’s index. Sorting is by decreasing γ = δ ∗ ρ values. The last abrupt decrease in γ at index 100 confirms that the expected number of clusters is 100. Figure 6 shows the ground truth (in red) and the predicted centroids (in back) on the overall dataset.

To figure out the benefit of the parallel organization of ParDP, Birch1 was executed separately under the parallel and the sequential modes (for the latter, the PARALLEL global attribute was set to false) with the scenario

k

= 500. Of course, the clustering results of ParDP in the sequential mode are identical to those achieved in the parallel (default) mode. This is because ParDP is inherently deterministic. In the parallel mode, a Parallel Elapsed Time (PET) of about 447 sec was measured. The sequential execution terminated after a Sequential Elapsed Time (SET) of 3669 sec. Therefore, the following execution speedup emerged:

s p e e d u p = \frac{S E T}{P E T} = \frac{3669}{447} = 8.2 (8 physical + 8 virtual cores) .

5.2. Second Group of Datasets

This group (see Table 2) of synthetic datasets includes the SHAPES and WORMS datasets available from [30], which come with ground truth partitions. In other terms, the designer of the datasets anticipates the label (cluster) assigned to each point. These datasets are challenging both because of the irregular point distribution and shapes of the clusters and because of the high number of handled data points, particularly for the WORMS datasets. The clustering quality can be assessed using the GCI, ARI, ACC, and NMI indexes (see Section 4). Moreover, starting from this group, the performance of ParDP was compared to those achievable using the ParDPC-KNN and ParDPC-MNN tools. Both these algorithms were achieved by adapting the code block of ParDP, which is responsible for the evaluation of the density, respectively to the algorithm proposed in [13] and to the concept of Mutual Nearest Neighbors (MNN) advocated in [14,15].

First, the clustering results for the datasets of Table 2 from the Aggregation to Flame, achieved by ParDP with no data scaling, are presented graphically. Then, some measured clustering indexes are documented and compared. A fundamental issue concerns the values of the parameter

k

to use with the kNN technique. Since all the chosen datasets are provided with ground truth partitions, suitable values for

k

can be used for the optimization of such clustering indexes as GCI, ARI, ACC or NMI. In the more general case of real-world datasets that are without ground truth information, “good” values for

k

can be suggested by the decision graph through the clear identification of centroids. In the following, the maximization of the ACC (or, equivalently, of the NMI or the ARI) measure is used as the criterion for detecting suitable

k

values. The maximization of ACC is expected to be accompanied by the minimization (hopefully 0) of the GCI index.

For simplicity, besides the emerged “best” value for

k

, the use of the most effective ParDP-supported approach (cutoff or non-cutoff) was assumed.

Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15, Figure 16, Figure 17, Figure 18, Figure 19 and Figure 20 document the clustering activities carried out on the first seven datasets in Table 2. It is worth noting that, in all cases, the predicted partitions agree with the ground truth partitions, which are not reported for brevity. Moreover, in some cases (e.g., the Aggregation dataset) the predicted partitions coincided exactly with the ground truth. These aspects are better documented in the results reported in Table 3.

For brevity, Table 3 collects the observed clustering indexes GCI and ACC for the first seven datasets in Table 2, and the three algorithms ParDP, ParDPC-KNN, and ParDPC-MNN. The results implicitly refer to the optimal (maximal) value of the accuracy index ACC, separately registered by each tool on every dataset. It emerges from Table 3 that only ParDP ensures that GCI = 0 in all the cases studied. In addition, the ACC measures furnished by ParDP are never less than those achieved by the other two tools. The cases where the ACC was 1.0 correspond to the datasets where the predicted partitions are identical to the ground truth partitions.

The Worms_2d and Worms_64d synthetic datasets in Table 2 are very challenging, and they are equipped with ground truth partitions [30]. Clustering difficulties arise not only from the high number of data points or the high number of clusters (

K

= 64 for Worms_64d), but also for the point distributions in the data space. As an example, the design of Worms_2d (see Figure 21) starts from a random position and moves, in subsequent steps, along a random direction. Data points obey a Gaussian distribution with variance that changes at each step.

Despite the high number of features, Worms_64d is more amenable to clustering, as witnessed, e.g., in [9,11]. Instead, to the best of our knowledge, the Worms_2d dataset does not have a documented correct (e.g.,

G C I = 0

) solution in the literature. Rather, different algorithms were tested to minimize the

G C I

. In [16], an average value of

G C I = 7.5

was reported by using the fast density peaks algorithm based on the approximate construction of the kNN graph. The same average value was achieved in [9] by using an evolutionary Random Swap algorithm. This paper studied the two worm datasets after a preliminary scaling of the data points by the overall maximum feature. Table 4 reports the experimental results when the

k

parameter maximizes the ACC clustering measure. For ParDP (operated with the cutoff approach),

k

was varied in [5..110] (see also Figure 22), whereas for ParDP-kNN and ParDP-MNN,

k

ranged in [5..300]. The average value of the observed GCI was 3.77 in the experiments with ParDP, 16.87 with ParDP-kNN, and 14.71 with ParDPC-MNN. The GCI = 3 emerged with ParDP, and this value significantly improves known results reported in the literature. Anyway, the still imperfect clustering even with ParDP is mirrored by the decision graph in Figure 23, where some uncertainty exists in picking up the effective number of centroids. The

γ

value at index 35 is 1.92 and that at index 36 is only 1.88.

For Worms_64d, ParDP (executed without the cutoff approach), ParDPC-KNN, and ParDPC-MNN were applied using the [5..50] range for

k

. The principal component analysis (PCA) confirmed that all the 64 coordinates must be used for the distance calculations. The results collected in Table 4 are the best and the average values of the various experiments. As one can see, ParDP behaved similarly to ParDPC-MNN, whereas ParDPC-KNN was confused by the dataset. The correct clustering of ParDP is proven by the decision graph in Figure 24, where the expected number of 25 centroids is predicted.

5.3. Third Group of Real-World Datasets

The real-world datasets listed in Table 5 are often used for checking the clustering capabilities of a new algorithm. The datasets are mainly available from the UCI repository [31] and are provided with ground truth partitions (golden labels) useful to assess the quality of a clustering.

The reader is referred to [31] for the meaning and purpose of the datasets in Table 5. Notably, the Olivetti case study (also used in [12]) from AT&T Laboratories, Cambridge, is a known facial recognition dataset, where 40 subjects are portrayed in 10 different poses. Each of the 400 photos is represented by 64 × 64 = 4096 pixels.

Since the datasets are multi-dimensional, the principal component analysis is exploited for possibly reducing the number of the coordinates required by the Euclidean distance. All the datasets were processed without data scaling.

Figure 25, Figure 26, Figure 27, Figure 28, Figure 29, Figure 30, Figure 31, Figure 32, Figure 33, Figure 34 and Figure 35 depict the decision graph generated by ParDP for the datasets in Table 5. Each figure specifies the particular

k

value used for the clustering. Table 6 collects the gathered experimental results.

The results in Table 6 confirm the effectiveness of ParDP, which was capable of reproducing correct clustering (GCI = 0) in almost all the studied datasets. The ACC measures testify the achieved partitions are close, although they do not coincide with the ground truth partitions. In addition, the ParDP observed ACC values are almost better than similar measures achieved by the other tools, except for the more challenging Ionosphere and the Landsat datasets. In these two cases, ParDPC-MNN generated a slightly improved ACC (0.75 vs. 0.74), although the GCI of Landsat was wrongly predicted to be 1 instead of 0.

In the Olivetti dataset, the minimal value GCI = 7, that is 7 subjects over 40 that were incorrectly recognized, was coherent with the same result documented, e.g., in [9]. However, it is worth noting that the other two tools both suggest greater values for the incorrect GCI.

6. Conclusions

Density peaks-based clustering (DP) [12] proves effective in the handling of datasets with arbitrary cluster shapes. This paper proposes a new algorithm named ParDP, based on the k-Nearest Neighbors (kNN) technique [13,16], which can flexibly host different strategies for evaluating density, which is the core issue in DP operations. ParDP is currently implemented in Java using parallel streams and lambda expressions. The use of parallelism is a key to smoothing out the kNN computational cost of

O (N^{2})

required to compute all pairwise distances among points. This paper describes the design of ParDP, which includes the principal component analysis (PCA) for possibly reducing the number of point dimensions (coordinates) strictly needed for calculating the Euclidean distance. ParDP modularity is exploited, in a significant case, for building two alternative tools: ParDPC-kNN based on the proposal in [13], and ParDPC-MNN, which rests on the concept of Mutual Nearest Neighbors (MNN) advocated in [14,15] for defining the point densities. These two tools, also provided by PCA, are used for comparison purposes with ParDP, in the clustering of several challenging synthetic and real-world datasets. ParDP appears capable of outperforming the competing tools in many practical datasets.

The work will continue according to the following points. First, to continue experimenting with ParDP to cluster irregular and complex datasets. Second, to develop additional strategies for density estimation, tailored to the needs of particular datasets. Third, to extend the application of ParDP to non-numerical datasets [16], where the Euclidean distance has to be replaced by a different metric. Fourth, to compare the clustering capabilities of ParDP with those of powerful tools based on Fuzzy C-Means with Particle Swam Optimization [32]. Fifth, to port ParDP to the Theatre actor system [33] executing on top of the Spark distributed architecture to deal with larger datasets.

Author Contributions

Conceptualization, L.N and F.C.; methodology, L.N. and F.C.; software, L.N. and F.C.; validation, L.N. and F.C.; writing-original draft preparation, L.N.; writing-review and editing, L.N. and F.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All the data are contained in the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Dutt, S.; Chandramouli, S.; Das, A.K. Machine Learning; Pearson: London, UK, 2018. [Google Scholar]
MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 21 June–18 July 1965; pp. 281–297. [Google Scholar]
Jain, A.K. Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
Nigro, L. Performance of parallel K-Means algorithms in Java. Algorithms 2022, 15, 117. [Google Scholar] [CrossRef]
Fränti, P.; Sieranoja, S. K-means properties on six clustering benchmark datasets. Appl. Intell. 2018, 48, 4743–4759. [Google Scholar] [CrossRef]
Fränti, P.; Sieranoja, S. How much can k-means be improved by using better initialization and repeats? Pattern Recognit. 2019, 93, 95–112. [Google Scholar] [CrossRef]
Bradley, P.S.; Fayyad, U.M. Refining initial points for K-Means clustering. In ICML; Morgan Kaufmann: San Francisco, CA, USA, 1998; pp. 91–99. [Google Scholar]
Baldassi, C. Systematically and efficiently improving k-means initialization by pairwise-nearest-neighbor smoothing. arXiv 2022, arXiv:2202.03949. [Google Scholar]
Nigro, L.; Cicirelli, F. Improving clustering accuracy of K-Means and Random Swap by an evolutionary technique based on careful seeding. Algorithms 2023, 16, 572. [Google Scholar] [CrossRef]
Fränti, P. Efficiency of random swap algorithm. J. Big Data 2018, 5, 1–29. [Google Scholar] [CrossRef]
Nigro, L.; Cicirelli, F.; Fränti, P. Parallel Random Swap: An efficient and reliable clustering algorithm in Java. Simul. Model. Pract. Theory 2023, 124, 102712. [Google Scholar] [CrossRef]
Rodriguez, R.; Laio, A. Clustering by fast search and find of density peaks. Science 2014, 344, 14.92–14.96. [Google Scholar] [CrossRef]
Du, M.; Ding, S.; Jia, H. Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowl.-Based Syst. 2016, 99, 135–145. [Google Scholar] [CrossRef]
Zhu, Q.; Feng, J.; Huang, J. Natural neighbor: A self-adaptive neighborhood method without parameter K. Pattern Recognit. Lett. 2016, 80, 30–36. [Google Scholar] [CrossRef]
Dai, Q.Z.; Xiong, Z.Y.; Xie, J.; Wang, X.X.; Zhang, Y.F.; Shang, J.X. A novel clustering algorithm based on the natural reverse nearest neighbor structure. Inf. Syst. 2019, 84, 1–16. [Google Scholar] [CrossRef]
Sieranoja, S.; Fränti, P. Fast and general density peaks clustering. Pattern Recognit. Lett. 2019, 128, 551–558. [Google Scholar] [CrossRef]
Sieranoja, S.; Fränti, P. Fast random pair divisive construction of kNN graph using generic distance measures. In Proceedings of the 3rd International Conference on Big Data and Computing, Sanya, China, 12–14 January 2024; pp. 95–98. [Google Scholar]
Dong, W.; Moses, C.; Li, K. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th International Conference on World Wide Web, Hyderabad, India, 28 March–1 April 2011; pp. 577–586. [Google Scholar]
Urma, R.G.; Fusco, M.; Mycroft, A. Modern Java in Action; Manning: Shelter Island, NY, USA, 2018. [Google Scholar]
Nigro, L.; Cicirelli, F. Parallel clustering method based on density peaks. In Proceedings of the Sixth World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4) 2022, London, UK, 26–27 August 2022; Springer: Berlin/Heidelberg, Germany, 2022, ISSN 2367-3370. [Google Scholar]
Li, Z.; Tang, Y. Comparative density peaks clustering. Expert Syst. Appl. 2018, 95, 236–247. [Google Scholar] [CrossRef]
Chen, Y.; Tang, S.; Zhou, L.; Wang, C.; Du, J.; Wang, T.; Pei, S. Decentralized clustering by finding loose and distributed density cores. Inf. Sci. 2018, 433, 510–526. [Google Scholar] [CrossRef]
Kuttler, K. Linear algebra: Theory and applications. Saylor Found. 2012, 110, 544–550. [Google Scholar]
Rezaei, M.; Franti, P. Set matching measures for external cluster validity. IEEE Trans. Know. Data Eng. 2016, 28, 2173–2186. [Google Scholar] [CrossRef]
Vouros, A.; Langdell, S.; Croucher, M.; Vasilaki, E. An empirical comparison between stochastic and deterministic centroid initialization for K-Means variations. Mach. Learn. 2021, 110, 1975–2003. [Google Scholar] [CrossRef]
Fränti, P.; Rezaei, M.; Zhao, Q. Centroid index: Cluster level similarity measure. Pattern Recognit. 2014, 47, 3034–3045. [Google Scholar] [CrossRef]
Fränti, P.; Rezaei, M. Generalized centroid index to different clustering models. In Proceedings of the Joint IAPR International Workshop on Structural Syntactic, and Statistical Pattern Recognition, S+SSPR 2016, Mérida, Mexico, 29 November–2 December 2016; Volume 10029, pp. 285–296. [Google Scholar]
Fränti, P.; Sieranoja, S. Clustering accuracy. Appl. Comput. Intell. 2024, 4, 24–44. [Google Scholar] [CrossRef]
Warrens, M.J.; van der Hoef, H. Understanding partition comparison indices based on counting object pairs. arXiv 2019, arXiv:1901.01777. [Google Scholar]
Benchmark Datasets. Available online: https://cs.uef.fi/sipu/datasets/ (accessed on 28 February 2025).
UCI Machine Learning Repository. 2025. Available online: https://archive.ics.uci.edu/ml (accessed on 28 February 2025).
Kumar, N.; Kumar, H. A fuzzy clustering technique for enhancing the convergence performance by using improved Fuzzy C-Means and Particle Swarm Optimization algorithms. Data Knowl. Eng. 2022, 140, 102050. [Google Scholar] [CrossRef]
Nigro, L. Parallel Theatre: An actor framework in Java for high performance computing. Simul. Model. Pract. Theory 2021, 106, 102189. [Google Scholar] [CrossRef]

Figure 1. Decision graph for S4,

k

= 20.

Figure 1. Decision graph for S4,

k

= 20.

Figure 2. Obtained partitions and centroids of S4,

k

=

2

0.

Figure 2. Obtained partitions and centroids of S4,

k

=

2

0.

Figure 3. Decision graph for Skewed,

k

= 100.

Figure 3. Decision graph for Skewed,

k

= 100.

Figure 4. Obtained partitions and centroids of Skewed,

k

= 100.

Figure 4. Obtained partitions and centroids of Skewed,

k

= 100.

Figure 5. Decision graph for Birch1,

k

= 500.

Figure 5. Decision graph for Birch1,

k

= 500.

Figure 6. Obtained centroids of Birch1,

k

= 500.

Figure 6. Obtained centroids of Birch1,

k

= 500.

Figure 7. Decision graph for Aggregation,

k

= 60.

Figure 7. Decision graph for Aggregation,

k

= 60.

Figure 8. Obtained partitions for Aggregation,

k

= 60.

Figure 8. Obtained partitions for Aggregation,

k

= 60.

Figure 9. Decision graph for PathBased,

k

= 23.

Figure 9. Decision graph for PathBased,

k

= 23.

Figure 10. Obtained partitions for PathBased,

k

= 23.

Figure 10. Obtained partitions for PathBased,

k

= 23.

Figure 11. Decision graph for Spiral,

k

= 50.

Figure 11. Decision graph for Spiral,

k

= 50.

Figure 12. Obtained partitions for Spiral,

k

= 50.

Figure 12. Obtained partitions for Spiral,

k

= 50.

Figure 13. Decision graph for D31,

k

= 50.

Figure 13. Decision graph for D31,

k

= 50.

Figure 14. Obtained centroids for D31,

k

= 50.

Figure 14. Obtained centroids for D31,

k

= 50.

Figure 15. Decision graph for R15,

k

= 40.

Figure 15. Decision graph for R15,

k

= 40.

Figure 16. Obtained partitions and centroids for R15,

k

= 40.

Figure 16. Obtained partitions and centroids for R15,

k

= 40.

Figure 17. Decision graph for Jain,

k

= 8.

Figure 17. Decision graph for Jain,

k

= 8.

Figure 18. Obtained partitions for Jain,

k

= 8.

Figure 18. Obtained partitions for Jain,

k

= 8.

Figure 19. Decision graph for Flame,

k

= 50.

Figure 19. Decision graph for Flame,

k

= 50.

Figure 20. Obtained partitions for Flame,

k

= 50.

Figure 20. Obtained partitions for Flame,

k

= 50.

Figure 21. Shape of the Worms_2d dataset [30].

Figure 22. GCI vs.

k

for the Worms_2d clustered by ParDP under the cutoff approach.

Figure 22. GCI vs.

k

for the Worms_2d clustered by ParDP under the cutoff approach.

Figure 23. ParDP decision graph for Worms_2d,

k

= 110.

Figure 23. ParDP decision graph for Worms_2d,

k

= 110.

Figure 24. ParDP decision graph for Worms_64d,

k

= 20.

Figure 24. ParDP decision graph for Worms_64d,

k

= 20.

Figure 25. Decision graph for Musk,

k

= 5.

Figure 25. Decision graph for Musk,

k

= 5.

Figure 26. Decision graph for Sonar,

k

= 15.

Figure 26. Decision graph for Sonar,

k

= 15.

Figure 27. Decision graph for ECG,

k

= 5.

Figure 27. Decision graph for ECG,

k

= 5.

Figure 28. Decision graph for Iris,

k

= 20.

Figure 28. Decision graph for Iris,

k

= 20.

Figure 29. Decision graph for Seeds,

k

= 20.

Figure 29. Decision graph for Seeds,

k

= 20.

Figure 30. Decision graph for Heart,

k

= 10.

Figure 30. Decision graph for Heart,

k

= 10.

Figure 31. Decision graph for Ionosphere,

k

= 13.

Figure 31. Decision graph for Ionosphere,

k

= 13.

Figure 32. Decision graph for Waveform,

k

= 40.

Figure 32. Decision graph for Waveform,

k

= 40.

Figure 33. Decision graph for Landsat,

k

= 6.

Figure 33. Decision graph for Landsat,

k

= 6.

Figure 34. Decision graph for WDBC,

k

= 5.

Figure 34. Decision graph for WDBC,

k

= 5.

Figure 35. Decision diagram for Olivetti,

k

= 20.

Figure 35. Decision diagram for Olivetti,

k

= 20.

Table 1. Parameters of the synthetic datasets of the first group [30].

Name	N	D	K
S4	5000	2	15
Skewed	1000	2	6
Birch1	100,000	2	100

Table 2. Parameters of the synthetic datasets of the second group [29].

Name	N	D	K
Aggregation	788	2	7
PathBased	300	2	3
Spiral	312	2	3
D31	3100	2	31
R15	600	2	15
Jain	373	2	2
Flame	240	2	2
Worms_2d	105,600	2	35
Worms_64d	105,000	64	25

Table 3. Clustering results for the first seven datasets in Table 2.

	ParDP		ParDPC-KNN		ParDPC-MNN
Dataset	GCI	ACC	GCI	ACC	GCI	ACC
Aggregation	0	1.0	1	0.90	1	0.83
PathBased	0	0.77	0	0.73	1	0.69
Spiral	0	1.0	0	1.0	0	0.50
D31	0	0.97	0	0.97	0	0.96
R15	0	1.0	0	1.0	0	0.99
Jain	0	0.93	0	0.90	0	0.93
Flame	0	1.0	0	1.0	0	0.95

Table 4. Clustering results of Worms_2d and Worms_64d [30] using density peaks algorithms.

	ParDP		ParDPC-KNN		ParDPC-MNN
Dataset	GCI	ACC	GCI	ACC	GCI	ACC
Worms_2d	3	0.65	16	0.49	10	0.53
Worms_64d	0	0.84	24	0.04	0	0.84

Table 5. Selected group of real-world datasets [31].

Dataset	N	D	K
Musk	6598	166	2
Sonar	208	60	2
ECG	4998	140	2
Iris	150	4	3
Seeds	210	7	3
Heart	303	13	2
Ionosphere	351	34	2
Waveform	5000	21	3
Landsat	2000	36	6
WDBC	569	30	2
Olivetti	400	4096	40

Table 6. Clustering results for the datasets in Table 5.

	ParDP		ParDPC-KNN		ParDPC-MNN
Dataset	GCI	ACC	GCI	ACC	GCI	ACC
Musk	0	0.81	1	0.64	1	0.65
Sonar	0	0.61	1	0.53	0	0.57
ECG	0	0.98	0	0.96	0	0.96
Iris	0	0.87	1	0.73	0	0.85
Seeds	0	0.90	0	0.89	0	0.90
Heart	0	0.62	0	0.60	1	0.54
Ionosphere	0	0.74	0	0.74	0	0.75
Waveform	0	0.68	0	0.66	0	0.67
Landsat	0	0.74	3	0.39	1	0.75
WDBC	0	0.87	0	0.72	0	0.87
Olivetti	7	0.69	16	0.45	12	0.49

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nigro, L.; Cicirelli, F. ParDP: A Parallel Density Peaks-Based Clustering Algorithm. Mathematics 2025, 13, 1285. https://doi.org/10.3390/math13081285

AMA Style

Nigro L, Cicirelli F. ParDP: A Parallel Density Peaks-Based Clustering Algorithm. Mathematics. 2025; 13(8):1285. https://doi.org/10.3390/math13081285

Chicago/Turabian Style

Nigro, Libero, and Franco Cicirelli. 2025. "ParDP: A Parallel Density Peaks-Based Clustering Algorithm" Mathematics 13, no. 8: 1285. https://doi.org/10.3390/math13081285

APA Style

Nigro, L., & Cicirelli, F. (2025). ParDP: A Parallel Density Peaks-Based Clustering Algorithm. Mathematics, 13(8), 1285. https://doi.org/10.3390/math13081285

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ParDP: A Parallel Density Peaks-Based Clustering Algorithm

Abstract

1. Introduction

2. Density Peaks-Basic Algorithm

3. Development of the ParDP Algorithm

3.1. Step pca()

3.2. Step kNN()

3.3. Step delta()

3.4. Step gamma()

3.5. Step clustering()

3.6. Java Implementation Issues

4. Clustering Accuracy Measures

5. Experimental Work

5.1. First Group of Datasets

5.2. Second Group of Datasets

5.3. Third Group of Real-World Datasets

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI