DEALER: Distributed Clustering with Local Direction Centrality and Density Measure

Liu, Xuze; Zhao, Ziqi; Zhao, Yuhai

doi:10.3390/app15073988

Open AccessArticle

DEALER: Distributed Clustering with Local Direction Centrality and Density Measure

by

Xuze Liu

,

Ziqi Zhao

and

Yuhai Zhao

^*

School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3988; https://doi.org/10.3390/app15073988

Submission received: 28 February 2025 / Revised: 30 March 2025 / Accepted: 2 April 2025 / Published: 4 April 2025

(This article belongs to the Special Issue Text Mining and Data Mining)

Download

Browse Figures

Versions Notes

Abstract

Clustering by Measuring Local Direction Centrality (CDC) is a recently proposed innovative clustering method. It identifies clusters by assessing the direction centrality of data points, i.e., the distribution of their k-nearest neighbors. Although CDC has shown promising results, it still faces challenges in terms of both effectiveness and efficiency. In this paper, we propose a novel algorithm, Distributed Clustering with Local Direction Centrality and Density Measure (DEALER). DEALER addresses the problem of weak connectivity by using a well-designed hybrid metric of direction centrality and density. In contrast to traditional density-based methods, this metric does not require a user-specified neighborhood radius, thus alleviating the parameter-setting burden on the user. Further, we propose a distributed clustering technique empowered by z-value filtering, which significantly reduces the cost of k-nearest neighbor computations in the direction centrality metric, lowering the time complexity from

O (n^{2})

to

O (n log n)

. Extensive experiments on both real and synthetic datasets validate the effectiveness and efficiency of our proposed DEALER algorithm.

Keywords:

distributed clustering; data mining; distributed; z-value filling curve

1. Introduction

Cluster analysis is a fundamental task in data mining [1], which involves partitioning data based on similarity without the need for labeled data. It has been widely applied in various fields, including image analysis [2,3], information retrieval [4,5], data compression [6,7], disease diagnosis [8,9], path planning [10], bioinformatics [11,12,13], and so on.

Recently, Peng D., Gui Z., and Wang D. [14] proposed a clustering method based on local direction centrality (Clustering by Measuring Local Direction Centrality, CDC), which has attracted considerable attention from researchers. The CDC method uses the Direction Centrality Metric (DCM) index to quantify the direction centrality of each data point o. The DCM value is essentially the variance of the k angles formed by point o and its k-nearest neighbors. A larger DCM value indicates that the distribution of the k-nearest neighbors of point o is more uneven, suggesting that point o is likely situated on the edge of a cluster (and is hence referred to as a boundary point). Conversely, a smaller DCM value suggests that point o is more likely to be located within the interior of a cluster (and is thus referred to as an interior point).

As illustrated in Figure 1,

p_{1}, p_{2}, p_{3}, \dots, p_{k}

represent the k-nearest neighbors of point o, with the lines connecting o to these k points forming k angles. The variance of these k angles defines the DCM value of o, denoted as

{D C M}_{o}

. In the case where the k-nearest neighbors are predominantly located on one side of point o, the distribution is uneven, leading to a higher DCM value and consequently a higher likelihood of point o being a boundary point. After determining the interior and boundary points based on the DCM value, the CDC method separates the interior points into distinct groups, which are then divided by the boundary points. Groups of interior points that are not separated by boundary points are treated as belonging to the same cluster, and boundary points are assigned to the cluster of the nearest interior point, thus forming the final clustering result.

Given a dataset P containing n data points

\{p_{1}, p_{2}, \dots, p_{n}\}

, the Direction Centrality Metric (DCM) is calculated for each data point

p_{i}

to evaluate its centrality within the cluster. A smaller DCM value indicates that the data point is closer to the cluster center, while a larger DCM value suggests proximity to the cluster’s edge. For example, as illustrated in Figure 2, DCM can be used to distinguish between interior and boundary points. Data points marked in purple have a relatively uniform distribution of neighboring points, resulting in a smaller variance in the angles between neighbors and thus a lower DCM value, classifying them as interior points. In contrast, data points marked in green exhibit an uneven distribution of neighboring points skewed toward one side, leading to greater variance in neighboring angles and a higher DCM value, identifying them as boundary points.

While the CDC algorithm has demonstrated notable success, it remains challenged by issues in two critical areas, which undermine its effectiveness and efficiency:

The Challenge In Terms Of Effectiveness: The DCM metric has both advantages and disadvantages. Replacing the density measure with the variance of k-nearest neighbor angles can address the shortcomings of density-based algorithms in identifying sparse clusters. However, this approach may lead to weak connectivity issues, where two high-density regions are mistakenly identified as a single region due to the presence of low-density connecting areas. As illustrated in Figure 3, points of different colors belong to different clusters, the CDC algorithm successfully identifies the green sparse cluster but erroneously merges the red and blue clusters into one. This misclassification occurs because points in sparse regions are recognized as internal points (due to the uniform distribution of k-nearest neighbors and low DCM values). For the green sparse region, this is advantageous, as density-based metrics would fail to identify it as a cluster. However, for the sparse region between the red and blue clusters, the points are incorrectly treated as belonging to a high-density region, resulting in weak connectivity. Developing an effective metric that can both accurately identify sparse clusters and alleviate weak connectivity issues is a challenge that warrants further investigation.
The Challenge In Terms Of Efficiency: The computational complexity of the CDC algorithm escalates rapidly with increasing data size, primarily due to the intensive distance computations required for determining DCM values. This substantial computational overhead renders the CDC algorithm impractical for efficiently clustering large-scale datasets in a single-machine environment. While distributed computing offers a solution by distributing the workload across multiple servers, most existing methods focus predominantly on partitioning the dataset to minimize the computational and communication costs associated with data replication or repartitioning. Nevertheless, the computational burden within each partition remains prohibitively high, as it necessitates pairwise distance calculations for all points within the group. According to the study [15], the cost of distance computations significantly surpasses that of other tasks, such as data partitioning, in distributed clustering algorithms. Consequently, there is a pressing need for an efficient distributed CDC clustering algorithm that not only ensures the assignment of similar data points to the same partition but also optimizes the computation of DCM values within each partition.

To address the aforementioned challenges, a hybrid metric called “DCM + Density” is proposed to enhance effectiveness. Unlike traditional density-based methods, this metric does not require predefining a neighborhood radius, thereby improving the accuracy of the measurement while reducing its computational cost. For efficiency, a distributed clustering method based on z-value filtering is introduced, reducing the time complexity from

O (n^{2})

to

O (n log n)

. Our major contributions in this paper are summarized as follows:

We propose the “DCM + Density” Hybrid Metric. To accurately distinguish boundary points from core points, a hybrid metric is developed by integrating the sparsity indicator ( $ρ$ ) of the region containing a data point with the cluster centralization metric (DCM) employed in the CDC algorithm. Unlike conventional density calculation methods, the computation of $ρ$ does not require a predefined neighborhood radius. Instead, it leverages k-nearest neighbor information obtained during the DCM computation, thus reducing computational overhead. This approach not only preserves the capability to identify sparse clusters but also mitigates issues related to weak connectivity, thereby enhancing the effectiveness of clustering algorithms.
We develop a distributed direction centrality clustering algorithm enhanced by z-value indexing. This algorithm employs a z-value sorting strategy to allocate proximate data points to the same processing node while pre-copying other required data points to reduce the communication cost of k-nearest neighbor calculations. Furthermore, a z-value curve-based indexing structure is designed to confine the search space to a linear range containing the query point, significantly narrowing the search scope and improving efficiency. Finally, the clustering process is completed through a “local clustering first, cluster merging later” approach, reducing the time complexity from $O (n^{2})$ to $O (n log n)$ .
We conduct extensive experiments to validate the proposed algorithms’ effectiveness and efficiency. First, experiments on two synthetic datasets and five real-world datasets in a single-node environment demonstrated the validity of the proposed metrics. Efficiency experiments further confirmed that the proposed algorithm substantially reduces computation and improves speed. Additionally, experiments in a distributed environment using five real-world datasets were conducted to compare efficiency and scalability, further verifying the algorithm’s high performance and effectiveness.

The remainder of the paper is organized as follows: Section 2 reviews related research work pertinent to this study. Section 3 presents the design of DEALER (Distributed Clustering with Local Direction Centrality and Density Measure), a distributed clustering algorithm enhanced with z-value indexing. Section 5 analyzes the proposed distributed clustering algorithm through experimental testing and evaluation. Section 6 summarizes the work of the full text.

2. Preliminaries

This chapter introduces the clustering method based on Measuring Local Direction Centrality [14] and the related concept of the z-value filling curve.

2.1. Fundamental Concepts

As illustrated in Figure 4, the Direction Centrality Metric (DCM) measures the variance of angles between a given point and its k-nearest neighbors, referred to as the direction centrality of that point. The formal definition of DCM is provided in Equation (1):

Definition 1.

Suppose the k-nearest neighbors of a data point

p_{i}

are denoted as

p_{1}, p_{2}, \dots, p_{k}

, arranged in a clockwise order. The DCM value, which quantifies the centrality of

p_{i}

, is calculated using the following formula:

D C M = \frac{1}{k} \times \sum_{j = 1}^{k} {(α_{j} - \frac{2 π}{k})}^{2}

(1)

The angle

α_{j}

is defined as the angle formed at the vertex of the data point

p_{i}

between the edge connecting

p_{i}

to the j-th neighbor

p_{j}

and the edge connecting

p_{i}

to the subsequent neighbor

p_{j + 1}

in the clockwise direction.

According to the DCM metric, data points can be classified as either boundary points or interior points.

Definition 2

(Interior Points and Boundary Points). Given a DCM threshold

T_{D C M}

and any point p in the dataset P, if the DDCM value of p, denoted as

D C M (p)

, satisfies

D C M (p) \geq T_{D C M}

, then p is classified as a boundary point. Conversely, if

D C M (p) < T_{D C M}

, p is classified as an interior point.

Based on these definitions of interior and boundary points, a cluster can be intuitively understood as a group of interior points that are separated from other clusters by boundary points and other interior points.

2.2. Z-Value Filling Curve

The z-value filling curve maps the coordinates of multidimensional data points into a one-dimensional space, thereby reducing data dimensionality. Figure 5 provides an illustration of the z-value filling curve in a two-dimensional space. The process of z-value transformation is as follows:

First, the coordinate values of the data points in the high-dimensional space are converted into binary numbers with the same bit length. Then, the binary bits at the same positions across all dimensions are interleaved to form a new binary number. This binary number is subsequently converted into a decimal number, which serves as the z-value coordinate of the data object in the one-dimensional space. Specifically, for a data point p in a d-dimensional space, let its original coordinates be denoted as

p (x_{1}, x_{2}, \dots, x_{d})

. Each coordinate value is converted into an mmm-bit binary number. For example, if

x_{1} = {(a_{11} a_{12} \dots a_{1 m})}_{2}

, the data point is represented in binary form as

p ({(a_{11} a_{12} \dots a_{1 m})}_{2}, {(a_{21} a_{22} \dots a_{2 m})}_{2}, \dots {(a_{d 1} a_{d 2} \dots a_{d m})}_{2})

. The binary sequence is then mapped to a one-dimensional coordinate using Equation (2).

z_{p} = \underset{i = 1}{\sum^{m}} \underset{j = 1}{\sum^{d}} 2^{d (i - 1) + j - 1} * a_{j i}

(2)

Taking the transformation process of the two-dimensional point d in Figure 5 as an example, first, the coordinate of the data point d in the two-dimensional space is

(3, 1)

. Next, each dimension’s decimal coordinate is converted into its binary representation. Since

3 = {(011)}_{2}, 1 = {(001)}_{2}

, the binary coordinates of point d become

({(011)}_{2}, {(001)}_{2})

. Following this, the coordinate transformation process begins by interleaving the binary bits. The first binary bit from each dimension is extracted, yielding 00. Then, the second binary bit from each dimension is taken, resulting in 10, and so on. Concatenating these bits produces the binary number 001011, which is then converted into its decimal equivalent. Thus, the z-value coordinate of point d on the z-value axis is

z_{d} = 11

. This process is illustrated in Figure 6.

Based on the above coordinate transformation, the six data points in Figure 6 can be mapped into one-dimensional space, as shown in Figure 7.

From Figure 5 and Figure 7, the z-value transformation process exhibits the following characteristics. 1. Local Order-Preserving Property: The distribution of data points on the z-value axis roughly preserves the proximity of points in the original space. That is, points that are close to each other in the original space tend to remain close on the z-value axis. 2. Disruptive Changes: In certain cases, the relative distribution of some points may change significantly on the z-value axis. For example, this can be observed in the relationship between points c and m in the figures. The local order-preserving property provides a foundation for using z-values to efficiently identify the range of k-nearest neighbors (KNNs). However, the occurrence of disruptive changes necessitates the design of filtering criteria to ensure the accurate identification of the KNN range.

3. The DEALER Algorithm

This chapter introduces DEALER, a distributed direction centrality clustering algorithm enhanced with z-value indexing. The algorithm refines the DCM (Direction Centrality Metric) by integrating node density information (

ρ

) and leverages distributed computing alongside z-value indexing to accelerate computations. These improvements enhance the CDC (Clustering by Measuring Local Direction Centrality) algorithm in terms of both effectiveness and efficiency.

3.1. Overview

As illustrated in Figure 8, the DEALER algorithm comprises the following steps: (1). Data Partitioning: The data are first sorted based on z-values and divided into tasks, aiming to assign spatially close points to the same partition whenever possible. Additionally, precise quantile-based partitioning is employed to ensure load balancing among subnodes, thereby completing the overall data partitioning process. To facilitate the accurate identification of local K-nearest neighbor (KNN) points in later steps, data replication and range reduction are performed in advance. (2). Computation of DCM and

ρ

for local clustering: Within each node, local clustering is conducted using two key metrics: DCM, which quantifies the centralization of data points, and

ρ

, which represents data point density. These measures are combined to distinguish between boundary points and cluster centers. (3). Data Clustering: First, local clustering is performed within each node based on the previously computed directional centrality and density. Then, the initial clusters formed through local clustering are merged to ensure that clusters that should belong to the same group are unified, ultimately achieving the final global clustering outcome. The framework follows a natural logical flow, beginning with data partitioning, followed by intra-partition DCM and density calculations, and concluding with data clustering. To align the framework with the chapter headings for better readability, both local and global clustering are categorized under data clustering. Additionally, since data partitioning involves certain steps and techniques related to intra-partition DCM and density calculations, presenting the computational aspects first enhances clarity. Placing data partitioning at the beginning would necessitate referencing many computational details within that section, leading to a more cumbersome and lengthy explanation. Thus, for conciseness, the order of presentation has been temporarily adjusted.

3.2. Computation of DCM and $ρ$ for Local Clustering

To facilitate a clearer introduction of the DEALER algorithm, we first present the efficient computation of

ρ

and DCM values within each partition.

Unlike the CDC algorithm, which relies solely on the DCM metric to distinguish between boundary points and internal points within clusters, the proposed DEALER algorithm incorporates

ρ

, a metric reflecting the sparsity of the region where a data point resides, into the DCM calculation. As illustrated in Figure 3 in Section 1, this hybrid metric not only maintains the ability to identify sparse clusters but also alleviates issues related to weak connectivity, thereby enhancing the effectiveness of the clustering algorithm. Specifically, the algorithm operates in two steps: First, it computes the DCM value and

ρ

value for each data point. Using the DCM values, it performs an initial classification, dividing the points into core point candidates and boundary point candidates. Then, the core candidate set is further refined by leveraging the

ρ

values. Specifically, if a point within the candidate set for internal points has a relatively low

ρ

value, it is determined not to be an internal point. This screening process is applied to the candidate set, effectively distinguishing internal points from boundary points. Compared to traditional directional-center clustering methods, the incorporation of

ρ

effectively mitigates weak connectivity issues. Meanwhile, in contrast to conventional density-based clustering, the directional-center assessment based on DCM values demonstrates superior performance in identifying sparse clusters. Under our proposed distributed computing framework, once the data partitioning is complete, the

ρ

and DCM values for data points within each partition can be computed independently, without requiring inter-node communication.

3.2.1. Computation of DCM

Based on DCM, data points are categorized as either boundary points or internal points. To achieve this, the DCM values for all data points in dataset P must first be computed. The calculation method for DCM values is provided in Equation (1) in Section 2.1. As indicated by the formula, the computation of DCM requires identifying the k-nearest neighbors (KNNs). The primary computational cost of the DEALER algorithm lies in the distance calculations involved in finding KNN points, which represents the bottleneck that constrains the speed of the clustering process. When the dataset size becomes large, the time complexity of

O (n^{2})

for these calculations becomes prohibitively expensive. Thus, improving the algorithm’s speed is essential to meet the demands of large-scale data processing. To accelerate KNN searching, this section proposes a filtering strategy based on a z-value filling curve, referred to as the Z-CF (z-value-based computing filter) algorithm. Before introducing the Z-CF algorithm, we first prove two theorems.

Theorem 1.

In a d-dimensional space, given two data points

p (x_{1}, x_{2}, \dots, x_{d})

and

q (y_{1}, y_{2}, \dots, y_{d})

, where

\forall i \in [1, d]

,

x_{i} < y_{i}

, the following holds:

z_{p} < z_{q}

(3)

Proof.

For two data points,

p (x_{1}, x_{2}, \dots, x_{d})

and

q (y_{1}, y_{2}, \dots, y_{d})

, where

\forall i \in [1, d]

,

x_{i} < y_{i}

, we refer to the method described in Lemma 4 of reference [14], which rearranges the coordinate values by alternating between the most significant bit and the least significant bit to derive the z-value of a point. This implies that

z_{p} < z_{q}

. □

Theorem 2.

Given two data points p and q in a d-dimensional space, if the distance between them

d (p, q) < s

, then the following holds:

z_{p^{s -}} < z_{q} < z_{p^{s +}}

(4)

Here,

p^{s -}

and

p^{s +}

are new data points derived by subtracting and adding s to each coordinate value of the data point p, respectively. That is,

p^{s -} (x_{1} - s, x_{2} - s, \dots, x_{d} - s)

and

p^{s +} (x_{1} + s, x_{2} + s, \dots, x_{d} + s)

.

Proof.

If

d (p, q) < s

, then for

\forall i \in [1, d]

, it holds that

x_{i} - s < y_{i} < x_{i} + s

. Based on Theorem 1, we can conclude that

z_{p^{s -}} < z_{q} < z_{p^{s +}}

. □

Since data points, when mapped from high-dimensional space to one-dimensional space via the z-value filling curve, cannot be guaranteed to be ordered strictly by distance, it is necessary to first identify the potential range for the KNN points, then compute the distances within that range and sort the points to obtain the exact KNN. Building on the two theorems mentioned above, after arranging the data points in ascending order based on their z-values, we can efficiently identify the KNN points for any given data point p. This is achieved by iteratively searching in both forward and backward directions along the z-value axis, selecting the closer of the two candidate points at each step. After k iterations, we obtain a point

k p

, and it follows that the distance

d (p, k p)

is necessarily greater than the farthest distance among p’s true KNN points. According to Theorem 2, the positions of p’s KNN points along the z-value axis fall within a specific range. Furthermore, Theorem 2 establishes that p’s KNN points must lie within a neighborhood centered at p with a radius of

d (p, k p)

after k comparisons. Within this defined range, an exact KNN search can then be conducted. Compared to traditional KNN searches, incorporating the z-value index significantly reduces the search space by narrowing the potential candidate set from the entire dataset to a localized subset. This approach substantially decreases the number of distance computations required, thereby improving computational efficiency. The detailed strategy and analysis are presented as follows.

First, identify the initial potential KNN points. For a given d-dimensional dataset P, all data points in the dataset are mapped into one-dimensional space via the z-value filling curve so that the z-value coordinates correspond one-to-one with the original space coordinates. Next, the data points are sorted by their z-values. On the z-value axis, the initial potential KNN points are identified. For any data point p, search for one point in each direction along the z-value axis: one in the positive direction, denoted

p_{1 +}

, and one in the negative direction, denoted

p_{1 -}

. The distances between p and these two points,

d (p, p_{1 +})

and

d (p, p_{1 -})

, are calculated. The point with the smaller distance is selected as the first initial KNN point. Then, along the direction of the closer point on the z-value axis, search for the second point, calculate the distance, and compare it with the previously identified farther point. The point with the smaller distance is selected as the second initial KNN point, and this process continues until k points are identified.

Next, determine the potential range for the KNN points. For any given data point p, the initially identified KNN points are sorted in ascending order of distance. The point that is the k-th closest, denoted as

k p

, is selected. The potential range for the KNN points is defined as a neighborhood centered at p with a radius of

d (p, k p)

.

As illustrated in Figure 9, instead of performing a search over the entire data space, a one-dimensional linear search within the interval

(z_{p^{d (p, k p) -}}, z_{p^{d (p, k p) +}})

is sufficient to identify the KNN points of data point p. This is because, for any data point q such that

d (p, q) < d (p, k p)

, according to Theorem 2, the z-value of q must lie within the interval

(z_{p^{d (p, k p) -}}, z_{p^{d (p, k p) +}})

. Therefore, there is no need to compute the distances from q to p for the ranges where

z_{q} \geq z_{p^{d (p, k p) +}}

or

z_{q} \leq z_{p^{d (p, k p) -}}

, as it is guaranteed that

d (p, q)

will be greater than

d (p, k p)

.

It is important to note that if, for data points p and q, the

z_{q}

lies within the interval

(z_{p^{d (p, k p) -}}, z_{p^{d (p, k p) +}})

, and the

z_{p}

lies within the interval

(z_{q^{d (q, k q) -}}, z_{q^{d (q, k q) +}})

, then when calculating the KNN for data point p, the distance

d (p, q)

needs to be calculated once. Similarly, when calculating the KNN for data point q, the distance

d (q, p)

must also be calculated once. This results in redundant distance calculations and increases unnecessary computational overhead, as illustrated in Figure 10.

To optimize the computational strategy and reduce unnecessary overhead, this study adopts a one-sided computation approach. Specifically, for a given data point p, we calculate the distance only to the points on one side of the interval. This means either computing the distance to the data points in the right-hand interval

(z_{q}, z_{p^{d (p, k p) +}})

or to those in the left-hand interval

(z_{p^{d (p, k p) -}}, z_{q})

. Both methods yield the same result, and in this paper, the first method is employed, as shown in Figure 11.

The proposed strategy is based on Theorem 3.

Theorem 3.

Given a dataset P and any data point p, suppose there exists a subset

Q_{p} \subset P

such that for

\forall q \in Q_{p}

,

d (p, q) < s

. The subset

Q_{p}

is then divided into two subsets, denoted as

Q_{p^{-}}

and

Q_{p^{+}}

, where

Q_{p^{-}} \cup Q_{p^{+}} = Q_{p}

. For

\forall q \in Q_{p^{-}}

,

z_{q} < z_{p}

, and for

\forall q \in Q_{p^{+}}

,

z_{q} > z_{p}

. Then, we have the following:

\forall q \in Q_{p^{-}}, z_{p^{s -}} < z_{q} < z_{p}

(5)

\forall q \in Q_{p^{+}}, z_{p} < z_{q} < z_{p^{s +}}

(6)

Proof.

According to Theorem 2, for all

\forall q \in Q_{p^{+}}

, we have

z_{p^{s -}} < z_{q} < z_{p^{s +}}

. Since

z_{p} < z_{q}

, it follows that

z_{p} < z_{q} < z_{p^{s +}}

, thus proving Equation (6). A similar argument can be applied to prove Equation (5). □

Based on Theorem 3, when calculating the distance between data point p and data point q, if

z_{p} < z_{q} < z_{p^{s +}}

, we can compute the distance from p to

\forall q \in Q_{p^{+}}

. Similarly, when calculating the distance from p to

\forall q \in Q_{p^{-}}

with respect to their previous data points in the interval

(z_{q}, z_{q^{s +}})

, we can compute the distance from p to all

\forall q \in Q_{p^{-}}

. This reduces the computational load, making the Z-CF algorithm more efficient.

The Z-CF algorithm uses the z-value filling curve to map points from high-dimensional space into one-dimensional space. The z-value filling curve ensures that points that are close in the original space remain close in the one-dimensional space, thereby narrowing the KNN search range. To further reduce redundant calculations, a one-sided calculation strategy is employed to optimize the algorithm. The specific computational process of the Z-CF algorithm is described in Algorithm 1.

Algorithm 1 Z-CF

Input: Dataset P, KNN value k

Output: KNN sets

{K N N_{i}}

and distances

\{D_{i}\}

for all points in P

1: Transform all data points

p_{i} \in P

into one-dimensional space using Equation (2) to obtain

z_{p_{i}}

2: Sort the data points by their z-values in ascending order

3:

K \leftarrow 0

4: for each data point

p_{i}

in

{z_{p_{i}}}

do // Find the initial KNN points

5:

Q \leftarrow \{z_{p_{i}}\}, \{K N N_{i}\} \leftarrow ⌀, \{D_{i}\} \leftarrow ⌀

// Q represents the subset of P with removable points

6:

q_{i} \leftarrow p_{i}

7: while

K < k

do

8:

K N N

,

D = \{d (q_{i}, q_{i + 1}), d (q_{i}, q_{i - 1})\}

9:

\{K N N_{i}\} \leftarrow K N N \cup {K N N_{i}}

10:

\{D_{i}\} \leftarrow D \cup {D_{i}}

11:

Q \leftarrow \{Q\} ∖ K N N

12:

K \leftarrow K + 1

13: end while

14: Sort

\{K N N_{i}\}

by increasing distance in

\{D_{i}\}

15:

R_{i} \leftarrow m a x \{D_{i}\}

16: end for

17: for each data point

p_{i}

in

{z_{p_{i}}}

do

18: Compute

z_{p_{i}^{R_{i} -}}

and

z_{p_{i}^{R_{i} +}}

19: for

\forall q \in {z_{p_{i}}}

such that

z_{q} \in (z_{q}, z_{p_{i}^{R_{i} +}})

do

20: if

d (p_{i}, q) < R_{i}

then

21:

R_{i} \leftarrow d (p_{i}, q)

22: Update

\{D_{i}\}

{K N N_{i}}

23: end if

24: if

d (p_{i}, q) < R_{q}

then

25:

R_{q} \leftarrow d (p_{i}, q)

26: Update

\{D_{q}\}

\{K N N_{q}\}

27: end if

28: end for

29: end for

3.2.2. Computation of $ρ$

Although clustering methods based on DCM values ensure the identification of sparse clusters, they perform poorly in addressing weak connectivity issues. To maintain the capability of identifying sparse clusters while mitigating weak connectivity problems, a hybrid metric incorporating the

ρ

-value is introduced. Unlike traditional density computation methods, the calculation of the

ρ

-value in the DEALER algorithm does not require specifying a neighborhood radius. Instead, it directly uses the reciprocal of the Euclidean distance between a data point and its k-th nearest neighbor, as determined during the DCM computation process, to represent the data point’s density. This approach offers two key advantages. First, it does not introduce additional computational overhead, as it merely applies a reciprocal operation to pre-computed data. Second, the reciprocal of the distance to the k-th nearest neighbor is positively correlated with the density as traditionally defined: the greater the distance to the k-th neighbor, the smaller the reciprocal, indicating sparser surrounding points and lower density in sparse regions, and vice versa. Moreover, using the reciprocal of the distance to the k-th nearest neighbor as a density metric inherently normalizes the density, which facilitates subsequent data processing tasks, such as data filtering. The formal definition of density is provided as follows:

Definition 3

(K-Nearest Neighbor Limited Density). For any data point

p_{i}

, if

p_{i k}

denotes the k-th nearest neighbor of

p_{i}

in the data space, then the reciprocal of the Euclidean distance between

p_{i}

and

p_{i k}

is defined as the k-nearest neighbor limited density

ρ_{i}

of the data point

p_{i}

.

Figure 12 illustrates the density computation process. As shown, for a given data point

p_{i}

, assuming

k = 5

, five nearest neighbors can be identified, denoted as

\{q_{1}, q_{2}, q_{3}, q_{4}, q_{5}\}

, and sorted by distance in ascending order as

\{q_{1}, q_{3}, q_{5}, q_{4}, q_{2}\}

. Based on Figure 12, the formula for calculating the density of

p_{i}

is presented in Equation (7):

ρ_{i} = \frac{1}{d (p_{i}, p_{i k})}

(7)

Here,

ρ_{i}

represents the density of the data point

p_{i}

,

p_{i k}

is the k-th nearest neighbor of

p_{i}

in the data space (corresponding to point

q_{2}

in the figure), and

d (p_{i}, p_{i k})

denotes the Euclidean distance between

p_{i}

and

p_{i k}

.

Based on the above analysis, to accurately differentiate between boundary points and core points and further enhance clustering accuracy, it is necessary to classify data objects using a combination of two metrics: density (

ρ

) and local DCM. Within the same cluster, boundary points generally have lower density compared to core points, while their DCM values are higher. Since the core points uniquely determine a cluster, it is essential to ensure all core points are correctly identified. To achieve this, a two-step strategy is employed: first, data objects are filtered based on their DCM values, followed by a secondary filtering based on density values. Specifically, data objects in the dataset are divided into two subsets based on their DCM values: a core point candidate set and a boundary point candidate set. Next, density values are applied to further filter the objects in the core point candidate set. If the density of a data object is smaller than the densities of more than half of its k-nearest neighbors (KNNs), the object is reclassified as a boundary point. This approach is justified because, for objects in the core point candidate set, their KNN points are relatively evenly distributed. If the density of a data object is exceeded by a certain proportion (

γ_{ρ}

, typically set to 50%) of its KNN points, the object is likely closer to the cluster boundary. Consequently, the center of the cluster to which this object belongs is biased toward the region with higher-density data points. This methodology provides a clear distinction between core points and boundary points within the dataset, laying a solid foundation for subsequent clustering operations.

3.3. Data Partitioning

This section presents a data partitioning method enhanced by z-value indexing. While addressing the challenge of partitioning data and computational tasks to ensure load balancing, it also requires copying certain data and adjusting data partitions in order to maintain the accuracy of the clustering results.

3.3.1. Initial Data Partitioning

This section introduces the initial data partitioning approach based on z-value indexing. Due to the local ordering property of z-values, points that are close in the original dataset are likely to have similar z-values. To facilitate explanation, two key concepts are introduced: local data and replicated data. As their names suggest, local data are the data inherently residing on a given node, while replicated data are data that do not originally belong to the node but are copied from other nodes due to computational requirements. These concepts are formally defined as follows:

Definition 4

(Local Data). For any sub-node

w o r k e r_{i}

, local data are the data points on which KNN computations are performed locally on the node.

Definition 5

(Replicated Data). For any sub-node

w o r k e r_{i}

, replicated data are data points that do not originally belong to the node but are copied from other nodes to enable the computation of KNN for the local data.

To partition nearby data points into the same sub-node, the distributed clustering algorithm partitions the data based on the z-value information of the data objects in the dataset. The z-value contains the spatial location information of the corresponding data points, meaning that data points that are close to each other on the one-dimensional z-value axis will also be close in the original space. Furthermore, according to the DEALER algorithm, the KNN of a data point is computed using a z-value optimization method. By leveraging this property, data can be partitioned without recalculating the z-values for each data point, ensuring that points that are close in the original space are grouped into the same partition, i.e., the same sub-node. Specifically, for a given dataset P, all data points are first mapped to a one-dimensional space based on the z-value transformation formula (Equation (2)), resulting in a corresponding z-value for each point.

Next, it is necessary to partition all data points in the dataset into distinct subsets to facilitate further processing on different sub-nodes. Specifically, given a dataset P, the data are divided into n equal parts. This is achieved by first calculating the quantile points

\frac{1}{n}, \frac{2}{n}, \dots, \frac{n - 1}{n}

and then partitioning the dataset into subsets

(0, \frac{1}{n}] P, (\frac{1}{n}, \frac{2}{n}] P, \dots, (\frac{n - 1}{n}, 1] P

. A sampling-based method introduced in [16] provides a straightforward approach to approximate these quantile points by sampling the data. However, this method often leads to imbalanced load distribution due to inaccuracies in the approximate quantiles. To address this issue, a sorting-based local data partitioning strategy is proposed.

Since the computation of KNN points for local data on each sub-node requires sorting the data based on z-values, the proposed local partitioning strategy performs a global sorting of all data points in the dataset by their z-values. The dataset is then evenly divided into partitions, as illustrated in Figure 13.

From Figure 13, it can be observed that after sorting the dataset based on the z-values, precise quantile points, such as

\frac{1}{n}, \frac{2}{n}, \dots, \frac{n - 1}{n}

, can be easily identified. Subsequently, the entire dataset P can be uniformly partitioned into n regions either by redistributing the dataset or by utilizing the precise quantile points. Compared to sampling-based partitioning strategies, this approach introduces additional overhead due to global sorting. However, it ensures the identification of accurate quantile points, thereby achieving load balancing across the sub-nodes. In practice, global sorting is not necessarily an extra expense, as the sub-nodes must locally sort their data based on z-values during their KNN computations. Thus, this strategy provides a balanced trade-off compared to the approximate partitioning strategy based on sampling. Furthermore, due to the properties of z-values, points that are spatially close are often assigned to the same sub-node, which facilitates subsequent local KNN computations.

3.3.2. Data Partition Adjustment

Calculating the exact KNN points on a sub-node is a necessary condition for ensuring the accuracy of DCM and

ρ

, as well as for ensuring the effectiveness of the clustering results. However, due to the discontinuity of the z-value curve, the global KNN points for a local data point

p_{i}

may not all be assigned to the sub-node containing

p_{i}

. Therefore, it is necessary to pre-copy the information of all potential KNN points that are not stored locally from other sub-nodes. Once the data copying is completed, each sub-node only needs to compute the KNN values for its local data. Figure 14 illustrates the process of copying data points.

Figure 14 illustrates the process of copying data points. In this figure, assume

k = 2

. In Partition 1, if only local data are considered, the KNN points for data point c would be points a and b, which are clearly inaccurate. To accurately identify the KNN points for c, the position information of point d must be copied from its original sub-node to the sub-node containing Partition 1. In this case, point d becomes a copied data point in Partition 1. Similarly, for data point f, the position information of point g in Partition 3 needs to be copied to Partition 2. By following this process, each sub-node must retrieve partial data from other sub-nodes to ensure the precise computation of KNN points for its local data objects.

The Z-CF algorithm accelerates the clustering process by narrowing the KNN search range through the correspondence between the original space coordinates and the z-values. However, in a fully distributed environment, the true KNN points may reside on other subnodes. Therefore, it is necessary to copy all the data information required to find the local data objects’ exact KNN points from other subnodes. To minimize the communication overhead, the strategy for finding exact KNN points in a distributed environment, called S-KNN (Spark-KNN), is proposed. This strategy is based on the following two theorems:

Theorem 4.

Given a dataset P,

\forall p \in P

,

\exists q \in P

such that

d (p, k p) \leq d (p, q)

, then the following holds:

z_{p^{d (p, q) -}} < z_{k p} < z_{p^{d (p, q) +}}

(8)

where

k p

is the exact k-th nearest neighbor of data point p, and q is the k-th nearest neighbor found within the same sub-node as p.

Proof.

According to Theorem 2, for

\forall p \in P

, the following holds:

z_{p^{d (p, q) -}} < z_{q} < z_{p^{d (p, q) +}}

and

z_{p^{d (p, k p) -}} < z_{k p} < z_{p^{d (p, k p) +}}

. Since

d (p, k p) \leq d (p, q)

, it follows that

z_{p^{d (p, q) -}} < z_{p^{d (p, k p) -}}

,

z_{p^{d (p, q) +}} > z_{p^{d (p, k p) +}}

. Therefore, we conclude that

z_{p^{d (p, q) -}} < z_{k p} < z_{p^{d (p, q) +}}

. □

Theorem 5.

Given a local dataset

P_{i}

on any sub-node, for

\forall p \in P_{i}

, if

p_{0} \notin P_{i}

and

d (p, p_{0}) \leq d (p, q)

, the following holds

l z_{{P_{i}}^{d (p, q)}} < z_{p_{0}} < r z_{{P_{i}}^{d (p, q)}}

(9)

where q is the k-th nearest neighbor of point p found within sub-node

l z_{{P_{i}}^{d (p, q)}} = \min {z_{{P_{i}}^{d (p, q) -}} | \forall p \in P_{i}}

and

r z_{{P_{i}}^{d (p, q)}} = m a x {z_{{P_{i}}^{d (p, q) +}} | \forall p \in P_{i}}

.

Proof.

Based on the definitions and formulas of

l z_{{P_{i}}^{d (p, q)}}

and

r z_{{P_{i}}^{d (p, q)}}

, we know

l z_{{P_{i}}^{d (p, q)}} < z_{{P_{i}}^{d (p, q) -}}

and

r z_{{P_{i}}^{d (p, q)}} > z_{{P_{i}}^{d (p, q) +}}

. Since

d (p, p_{0}) \leq d (p, q)

, and according to Theorem 2, the following holds:

z_{p^{d (p, q) -}} < z_{q} < z_{p^{d (p, q) +}}

. Thus, we can conclude that

z_{p^{d (p, q) -}} < z_{k p} < z_{p^{d (p, q) +}}

. □

Based on the theorems above, we propose the S-KNN strategy for accurately finding local KNN points in a distributed environment. First, we discuss how to accurately compute the KNN points for local data on a single sub-node. According to the Z-CF algorithm, for a local data point

p_{i}

on a sub-node, the algorithm identifies its k-th nearest neighbor

k p^{'}

within the sub-node and calculates the distance

d (p_{i}, k p^{'})

, which serves as the upper bound for the search range. Next, it computes

z_{p_{i}^{d (p, q) -}}

and

z_{p_{i}^{d (p, q) +}}

and then determines the

\min (z_{p_{i}^{d (p, q) -}})

and

\max (z_{p_{i}^{d (p, q) +}})

z-values for all local data points on the sub-node. Subsequently, all data points with z-values within the interval

[min (z_{p_{i}^{d (p, q) -}}), \max (z_{p_{i}^{d (p, q) +}})]

, but not belonging to the local data, are identified as candidate points for replication. These candidate points are copied to the current sub-node to serve as the dataset for finding accurate KNN points. Finally, the Z-CF algorithm is used to compute the precise KNN points for each local data point on the sub-node. The detailed steps of the S-KNN strategy are provided in Algorithm 2.

Algorithm 2 S-KNN

Input: Dataset P, KNN value k

Output: KNN sets

{K N N_{i}}

and distances

\{D_{i}\}

for all points in P

1: Transform all data points

p_{i}

in dataset P into one-dimensional space using Equation (2), obtaining

z_{p_{i}}

.

2: Sort the data points by their z-values in ascending order

3: Compute the quantiles

\frac{1}{n}, \frac{2}{n}, \dots, \frac{n - 1}{n}

and assign data points to corresponding sub-nodes

P_{i}

, where

i \in \{1, 2, \dots, n\}

.

4: for each sub-node

P_{i}

do // Locate initial KNN points

5: for each

p_{j} \in P_{i}

do

6: Compute

z_{{P_{i}}^{d (p_{j}, q) -}}

and

z_{{P_{i}}^{d (p_{j}, q) +}}

7: end for

8:

l z_{{P_{i}}^{d (p, q)}} = \min {z_{{P_{i}}^{d (p_{j}, q) -}} | \forall p_{j} \in P_{i}}

9:

r z_{{P_{i}}^{d (p, q)}} = m a x {z_{{P_{i}}^{d (p_{j}, q) +}} | \forall p_{j} \in P_{i}}

10: Copy all points with z-values in the range

[l z_{{P_{i}}^{d (p, q)}}, r z_{{P_{i}}^{d (p, q)}}]

to sub-node

P_{i}

11: Use Algorithm 1 to compute the KNN results.

12: end for

3.4. Data Clustering

This section describes the clustering process from the node to the global level after obtaining directional centrality and density data based on the z-value index.

3.4.1. Local Clustering

Once data points have been labeled as either internal points or boundary points, the clustering problem reduces to clustering internal points and assigning boundary points to clusters. The process begins by clustering nearby internal points into initial clusters, followed by assigning boundary points to their nearest clusters based on distance. The detailed process is as follows.

First, internal points are clustered by expanding outward from a central internal point, forming initial clusters. For this purpose, we propose the Local Expand (LE) algorithm. Before detailing the algorithm, all internal points are initially stored in a separate set

C T

. Next, the algorithm selects a random internal point, assigns it a unique and exclusive label and removes it from

C T

. Subsequently, it checks whether any of its KNN points are boundary points. If no boundary points are found, all unlabeled KNN points are assigned the same label as the selected data point. If boundary points exist, all unlabeled internal points closer than the boundary points are assigned the same label as the selected data point. These newly labeled points are then removed from

C T

, and their KNN points undergo the same process. This continues until no additional points can be labeled, thereby identifying one initial cluster. The process is repeated by selecting another random point from

C T

and following the same steps until

C T

becomes empty. At this point, all initial clusters are formed. The detailed procedure for the LE algorithm is presented in Algorithm 3.

Algorithm 3 Local cluster expansion

Input: Set of internal points

C T = {p_{1}, p_{2}, \dots, p_{n 1}}

.

Output: Initial clusters

C_{1}, C_{2}, \dots, C_{m}

1:

R \leftarrow C T

2:

m \leftarrow 0

3: while

R \neq ⌀

do

4:

m \leftarrow m + 1

5:

\forall p_{i} \in P

6:

C_{m} \leftarrow \{p_{i}\}

7:

C_{m} \leftarrow C_{m} \cup p_{j}

p_{j}

is among the KNN points of

p_{i}

, and no KNN points of

p_{i}

are boundary points.

p_{j}

is unlabeled and its distance to

p_{i}

is smaller than the distance between

p_{i}

and the nearest boundary point.

8:

R \leftarrow R ∖ C_{m}

9: end while

Next, the clustering task for boundary points must be completed. Once the initial clusters are established, both the number and shape of the clusters are essentially determined. The clustering of boundary points becomes straightforward: since boundary points are located on the periphery of clusters, they can simply be assigned the same label as the nearest internal point. Specifically, boundary points are stored in a set B. A boundary point is randomly selected, and its KNN points are examined to identify the closest internal point. The boundary point is then assigned the same label as the identified internal point, and the point is removed from B. If all KNN points of the selected boundary point are also boundary points, the nearest boundary point is located, and the same cluster label is applied. If the nearest boundary point itself lacks a cluster label, the nearest boundary point of that point is identified, and the process repeats until all points in B have been labeled.

Through these steps, all data points in dataset P are assigned cluster labels, marking the completion of the clustering algorithm. Points with the same cluster label belong to the same cluster, while points with different labels belong to distinct clusters.

Next, the local clustering strategy LC (Local Clustering) for sub-nodes is introduced. Since each node can compute metrics only for its local data and the copied data contain only information about individual points, local clustering is applied solely to local data. Specifically, each node processes its local data to produce clustering results. This is identical to global clustering in terms of process, as the k-nearest neighbors (KNNs) identified by the S-KNN algorithm are accurate, ensuring precise values for the DCM and

ρ

metrics. Differentiating between boundary and internal points hinges on the threshold for the DCM metric and the proportion of KNN points with higher density. To facilitate this, the density values (

ρ

) of copied points must be imported from other nodes. Then, using the approach outlined in Section 3.2, the local data are divided into internal and boundary points. Finally, the single-machine local expansion algorithm LE is applied to cluster the local data.

This completes the preliminary clustering process for each sub-node in a distributed environment. Algorithm 4 provides the pseudocode for the local clustering algorithm LC at any given sub-node.

Algorithm 4 Local clustering

Input: Local dataset

L D T

, copied dataset

C D T

Output: Initial clusters

C_{1}, C_{2}, \dots, C_{m}

1: Utilize Algorithm 2 (S-KNN) to determine the k-nearest neighbors (KNNs) for all points in the local dataset

L D T

2: for each

p_{i} \in L D T

do

3: Compute the DCM value

D C M_{i}

of

p_{i}

using Equation (1)

4: Compute the density value

ρ_{i}

of

p_{i}

using Equation (7)

5: end for

6: for each

p_{i} \in C D T

do

7: Copy the density value

ρ_{i}

to the corresponding node

8: end for

9: Distinguish all

p_{i} \in L D T

as either boundary points or internal points based on the boundary identification method described in Section 3.2

10: Perform clustering on the local dataset using Algorithm 3 (LE)

3.4.2. Global Clustering

In the preceding sections, the local clustering of data on individual sub-nodes was completed. However, in the entire data space, data points belonging to the same cluster may have been allocated to different sub-nodes during the data partitioning process. This situation could result in a single cluster being split and clustered into two or more separate clusters by different sub-nodes. To address this issue, this section introduces the concept of global clustering (GC). The goal of global clustering is to merge the initial clusters generated from local clustering, ensuring that data points belonging to the same cluster are correctly grouped together. This strategy ensures that the accuracy of distributed clustering matches or approximates the accuracy of single-machine clustering results. The global clustering strategy is described in detail below.

In distributed clustering, clusters are uniquely determined by their internal points. Thus, the process of cluster merging can be reduced to the merging of internal points. To determine whether two clusters from different sub-nodes need to be merged, the critical condition is whether their internal points share the same label. Specifically, for internal points belonging to two different clusters, if one internal point has no boundary points among its k-nearest neighbors (KNNs), and the other internal point is one of its KNN points, then the two clusters should be merged. Alternatively, if one internal point’s KNN set includes boundary points, but another cluster has an internal point whose distance to the first point is smaller than the distance between the first point and its nearest boundary point, then the two clusters should also be merged. In other words, if clusters belonging to different sub-nodes have internal points that satisfy the conditions for being in the same cluster according to the local clustering algorithm (LE), these clusters are considered part of the same cross-node cluster. Global Clustering Strategy The specific steps of the global clustering strategy are as follows: 1. Save all internal points from the copied dataset into a separate set,

C T

. 2. Randomly select an internal point from

C T

. This point should retain its original cluster label from its atomic node and ensure the label differs from those on the current sub-node. Remove this point from

C T

. 3. Check the KNN set of the selected point in the current sub-node: (1) If no boundary points exist, assign the same cluster label to all KNN points of the selected point. (2) If boundary points exist, assign the same cluster label to all points within a distance smaller than the distance to the nearest boundary point. 4. Update the cluster labels for all points whose internal point labels have changed. Ensure that all points in the same cluster share the same label. 5. Repeat the process by selecting another point from CT and applying the above steps until CT is empty. Through these iterative steps, the merging of clusters is completed. The pseudocode for the global clustering algorithm is presented in Algorithm 5.

Algorithm 5 Global clustering

Input: Initial clusters

C_{1}, C_{2}, \dots, C_{m}

, sub-nodes

s l a v e 1, s l a v e 2, \dots, s l a v e M

Output: Final clusters

C_{1}, C_{2}, \dots, C_{E}

1: for each sub-node

s l a v e i

do

2: Copy the cluster labels of the replicated data points to the sub-node

3:

C T = \{C D T, C D T_T a b\}

//

C D T

represents the replicated data points and

C D T_T a b

represents their cluster labels

4: for each

C D T_{i} \in C T

do

5:

C T \leftarrow C T ∖ \{C D T_{i}, C D T_T a b_{i}\}

6: Identify other data points belonging to the same cluster according to the rules in the DEALER algorithm and update their cluster labels

7: end for

8: end for

Algorithm 6 presents the pseudocode for the DEALER algorithm. The procedure begins with data preprocessing, where the input data are transformed into a one-dimensional z-value coordinate space and sorted (Lines 1–2). The sorted data are then evenly distributed across all worker nodes (Line 3). Each node independently processes its local data partition by computing the required z-value search range for all local data points, leveraging the optimized KNN search method described in Section 3 (Lines 4–7). To ensure all necessary data reside locally, the algorithm proactively replicates remote data falling within the computed range to the respective node (Line 8). The clustering process proceeds in two phases: 1. Local clustering: Each node executes the base clustering algorithm on its local data to generate initial clusters (Line 9). 2. Global cluster merging: The initial clusters from all nodes are aggregated and merged to produce the final global clustering result (Line 10).

Algorithm 6 DEALER algorithm

Input: Data set P, KNN value k, DCM threshold

T_{D C M}

, number of nodes M

Output: All clustering in the dataset P:

{C_{1}}, {C_{2}}, \dots, {C_{n}}

1: all the data points

p_{i}

in the dataset P into one-dimensional space, obtain

z_{p_{i}}

2: Sort the number of points according to the value of z from smallest to largest

3: According to the number of nodes, divide the data points on the axis of the z value into M equalilities and assign them to the corresponding node

M_{j}

4: for each data point p in

M_{j}

do

5: Compute

z_{p^{d (p, k p) -}}, z_{p^{d (p, k p) +}}

6: end for

7: Calculate

max {z_{p^{d (p, k p) +}} | p \in M_{j}}

and

min {z_{p^{d (p, k p) -}} | p \in M_{j}}

8: Copy data that is not on node

M_{j}

in the interval

(min {z_{p^{d (p, k p) -}} | p \in M_{j}},

max {z_{p^{d (p, k p) +}} | p \in M_{j}})

with z values to

M_{j}

9: Execute Algorithm 4 on node

M_{j}

to locally cluster to form the initial cluster

10: Run Algorithm 5 to complete the global clustering

3.5. Time Complexity Analysis

The computational cost of the DEALER algorithm primarily arises from the distance calculations required to identify the k-nearest neighbors (KNNs). This step is also the main bottleneck limiting the speed of the clustering algorithm. Before determining the potential range for KNN, the algorithm uses z-value indexing to map all data points in the dataset onto a one-dimensional coordinate axis and sort them. This step incurs a computational cost of

O (n log n)

. Next, when calculating the potential range for KNN, the algorithm first identifies the initial KNN points on the z-value coordinate axis. This is performed using a linear search. Since the search requires moving left and right to find k potential KNN points, the algorithm computes the distance between the current data point and

k + 1

other points and performs comparisons. This step has a cost of

O (2 k n + 2 n) = O (k n)

. Following this, the precise KNN search range is determined based on the relationship between the k-th nearest neighbor and the current data point. This computation incurs an additional cost of

O (2 n)

. Finally, the algorithm calculates the distance between the current data point and every data point within the determined search range. These data points are those whose z-values fall within the interval

(Z_{d {(p, q)}^{-}}, Z_{d {(p, q)}^{+}})

, where p is any data point in the dataset, and q is the k-th nearest neighbor of p in the initial KNN set. This step also incurs a cost of

O (k n)

. In most cases,

k ≪ n

, meaning that k is significantly smaller than n. In summary, the overall time complexity of the DEALER algorithm is

O (n log n)

.

4. Discussion

Our enhancements to the CDC algorithm primarily focus on two key dimensions: clustering effectiveness and computational efficiency.

(1) Effectiveness Enhancement: To improve clustering quality, we developed a hybrid metric integrating DCM with local density (

ρ

). This integration builds upon CDC’s original DCM cluster centrality metric while incorporating regional density estimation for precise core/boundary point differentiation. The proposed two-step filtering mechanism operates as follows:

Primary screening using DCM values

Secondary refinement through

ρ

-density evaluation

This dual-filter approach maintains sensitivity for sparse cluster detection while effectively mitigating weak connectivity issues. Unlike conventional density computation requiring predefined neighborhood radii, our method innovatively derives

ρ

-values directly from k-nearest neighbor (KNN) information generated during DCM calculation. This synergistic reuse of computational byproducts eliminates redundant operations, significantly reducing algorithmic complexity.

(2) Efficiency Optimization: For distributed computing environments, we propose a z-index-enhanced distributed directional centrality clustering algorithm featuring three key innovations: (i) Spatial data organization: z-value ordering optimizes data locality by colocating proximate data points on the same compute nodes with prefetched reference data, minimizing KNN query communication overhead; (ii) Index structure: A novel z-curve-based spatial index constrains KNN searches to linear intervals containing query points, achieving order-of-magnitude search efficiency gains; (iii) Two-phase processing: The “local-clustering-first, global-merging-later” framework reduces overall time complexity from

O (n^{2})

to

O (n log n)

.

The combined improvements simultaneously enhance both processing capability and clustering accuracy compared to the baseline CDC algorithm, demonstrating superior performance in handling large-scale datasets while maintaining precise cluster separation.

According to the experimental analysis presented below, the algorithm demonstrates strong performance in terms of effectiveness and clustering efficiency. However, the proposed algorithm is not exclusive to CDC, nor is it universally applicable to all clustering methods. Instead, it is particularly suitable for algorithms that rely on neighborhood information for point evaluation, such as certain density-based approaches, where the measurement of points inherently involves their local neighborhoods.

In high-dimensional clustering, the performance of the z-value index is suboptimal. According to the literature [17,18], it is theoretically impossible for a single curve to traverse every point in a given multidimensional discrete space without crossing itself without favoring any particular direction and while completely preserving locality. The topological discontinuities and spatial preservation issues across multidimensional spaces lead to locality loss. Consequently, when dealing with high-dimensional data, the z-value index exhibits poor local order-preserving properties, resulting in increased abrupt changes. Clustering for high-dimensional data remains a key focus for future research.

5. Experiment Analysis

This chapter presents the experimental evaluation of the algorithm proposed in this paper. It compares the algorithm with current mainstream approaches, focusing primarily on the accuracy of the results and the computational time.

5.1. Experimental Setup

The experimental setup section outlines the configuration of the environment required for the experiments and the preparation of datasets. This includes details about the hardware and software environment, dataset preparation, and parameter settings.

5.1.1. Experimental Environment and Datasets

The cluster consists of five servers, each equipped with 24 GB of memory and dual Intel Xeon E5-2420 six-core processors. The servers operate on the Red Hat Enterprise Linux system. Development is conducted using the Eclipse IDE, with Java as the primary programming language.

The experimental datasets selected for this study include both synthetic and real-world datasets: DS1, DS2, Flame, Spiral, Levine, Samusik, UrbanGB, CoverType, Household, Stock, and Sofia. Dataset DS1 consists of 10 spherical clusters, while DS2 contains 6 non-convex clusters with significant density differences. The Flame dataset consists of two clusters, and the Spiral dataset comprises three linear clusters. The Levine dataset, which records human bone marrow cell features, can be obtained from the FlowRepository platform and contains 14 clusters. The Samusik dataset, which records mouse bone marrow cell data, is also available on FlowRepository and includes 24 clusters. The UrbanGB dataset contains geographical coordinates of 469 cities, with 360,177 latitude and longitude points for each city’s districts, counties, towns, and villages. It can be accessed from the UC Irvine repository. The datasets CoverType, Household, Stock, and Sofia do not specify the number of clusters and are primarily used for efficiency analysis. These datasets can be obtained from the UCI Machine Learning Repository and Kaggle platform. The characteristics of these datasets are summarized in Table 1.

5.1.2. Parameter Settings

In the DEALER clustering algorithm, the key parameters include the k parameter in the “k-nearest neighbors” method and the DCM threshold

T_{D C M}

.

Parameter k in the “k-nearest neighbors” method: The value of k is set manually based on the characteristics of the dataset and can be estimated according to the dataset size. An empirical function is proposed to express the relationship between k and the dataset size n as follows:

$k = \{\begin{matrix} ⌈\frac{n}{50}⌉ \sim ⌈\frac{n}{20}⌉, 100 \leq n \leq 1000 \\ ⌈{log}_{2} n + 10⌉ \sim 5 ⌈{log}_{2} n⌉, n > 1000 \end{matrix}$

(10)

Here, $⌈n⌉$ represents the smallest integer greater than or equal to n (i.e., the ceiling of n). This function illustrates the increasing trend of k as n grows. However, since the density of points in specific datasets may vary, the fine-tuning of k is performed based on the density of the data in each particular dataset
DCM Threshold $T_{D C M}$ : The DCM threshold $T_{D C M}$ is used to distinguish between boundary points and core points. This threshold can be estimated using the method described in the literature [14].

5.2. The Efficiency Analysis

To validate the effectiveness of the DEALER clustering algorithm, this section compares it with four other clustering algorithms on two synthetic datasets (DS1, DS2) and two real-world datasets (Flame, Spiral). The comparison is based on external evaluation metrics, including Clustering Accuracy (ACC) [19], Adjusted Mutual Information (AMI) [20], Adjusted Rand Index (ARI) [21], and F-score [22]. These metrics are used to measure clustering accuracy, providing an overall performance comparison of the clustering results.

(1). Comparison Algorithms: The algorithms compared in this section include the classic k-means clustering algorithm, DBSCAN, DPC, and CDC algorithms. The parameter settings are based on the original configurations provided in the respective algorithmic literature. The specific settings are as follows:

K-means: A classical clustering algorithm that requires the parameter k, which represents the number of clusters.

DBSCAN: A density-based clustering algorithm requiring two parameters—

ϵ

, which defines the radius of the neighborhood, and

M i n P t s

, which sets the density threshold to distinguish core points from border points.

DPC: A density-peak-based clustering algorithm that requires

ϵ

, the neighborhood radius parameter.

CDC: A clustering algorithm proposed by Gui et al. in 2022 [14], which requires the k-nearest neighbor parameter (k) and an additional parameter

T_{D C M}

for distinguishing core points from border points.

Table 2 presents the parameter settings used for the comparison algorithms on datasets DS1, DS2, Flame, and Spiral, respectively. A dash (“—”) in the tables indicates that the corresponding algorithm does not require that specific parameter.

(2). Accuracy Comparison: Table 3 presents the performance of the five clustering algorithms across the four evaluation metrics. In each table, the best value for each metric is highlighted in bold.

The quality of clustering results requires evaluation through specific metrics, which can be divided into internal and external metrics. Internal metrics assess the clustering results without referring to the true data labels, while external metrics compare the predicted labels from the clustering algorithm with the true labels to evaluate clustering quality. Common external metrics include Clustering Accuracy (ACC), Adjusted Mutual Information (AMI), Adjusted Rand Index (ARI), and F-score. This paper primarily uses external metrics to evaluate the clustering algorithms. The ACC metric measures the proportion of data points that are correctly clustered. The ARI metric focuses on pairs of correctly classified data points, rather than individual points, and is an improvement of the Rand Index [23]. NMI, based on information theory, is another clustering evaluation metric. The F-score combines precision and recall into a single measurement. These four metrics are positively correlated with the quality of the experimental results, with higher values indicating better performance.

Table 3 presents the evaluation metrics for clustering quality across four datasets using five different algorithms, with the best values highlighted in bold. It is evident that, regardless of the clustering metric used, the DEALER algorithm consistently achieves the highest number of optimal results. Even in cases where DEALER is not the top performer, its performance is only marginally below the best. In contrast, other algorithms achieve fewer optimal results and occasionally exhibit poor performance on specific datasets due to their underlying clustering principles. Based on the results and analysis, it is clear that DEALER demonstrates excellent clustering performance across various complex datasets, consistently ranking among the top two. Its accuracy is generally above 95%, indicating strong robustness.

5.3. Davies–Bouldin Index and Silhouette Coefficient

This section presents comparative experiments evaluating the DEALER clustering algorithm against four other clustering algorithms on two synthetic datasets (DS1 and DS2) and two real-world datasets (Flame and Spiral). The clustering accuracy is assessed using metrics such as the Davies-Bouldin Index (DB Index) [24] and the Silhouette Coefficient [25]. The Silhouette Coefficient is a widely used metric for evaluating clustering quality, measuring both the cohesion of data points within a cluster and their separation from other clusters. The Davies–Bouldin Index is an internal evaluation metric that assesses clustering performance based on intra-cluster compactness and inter-cluster separation.

(1) Comparison Algorithms: The comparison algorithms used in this section include the classical k-means clustering algorithm, DBSCAN, DPC, and CDC. The parameter settings for these algorithms are based on their original research papers.

(2) Accuracy Comparison: Table k presents the performance of the five algorithms across four evaluation metrics. Among these metrics, the Davies–Bouldin Index has a lower bound of 0, where a smaller value indicates better clustering performance. The Silhouette Coefficient ranges from −1 to 1, where values closer to 1 indicate better clustering quality, meaning that samples are well clustered within their own groups while being well separated from other clusters. However, both metrics are more suitable for convex-shaped clusters and may not be effective for non-convex structures, such as ring-shaped clusters. Additionally, when the dataset contains only a single cluster, both metrics may produce errors and be invalid.

Table 4 presents the clustering quality evaluation metrics for five different algorithms across four datasets. From the data in the table, it is evident that for DS1, where clusters are predominantly convex, most algorithms produce reasonable clustering results. Among them, DEALER demonstrates the best overall performance. However, on the other three datasets, the clustering quality metrics are generally suboptimal. In conjunction with the clustering visualizations shown in Figure 15, Figure 16, Figure 17 and Figure 18, this issue primarily arises because, in these three datasets (excluding DS1), many clusters exhibit non-convex shapes, making these two parameters less suitable and leading to irregular performance. Specifically, in the Flame dataset, weak connectivity causes the CDC algorithm to misclassify the entire dataset as a single cluster. As a result, the Davies–Bouldin Index and Silhouette Coefficient metrics for CDC fail to compute, leading to errors and missing results.

5.4. Impact of Parameter Variations on Clustering Results

This section presents an experimental analysis of the relationship between clustering results and parameter variations in the DEALER algorithm. Specifically, it examines the impact of the number of k-nearest neighbors (k) and the DCM threshold (

T_{D C M}

) on clustering performance.

Since the parameters used in this study are similar, the CDC algorithm is selected for comparison with DEALER. However, due to the weak connectivity of the DS2, Flame, and Spiral datasets, the CDC algorithm performs poorly on these datasets. Therefore, the more stable DS1 dataset is used for evaluation, comparing external clustering metrics such as ACC, AMI, ARI, and F-score. Table 5 and Table 6 present the variations in these metrics under different parameter settings. Specifically, Table 5 reports the external metrics when the independent variable is the number of k-nearest neighbors (k), while Table 6 shows the results when the independent variable is the DCM threshold (

T_{D C M}

).

From Table 5 and Table 6, it is evident that DEALER maintains superior clustering performance across a broader range of parameter variations compared to CDC, demonstrating greater stability. The number of k-nearest neighbors (k) has a relatively smaller impact on clustering results over a wide range of values compared to

T_{D C M}

.

5.5. Case Analysis

This section provides a detailed analysis of how the DEALER algorithm ensures the validity of clustering results at each key step. To facilitate a better understanding and visualization of the analysis, we use two datasets, DS1 and Flame, which are easier to visualize. These datasets allow for a more intuitive perception of the data distribution and clustering results. Below, we analyze the parts of the clustering process that directly influence the clustering outcomes.

Based on the analysis in Section 3, it is clear that the resolution of weak connectivity issues depends on the clustering method, especially for datasets with complex structures. Therefore, the choice of clustering method is critical to the clustering results. For better visualization and analysis, we again use the two datasets mentioned above: the synthetic dataset DS1 and the real-world dataset Flame. There are two clustering approaches to compare: one is the method used in the CDC algorithm, which involves surrounding core points with boundary points to form clusters, and the other is the local expansion approach used in DEALER, where clusters are first formed by expanding from center points. Figure 19, Figure 20, Figure 21 and Figure 22 show the scatter plots of the clustering results using the CDC algorithm and the DEALER algorithm, respectively, on the DS1 and Flame datasets.

Figure 19 and Figure 20 show the scatter plots of the clustering results for the DS1 dataset using the CDC and DEALER clustering methods, respectively. It is evident that the clustering results obtained with the CDC algorithm are inaccurate. This is because the CDC algorithm uses a boundary-point-enclosing approach for clustering the center points. When the distance between two adjacent boundary points is large, it becomes easy for the algorithm to merge with other clusters that are closer in proximity, leading to the occurrence of weak connectivity issues.

Figure 21 and Figure 22 present the scatter plots of clustering results for the Flame dataset using the CDC and DEALER clustering methods, respectively. It is clear that the DEALER algorithm performs well, producing high-quality clustering results. The local expansion approach used in DEALER allows for clustering by expanding from center points, with expansion halting when a boundary point is encountered. This method effectively mitigates weak connectivity issues to some extent.

Based on the experimental evaluation of the different clustering methods, the accuracy of the clustering results was calculated as follows: On the synthetic dataset DS1, the CDC clustering method achieved an accuracy of 0.7007, while the DEALER method achieved an accuracy of 0.9980, representing an improvement of approximately 42.5%. On the real-world Flame dataset, the CDC clustering method resulted in an accuracy of 0.6375, while the DEALER method achieved an accuracy of 0.9917, representing an improvement of approximately 35.7%. These results demonstrate that the local cluster expansion algorithm outperforms the boundary-point-enclosing center-point algorithm, especially when dealing with complex datasets.

5.6. Efficiency Analysis

In response to the data processing demands of the big data era, distributed clustering algorithms have emerged. This section conducts distributed experiments on the DEALER algorithm to evaluate its efficiency. The dataset used in the experiment has a larger scale. Since no CDC algorithm implementation was found for a distributed environment, comparisons are made between the CDC algorithm, the single-machine version of the DEALER algorithm, and one of the most advanced distributed density-based clustering algorithms, the FDDP algorithm [15]. The experiment is designed with two primary objectives. First, it compares the efficiency of the CDC algorithm, the FDDP algorithm, and the DEALER algorithm in both single-machine and distributed settings. Second, it evaluates the speedup and scalability of the DEALER algorithm.

Figure 23 presents a comparison of the execution time (in seconds) for the CDC algorithm, the FDDP algorithm, and the DEALER algorithm in both single-machine and distributed environments on the given dataset. As shown in Figure 23, the DEALER algorithm demonstrates a higher computational speed compared to the CDC algorithm. Moreover, the distributed implementation of the DEALER algorithm performs well, achieving a slightly better computational speed than the FDDP algorithm.

Figure 24 and Figure 25 present the speedup and scalability performance of the distributed direction centrality clustering algorithm enhanced with z-value indexing. The former shows the proportion of speed improvement in algorithm runtime as the number of distributed system nodes increases, while the latter tests the algorithm’s performance with datasets of increasing size, selected randomly in multiples. The experimental results demonstrate that the z-value-indexing-enhanced distributed-direction centrality-clustering algorithm performs well in terms of both speed and scalability, achieving the expected outcomes of the experiment.

5.7. Z-Value Filtering Effectiveness

The improvement in the speed of the clustering algorithm is achieved by optimizing the process of finding k-nearest neighbors using the proposed Z-CF algorithm. This optimization reduces the computational cost of calculating distances between points in the data space, lowering the time complexity from

O (n^{2})

to

O (n log n)

. The efficiency of the Z-CF algorithm is evaluated through experiments.

In this experiment, seven datasets (ordered by size from smallest to largest: Flame, Spiral, DS2, DS1, Levine, UrbanGB, and Samusik) are used for testing. The objective is to compare the unoptimized k-nearest neighbor search algorithm (KNN-search) with the optimized Z-CF algorithm, which uses z-value filling curves. Table 7 presents the running times (in 10 ms) for computing k-nearest neighbors across different datasets. The variation in running times is ultimately due to differences in the number of computations. Table 8 provides the number of distance calculations performed by each algorithm on the different datasets.

Based on Table 7, it can be observed that for smaller datasets, the unoptimized KNN-search algorithm performs faster in terms of finding k-nearest neighbors. However, as the dataset size increases, the advantages of the Z-CF algorithm become more apparent, with its efficiency surpassing that of the unoptimized KNN-search. This is because Z-CF reduces the number of distance calculations by narrowing the search range, which in turn decreases computational time. However, since the Z-CF algorithm requires additional computation to determine the new search range, its efficiency is lower for smaller datasets. As the dataset size grows, the efficiency of Z-CF improves, as evidenced by the number of distance calculations presented in Table 8.

To provide a clearer comparison between the two algorithms, the logarithms of the data values are taken, and histograms are shown in Figure 26 and Figure 27. Figure 26 compares the running times across different datasets, while Figure 27 compares the number of distance calculations for each dataset.

6. Conclusions

In this paper, we propose DEALER, an efficient and effective distributed clustering algorithm. Through the proposed density-enhanced local direction centrality measure, DEALER mitigates the weak connectivity problem while maintaining the ability to recognize sparse clusters. Further, the z-value index is designed to reduce the number of distance computations for KNN and enable the communication-free computation of the Direction Centrality Metric DCM and density

ρ

for the distributed large-scale clustering. We also theoretically prove that DEALER is able to reduce the time complexity of the CDC algorithm from

O (n^{2})

to

O (n log n)

. A comprehensive experimental evaluation shows that DEALER outperforms the competitors in terms of effectiveness and efficiency and is scalable to large-scale datasets. Facing large-scale data clustering challenges such as those in social networks [26] and genomics [27], the distributed clustering algorithm based on directional centrality discussed in this paper holds significant potential for practical applications. Recent advances have introduced scalable and interpretable clustering frameworks for high-dimensional data [28], including density-based spatial clustering approaches specifically designed for high-dimensional environments [29]. As part of future work, we plan to investigate extensions to the DEALER algorithm that could potentially enhance the performance of these high-dimensional clustering techniques. In the future, we will explore how to extend DEALER to high-dimensional clustering, where the z-value index does not perform very well.

Author Contributions

Conceptualization, Y.Z.; methodology, X.L. and Y.Z.; software, X.L. and Y.Z.; validation, X.L., Z.Z. and Y.Z.; formal analysis, X.L., Z.Z. and Y.Z.; data curation, X.L., Z.Z. and Y.Z.; writing—original draft preparation, X.L.; writing—review and editing, X.L., Z.Z. and Y.Z.; visualization, Y.Z. and X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 62432003 and 62032013).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

We extend our sincere gratitude to the anonymous reviewers for their valuable insights and feedback.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Noshari, M.R.; Azgomi, H.; Asghari, A. Efficient clustering in data mining applications based on harmony search and k-medoids. Soft Comput. 2024, 28, 13245–13268. [Google Scholar] [CrossRef]
Xu, M.; Luo, L.; Lai, H.; Yin, J. Category-Level Contrastive Learning for Unsupervised Hashing in Cross-Modal Retrieval. Data Sci. Eng. 2024, 9, 251–263. [Google Scholar]
Qu, X.; Wang, Y.; Li, Z.; Gao, J. Graph-Enhanced Prompt Learning for Personalized Review Generation. Data Sci. Eng. 2024, 9, 309–324. [Google Scholar]
Olszewski, D. A clustering-based adaptive Neighborhood Retrieval Visualizer. Neural Netw. 2021, 140, 247–260. [Google Scholar] [PubMed]
Djenouri, Y.; Belhadi, A.; Djenouri, D.; Lin, J.C. Cluster-based information retrieval using pattern mining. Appl. Intell. 2021, 51, 1888–1903. [Google Scholar] [CrossRef]
Paek, J.; Ko, J. K-Means Clustering-Based Data Compression Scheme for Wireless Imaging Sensor Networks. IEEE Syst. J. 2017, 11, 2652–2662. [Google Scholar]
Anand, S.K.; Kumar, S. Experimental Comparisons of Clustering Approaches for Data Representation. ACM Comput. Surv. 2023, 55, 45:1–45:33. [Google Scholar]
Martsenyuk, V.P.; Nykytyuk, S.; Palaniza, Y.; Bahrii-Zaiats, O.; Sverstiuk, S. Stages of cluster analysis in the diagnosis of Lyme disease in children. In Proceedings of the ITTAP’2023: 3rd International Workshop on Information Technologies: Theoretical and Applied Problems Ternopil, Ternopil, Ukraine, 22–24 November 2023; pp. 79–92. [Google Scholar]
Dentamaro, V.; Franchini, F.; Massaro, I.; Musti, L.; Pirlo, G.; Sblendorio, E. Explainable Gait Analysis for Early Detection of Neurodegenerative Diseases Using Unsupervised Clustering Techniques. In Proceedings of the 2024 IEEE International Conference on Metrology for eXtended Reality, Artificial Intelligence and Neural Engineering (MetroXRAINE), St. Albans, UK, 21–23 October 2024; pp. 861–866. [Google Scholar]
Brotee, S.; Kabir, F.; Razzaque, M.A.; Roy, P.; Mamun-Or-Rashid, M.; Hassan, M.R.; Hassan, M.M. Optimizing UAV-UGV coalition operations: A hybrid clustering and multi-agent reinforcement learning approach for path planning in obstructed environment. Ad Hoc Netw. 2024, 160, 103519. [Google Scholar] [CrossRef]
Yuan, L.; Xu, Z.; Li, Z.; Zhang, S.; Hu, C.; Yu, W.; Wei, H.; Wang, X.; Geng, Y. Cluster Analysis of Scrna-Seq Data Combining Bioinformatics with Graph Attention Autoencoders and Ensemble Clustering. In Proceedings of the 20th International Conference, Advanced Intelligent Computing in Bioinformatics, Tianjin, China, 5–8 August 2024; Huang, D.S., Pan, Y., Zhang, Q., Eds.; Springer: Singapore, 2024; pp. 62–71. [Google Scholar]
Zeng, Z.; Zhao, Z.; Xu, K.; Li, Y.; Chen, C.; Zou, X.; Wang, Y.; Wei, W.; Chow, P.K.H.; Li, X. CoIn: Correlation Induced Clustering for Cognition of High Dimensional Bioinformatics Data. IEEE J. Biomed. Health Inform. 2023, 27, 598–607. [Google Scholar]
Maiti, D.; Basak, M.; Das, D. Advancing Fingerprint Template Generation and Matching With Recast Minutiae Clustering and mRBFN. Adv. Artif. Intell. Mach. Learn. 2024, 4, 1847–1865. [Google Scholar]
Peng, D.; Gui, Z.; Wang, D.; Ma, Y.; Huang, Z.; Zhou, Y.; Wu, H. Clustering by measuring local direction centrality for data with heterogeneous density and weak connectivity. Nat. Commun. 2022, 13, 5455. [Google Scholar] [PubMed]
Lu, J.; Zhao, Y.; Tan, K.; Wang, Z. Distributed Density Peaks Clustering Revisited. IEEE Trans. Knowl. Data Eng. 2022, 34, 3714–3726. [Google Scholar]
Lu, J.; Duan, Y.; Liu, H. Distributed density peaks clustering based on z-value. Acta Electron. Sin. 2018, 46, 730–738. (In Chinese) [Google Scholar]
Zeng, Q.; Huang, J.; Lu, P.; Xu, G.; Chen, B.; Ling, C.; Wang, B. ZETA: Leveraging Z-order Curves for Efficient Top-k Attention. arXiv 2025, arXiv:2501.14577. [Google Scholar]
Franco, P.; Mullot, R.; Owczarek, V. Pareto multi-objective optimization for high locality-preserving space-filling curve identification. Swarm Evol. Comput. 2025, 92, 101797. [Google Scholar] [CrossRef]
Zheng, X.; Cai, D.; He, X.; Ma, W.Y.; Lin, X. Locality preserving clustering for image database. In Proceedings of the 12th Annual ACM International Conference on Multimedia, MULTIMEDIA ’04, New York, NY, USA, 10–16 October 2004; pp. 885–891. [Google Scholar]
Nguyen, X.V.; Epps, J.; Bailey, J. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance. J. Mach. Learn. Res. 2010, 11, 2837–2854. [Google Scholar]
Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
Zhu, Y.; Ting, K.M.; Carman, M.J. Density-ratio based clustering for discovering clusters with varying densities. Pattern Recognit. 2016, 60, 983–997. [Google Scholar]
Rand, W.M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 1971, 66, 846–850. [Google Scholar] [CrossRef]
Moulavi, D.; Jaskowiak, P.A.; Campello, R.J.G.B.; Zimek, A.; Sander, J. Density-Based Clustering Validation. In Proceedings of the 2014 SIAM International Conference on Data Mining, Philadelphia, PA, USA, 24–26 April 2014; Zaki, M.J., Obradovic, Z., Tan, P., Banerjee, A., Kamath, C., Parthasarathy, S., Eds.; SIAM: Philadelphia, PA, USA, 2014; pp. 839–847. [Google Scholar] [CrossRef]
Bagirov, A.M.; Aliguliyev, R.M.; Sultanova, N. Finding compact and well-separated clusters: Clustering using silhouette coefficients. Pattern Recognit. 2023, 135, 109144. [Google Scholar] [CrossRef]
Meng, L.; Tan, A.; Wunsch, D.C. Adaptive Scaling of Cluster Boundaries for Large-Scale Social Media Data Clustering. IEEE Trans. Neural Netw. Learn. Syst. 2016, 27, 2656–2669. [Google Scholar] [CrossRef] [PubMed]
Patil, S.; Lora, C.P.; Prabhu, A. Genome Analysis at Scale: Leveraging HPC for AI-Driven Genomics Research. In Proceedings of the 2024 International Conference on Advances in Computing Research on Science Engineering and Technology (ACROSET), Indore, India, 27–28 September 2024; pp. 1–6. [Google Scholar] [CrossRef]
Belucci, B.; Lounici, K.; Meziani, K. CoHiRF: A Scalable and Interpretable Clustering Framework for High-Dimensional Data. arXiv 2025, arXiv:2502.00380. [Google Scholar]
Lehner, S.; Enigl, K.; Schlögl, M. Derivation of characteristic physioclimatic regions through density-based spatial clustering of high-dimensional data. Environ. Model. Softw. 2025, 186, 106324. [Google Scholar] [CrossRef]

Figure 1. Illustration of boundary points.

Figure 2. Scatter plot of weak connectivity.

Figure 3. Scatter plot.

Figure 4. The Illustration of DCM value calculation.

Figure 5. Illustration of z-value filling curve.

Figure 6. Illustration of coordinate transformation.

Figure 7. z-value coordinate representation.

Figure 8. Overall framework of the algorithm.

Figure 9. The illustration of the linear search range.

Figure 10. Illustration of repeated calculations.

Figure 11. Illustration of the distance calculation range.

Figure 12. Illustration of density calculation.

Figure 13. The illustration of the global sorting partition strategy.

Figure 14. Illustration of the data copy.

Figure 15. Clustering visualization: DS1.

Figure 16. Clustering visualization: DS2.

Figure 17. Clustering visualization: Flame.

Figure 18. Clustering visualization: Spiral.

Figure 19. Scatter plot of clustering results of data et DS1 in the CDC clustering mode.

Figure 20. Scatter plot of clustering results of dataset DS1 in DEALER clustering mode.

Figure 21. Scatter plot of clustering results of dataset Flame in CDC clustering mode.

Figure 22. Scatter plot of clustering results of the dataset Flame in the DEALER clustering mode.

Figure 23. Runtime of the distributed direction centrality clustering algorithm enhanced by z-value index and existing algorithms.

Figure 24. Distributed algorithm speed ratio.

Figure 25. Distributed algorithm scalability.

Figure 26. The running times.

Figure 27. The number of distance calculations.

Table 1. Dataset characteristics.

Dataset Type	Dataset Name	Dataset Size	Number of Clusters
Synthetic	DS1	999	10
Synthetic	DS2	459	6
Real-World	Flame	240	2
Real-World	Spiral	312	3
Real-World	Levine	265,627	14
Real-World	Samusik	841,644	24
Real-World	UrbanGB	360,177	469
Real-World	CoverType	581,012	N/A
Real-World	Household	2,049,280	N/A
Real-World	Stock	17,400,000	N/A
Real-World	Sofia	35,667,207	N/A

Table 2. The clustering-related parameters of various algorithms on the datasets.

	DS1
algorithm	Clusters number k	knn-k	$ϵ$	$M i n P t s$	$T_{D C M}$
k-means	10	—	—	—	—
DBSCAN	—	—	0.3	12	—
DPC	—	—	0.3	—	—
CDC	—	30	—	—	0.2
DEALER	—	30	—	—	0.2
	DS2
algorithm	Clusters number k	knn-k	$ϵ$	$M i n P t s$	$T_{D C M}$
k-means	6	—	—	—	—
DBSCAN	—	—	1.5	9	—
DPC	—	—	2	—	—
CDC	—	4	—	—	0.1
DEALER	—	4	—	—	0.2
	Flame
algorithm	Clusters number k	knn-k	$ϵ$	$M i n P t s$	$T_{D C M}$
k-means	2	—	—	—	—
DBSCAN	—	—	1.5	14	—
DPC	—	—	2	—	—
CDC	—	10	—	—	0.04
DEALER	—	10	—	—	0.04
	Spiral
algorithm	Clusters number k	knn-k	$ϵ$	$M i n P t s$	$T_{D C M}$
k-means	3	—	—	—	—
DBSCAN	—	—	2.5	5	—
DPC	—	—	2.5	—	—
CDC	—	6	—	—	0.3
DEALER	—	6	—	—	0.5

Table 3. Clustering metrics of various algorithms on datasets.

	ACC				NMI
algorithm	DS1	DS2	Flame	Spiral	DS1	DS2	Flame	Spiral
k-means	0.9940	0.8279	0.8417	0.3397	0.9874	0.8878	0.4267	0.0033
DBSCAN	0.7998	0.7669	0.6333	0.9968	0.9643	0.9551	0.8687	0.9918
DPC	0.9990	0.9216	0.9917	0.6090	0.9976	0.9111	0.9355	0.2845
CDC	0.9009	0.8105	0.6375	0.3718	0.9689	1.0000	0	0.0589
DEALER	0.9980	1.0000	0.9917	0.9968	0.9957	1.0000	0.9269	0.9918
	ARI				F-score
algorithm	DS1	DS2	Flame	Spiral	DS1	DS2	Flame	Spiral
k-means	0.9867	0.7834	0.4649	−0.0063	0.9940	0.8623	0.8447	0.3390
DBSCAN	0.8926	0.9501	0.9336	0.9951	0.7670	0.7680	0.6333	0.9952
DPC	0.9978	0.8520	0.9666	0.2097	0.9990	0.8864	0.9916	0.6056
CDC	0.8969	1.0000	0	0.0185	0.8676	0.8105	0.4964	0.3289
DEALER	0.9956	1.0000	0.9666	0.9951	0.9980	1.0000	0.9917	0.9952

Table 4. The Davies-Bouldin Index and Silhouette Coefficient of each algorithm on different datasets.

	DS1		DS2
algorithm	DB Index	Silhouette Coefficient	DB Index	Silhouette Coefficient
k-means	0.3515	0.7345	0.6959	0.5509
DBSCAN	1.3202	0.7183	3.0477	0.3408
DPC	0.4130	0.6856	3.5085	0.3137
CDC	0.3561	0.7214	3.0475	0.3417
DEALER	0.3493	0.7342	3.0475	0.3417
	Flame		Spiral
algorithm	DB Index	Silhouette Coefficient	DB Index	Silhouette Coefficient
k-means	1.1154	0.3785	5.8820	0.0013
DBSCAN	1.1545	0.3310	0.8817	0.3602
DPC	1.0413	0.2276	5.7375	−0.2430
CDC	—	—	1.0514	0.1729
DEALER	1.1575	0.3294	5.8820	0.0013

Table 5. Clustering performance metrics under varying KNN k values when

T_{D C M}

= 0.20.

Table 5. Clustering performance metrics under varying KNN k values when

T_{D C M}

= 0.20.

	CDC				DEALER
KNN- $k$	ACC	NMI	ARI	F-Score	ACC	NMI	ARI	F-Score
20	0.9009	0.9689	0.8969	0.8676	0.9980	0.9957	0.9956	0.9980
35	0.9009	0.9689	0.8969	0.8676	0.9980	0.9957	0.9956	0.9980
50	0.9009	0.9689	0.8969	0.8676	0.9980	0.9957	0.9956	0.9980
65	0.9009	0.9689	0.8969	0.8676	0.9980	0.9957	0.9956	0.9980
80	0.7998	0.9185	0.7326	0.7498	0.9980	0.9957	0.9956	0.9980

Table 6. Clustering performance metrics under varying

T_{D C M}

values when KNN k = 30.

Table 6. Clustering performance metrics under varying

T_{D C M}

values when KNN k = 30.

	CDC				DEALER
$T_{DCM}$	ACC	NMI	ARI	F-Score	ACC	NMI	ARI	F-Score
0.15	0.9009	0.9689	0.8969	0.8676	0.9980	0.9957	0.9956	0.9980
0.20	0.9009	0.9689	0.8969	0.8676	0.9980	0.9957	0.9956	0.9980
0.25	0.9009	0.9689	0.8969	0.8676	0.9980	0.9957	0.9956	0.9980
0.30	0.7007	0.9006	0.7347	0.6007	0.9980	0.9957	0.9956	0.9980
0.35	0.5015	0.7906	0.5186	0.3680	0.9980	0.9957	0.9956	0.9980

Table 7. Runtime comparison between the proposed method and baseline approaches.

Algorithm	Flame	Spiral	DS2	DS1	Levine	UrbanGB	Samusik
KNN-search	12	13	17	36	1302	5741	12,032
Z-CF	20	25	39	69	987	2065	3386

Table 8. Distance calculation comparison of the proposed method with baselines.

Algorithm	Flame	Spiral	DS2	DS1	Levine	UrbanGB	Samusik
KNN-search	57,360	97,344	210,681	998,001	70,557,703,129	129,727,471,329	708,364,622,736
Z-CF	32,771	55,625	120,395	570,299	40,318,687,502	74,129,983,616	404,779,784,420

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, X.; Zhao, Z.; Zhao, Y. DEALER: Distributed Clustering with Local Direction Centrality and Density Measure. Appl. Sci. 2025, 15, 3988. https://doi.org/10.3390/app15073988

AMA Style

Liu X, Zhao Z, Zhao Y. DEALER: Distributed Clustering with Local Direction Centrality and Density Measure. Applied Sciences. 2025; 15(7):3988. https://doi.org/10.3390/app15073988

Chicago/Turabian Style

Liu, Xuze, Ziqi Zhao, and Yuhai Zhao. 2025. "DEALER: Distributed Clustering with Local Direction Centrality and Density Measure" Applied Sciences 15, no. 7: 3988. https://doi.org/10.3390/app15073988

APA Style

Liu, X., Zhao, Z., & Zhao, Y. (2025). DEALER: Distributed Clustering with Local Direction Centrality and Density Measure. Applied Sciences, 15(7), 3988. https://doi.org/10.3390/app15073988

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DEALER: Distributed Clustering with Local Direction Centrality and Density Measure

Abstract

1. Introduction

2. Preliminaries

2.1. Fundamental Concepts

2.2. Z-Value Filling Curve

3. The DEALER Algorithm

3.1. Overview

3.2. Computation of DCM and ρ for Local Clustering

3.2.1. Computation of DCM

3.2.2. Computation of ρ

3.3. Data Partitioning

3.3.1. Initial Data Partitioning

3.3.2. Data Partition Adjustment

3.4. Data Clustering

3.4.1. Local Clustering

3.4.2. Global Clustering

3.5. Time Complexity Analysis

4. Discussion

5. Experiment Analysis

5.1. Experimental Setup

5.1.1. Experimental Environment and Datasets

5.1.2. Parameter Settings

5.2. The Efficiency Analysis

5.3. Davies–Bouldin Index and Silhouette Coefficient

5.4. Impact of Parameter Variations on Clustering Results

5.5. Case Analysis

5.6. Efficiency Analysis

5.7. Z-Value Filtering Effectiveness

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.2. Computation of DCM and $ρ$ for Local Clustering

3.2.2. Computation of $ρ$