Next Article in Journal
Controllability of Bilinear Systems: Lie Theory Approach and Control Sets on Projective Spaces
Previous Article in Journal
Some Centrally Extended Derivations on Banach Algebras
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Clustering Algorithm for Large Datasets Based on Detection of Density Variations

by
Adrián Josué Ramírez-Díaz
*,
José Francisco Martínez-Trinidad
and
Jesús Ariel Carrasco-Ochoa
Instituto Nacional de Astrofísica, Óptica y Electrónica, Luis Enrique Erro # 1, Tonantzintla, Puebla 72840, Mexico
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(14), 2272; https://doi.org/10.3390/math13142272 (registering DOI)
Submission received: 8 May 2025 / Revised: 20 June 2025 / Accepted: 24 June 2025 / Published: 15 July 2025

Abstract

Clustering algorithms help handle unlabeled datasets. In large datasets, density-based clustering algorithms effectively capture the intricate structures and varied distributions that these datasets often exhibit. However, while these algorithms can adapt to large datasets by building clusters with arbitrary shapes by identifying low-density regions, they usually struggle to identify density variations. This paper proposes a Variable DEnsity Clustering Algorithm for Large datasets (VDECAL) to address this limitation. VDECAL introduces a large-dataset partitioning strategy that allows working with manageable subsets and prevents workload imbalance. Within each partition, relevant objects subsets characterized by attributes such as density, position, and overlap ratio are computed to identify both low-density regions and density variations, thereby facilitating the building of the clusters. Extensive experiments on diverse datasets show that VDECAL effectively detects density variations, improving clustering quality and runtime performance compared to state-of-the-art DBSCAN-based algorithms developed for clustering large datasets.
MSC:
62H30; 68T09

1. Introduction

Clustering aims to divide datasets into groups called clusters. The objects of a cluster must be similar and different from those of other clusters [1,2,3,4,5]. The clustering problem has appeared in various disciplines and contexts, reflecting the broad applicability of this technique, probably due to one of its main characteristic: the ability to deal with unlabeled data [6,7,8,9,10]. The broad applicability of clustering has motivated the development of several algorithms adapted to different types of datasets, such as stream datasets [11], categorical or numerical datasets [12], high dimensional datasets [13,14], point clouds [15], large datasets [16], and others. Recently, due to the increasing amount of data generated by current information systems, the development of clustering algorithms capable of processing large datasets has become an important research topic [17,18,19].
Clustering large datasets presents several challenges, including the development of efficient techniques to handle them effectively [20]. However, this is not the only challenge to consider. Clustering large datasets should also address the diversity in the data, which follows unknown distributions [21,22]. Therefore, several clustering algorithms for large datasets employ approaches that allow the construction of clusters with arbitrary shapes, i.e., that do not necessarily follow a Gaussian distribution [23].
The density-based clustering approach builds clusters by establishing connectivity relationships between objects in the dataset and identifying regions with at least a predetermined number of points. This approach creates clusters with arbitrary shapes and allows for dealing with noisy data [24,25,26]. However, due to the inherent properties of the density-based approach, it requires a high number of comparisons between objects in the dataset [27]. For this reason, density-based clustering algorithms must use scalable approaches such as sampling, data projection, parallelism, and distributed techniques to process large datasets [28,29,30,31].
In addition to the use of scalable approaches, techniques for building clusters rapidly and reducing the number of comparisons between the elements in the dataset without losing the advantages of density-based clustering, are required. The most common way for building clusters with arbitrary shapes is identifying high-density regions separated by low-density regions.
However, a practical issue in large datasets arises when clusters are not separated by low-density regions, but by density variations in regions with approximately homogeneous densities [32,33]. Building clusters using information about density variations is helpful in problems such as segmenting objects in medical images [34,35,36,37] or reconstructing objects from point clouds generated by sensors [38,39,40]. These problems have in common the detection of clusters of interest by identifying changes in the distribution of objects belonging to the dataset.
Building clusters using density variations has been studied for small datasets but has not been addressed for large datasets [32,33,41,42,43]. For this reason, this paper introduces a clustering algorithm that builds clusters with arbitrary shapes in large datasets called Variable DEnsity Clustering Algorithm for Large datasets (VDECAL). VDECAL identifies density variations to build clusters, aiming to improve clustering quality while processing large datasets. Our experiments on synthetic and standard public datasets, commonly used to assess clustering algorithms, demonstrated that the proposed algorithm, VDECAL, has the best compromise between runtime and quality.
Our contributions include the following:
  • A new large-dataset partitioning strategy that allows working with manageable subsets and prevents workload imbalance. This strategy could be used in other large-dataset clustering algorithms.
  • A new strategy to build clusters by identifying density variations and low-density regions in large datasets.
  • Based on the above, we also contribute with the Variable DEnsity Clustering Algorithm for Large datasets VDECAL.
The rest of this work is organized as follows. Section 2 presents the related work. Section 3 introduces the proposed algorithm. Section 4 shows an experimental evaluation of the proposed algorithm. Section 5 presents the conclusions and future work.

2. Related Work

The increasing amount of data generated by current information systems requires the development of algorithms capable of processing large amounts of data by overcoming the challenges that large datasets present, such as volume, variety, and velocity [44].
To address the growing amount of information, distributed architectures (such as Apache Spark and Hadoop) have been designed [23,27,45,46,47]. For this reason, most recent density-based algorithms are designed to be scalable for these architectures, employing techniques like partitioning, sampling, and others to handle large datasets [28].
Numerous algorithms address clustering in large datasets. This work focuses on density-based methods, as they manage challenges common in large datasets, such as noise and arbitrary cluster shapes, among others. These methods are particularly well-suited for capturing the complex structures and variability found in such data. The analyzed methods are presented below.
DBSCAN [48] is the cornerstone of density-based clustering, so it is not surprising that algorithms for processing large datasets are based on it. SDBSCAN [30] was proposed in 2016 to have a comparable version of DBSCAN for large datasets. To do this, SDBSCAN uses three phases: Partitioning, Processing, and Clustering. First, the large dataset is randomly partitioned; second, each partition is analyzed with DBSCAN; and third, the resulting clusters are grouped based on the distance between their centroids. Like SDBSCAN, other algorithms use the phases of partitioning, processing, and clustering to process large datasets by executing most of the process in parallel to reduce the runtime. DBSCAN-PSM [49] improves the partitioning process and the parallel processing using a k-d Tree, which reduces the number of I/O operations. In addition, the clustering process was also enhanced by reviewing the proximity of the local clusters’ borders.
RNN-DBSCAN algorithm [50], proposed in 2018, builds clusters using nearest neighbors approach to identify density variations. For this, the k-nearest neighbors for every object are calculated and the distance between them are compared to determine if they can be in the same cluster. A k-d tree is employed to speed up the search of the k-nearest neighbors. The algorithm does not mention that it processes large datasets.
Finally, RS-DBSCAN [51], proposed in 2024, is an algorithm designed to process large datasets, built on top of the RNN-DBSCAN algorithm. RS-DBSCAN partitions the dataset, and every partition is processed independently with RNN-DBSCAN. From the local clusters, a set of representative objects are selected. Then, RNN-DBSCAN is applied to classify all representative objects globally. Finally, the dataset is clustered using these representative objects. This process is faster than RNN-DBSCAN; however, it cannot detect density variations, as opposed to RNN-DBSCAN.
Another widely used technique in density-based algorithms for processing large datasets is the identification of density peaks (high-density regions), which can quickly obtain the cluster centers by drawing the decision diagram by using the calculation of local density and relative distance. Below are some algorithms that build clusters based on density peaks [52].
FastDPeak [53], proposed in 2020, is an algorithm built on top of the DPeak algorithm and designed to process large datasets. FastDPeak builds clusters by identifying the dataset’s density peaks and also uses a three-phases partitioning process. The partitioning phase is performed using the kNN algorithm. The processing phase picks up the ideas of the DPeak algorithm applied to the partitions created in the first phase. The results are grouped by creating a DPeak tree that contains information about the density peaks and the objects associated with these peaks. Using the DPeak tree as a tool, the clusters are built.
The KDBSCAN algorithm [54], proposed in 2021, is also a three-phase partitioning algorithm. In the first phase, the dataset is partitioned using the K-Means++ algorithm [55], using a reduced number of iterations without achieving convergence; this algorithm works in the same way as K-Means, with the difference that the selection of initial centroids is not random. In the second phase, KDBSCAN, like S-DBSCAN or DBSCAN-PSM algorithm, uses DBSCAN to process the partitions. In the clustering phase, the "close" clusters are identified by analyzing the distances between their centroids, and then those close clusters are grouped again with DBSCAN.
KNN-BLOCK DBSCAN [56], proposed in 2021, improves the quality of DBSCAN and speeds up the processing of large datasets. This algorithm merges the partitioning and processing phases in a version of the KNN algorithm called FLANN. In this phase, three groups of objects are created: core blocks, non-core blocks, and noise. In the merging phase, clusters are formed by connecting core blocks when they are densely reachable. Non-core blocks near clusters are merged to form clusters. Noise and non-core blocks that are not near a cluster are discarded. KNN-BLOCK DBSCAN builds clusters based on identifying density peaks; however, this approach assumes that the cluster center has a higher density than the surroundings, in addition to the fact that there is a low-density region between the clusters, as Huan Yan stated in their paper on the ANN-DPC algorithm [57], which explores this approach on smaller datasets.
In 2023, the GB-DP algorithm [58] was introduced as a three-phase clustering algorithm based on detecting density peaks. First, GB-DP partitions the dataset using K-Means; the clusters built are called granular balls. It then uses the characteristics of these granular balls, such as density and distance, to build a decision graph for identifying density peaks. In the last phase, granular balls and the points they represent are assigned to their nearest peaks, allowing for the build of clusters. This way of building clusters is partition-based rather than density-based.
Though the mentioned algorithms employ partitioning to create dataset sections, it is important to note that partitioning is not the only option. Other algorithms select a dataset sample to create a preliminary version of the clusters. This reduces runtime by performing operations with a higher load on the selected sample. Usually, after processing the sample, these algorithms extrapolate the results to the rest of the dataset.
For example, in 2019, DBSCAN++ [59] was proposed as an alternative to using DBSCAN on large datasets by following sample selection. DBSCAN++ selects a random subset of the dataset that is large enough to be meaningful but small enough to be processed by DBSCAN. DBSCAN processes the selected subset to build the clusters. The rest of the dataset is clustered by assigning each element to the nearest cluster. This algorithm reduces the runtime and maintains the quality of DBSCAN as long as the selected subset is representative of the dataset.
Improved K-Means [60], proposed in 2020, has emerged as an alternative that combines partition-based algorithm techniques with density-based algorithm techniques. This can be used to build arbitrarily shaped clusters in large datasets. Improved K-Means, like DBSCAN++, selects a sample from the dataset and uses it to identify reference objects that represent the highest density regions. Clusters are formed using these reference objects. The remaining objects are assigned to the closest cluster, using the reference objects and a distance function.
The OP-DBSCAN algorithm [61], proposed in 2022, also uses sample selection to analyze large datasets. OP-DBSCAN selects a set of objects from the dataset close to each other; this set is called the operational dataset. Then, the operational dataset is clustered using DBSCAN. Next, OP-DBSCAN finds the boundaries of the previously computed clusters. Finally, OP-DBSCAN computes the objects closest to the previously determined boundaries; these objects now form the operational dataset. This process is repeated until the entire dataset is analyzed, constantly updating the operational dataset. Although the sample is more controlled and computed multiple times, the goal of selecting a sample is to reduce the number of comparisons between objects, thus reducing the runtime.
UP-DPC [62], proposed in 2024, is also a sampling-based framework for cluster building, operating through four steps. It begins by sampling a random subset of the dataset and identifying local density peaks through partitioning. Then, it constructs a similarity graph between these peaks and the samples. The graph is refined to improve accuracy. Finally, the density of local peaks is calculated and the significant ones are selected as initial cluster labels, thus clustering all objects in the dataset. Although UP-DPC uses density-based techniques to identify density peaks, clusters are built by assigning objects to the nearest peak, being a partitional technique which limits the type of clusters built.
As seen in the above works, current density-based clustering algorithms for large datasets aim to identify clusters by detecting high-density regions separated by low-density boundaries. This approach relies on well-defined, low-density separation zones, which can limit the effectiveness of clustering algorithms like SDBSCAN and KDBSCAN. Moreover, traditional partitioning methods, such as random partitioning or K-Means, may result in imbalanced partitions. Thus, an opportunity area for research is the development of clustering algorithms for large datasets that enable the identification of cluster boundaries via low-density areas or density fluctuations and by utilizing balanced partitioning to ensure equitable workload distribution when clustering large datasets.

3. Proposed Algorithm

This section introduces the proposed Variable DEnsity Clustering Algorithm for Large datasets (VDECAL). The main goal of VDECAL is to create clusters with arbitrary shapes in large datasets using density-based techniques. Unlike previously studied clustering algorithms that focus on identifying low-density regions that can be used as boundaries for clustering, VDECAL also identifies variations in the density of nearby regions, allowing the separation of clusters with density variations. This approach aims to improve the clustering quality of clustering without increasing the runtime in large datasets.

3.1. Preliminary Definitions

Density-based clustering is an approach in which clusters are defined as sets of objects distributed in the data space in a continuous region with high object density (with many objects). Regions with few objects are called low-density regions. In this way, clusters are separated by low-density regions. Objects in low-density regions are considered as noise [25].
One way to quantify the density of a region is to calculate the number of objects near a reference point. The set of objects near a reference point is called the neighborhood of the reference point. To determine if two objects are nearby, a parameter ϵ indicates the maximum distance between two objects to be considered close [63]; in this sense, a ϵ -neighborhood is defined as follows.
Definition 1.
ϵ-Neighborhood. Consider a set of objects O = { o 1 , o 2 , , o n } in a k-dimensional space R k , a distance function D i s t a n c e : R k × R k R , and a distance threshold ϵ R + . The ϵ-neighborhood of an object x, denoted by N ϵ ( x ) , is as follows:
N ϵ ( x ) = { o O | D i s t a n c e ( x , o ) ϵ }
A second parameter, m i n P t s , indicates whether the neighborhood of an object is dense enough to be considered a high-density region or, on the contrary, whether it is a low-density region. An object that exceeds the specified density ( m i n P t s ) is called a core object [63].
Definition 2.
Core object. Consider a set of objects O = { o 1 , o 2 , , o n } in a k-dimensional space R k , a distance threshold ϵ R + and a minimum density m i n P t s Z + . An object o i O is called a core object if | N ϵ ( o i ) | m i n P t s .
Identifying core objects in a dataset and connecting them by their common objects is the most commonly used technique by density-based algorithms to build clusters. However, this requires comparisons among multiple objects. In large datasets, this process can be time-consuming. To assess this problem, our proposed algorithm reduces the number of calculations by using only selected reference points to determine the density of a region. For this purpose, the concept of a relevant object subset was introduced.
Definition 3.
Relevant object subset. Consider a set of objects O = { o 1 , o 2 , , o n } in a k-dimensional space R k , a distance threshold ϵ R + , and an object c R k . A relevant object subset R in the set O is represented as a 3-tuple ( M , ϵ R , c ) , where c R k is the centroid of the relevant object subset, ϵ R R + defines the size of the neighborhood, and M O contains objects within the relevant object subset such that M = N ϵ R ( c ) .
Therefore, the density of a relevant object subset is defined as follows.
Definition 4.
Density of a relevant object subset. Consider a relevant object subset R = ( M , ϵ R , c ) . The density of R, denoted as ρ ( R ) , is the number of objects in the set M.
ρ ( R ) = | M |
Relevant object subsets with a density less than m i n P t s represent low-density regions. The drawback of using only low-density regions for cluster building is that some datasets may lack defined low-density regions. In such cases, clusters could be identified by assessing density variations between neighboring relevant object subsets.
Figure 1a depicts a dataset where a low-density region separates two clusters. In this scenario, it is possible to use the parameters ϵ and m i n P t s to define the low-density region between clusters and separate them. Figure 1b represents a dataset where no low-density region exists between clusters. There only exists a density variation.

3.2. Variable DEnsity Clustering Algorithm for Large Datasets (VDECAL)

VDECAL aims to build clusters with arbitrary shapes in large datasets, and it is divided into three stages: dataset partitioning, finding relevant object subsets, and clustering relevant object subsets. In the first stage, dataset partitioning, the dataset is divided into independent segments, which reduces the number of comparisons between objects. In the next stage, finding relevant object subsets, VDECAL computes relevant object subsets for each partition. Finally, in the last stage, clustering relevant object subsets, the characteristics of the relevant object subsets, such as proximity, density, and their associated objects, are used to build the resulting clusters.

3.2.1. Dataset Partitioning

In the literature, a common approach for partitioning is using K-Means++. Without reaching the convergence for partitioning, the dataset produces good results because it creates partitions of nearby objects. However, unbalanced subsets may occur. When the subsets are unbalanced, the workload accumulates in the subsets with more objects. In extreme cases, this could lead to the partitioning stage becoming useless. To assess imbalanced subsets, we propose a recursive partitioning method.
The proposed dataset partitioning splits the dataset using K-Means++ (without reaching convergence) into two subsets. If the size of some of the resulting subsets exceeds the partition size t, it is partitioned again using K-Means++. This process continues until all partitions have less objects than t. This method is described in Algorithm 1.
Algorithm 1 Algorithm for dataset partitioning.
  • procedure Partition(Dataset D, partition size t)
  •      Initialize an empty list p a r t i t i o n L i s t
  •      if number of elements in D > t  then
  •          Divide D into two parts D 1 and D 2 using the K-Means++ algorithm.
  •           s u b s e t s 1 = Partition ( D 1 , t)
  •           s u b s e t s 2 = Partition ( D 2 , t)
  •          Merge s u b s e t s 1 and s u b s e t s 2 into p a r t i t i o n L i s t .
  •      else
  •          Add D to p a r t i t i o n L i s t as a partition.
  •      end if
  •      return  p a r t i t i o n L i s t
  • end procedure
Figure 2 compares two methods for dividing a dataset: K-Means++ and our proposed approach. In K-Means++ configured to create four partitions (see Figure 2a), we observe that most of the dataset’s objects are concentrated in a single partition. This leads to an uneven distribution of objects across partitions, causing workload imbalance. In contrast, our method (see Figure 2b), with t = 2000 , creates two partitions in the same region, effectively distributing the workload more evenly.

3.2.2. Finding Relevant Object Subsets

This stage aims to find a set of relevant object subsets for each partition P generated in the first stage, using a distance threshold ϵ . These relevant object subsets are computed according to Definition 3, which states that a relevant object subset is a 3-tuple R = ( M , ϵ R , c ) : the set M contains the objects represented by the relevant object subset. The distance ϵ R defines the size of the neighborhood that the relevant object subset R covers. Finally, c is the centroid of the relevant object subset R.
All objects in a partition P are analyzed to calculate the relevant object subsets. For each object, o in P, that has not been previously added to a relevant object subset, a new relevant object subset R is created from o. All objects p in P whose distance to o is less than or equal to ϵ are selected, whether or not they have been added to another relevant object subset, to create a relevant object subset R = ( M , ϵ R , c ) with o. These selected objects, including o, form the set M. From M, the centroid c is calculated as the mean of all the objects in M, and ϵ R is defined as the distance from c to the farthest object in M. This process ends when all objects in P have been added to one or more relevant object subsets. This process is shown in Algorithm 2.
Algorithm 2 Algorithm for finding relevant object subsets in a partition.
  • procedure FindRelevantObjectSubsets(partition P, distance threshold ϵ )
  •      Initialize an empty list r e l e v a n t O b j e c t S u b s e t s
  •      for each object o in P do
  •           if o is not visited then
  •               M = { p P such that D i s t a n c e ( o , p ) ϵ }
  •               c = centroid of M
  •               ϵ R = max i = 1 , , | M | D i s t a n c e ( c , m i )
  •              Create a new relevant object subset R = ( M , ϵ R , c )
  •              Add R to r e l e v a n t O b j e c t S u b s e t S u b s e t s
  •              Mark each object in M and o as visited
  •           end if
  •      end for
  •      return  r e l e v a n t O b j e c t S u b s e t s
  • end procedure

3.2.3. Clustering Relevant Object Subsets

The final stage of the VDECAL algorithm involves building clusters using the relevant object subsets identified in the previous stage. This process relies on identifying overlaps and density variations among these relevant object subsets. Building clusters from the relevant object subsets begins by discarding those not meeting the minimum density requirement set by the parameter m i n P t s (see Definition 4).
The next step is to identify the relevant object subsets in the same cluster. Each pair of relevant object subsets is examined to determine if their neighborhoods overlap and have similar densities. The criterion for determining when two relevant object subsets are close enough to share neighborhoods, meaning they overlap, is given in Definition 5.
Definition 5.
Overlap of relevant object subsets. Consider two relevant object subsets R 1 = ( M 1 , ϵ 1 , c 1 ) and R 2 = ( M 2 , ϵ 2 , c 2 ) and a distance function D i s t a n c e : R k × R k R . R 1 has an overlap with R 2 if the following condition is fulfilled.
D i s t a n c e ( c 1 , c 2 ) ( ϵ 1 + ϵ 2 )
To determine if there is a density variation between two relevant object subsets, R 1 and R 2 , we first need to identify their neighboring relevant object subsets. The neighboring relevant object subsets of R 1 , denoted by Ω ( R 1 ) , are those that overlap with R 1 , and similarly, the neighboring relevant object subsets of R 2 , denoted by Ω ( R 2 ) , are those that overlap with R 2 (Definition 5). Once these neighbors are identified, the density of each neighboring relevant object subset ρ ( R ) is obtained (Definition 4). Then, we calculate the mean ( x ¯ ) and standard deviation ( σ ) of the densities for the neighboring relevant object subsets of R 1 and R 2 , obtaining the mean and standard deviation of the densities for R 1 and those for R 2 . Next, we calculate the Z-score of R 1 relative to the neighbors of R 2 and R 2 relative to the neighbors of R 1 according to Definition 6.
Definition 6.
Z-score of R 2 relative to the neighbors of R 1 . Let R 1 = ( M 1 , ϵ 1 , c 1 ) and R 2 = ( M 2 , ϵ 2 , c 2 ) be two relevant object subsets. The Z-score of R 2 relative to the neighbors of R 1 is defined as:
Z ( R 2 , R 1 ) = ρ ( R 2 ) x ¯ Ω ( ρ ( R 1 ) ) σ Ω ( ρ ( R 1 ) )
where ρ ( R 2 ) is the density of R 2 (see Definition 4), x ¯ Ω ( ρ ( R 1 ) ) is the mean of the densities of the neighboring relevant object subsets of R 1 , and σ R 1 is the standard deviation of the densities of the neighboring relevant object subsets of R 1 .
The following definition states when there is a density variation between two relevant object subsets.
Definition 7.
Density variation between two relevant object subsets. Consider two relevant object subsets R 1 = ( M 1 , ϵ 1 , c 1 ) and R 2 = ( M 2 , ϵ 2 , c 2 ) , and a density variation threshold δ R + . There exists a density variation between R 1 and R 2 if any of the following conditions is fulfilled.
δ | Z ( R 1 , R 2 ) | o r δ | Z ( R 2 , R 1 ) | .
Parameter δ should be determined in terms of the absolute Z-score. Z-score shows how many standard deviations a relevant object subset is from the mean of its neighbors. As a result, the values for δ must be between 0 and 4, since according to Shiffler [64] there are no absolute Z-scores that exceed 4 in limited sets.
VDECAL constructs an adjacency matrix using information about relevant object subsets that can be clustered using the overlap and density properties. In each cell i , j , the adjacency matrix contains a 1 if the relevant object subsets R i and R j overlaps and do not have a density variation, and 0 otherwise. With this adjacency matrix, the final clusters can be built by computing the connected components [65] (p. 552). Finally, each object in the dataset is assigned to a cluster by identifying the relevant object subsets to which it belongs. If it only belongs to relevant object subsets within the same cluster, it is assigned to that cluster. The object is labeled as noise if it belongs to a relevant object subset that was discarded because it did not meet the m i n P t s threshold. Finally, if the object belongs to relevant object subsets in different clusters, it will be assigned to the cluster with the relevant object subset with the closest centroid( c R ).
In summary, given a large dataset, the proposed algorithm begins by partitioning the input dataset, employing K-Means++ as explained in Section 3.2.1. On each partition, the relevant object subsets are computed according to Section 3.2.2, which includes calculating the objects they represent, their centroids, and their neighborhood sizes. Next, we label the objects included in low-density relevant object subsets as noise. In Section 3.2.3, we construct a graph to connect the relevant object subsets after calculating them for all the partitions. The edges are made between overlapping relevant object subsets (Definition 5) that have similar density levels (Definition 7). The clusters emerge from building the connected components on the graph. In the final step, we assign each object not labeled as noise to a cluster by identifying the relevant object subsets to which it belongs. This clustering process is summarized in Algorithm 3.
Algorithm 3 VDECAL algorithm.
  • procedure VDECAL( D , ϵ , minPts , δ , t )
  •      Create a partition list of the dataset D using Partition( D , t ).
  •      for each partition P in the partition list do
  •           Find relevant object subsets in P using FindRelevantObjectSubsets ( P , ϵ ) .
  •           Mark objects in relevant object subsets with density less than m i n P t s as noise.
  •      end for
  •      Build a graph connecting relevant object subsets
              with overlapping neighborhoods (Definition 5) and
              density variation below δ (Definition 7).
  •      Extract clusters as connected components of the graph.
  •      for each object o in D not marked as noise do
  •           Assign o to the cluster of the nearest relevant object subset in its ϵ -neighborhood.
  •      end for
  • end procedure

3.3. Effect of the t, ϵ , m i n P t s , and δ Parameters

VDECAL has four parameters: partition size (t), neighborhood size ( ϵ ), minimum number of objects per relevant object subset ( m i n P t s ), and density variation threshold ( δ ). The parameter t sets the partition size. Small values for this parameter generate more and smaller partitions, producing subsets with insufficient information to build relevant object subsets. Conversely, higher values generate fewer and larger partitions, which slow down the construction of relevant object subsets. According to our experiments, shown later, taking 20% of the dataset provides good results.
High values for ϵ produce fewer relevant object subsets and reduce VDECAL’s runtime but may blur cluster boundaries. Low values for ϵ improve boundary detection but increase VDECAL’s runtime. The parameter m i n P t s allows VDECAL to discard relevant object subsets in low-density regions, thereby differentiating high-density areas by separating relevant object subsets with overlapping neighborhoods. Regarding ϵ and m i n P t s , given that they are highly data-dependent, it is difficult to recommend a value for them.
VDECAL uses the Z-score to determine density variations. Shiffler [64] stated that absolute Z-scores cannot exceed 4, so δ must range from 0 to 4. VDECAL does not detect density variations when δ = 4 ; however, as δ decreases, more density variations are identified. When δ = 0 , every relevant object subset constitutes a cluster. From our experiments, shown later, fixing delta around 2.5 produces good results.
Setting values for the VDECAL’s parameters, as with other density-based clustering algorithms, remains a challenging issue due to its dependence on the specific characteristics of the dataset.

3.4. Time Complexity

For this analysis, we considered the three stages of the algorithm: partitioning, processing, and clustering. The partitioning stage involves recursively dividing the dataset into two partitions using K-Means++ until each partition contains a maximum of t points. The time complexity of K-Means++ is O ( 2 i n ) , where 2 is the number of partitions, i represents the iterations to determine the centroids, and n represents the number of reviewed objects. The worst-case scenario occurs when K-Means++ separates only one object in each iteration. We intentionally keep i small because achieving full convergence is not necessary for our purposes. In this context, the time complexity for the first stage can be expressed as:
x = 0 n t 2 i ( n x ) = i ( t n 1 ) ( t + n ) = i n 2 + i n i t 2 + i t
where t represents the maximum number of elements per partition. By ignoring the constant terms, we can focus on the dominant term, which in this case is n 2 , indicating a quadratic complexity of O ( n 2 ) for the first stage.
For the processing stage, the worst case occurs when a relevant object subset is built for every object in the partition. Since creating relevant object subsets involves comparing the objects in the partition against the relevant object subsets found, the number of operations in the worst case is expressed as O ( t 2 ( n / t ) ) = O ( t n ) , where t is the maximum number of objects in the partition and n / t is the number of partitions found. However, since t is a constant, the complexity of this stage is O ( n ) .
Finally, the clustering stage analyzes all the relevant object subsets found in the partitions. To determine if two relevant object subsets can be merged, they need to be compared against the rest, requiring r 2 operations, where r is the number of relevant objects subsets. The worst case is when the number of relevant object subsets found is equal to the number of objects in the dataset, resulting in O ( n 2 ) . To build the clusters, connected components need to be built, which is O ( ( n + m ) log n ) ([66], ch. 25, p. 24), where m is the number of overlapping relevant object subsets. Therefore, for this stage the complexity is O ( n 2 + ( ( n + m ) log n ) = O ( n 2 ) in the worst case.
Given that the partitioning stage is O ( n 2 ) , the processing stage is O ( n ) and the clustering stage is O ( n 2 ) . Considering the dominant terms in each stage, the time complexity of the VDECAL algorithm is O ( n 2 ) . This complexity is similar to that of other recent density-based algorithms when their worst-case behavior is also considered [51,54]. For this reason, we include a runtime evaluation to empirically assess its performance.

4. Experiments and Results

This section presents the experiments conducted to assess the performance of the proposed algorithm, VDECAL. The experiments encompass an analysis of the algorithm’s stability and convergence, as well as its quality, runtime, and scalability regarding the number of objects and dimensions.
We include a comparison against state-of-the-art density-based clustering algorithms. The selected algorithms include DBSCAN [48], which serves as the reference algorithm for density-based clustering; IKMEANS [60], an algorithm for large datasets that performs sampling on the dataset; KDBSCAN [54], which is an improvement of DBSCAN for large datasets based on dataset partitioning; KNN-BLOCK DBSCAN [56], another algorithm for large datasets that identifies objects called core blocks; RNNDBSCAN [50], an algorithm that detects density variations; and RSDBSCAN [51], a recent enhancement of RNNDBSCAN designed for large datasets.
To allow a fair comparison with the state-of-the-art clustering algorithms, we implemented each algorithm in Python (latest v. 3.13.5), using the same data structures for input/output processing for all algorithms. Experiments were conducted on a computer with two Intel Xeon E5-2620 at 2.40 GHz processors, 256 GB RAM, and Linux Ubuntu 22.

4.1. Stability and Convergence

Clustering algorithms often involve randomness, which can impact their results and lead to variations across different runs. VDECAL introduces randomness during the dataset partitioning stage by employing K-Means++, which selects random objects as initial centroids for each partition. This experiment aims to evaluate the stability and convergence of VDECAL. To assess these characteristics, we performed multiple runs of the algorithm using different random seeds and the corresponding object processing orders generated by these seeds, while keeping all other parameters constant.
For this experiment, we used the Compound dataset from Tomas Barton’s repository [67]. Compound is widely used to assess density-based clustering algorithms, as it contains clusters with arbitrary shapes, noise, concentric clusters, clusters with varying densities, and clusters with close boundaries. The parameters used were t = 200 (resulting in two partitions), ϵ = 1.5 , m i n P t s = 3 , and δ = 3 . These values allow VDECAL to build the known clusters in the Compound dataset.
We executed the algorithm four times, using the same parameters while varying only the random seeds and orders in each run. For each execution, we recorded the clusters produced and the relevant object subsets used to build them. Figure 3 shows the results of these.
In this figure, objects labeled as noise are shown in gray, while the remaining objects are colored according to their assigned cluster. Larger circles represent the relevant object subsets identified by VDECAL, each colored to match its corresponding cluster. Figure 3 shows that all four runs converge to the same number of clusters with similar shapes. The objects in the concentric clusters (blue and purple) and the arbitrarily shaped cluster (orange) were always identified correctly, despite being derived from different relevant object subsets. The two remaining clusters (pink and green), which have ambiguous boundaries, were slightly affected by the algorithm’s random component. In Figure 3a, these clusters contained 27 and 43 objects, respectively; in Figure 3b, 28 and 42; in Figure 3c, 26 and 41; and in Figure 3d, 25 and 40 objects. The affected objects are located near cluster boundaries, where it is unclear whether an object belongs to one cluster or another or if it should be considered noise, which shows the VDECAL’s stability.

4.2. Quality

Three metrics were used to evaluate clustering quality, i.e., Adjusted Rand Index (ARI) [68], Adjusted Mutual Information (AMI) [69], and Normalized Mutual Information (NMI) [70], since these metrics are widely used in the literature for evaluating clustering algorithms. ARI adjusts for chance and measures the similarity between two clustering. AMI, also adjusted for chance, quantifies the shared information between the clusters and the ground truth, accounting for label distribution. NMI is based on Mutual Information (MI), which calculates how much knowing one clustering reduces the uncertainty about the other, and NMI normalizes this score. For all three metrics, a score of 1 implies a perfect matching. As the value decreases, the quality of the clustering also decreases.
All used metrics are external evaluation metrics and require labeled data to evaluate the clustering results. Therefore, we use a set of ten synthetic datasets (see Table 1) from the Tomas Barton’s repository [67], a benchmark widely used in the literature to evaluate density-based clustering algorithms. These datasets present diverse challenges, including arbitrary shapes, geometric shapes with noise, overlapping clusters separable by density, variations between clusters, smooth transitions between clusters, outliers, imbalanced clusters, differences in size and shapes, and the identification of weak connections, among others [71].
Since these datasets have been widely used for evaluating clustering algorithms, the parameter values that produce good clustering results for DBSCAN are well-known ( ϵ d b s , m i n P t s d b s ) [56]. Additionally, IKMEANS, KDBSCAN, KNN_BLOCK_DBSCAN, and our proposed algorithm (VDECAL) use the same parameters ϵ and m i n P t s . Thus, we employed these same values in our experiments and conducted further evaluations, changing these parameters values. For ϵ , we varied the values from ϵ d b s / 2 to 2 ϵ d b s with increments of 0.1, and for m i n P t s , a range from 1 to 2 m i n P t s d b s with increments of 1. For those algorithms that require the number of nearest neighbors (K) as a parameter, such as RSDBSCAN and RNNDBSCAN, tests were conducted using values within the range of 2 to 30, with increments of 1, as recommended by the authors. For IKMEANS, which uses a sample size, the samples were established between 10% and 50% from the dataset size, in intervals of 5%. For those algorithms that require a number of partitions, such as KDBSCAN and RS-DBSCAN, the number of partitions was set between the range of 1 to 4 partitions. Finally, VDECAL requires the parameter δ , the density variation threshold, to vary between 1 and 4 with increments of 0.1; moreover, the t parameter, defining the maximum partition size, takes values from 20% to 60% of the dataset size, with increments of 10%. Table 2 shows the combinations of parameter values that produce the best-quality results for each algorithm.
Table 3 presents the best results of all the tests. These results show that our proposed algorithm, VDECAL, achieved the best result in 6 out of the 10 datasets evaluated, making it the top-performing algorithm. It was followed by RNNDBSCAN, which achieved the best results on three datasets, although in one, 2G unbalance was superior, and on the other two, Jain and aggregation RNNDBSCAN tied with VDECAL. It is important to highlight that both VDECAL and RNNDBSCAN are algorithms that consider density variations when building clusters. However, RNNDBSCAN is not designed for large datasets, which is reflected in the runtime experiments.
These experiments show that VDECAL achieved the best results on Jain, Aggregation, Flame, and Pathbased datasets. These datasets have clusters with arbitrary shapes and have density variations in continuous regions, which shows the proposed algorithm’s applicability to these kinds of datasets.

Detection of Density Variations

VDECAL’s ability to detect density variations allows it to create clusters that other algorithms cannot achieve. To illustrate this, we used a dataset named dense-disk-5000, also obtained from Tomas Barton’s repository [67]. This dataset has 5000 objects grouped in two clusters with different densities, the densest cluster is in the center of the dataset surrounded by a less dense cluster. The main characteristic of dense-disk-5000 is that there is no low-density area to separate the clusters; instead, there is a density variation between them (Figure 4a).
Upon analyzing the clusters built by the different algorithms with this dataset, we observed that DBSCAN marked the objects within the less dense cluster as noise (see the points in light gray), obtaining only the densest cluster (Figure 4b). However, this solution results in some information being discarded. KDBSCAN (Figure 4c) and KNN-BLOCK-DBSCAN (Figure 4d) have the same issue as DBSCAN. IKMEANS algorithm (Figure 4e) detected a high-density zone in the center of the dataset; however, the proximity among the objects of the less dense cluster leads to the incorrect detection of the cluster’s borders. The RNN-DBSCAN algorithm (Figure 4f) allows the detection of density variations, allowing it to separate the objects of the densest cluster; however, it cannot group the objects of the less dense cluster in a single cluster. Instead, it creates more clusters. The RS-DBSCAN algorithm (Figure 4g) built clusters similarly to RNN-DBSCAN, but the densest clusters’ borders were incorrectly identified. Finally, VDECAL (Figure 4h) detected density variations and can separate the objects into two clusters of different densities, resulting in a better separation of the two clusters.

4.3. Runtime

The datasets used in the quality evaluation pose challenges for cluster building; however, most of them contain a small number of objects, which limits the runtime analysis for large datasets. To assess this aspect, following Gholizadeh [54] and Hanafi [61], we used seven datasets from the UCI Machine Learning Repository [72], as they offer diversity in the number of dimensions and objects. In [54,61], values for ϵ and m i n P t s are also suggested for each dataset, normalized within [0,1]. Consequently, we adopted these same values to set these parameters in the evaluated algorithms. Table 4 shows the dataset name, the number of objects, the dimensions, and the values for ϵ and m i n P t s used for IKMEANS, KDBSCAN, KNN_BLOCK_DBSCAN, and VDECAL. Additionally, IKMEANS uses a sample size parameter set to 10% of the dataset size, a value recommended by its authors. KDBSCAN and RS-DBSCAN use the number of partitions as a parameter, set to 5; higher values resulted in most of the dataset being marked as noise, preventing cluster formation. Our proposed algorithm, VDECAL, employs the δ parameter set to 2.5 for all the tests, a value experimentally determined to allow density variations. It also uses the t parameter, set to 20% of the dataset size, equivalent to creating five partitions. On the other hand, RNN-DBSCAN and RS-DBSCAN algorithms use a K parameter instead of these parameters, which was set to the same value as m i n P t s , as recommended by the authors.
In this experiment, we analyze the runtime of six algorithms (VDECAL, RNNDBSCAN, RSDBSCAN, IKMEANS, KDBSCAN, KNN_BLOCK_DBSCAN) across the datasets listed in Table 4. Figure 5 shows the results on a graph with a logarithmic scale, where the runtime was recorded for each algorithm on each dataset; the datasets are ordered by increasing size and reported times to include effective CPU processing without input/output time. Note that if a dataset does not have a resulted runtime in the figure, it is because the algorithm exceeded 24 h, highlighting its limitations with large datasets.
KNN_BLOCK_DBSCAN processed 8 out of the 12 datasets, but exceeded the 24-h limit for the remaining four. Moreover, in all the datasets KNN_BLOCK_DBSCAN could process, it had the highest runtime out of the six clustering algorithms.
RNNDBSCAN and RSDBSCAN processed 11 out of the 12 datasets, but exceeded the 24-h limit for the largest dataset (Poker-hand). RNNDBSCAN’s runtime is among the highest. RSDBSCAN’s runtime exhibited a similar growth to RNNDBSCAN’s; however, RSDBSCAN outperformed RNNDBSCAN regarding runtime because it is an enhanced version for large datasets.
KDBSCAN processed 10 out of the 12 datasets. In medium-sized datasets such as Dry_beam, Letter, and Nomao, KDBSCAN outperformed VDECAL regarding runtime. However, as the size of the dataset increased, KDBSCAN’s runtime grew faster than VDECAL’s, making it unable to process the two largest datasets within the 24-h limit. In the Gas and Position dataset, although KDBSCAN’s runtime outperformed VDECAL’s and KDBSCAN’s runtime was 15 s while VDECAL’s runtime was up to 17 min, this variation on KDBSCAN’s runtime is because KDBSCAN marked all objects in the dataset as noise.
IKMEANS and VDECAL completed all 12 datasets within the 24-h limit. On the smaller datasets, VDECAL outperformed the IKMEANS runtime. On the largest datasets (Skin Segmentation, Gas and Position, 3D Road Network, and Poker-hand), IKMEANS shows better runtime compared to VDECAL. However, according to Section 4.2, IKMEANS is the algorithm with the worst clustering quality.
These results indicate that the proposed algorithm, VDECAL was faster than KNN_BLOCK_DBSCAN and RS-DBSCAN, which are the fastest, most accurate, and recent clustering algorithms for large datasets, with better or similar quality results.

4.4. Scalability

To assess VDECAL’s scalability, we conducted two experiments. The first experiment evaluates how the algorithm’s runtime scales with increasing dataset size by varying the number of objects. The second experiment evaluates the effect of data dimensionality on runtime, maintaining the number of objects fixed while varying the number of dimensions. The dataset used for these experiments is HIGGS, obtained from the UCI Machine Learning Repository [70]. This dataset has 11,000,000 objects, each with a dimensionality of 28. The parameters used for evaluation were set at epsilon = 1.0, minPts = 1100, and a partition size of 110,000. These parameters were chosen to generate two clusters within a reasonable runtime. While the clusters may not be optimal, they offer acceptable results given the time and computational limitations. We conducted each test ten times to account for the random component of VDECAL.
In the first experiment, we evaluated the scalability of VDECAL in terms of dataset size by increasing the number of objects. We used incremental portions of the HIGGS dataset, starting with 10% and progressively adding increments of 10% until reaching the full dataset size of 11,000,000 objects. Figure 6a depicts the results of how the runtime increases with the dataset size. The x-axis shows the portion of the dataset used, and the y-axis indicates the corresponding runtime (in minutes) required for VDECAL to process each portion. Each point in the figure has error bars, indicating the variability in runtime for the ten runs. Figure 6a shows that the VDECAL algorithm exhibits a proportional increase in runtime as the number of objects increases. This means that the algorithm scales well with increasing input size, as the runtime does not grow drastically. Thus, the VDECAL algorithm has good scalability, making it suitable for applications where the input size can vary significantly. The linear relationship between input size and runtime ensures predictable performance as the number of objects increases.
The second experiment evaluates the impact of data dimensionality on the runtime performance of VDECAL. For this experiment, we used only 10% of the original dataset. Initially, we used the 28 dimensions from the original dataset and progressively added 28 additional dimensions at each sample until we reached a total of 280 dimensions. To ensure that the distance between data objects does not change, all newly added dimensions were set to zero, which allows us to evaluate the impact of dimensionality on runtime without altering the data. Figure 6b depicts how runtime increases as the number of dimensions increases. The x-axis shows the number of dimensions, and the y-axis indicates the corresponding runtime (in minutes). Each point in the figure has error bars, indicating the variability in runtime for the ten runs. The results indicate that the runtime grows linearly as dimensionality increases. Figure 6b also shows that the runtime remains consistent across different dimensions, indicating that the VDECAL algorithm exhibits good scalability. In summary, both figures highlight the good scalability of VDECAL with increasing data size and dimensionality.

5. Conclusions and Future Works

This paper introduced VDECAL, a density-based clustering algorithm for efficiently handling large datasets. VDECAL operates through three stages. Initially, it partitions the dataset into segments of similar sizes to evenly distribute the computational workload. Subsequently, it identifies relevant object subsets characterized by attributes such as density, position, and overlap ratio, which are used for cluster building. Finally, using these relevant object subsets, VDECAL identifies low-density regions and density variations in the high-density regions to build the clusters. This way of building clusters enables VDECAL to reduce runtime without sacrificing clustering quality.
Our evaluation of VDECAL focuses on randomness, clustering quality, runtime performance, and scalability. Regarding randomness, our experiments show that VDECAL produces consistent results across different initial seeds and processing orders. Despite the random initialization, the algorithm consistently converges to similar clusters. Regarding clustering quality, from our experiments on a well-known repository widely used for assessing density-based clustering algorithms, we found that VDECAL shows quality improvement compared to state-of-the-art. The improvement in quality is obtained thanks to the density variation performed by our proposal, an unexplored approach in density-based clustering algorithms for large datasets.
Regarding runtime, the experiments show that our algorithm outperforms KNN BLOCK DBSCAN and RS-DBSCAN, the fastest, most accurate, and recent clustering algorithms for large datasets. Although IKMEANS was the fastest, its clustering quality is low compared to the other algorithms. The scalability experiments illustrate how VDECAL’s runtime changes when the number of objects and dimensions are increased. For dimensionality, VDECAL runtime is proportional to the number of dimensions. However, concerning the number of objects, VDECAL exhibits a slower growth rate, which is crucial when processing large datasets. This is achieved through the balanced distribution of workload proposed in the VDECAL partitioning stage.
The runtime efficiency of VDECAL relies on keeping a low number of relevant object subsets concerning the total dataset size, which is a limitation of our proposed algorithm. Another limitation is that in our proposed algorithm, the parameters require manual adjustment for each dataset. Future research could explore integrating statistical analysis techniques to detect density variations among relevant object subsets, potentially eliminating the need for specific thresholds as parameters. Additionally, considering the prevalent use of parallel processing techniques for accelerating computation and handling larger datasets, implementing VDECAL within frameworks such as MapReduce could offer promising improvements in scalability and performance.

Author Contributions

Conceptualization, A.J.R.-D., J.F.M.-T. and J.A.C.-O.; methodology, A.J.R.-D., J.F.M.-T. and J.A.C.-O.; software, A.J.R.-D.; validation, A.J.R.-D.; formal analysis, A.J.R.-D.; investigation, A.J.R.-D.; resources, A.J.R.-D.; data curation, A.J.R.-D.; writing—original draft preparation, A.J.R.-D.; writing—review and editing, J.F.M.-T. and J.A.C.-O.; visualization, A.J.R.-D.; supervision, J.F.M.-T. and J.A.C.-O. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Secretaría de Ciencia, Humanidades, Tecnología e Innovación scholarship, grant number 778974.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors acknowledge the scholarship support from the Secretaría de Ciencia, Humanidades, Tecnología e Innovación (Grant 778974).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Jain, A.K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
  2. Dalal, M.A.; Harale, N.D. A Survey on Clustering in Data Mining. In Proceedings of the ICWET ’11: International Conference & Workshop on Emerging Trends in Technology, Mumbai, India, 25–26 February 2011; Association for Computing Machinery: New York, NY, USA, 2011; pp. 559–562. [Google Scholar]
  3. Aggarwal, C.C. An introduction to cluster analysis. In Data Clustering; Chapman and Hall/CRC: New York, NY, USA, 2018; pp. 1–28. [Google Scholar]
  4. Giordani, P.; Ferraro, M.B.; Martella, F. Introduction to Clustering. In An Introduction to Clustering with R; Springer: Berlin/Heidelberg, Germany, 2020; pp. 3–5. [Google Scholar]
  5. Kumar, S.; Katiyar, V.; Katiyar, D. A Review on Data Mining and Techniques of Clustering Algorithms. In Challenges and Opportunities for Innovation in India; CRC Press: Boca Raton, FL, USA, 2025; p. 223. [Google Scholar]
  6. Oyewole, G.J.; Thopil, G.A. Data clustering: Application and trends. Artif. Intell. Rev. 2023, 56, 6439–6475. [Google Scholar] [CrossRef] [PubMed]
  7. Pitafi, S.; Anwar, T.; Sharif, Z. A taxonomy of machine learning clustering algorithms, challenges, and future realms. Appl. Sci. 2023, 13, 3529. [Google Scholar] [CrossRef]
  8. Govender, P.; Sivakumar, V. Application of k-means and hierarchical clustering techniques for analysis of air pollution: A review (1980–2019). Atmos. Pollut. Res. 2020, 11, 40–56. [Google Scholar] [CrossRef]
  9. Gan, G.; Valdez, E.A. Data clustering with actuarial applications. N. Am. Actuar. J. 2020, 24, 168–186. [Google Scholar] [CrossRef]
  10. Benabdellah, A.C.; Benghabrit, A.; Bouhaddou, I. A survey of clustering algorithms for an industrial context. Procedia Comput. Sci. 2019, 148, 291–302. [Google Scholar] [CrossRef]
  11. Kolajo, T.; Daramola, O.; Adebiyi, A. Big data stream analysis: A systematic literature review. J. Big Data 2019, 6, 47. [Google Scholar] [CrossRef]
  12. Dinh, D.T.; Huynh, V.N.; Sriboonchitta, S. Clustering mixed numerical and categorical data with missing values. Inf. Sci. 2021, 571, 418–442. [Google Scholar] [CrossRef]
  13. Steinbach, M.; Ertöz, L.; Kumar, V. The challenges of clustering high dimensional data. In New Directions in Statistical Physics; Springer: Berlin/Heidelberg, Germany, 2004; pp. 273–309. [Google Scholar]
  14. Choi, C.; Hong, S.Y. MDST-DBSCAN: A Density-Based Clustering Method for Multidimensional Spatiotemporal Data. ISPRS Int. J. Geo-Inf. 2021, 10, 391. [Google Scholar] [CrossRef]
  15. Su, Z.; Du, S.; Hao, J.; Han, B.; Ge, P.; Wang, Y. NELD-EC: Neighborhood-Effective-Line-Density-Based Euclidean Clustering for Point Cloud Segmentation. Sensors 2025, 25, 1174. [Google Scholar] [CrossRef]
  16. Chen, L.; Pan, Z.; Yuan, L. Study on Clustering Computing Methods of Big Data. In Proceedings of the ICITEE-2019: 2nd International Conference on Information Technologies and Electrical Engineering, Zhuzhou, China, 6–7 December 2019; Association for Computing Machinery: New York, NY, USA, 2020. [Google Scholar]
  17. Tole, A.A. Big data challenges. Database Syst. J. 2013, 4, 31–40. [Google Scholar]
  18. Nasser, T.; Tariq, R. Big data challenges. J. Comput. Eng. Inf. Technol. 2015, 9307, 2. [Google Scholar]
  19. Amanullah, M.A.; Habeeb, R.A.A.; Nasaruddin, F.H.; Gani, A.; Ahmed, E.; Nainar, A.S.M.; Akim, N.M.; Imran, M. Deep learning and big data technologies for IoT security. Comput. Commun. 2020, 151, 495–517. [Google Scholar] [CrossRef]
  20. Zaki, U.H.H.; Kamsani, I.I.; Ahmad Fadzil, A.F.; Idrus, Z.; Kandogan, E. Big Data: Issues and Challenges in Clustering Data Visualization. J. Adv. Res. Appl. Sci. Eng. Technol. 2024, 51, 150–159. [Google Scholar] [CrossRef]
  21. Fahim, A.; Salem, A.E.; Torkey, F.; Ramadan, M.; Saake, G. Scalable Varied Density Clustering Algorithm for Large Datasets. J. Softw. Eng. Appl. 2010, 3, 10. [Google Scholar] [CrossRef]
  22. Wani, A.A. Comprehensive analysis of clustering algorithms: Exploring limitations and innovative solutions. PeerJ Comput. Sci. 2024, 10, e2286. [Google Scholar] [CrossRef]
  23. Dafir, Z.; Lamari, Y.; Slaoui, S.C. A survey on parallel clustering algorithms for big data. Artif. Intell. Rev. 2021, 54, 2411–2443. [Google Scholar] [CrossRef]
  24. Ajmal, O.; Mumtaz, S.; Arshad, H.; Soomro, A.; Hussain, T.; Attar, R.W.; Alhomoud, A. Enhanced Parameter Estimation of DENsity CLUstEring (DENCLUE) Using Differential Evolution. Mathematics 2024, 12, 2790. [Google Scholar] [CrossRef]
  25. Campello, R.J.; Kröger, P.; Sander, J.; Zimek, A. Density-based clustering. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2020, 10, e1343. [Google Scholar] [CrossRef]
  26. Zou, Y.; Wang, Z.; Wang, X.; Lv, T. A Clustering Algorithm Based on Local Relative Density. Electronics 2025, 14, 481. [Google Scholar] [CrossRef]
  27. Khader, M.; Al-Naymat, G. Density-based algorithms for big data clustering using MapReduce framework: A Comprehensive Study. ACM Comput. Surv. (CSUR) 2020, 53, 93. [Google Scholar] [CrossRef]
  28. Mahdi, M.A.; Hosny, K.M.; Elhenawy, I. Scalable clustering algorithms for big data: A review. IEEE Access 2021, 9, 80015–80027. [Google Scholar] [CrossRef]
  29. Jiang, H.; Li, J.; Yi, S.; Wang, X.; Hu, X. A new hybrid method based on partitioning-based DBSCAN and ant clustering. Expert Syst. Appl. 2011, 38, 9373–9381. [Google Scholar] [CrossRef]
  30. Luo, G.; Luo, X.; Gooch, T.F.; Tian, L.; Qin, K. A Parallel DBSCAN Algorithm Based on Spark. In Proceedings of the 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) (BDCloud-SocialCom-SustainCom), Atlanta, GA, USA, 8–10 October 2016; pp. 548–553. [Google Scholar] [CrossRef]
  31. Malzer, C.; Baum, M. A hybrid approach to hierarchical density-based cluster selection. In Proceedings of the 2020 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), Virtual, 14–16 September 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 223–228. [Google Scholar]
  32. Almazroi, A.A.; Atwa, W. An Improved Clustering Algorithm for Multi-Density Data. Axioms 2022, 11, 411. [Google Scholar] [CrossRef]
  33. Borah, B.; Bhattacharyya, D. A clustering technique using density difference. In Proceedings of the 2007 International Conference on Signal Processing, Communications and Networking, Chennai, India, 22–24 February 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 585–588. [Google Scholar]
  34. Khan, M.A.; Jang, J.H.; Iqbal, N.; Jamil, H.; Naqvi, S.S.A.; Khan, S.; Kim, J.C.; Kim, D.H. Enhancing patient rehabilitation predictions with a hybrid anomaly detection model: Density-based clustering and interquartile range methods. CAAI Trans. Intell. Technol. 2025; early view. [Google Scholar]
  35. Chen, P.; Fan, X.; Liu, R.; Tang, X.; Cheng, H. Fiber segmentation using a density-peaks clustering algorithm. In Proceedings of the 2015 IEEE 12th International Symposium on Biomedical Imaging (ISBI), Brooklyn, NY, USA, 16–19 April 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 633–637. [Google Scholar]
  36. Bauer, S.; Wiest, R.; Nolte, L.P.; Reyes, M. A survey of MRI-based medical image analysis for brain tumor studies. Phys. Med. Biol. 2013, 58, R97. [Google Scholar] [CrossRef]
  37. Fletcher-Heath, L.M.; Hall, L.O.; Goldgof, D.B.; Murtagh, F.R. Automatic segmentation of non-enhancing brain tumors in magnetic resonance images. Artif. Intell. Med. 2001, 21, 43–63. [Google Scholar] [CrossRef]
  38. Wen, M.; Cho, S.; Chae, J.; Sung, Y.; Cho, K. Range image-based density-based spatial clustering of application with noise clustering method of three-dimensional point clouds. Int. J. Adv. Robot. Syst. 2018, 15, 1729881418762302. [Google Scholar] [CrossRef]
  39. Klasing, K.; Wollherr, D.; Buss, M. A clustering method for efficient segmentation of 3D laser data. In Proceedings of the 2008 IEEE International Conference on Robotics and Automation, Pasadena, CA, USA, 19–23 May 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 4043–4048. [Google Scholar]
  40. Bogoslavskyi, I.; Stachniss, C. Efficient online segmentation for sparse 3D laser scans. PFG- Photogramm. Remote Sens. Geoinf. Sci. 2017, 85, 41–52. [Google Scholar] [CrossRef]
  41. McInnes, L.; Healy, J. Accelerated hierarchical density based clustering. In Proceedings of the 2017 IEEE International Conference on Data Mining Workshops (ICDMW), New Orleans, LA, USA, 18–21 November 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 33–42. [Google Scholar]
  42. Qian, L.; Plant, C.; Böhm, C. Density-Based Clustering for Adaptive Density Variation. In Proceedings of the 2021 IEEE International Conference on Data Mining (ICDM), Auckland, New Zealand, 7–10 December 2021; pp. 1282–1287. [Google Scholar] [CrossRef]
  43. Qian, J.; Zhou, Y.; Han, X.; Wang, Y. MDBSCAN: A multi-density DBSCAN based on relative density. Neurocomputing 2024, 576, 127329. [Google Scholar] [CrossRef]
  44. Storey, V.C.; Song, I.Y. Big data technologies and management: What conceptual modeling can do. Data Knowl. Eng. 2017, 108, 50–67. [Google Scholar] [CrossRef]
  45. Heidari, S.; Alborzi, M.; Radfar, R.; Afsharkazemi, M.A.; Rajabzadeh Ghatari, A. Big data clustering with varied density based on MapReduce. J. Big Data 2019, 6, 77. [Google Scholar] [CrossRef]
  46. Bathla, G.; Aggarwal, H.; Rani, R. A novel approach for clustering big data based on MapReduce. Int. J. Electr. Comput. Eng. 2018, 8, 1711–1719. [Google Scholar] [CrossRef]
  47. Kaur, K.; Bharti, V. A Survey on Big Data—Its Challenges and Solution from Vendors. In Big Data Processing Using Spark in Cloud; Springer: Berlin/Heidelberg, Germany, 2019; pp. 1–22. [Google Scholar]
  48. Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the KDD’96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; Volume 96, pp. 226–231. [Google Scholar]
  49. Jing, W.; Zhao, C.; Jiang, C. An improvement method of DBSCAN algorithm on cloud computing. Procedia Comput. Sci. 2019, 147, 596–604. [Google Scholar] [CrossRef]
  50. Bryant, A.; Cios, K. RNN-DBSCAN: A Density-Based Clustering Algorithm Using Reverse Nearest Neighbor Density Estimates. IEEE Trans. Knowl. Data Eng. 2018, 30, 1109–1121. [Google Scholar] [CrossRef]
  51. Chen, Y.; Yang, Y.; Pei, S.; Chen, Y.; Du, J. A simple rapid sample-based clustering for large-scale data. Eng. Appl. Artif. Intell. 2024, 133, 108551. [Google Scholar] [CrossRef]
  52. Wei, X.; Peng, M.; Huang, H.; Zhou, Y. An overview on density peaks clustering. Neurocomputing 2023, 554, 126633. [Google Scholar] [CrossRef]
  53. Chen, Y.; Hu, X.; Fan, W.; Shen, L.; Zhang, Z.; Liu, X.; Du, J.; Li, H.; Chen, Y.; Li, H. Fast density peak clustering for large scale data based on kNN. Knowl.-Based Syst. 2020, 187, 104824. [Google Scholar] [CrossRef]
  54. Gholizadeh, N.; Saadatfar, H.; Hanafi, N. K-DBSCAN: An improved DBSCAN algorithm for big data. J. Supercomput. 2021, 77, 6214–6235. [Google Scholar] [CrossRef]
  55. Arthur, D.; Vassilvitskii, S. K-Means++: The Advantages of Careful Seeding. In Proceedings of the SODA ’07: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; pp. 1027–1035. [Google Scholar]
  56. Chen, Y.; Zhou, L.; Pei, S.; Yu, Z.; Chen, Y.; Liu, X.; Du, J.; Xiong, N.N. KNN-BLOCK DBSCAN: Fast Clustering for Large-Scale Data. IEEE Trans. Syst. Man, Cybern. Syst. 2021, 51, 3939–3953. [Google Scholar] [CrossRef]
  57. Yan, H.; Wang, M.; Xie, J. ANN-DPC: Density peak clustering by finding the adaptive nearest neighbors. Knowl.-Based Syst. 2024, 294, 111748. [Google Scholar] [CrossRef]
  58. Cheng, D.; Li, Y.; Xia, S.; Wang, G.; Huang, J.; Zhang, S. A Fast Granular-Ball-Based Density Peaks Clustering Algorithm for Large-Scale Data. IEEE Trans. Neural Networks Learn. Syst. 2024, 35, 17202–17215. [Google Scholar] [CrossRef] [PubMed]
  59. Jang, J.; Jiang, H. DBSCAN++: Towards fast and scalable density clustering. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; PMLR: Proceedings of Machine Learning Research. JMLR: Cambridge, MA, USA; 2019; Volume 97, pp. 3019–3029. [Google Scholar]
  60. Lu, W. Improved K-means clustering algorithm for big data mining under Hadoop parallel framework. J. Grid Comput. 2020, 18, 239–250. [Google Scholar] [CrossRef]
  61. Hanafi, N.; Saadatfar, H. A fast DBSCAN algorithm for big data based on efficient density calculation. Expert Syst. Appl. 2022, 203, 117501. [Google Scholar] [CrossRef]
  62. Ma, L.; Yang, G.; Yang, Y.; Chen, X.; Lu, J.; Gong, Z.; Hao, Z. UP-DPC: Ultra-scalable parallel density peak clustering. Inf. Sci. 2024, 660, 120114. [Google Scholar] [CrossRef]
  63. Kriegel, H.P.; Pfeifle, M. Density-Based Clustering of Uncertain Data. In Proceedings of the KDD ’05: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, Chicago, IL, USA, 21–24 August 2005; Association for Computing Machinery: New York, NY, USA, 2005; pp. 672–677. [Google Scholar]
  64. Shiffler, R.E. Maximum Z Scores and Outliers. Am. Stat. 1988, 42, 79–80. [Google Scholar] [CrossRef]
  65. Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C. Introduction to Algorithms; MIT Press: Cambridge, MA, USA, 2022. [Google Scholar]
  66. Atallah, M.; Blanton, M. Algorithms and Theory of Computation Handbook, Volume 1: General Concepts and Techniques; Chapman & Hall/CRC Applied Algorithms and Data Structures Series; CRC Press: Boca Raton, FL, USA, 2009. [Google Scholar]
  67. Barton, T. Clustering-Benchmark. 2019. Available online: https://github.com/deric/clustering-benchmark (accessed on 3 October 2024).
  68. Halkidi, M.; Batistakis, Y.; Vazirgiannis, M. Cluster validity methods: Part I. ACM Sigmod Rec. 2002, 31, 40–45. [Google Scholar] [CrossRef]
  69. Romano, S.; Vinh, N.X.; Bailey, J.; Verspoor, K. Adjusting for chance clustering comparison measures. J. Mach. Learn. Res. 2016, 17, 1–32. [Google Scholar]
  70. Zhang, P. Evaluating accuracy of community detection using the relative normalized mutual information. J. Stat. Mech. Theory Exp. 2015, 2015, P11006. [Google Scholar] [CrossRef]
  71. Thrun, M.C.; Ultsch, A. Clustering benchmark datasets exploiting the fundamental clustering problems. Data Brief 2020, 30, 105501. [Google Scholar] [CrossRef]
  72. Kelly, M.; Longjohn, R.; Nottingham, K. The UCI Machine Learning Repository. 2023. Available online: https://archive.ics.uci.edu (accessed on 3 October 2024).
Figure 1. Cluster separation in density-based clustering. (a) Clusters separated by a low-density region. (b) Adjacent clusters with density variation.
Figure 1. Cluster separation in density-based clustering. (a) Clusters separated by a low-density region. (b) Adjacent clusters with density variation.
Mathematics 13 02272 g001
Figure 2. Division of a dataset using partition-based methods: (a) K-Means++ partitions vs. (b) our density-aware partitioning.
Figure 2. Division of a dataset using partition-based methods: (a) K-Means++ partitions vs. (b) our density-aware partitioning.
Mathematics 13 02272 g002
Figure 3. Clusters and relevant object subsets produced by VDECAL on the Compound dataset using different random seeds and orders: (a) Run 1; (b) Run 2; (c) Run 3; (d) Run 4.
Figure 3. Clusters and relevant object subsets produced by VDECAL on the Compound dataset using different random seeds and orders: (a) Run 1; (b) Run 2; (c) Run 3; (d) Run 4.
Mathematics 13 02272 g003
Figure 4. Clustering results for dense-disk-5000 dataset. Light gray points represent noise. (a) Ground truth. (bd) DBSCAN, KDBSCAN, and KNN-BLOCK-DBSCAN keep only the densest cluster. (e) IKMEANS with incorrect borders. (f,g) RNN-DBSCAN and RS-DBSCAN create multiple clusters. (h) VDECAL achieves the best separation.
Figure 4. Clustering results for dense-disk-5000 dataset. Light gray points represent noise. (a) Ground truth. (bd) DBSCAN, KDBSCAN, and KNN-BLOCK-DBSCAN keep only the densest cluster. (e) IKMEANS with incorrect borders. (f,g) RNN-DBSCAN and RS-DBSCAN create multiple clusters. (h) VDECAL achieves the best separation.
Mathematics 13 02272 g004
Figure 5. Runtime analysis of evaluated clustering algorithms on datasets of different sizes.
Figure 5. Runtime analysis of evaluated clustering algorithms on datasets of different sizes.
Mathematics 13 02272 g005
Figure 6. VDECAL scalability: (a) Scalability regarding the number of objects; (b) Scalability regarding dimensions per object.
Figure 6. VDECAL scalability: (a) Scalability regarding the number of objects; (b) Scalability regarding dimensions per object.
Mathematics 13 02272 g006
Table 1. Synthetic datasets used to evaluate clustering quality.
Table 1. Synthetic datasets used to evaluate clustering quality.
DatasetObjectsClustersDimensions
R15600152
Jain37322
complex8255182
aggregation78872
Flame24022
2G_unbalance.txt105022
Pathbased30032
Cluto-t4-8k800072
Compound39962
Complex9303192
Table 2. Parameter values that produce the best-quality results.
Table 2. Parameter values that produce the best-quality results.
DatasetVDECALDBSCANIKMKDBSCANRNNDBSCANRSDBSCANKNNBDBS
( t , ϵ , mPts , δ ) ( ϵ , mPts ) ( ϵ , mPts ) ( P , ϵ , mPts ) ( K ) ( K 1 , K 2 ) ( ϵ , mPts )
R15400, 0.5, 4, 30.35, 50.7, 12, 0.3, 1122, 50.39, 5
Jain200, 3.3, 1, 2.72.3, 42.6, 42, 2.26, 1179, 42.34, 5
complex81300, 15.6, 3, 1.7114.89, 411.3, 22, 14, 1172, 1014.9, 4
Aggregation400, 1.8, 9, 1.91.6, 91.83, 22, 1.6, 9142, 101.5, 8
Flame120, 1.3, 4, 2.71.4, 91.45, 32, 1.45, 91711, 31.4, 10
2G_unbalance600, 0.14, 7, 2.70.1, 71.25, 12, 0.1, 4143, 80.12, 9
pathbased150, 1.7, 6, 21.6, 72.0, 22, 1.5, 6113, 51.92, 8
cluto-t4-8k4000, 9.7, 18, 2.49.5, 2010.6, 22, 8.4, 15163, 169, 15
compound200, 1.5, 2, 31.3, 36.0, 42, 1.5, 2105, 141.48, 3
complex91600, 15, 5, 2.712.1, 511.5, 22, 12.2, 1252, 1714.9, 5
Table 3. Performance on artificial datasets. Bold values indicate the best result(s) for each dataset.
Table 3. Performance on artificial datasets. Bold values indicate the best result(s) for each dataset.
DatasetMetricVDECALDBSCANIKMKDBSRNNDBSRSDBSKNNBDBS
R15ARI0.99270.93560.92640.85230.98910.99270.9500
AMI0.99380.94160.95910.91830.99070.99380.9544
NMI0.99420.94580.96260.92800.99130.99420.9577
JainARI1.00000.93490.31380.94201.00000.53130.9345
AMI1.00000.84760.51770.87951.00000.41410.8457
NMI1.00000.84850.52020.88001.00000.41540.8467
complex8ARI0.97230.99320.23600.98580.87350.50230.9490
AMI0.97370.98680.60850.97410.92390.61350.9472
NMI0.97390.98690.61680.97440.92450.61540.9478
aggregationARI0.99560.98280.63180.97670.99560.94210.9781
AMI0.99230.97630.75880.92730.99230.92480.9705
NMI0.99240.97670.76380.93130.99240.92570.9710
FlameARI0.97140.93870.34240.93960.96660.82290.9440
AMI0.93450.87220.50470.88730.92660.72930.8808
NMI0.93500.87310.50970.88800.92690.73020.8816
2GARI0.97780.82440.04510.85650.97840.65920.8821
unbalanceAMI0.94960.80940.15110.83690.96520.54110.8601
NMI0.94970.81020.15400.83770.96530.54210.8608
pathbasedARI0.93960.80320.51840.77380.53240.46710.8799
AMI0.91690.77990.56240.73810.55090.53190.8467
NMI0.91740.78130.56700.74540.55820.53620.8476
cluto-t4-8kARI0.95220.96990.38780.96820.38660.62410.9709
AMI0.93760.95450.50110.95200.65950.68610.9575
NMI0.93770.95460.50170.95210.66180.68650.9576
compoundARI0.93850.94190.79570.97120.88170.62480.9634
AMI0.93050.91620.83720.93860.89940.68460.9315
NMI0.93200.91800.84030.94030.90200.68500.9330
complex9ARI1.00000.88060.42320.94700.99150.54800.9313
AMI1.00000.94910.60850.95250.98560.68520.9686
NMI1.00000.94940.60970.95270.98570.68640.9687
Table 4. Datasets and its parameters used in runtime evaluation.
Table 4. Datasets and its parameters used in runtime evaluation.
DatasetObjectsDimensions ϵ minPts
Abalone417780.0354
Dry_beam13,611160.164
Gas-drift13,9101280.405
Magic19,020100.064
Letter20,000160.354
Nomao34,4651181.985
Statlog (Shuttle)58,00090.025
Accelerometer153,00040.0654
Skin Segmentation245,05730.014
Gas and position416,15390.00320
3D Road Network434,87440.014
Poker-hand1,000,000100.2510
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ramírez-Díaz, A.J.; Martínez-Trinidad, J.F.; Carrasco-Ochoa, J.A. A Clustering Algorithm for Large Datasets Based on Detection of Density Variations. Mathematics 2025, 13, 2272. https://doi.org/10.3390/math13142272

AMA Style

Ramírez-Díaz AJ, Martínez-Trinidad JF, Carrasco-Ochoa JA. A Clustering Algorithm for Large Datasets Based on Detection of Density Variations. Mathematics. 2025; 13(14):2272. https://doi.org/10.3390/math13142272

Chicago/Turabian Style

Ramírez-Díaz, Adrián Josué, José Francisco Martínez-Trinidad, and Jesús Ariel Carrasco-Ochoa. 2025. "A Clustering Algorithm for Large Datasets Based on Detection of Density Variations" Mathematics 13, no. 14: 2272. https://doi.org/10.3390/math13142272

APA Style

Ramírez-Díaz, A. J., Martínez-Trinidad, J. F., & Carrasco-Ochoa, J. A. (2025). A Clustering Algorithm for Large Datasets Based on Detection of Density Variations. Mathematics, 13(14), 2272. https://doi.org/10.3390/math13142272

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop