Grid-Based Clustering Using Boundary Detection

Clustering can be divided into five categories: partitioning, hierarchical, model-based, density-based, and grid-based algorithms. Among them, grid-based clustering is highly efficient in handling spatial data. However, the traditional grid-based clustering algorithms still face many problems: (1) Parameter tuning: density thresholds are difficult to adjust; (2) Data challenge: clusters with overlapping regions and varying densities are not well handled. We propose a new grid-based clustering algorithm named GCBD that can solve the above problems. Firstly, the density estimation of nodes is defined using the standard grid structure. Secondly, GCBD uses an iterative boundary detection strategy to distinguish core nodes from boundary nodes. Finally, two clustering strategies are combined to group core nodes and assign boundary nodes. Experiments on 18 datasets demonstrate that the proposed algorithm outperforms 6 grid-based competitors.


Introduction
Clustering is the process of grouping similar objects together [1]. As a fundamental data mining task, it can be used either independently or as a preprocessing step before other data mining tasks. Clustering plays an important role in many scientific fields [2], including earth sciences [3,4], biology [5][6][7], and economics [8,9].
Generally, clustering can be divided into five categories: partitioning [10,11], hierarchical [12,13], model-based [14,15], density-based [16][17][18], and grid-based algorithms. Partitioned clustering is designed to discover clusters in the data by optimizing a given objective function. Hierarchical clustering deals with the clustering problem by constructing a tree diagram. Model-based clustering uses a probabilistic methodology to optimize the match between some mathematical models and the data. Density-based and grid-based solutions are two closely related categories that attempt to explore the data space at a high level of granularity.
In recent decades, many grid-based clustering algorithms have been developed. In these algorithms, the data space is partitioned into a finite number of cells to form a grid structure. Clusters correspond to regions that are the connected cells with more density. As most grid-based clustering algorithms rely on calculations of cell density, these algorithms may be considered density-based. Even some grid-based clustering algorithms are developed by improving density-based clustering. Among density-based clustering algorithms, DBSCAN [19] and DPC [20] algorithms are the most widely used and have the most variants.
The computation cost of grid-based clustering is determined by the number of grid cells, independent of dataset size. Generally, grid-based clustering is more efficient than other clustering algorithms for large-scale spatial data since the number of cells is significantly smaller than the number of data points. Although grid-based clustering algorithms greatly improve computational efficiency, they still face some of the following problems.

•
They are sensitive to the parameter of density threshold, which may be difficult to obtain.
• They may not be sufficient to achieve desired clustering results for data with varying densities. • They are not able to handle boundary regions between some adjacent clusters well.
We aim at alleviating the above-mentioned issues and propose a grid-based clustering using boundary detection, named GCBD. Firstly, the density estimation of nodes is defined using the standard grid structure. Secondly, GCBD uses an iterative boundary detection strategy to distinguish core nodes from boundary nodes. Finally, DBSCAN and DPC clustering strategies are combined to group core nodes and assign boundary nodes.
The rest of this paper is organized as follows. We survey the related work in Section 2. Our proposed GCBD algorithm is presented in Section 3. Section 4 presents the experimental results. Conclusions and suggestions for future work are given in Section 5.

Grid-Based Clustering
In this subsection, we discuss some classical grid-based clustering algorithms. To our knowledge, Schikuta [21] introduces the first grid-based hierarchical clustering algorithm called GRIDCLUS. In GRIDCLUS, data points are designated to blocks in the grid structure such that their topological distributions are maintained. GRIDCLUS calculates the density for each block. The blocks are clustered iteratively in order of descending density to form a nested sequence of nonempty, disjoint clusters. Schikuta and Erhart [22] further propose the BANG algorithm to improve the inefficiency of GRIDCLUS in terms of grid structure size and neighbor search. Wang et al. [23] propose a statistical information grid-based clustering method (STING) to cluster spatial data. In contrast to GRIDCLUS and BANG, STING divides the spatial area into rectangular cells and uses a hierarchical grid structure for storing the cells. Using the hierarchical grid structure may generate rougher cluster boundaries, which reduces clustering quality. As a solution to the problem, Sheikholeslami et al. [24] propose WaveCluster, a grid-based and density-based clustering approach utilizing wavelet transforms. In this algorithm, wavelet transforms are applied to the spatial data feature space to detect arbitrary shape clusters at different scales.
A significant issue with grid-based algorithms is their scalability in higher dimensional data since the time complexity grows as the number of grids increases. To address the challenge, Agrawal et al. [25] invent the algorithm CLIQUE (clustering in quest). CLIQUE seeks dense rectangular cells in all subspaces with high density by applying a bottom-up scheme. Clusters are generated as the connected areas of dense cells. OptiGrid (optimal grid clustering) [26] significantly modifies CLIQUE. OptiGrid constructs the best cutting hyperplanes through a set of projections to obtain optimal grid-partitioning. The above algorithm is very sensitive to the density threshold and it is difficult to adjust this parameter.
In the grid-based clustering, variants of the DBSCAN algorithm [19] are the most closely related to our algorithm. Wu et al. [27] propose a density-and grid-based (DGB) clustering algorithm inspired by grid partitioning and DBSCAN. The DGB algorithm only needs to calculate distances between grid nodes instead of distances between data points. Therefore, the algorithm processes spatial data more efficiently. Like DBSCAN, however, the algorithm cannot recognize clusters with varying densities. Uncu et al. [28] propose a three-step clustering algorithm (called GRIDBSCAN) to address the issue. The first step ensures homogeneous density in each grid by selecting appropriate grids. In the second step, cells that have similar densities are merged. Lastly, the DBSCAN algorithm is executed. However, these algorithms are not able to handle boundary regions between some adjacent clusters well. The main difference between the two algorithms and ours is the way they divide core and boundary nodes (or cells). The two algorithms apply a fixed, global density threshold to classify core and boundary nodes (or cells). In contrast, our algorithm uses an iterative boundary detection strategy to divide core and boundary nodes.

Density-Based Clustering
This subsection discusses several density-based clustering algorithms that are most relevant to our algorithm.
DBSCAN [19] is one of the most important density-based techniques. In DBSCAN, clusters are assumed to be connected to high-density regions separated by low-density regions. In DBSCAN, the core points are points in the dense part of a cluster. The boundary points in DBSCAN are defined as points that are part of a cluster but are not surrounded by a dense neighborhood. The primary difference between DBSCAN and our algorithm is that DBSCAN is a density-based clustering while our algorithm is a grid-based clustering. Our algorithm is similar to DBSCAN as it utilizes the notions of reachability and connectivity to find the maximally connected components of nodes. Despite this, they differ in their definitions of connectivity. DBSCAN defines connectivity between points and then generates clustering for points (including core points and boundary points) according to connectivity. However, our algorithm defines the connectivity between core nodes and then generates cluster cores for the core nodes based on their connectivity.
Rodriguez and Laio [20] propose density peaks clustering (DPC), a density-based algorithm. The algorithm assumes that cluster centers are surrounded by neighbors with lower local densities and that they are at a relatively large distance from any points with a higher local density. Our algorithm and DPC are similar in how the nodes (or points) are assigned. However, they are quite different. Unlike DPC's assignment mechanism, where each non-centered point is assigned to the same cluster as its nearest neighbor with higher density, our assignment mechanism is a multilevel-based approach, where each boundary node is assigned to the same cluster as the node with the highest density among its nearest neighbors in the inner layers.

DBSCAN Algorithm
DBSCAN [19] is one of the most widely used density-based clustering algorithms. It can identify arbitrary-shaped clusters and clusters with noise (i.e., outliers). In DBCSAN, there are two key parameters: • (or eps): It is a distance threshold. Two points are considered to be neighbors if the distance between them is less than or equal to eps. • k (or minPts): It specifies the minimum number of neighbors within a given radius.
Based on these two parameters, DBSCAN makes several definitions: • Core point: A point is a core point if there are at least minPts number of points (including the point itself) in its surrounding area with radius eps.
Density-connected: Two points x p and x q are density-connected if x p is directly or transitively reachable from x q or x q is directly or transitively reachable from x p . • Boundary point: A point is a boundary point if it is reachable from a core point and there are less than minPts number of points within its surrounding area. • Noise point (or outlier): A point is a noise point (or outlier) if it is not a core point and not reachable from any core points. • Cluster: A cluster w.r.t. eps and minPts is a non-empty maximal subset of the data set such that every pair of points in the cluster is density-connected.
To explain the notions of core point, boundary point, and noise point, we provide an example in Figure 1. Red points are core points because there are at least 4 points within their surrounding area with a radius of eps. This area is shown with the circles in the figure. The green points are boundary points because they are reachable from a core point and have less than 4 points within their neighborhood. Reachable means being in the surrounding area of a core point. The points x 2 and x 3 have two points (including the point itself) within their neighborhood (i.e., the surrounding area with a radius of eps). Finally, x 4 is a noise point because it is not a core point and cannot be reached from a core point.

DPC Algorithm
Rodriguez and Laio [20] present density peaks clustering (DPC), an algorithm that combines density-based clustering algorithms with centroid-based clustering algorithms. This algorithm has its basis on the assumptions that cluster centers are surrounded by neighbors with lower local densities and that they are far away from points of higher densities. There are two important quantities to describe each point x i : its local density ρ i and its distance from the nearest larger density point δ i . The local density ρ i of x i is calculated as where d(x i , x j ) is the distance between points x i and x j , and the "cutoff distance" d c is a user-specified parameter.
For point x i with the highest density, DPC defines δ i = max(d(x i , x j )). For the other points, δ i is the minimum distance between point x i and another point x j whose ρ j is higher than ρ i . Its formula is as follows: where x j ∈ X. X denotes the whole data set. Based on DPC's center assumption, density peaks with large ρ-δ are manually selected as centers by observing through a decision graph (i.e., a ρ-δ plot). Subsequently, each non-center point is allocated to the same cluster as its nearest point with higher density.

Proposed Algorithm
In this section, we describe our proposed clustering algorithm in detail and analyze its computational complexity.

Standard Grid Structure
As with most grid-based clustering algorithms, the first step in GCBD is to create a grid structure that divides the data space into a finite number of cells. To simplify subsequent calculations, we create a standard grid structure.
Let X = {x 1 , x 2 , . . . , x n } is a dataset consisting of n data points, where each data point has m features, i.e., x i = {x i1 , x i2 , . . . , x im }. x j and x j denote the maximum and minimum values of the j-th feature, respectively. Assume that each dimension should be divided into l equal intervals. Features in the original dataset are scaled to fall between 1 and l+1 by using the scaling function Φ(·).
A m are the domains of the features (dimensions) of A. We define the notion of a standard grid structure.
Definition 1 (Standard grid structure). A grid structure is called a standard grid structure, if the transformed feature space is divided into l intervals of equal length after each transformed feature is scaled by Equation (3).
Let v = {v 1 , v 2 , · · · , v m } is a node in the standard grid structure. We will obtain the following property.

Property 1.
The nodes in the standard grid structure are only located with integer coordinates, i.e., To explain some notions in the standard grid structure, Figure 2 provides an example in a two-dimensional space. Green-shaded rectangles are cells in the standard grid structure. Red intersection points in the standard grid are called nodes. Blue points are transformed data.

Density Estimation
In density-based clustering, the densities of data points are computed. The traditional grid-based clustering calculates the densities of cells. In the proposed algorithm, we focus on the nodes' density.
At the t-th iteration, we define V (t) as the set of active nodes and X (t) as the set of active points. To estimate the densities of nodes, we calculate the similarity between nodes and data points. In j-th dimension, a local scaling function of the node v ∈ V (t) and the data point x i ∈ X (t) is given by where v j is the coordinate of the grid node v ∈ V (t) in the j-th dimension and x ij is the coordinate of the data point x i ∈ X (t) in the j-th dimension. Using f , the node's density value at t-th iteration is given by It is worth noting that each data point only affects the densities of the vertices (nodes) of the cell in which it is located. There may be a large number of nodes with a density value of 0 in the standard grid. To reduce the computational efficiency, we use a sparse tensor to preserve the density of non-zero nodes.

Boundary Detection
Inspired by border-peeling clustering [29], we use an iterative boundary detection strategy to divide the core and boundary nodes. The GCBD algorithm will classify a portion of the nodes of V (t) as boundary nodes and assign an inactive status to them during every boundary detection iteration.
The inactive nodes of V (t) are defined using a specific cut-off value, as follows: As with [16], a percentile is used to indirectly provide a series of cut-off values. The inactive nodes are defined as nodes whose density values fall below the 10-th percentile.
The inactive data points of X (t) are given by where dist C is the Chebyshev distance. At the next iteration, the active nodes are given by Similarly, the active nodes at the next iteration are given by At the end of all iterations, the set of activated nodes and the set of active data points are V (T+1) and X (T+1) , where T is the number of iterations.

Definition 3 (Core point).
A point x is called a core point if it belongs to the set X (T+1) , i.e., if x ∈ X (T+1) .

Definition 5 (Boundary point).
A point x is called a boundary point if it belongs to the set x

Connection Strategy
Next, we introduce the merging and assignment steps for nodes.

Merging
Step Inspired by DBSCAN [19], we devise a merging step to classify the core nodes. We define the following notions.
Definition 6 (Reachable). Two core nodes v p and v q are reachable if dist C (v p , v q ) = 1.
Definition 7 (Connected). Two core nodes v p and v q are connected if they are directly or transitively reachable.
Definition 8 (Cluster core). A Cluster coreĈ is a non-empty maximal subset of V (T+1) such that every pair of nodes inĈ is connected.

Assignment
Step Inspired by DPC [20], we devise an assignment step to assign boundary nodes to clusters. For each node v p ∈ V (t) U , we find a node v q ∈ V (t+1) with the highest density among its nearest neighbors. We associate v p to v q and form the resulting clusters C = {C 1 , C 2 , · · · , C k }.

Mapping of Points to Clusters
All points are clustered into the same group with their nearest nodes. To improve computational efficiency, we use the "round" function to find the matching node for the point, instead of using the distance function to calculate distances between the point and all nodes.

Algorithm Description and Complexity Analysis
The proposed algorithm is summarized in Algorithm 1.

Algorithm 1: GCBD algorithm
Input: A set of points, X = {x 1 , · · · , x n } The number of intervals, l The number of iterations, T Output: The clustering result, C = {C 1 , C 2 , · · · , C k } 1 Create a standard grid structure based on Definition 1. 2 Calculate the initial node density for each node according to Equation (5). 3 Categorize nodes as core nodes or boundary nodes based on an iterative boundary detection strategy. 4 Merge core nodes based on Definition 8. We analyze the time complexity of GCBD. Assume that the number of points, the number of dimensions, and the number of intervals in each dimension are denoted by n, m, and l, respectively. The number of all nodes is g, and the number of sparse nodes is g . The number of iterations is T. A standard grid structure can be constructed in O(n) time. The time complexity of calculating the initial node density is O(gn). In the worst case, the time complexity of the third step is O(T(g 2 + g n)) = O(g 2 + g n), where T is a small constant representing the number of iterations. The merging step and assignment step can be completed in O(g log(g )) time. The time complexity of the last step is O(n).The time complexity of GCBD is approximated to be O(g n) if g n, O(g 2 ) otherwise. We also give the time complexity of some existing algorithms. The time complexity of DGB [27] is O(g 2 ), where g is the number of nodes. The time complexity of WaveCluster [24] is O(c) ≈ O(g), where c is the number of cells. The time complexity of CLIQUE [25] is the number of O(c 2 ) ≈ O(g 2 ). The time complexity of BANG [22] is O(n log (n)), where n is the number of points. The time complexity of GRIDBSCAN [28] is O(c log (c)). The time complexity of OpiGrid [26] is O(n log (n)).

Experiments
In the following, the performance of the proposed clustering is studied in comparison to several grid-based clustering algorithms on various datasets.

Experiment Setup
We compare the performance of GCBD with 6 grid-based clustering algorithms, including DGB [27], WaveCluster [24], CLIQUE [25], BANG [22], GRIDBSCAN [28] and OpiGrid [26]. According to the following parameter settings, we search for the best clustering result. GCBD, DGB, WaveCluster, CLIQUE, and OpiGrid all require a parameter l describing the number of intervals. Its range is between [5,50]. In GCBD, the parameter T indicates the number of iterations. This parameter value lies in the range [2,12]. DGB and WaveCluster have one parameter ε. We set ε ranging from 0.01 to 0.1 with step 0.01. In WaveCluster and BANG, the parameter h indicates the number of levels. Its value falls between [1,5]. CLIQUE, BANG, and GRIDBSCAN all require a parameter c which indicates the density threshold. Its range is between [0, 5]. GRIDBSCAN has one parameter p. We set p ranging from 0.01 to 0.1 with step 0.01.

Results on Synthetic Datasets
In Figures 3-14, we present Mickey, Gu, Jain, ThreeD, DiffD, Moons, Shape3, Handl, Yinyang, T4, T7, and SF as examples to illustrate the superiority of our algorithm. Each cluster is indicated by a different color solid dot. And black dots with hollow shapes indicate noisy data.
In Figures 3 and 4, the first two rows correspond to the clustering results of Mickey and Gu, two unbalanced data sets. The Mickey dataset is perfectly clustered by DGB, CLIQUE, BANG, and our algorithm. Compared to the Mickey dataset, the two spherical clusters on the Gu dataset are closer together, which makes clustering more challenging. Our algorithm is the only one that can cluster this dataset perfectly. Figures 5-7 show clustering results for three datasets with different densities (Jain, ThreeD, and DiffD). A perfect clustering of the Jain dataset is achieved by DGB and our algorithm. On the ThreeD dataset, the results of GCBD and DGB are nearly perfectly right. On the DiffD dataset, only our algorithm can find the correct number of clusters. Some algorithms (DGB and WaveCluster) incorrectly classify low-density clusters as noise, and others (CLIQUE and Bang) merge two adjacent high-density clusters into one class. Figures 8-10 show clustering results for three datasets that have clusters with overlapping regions (Moons, Shape3 and Handl). On these three datasets, only our algorithm can discover the overall structure of the clusters. Almost all comparison algorithms produce false merges between adjacent clusters. Figures 11-14 correspond to the clustering results of four datasets with various shapes (Yinyang, T4, T7, and SF). The Yinyang dataset is perfectly clustered by DGB and our algorithm. On T4, T7, and SF, our algorithm outperforms the others.          Table 2 shows the quantitative results from these datasets. On all datasets, GCBD outperforms other algorithms in terms of all metrics. Experimental results show that our algorithm can identify clusters with different sizes, varying densities, overlapping regions, and arbitrary shapes.

Results on Real-World Datasets
We evaluate the clustering quality of all algorithms on 6 real-world datasets. The datasets are preprocessed the same way as in [33]. Table 3 illustrates that GCBD outperforms all comparison algorithms on most real-world datasets. For the high-dimensional dataset ORL, our algorithm improves F1 by an average of 0.30 compared to the others. Compared with DGB, WaveCluster, CLIQUE, BANG, GRIDBSCAN, and OptiGrid, GCBD has significantly better AMI and FMI. On the Dermatology dataset, our algorithm improves AMI and FMI by an average of 0.23 and 0.30 compared to the others. GCBD improves F1 by an average of 0.51 over WaveCluster, GRIDBSCAN, and OptiGrid. Furthermore, its F1 is far superior to DGB, CLIQUE, and BANG. On the Control dataset, GCBD has better AMI, FMI, and F1 than other algorithms. In particular, the improvement of F1 is more significant. On the Dig dataset, GCBD outperforms all comparison algorithms. On the Optdigits dataset, our algorithm has better AMI, FMI, and F1 than the others. Specifically, the F1 score of GCBD improves by an average of 0.36 over other algorithms. On the Satimage dataset, GCBD still achieves better clustering results. It improves the clustering AMI, FMI, and F1 by an average of 0.25, 0.30, and 0.32 compared to other algorithms. In summary, our algorithm improves AMI, FMI, and F1 by an average of 0.23, 0.27, and 0.29 over the others. Mickey (

Running Time
This subsection compares the running times of GCBD and 6 competitors (DGB, WaveCluster, CLIQUE, BANG, GRIDBSCAN, and OptiGrid) on synthetic data sets with different sizes (n = 1000:1000:10,000). For a fair comparison, the number of intervals (a parameter common to GCBD, DGB, WaveCluster, CLIQUE, and OpiGrid) is kept at 20. To determine the running time, we use the average and standard deviation of 50 repeated experiments. We perform all experiments in the Matlab environment on a PC machine containing an Intel(R) Core(TM)-i7-9700F CPU and 32 GB RAM. Figure 15 shows the average and standard deviation of the running times for GCBD, DGB, WaveCluster, CLIQUE, BANG, GRIDBSCAN, and OptiGrid. It is important to note that the y-axis is plotted using a base-10 log scale. We see that the BANG is significantly slower than other algorithms (GCBD, DGB, WaveCluster, CLIQUE, GRIDBSCAN, and Opti-Grid). GCBD is the second-fastest in more than half of the cases. Scalability comparisons show our algorithm is competitive.

Conclusions
This paper presents a novel grid-based clustering algorithm for clusters with different sizes, varying densities, overlapping regions, and arbitrary shapes. Specifically, we define a density estimation of nodes based on a standard grid structure. We use an iterative boundary detection strategy to distinguish core nodes from boundary nodes. Therefore, the density threshold does not need to be specified by the user. In addition, the iterative density estimation and boundary detection can discover the boundary regions between adjacent clusters well, which facilitates the processing of clusters with varying densities and overlapping regions. Finally, the adopted connectivity strategy is beneficial for identifying clusters of arbitrary shapes.
This algorithm is mainly applied only to low-dimensional data. However, in the case of very high-dimensional data. To embed the data into a proper dimension, dimension reduction techniques such as Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP) may be adopted. To be able to better cluster highdimensional data, our future work is to introduce the idea of subspaces to alleviate the problem of the "curse of dimensionality".

Conflicts of Interest:
The authors declare no conflict of interest.