Next Article in Journal
Quotients of Euler Equations on Space Curves
Previous Article in Journal
Solution Behavior Near Very Rough Walls under Axial Symmetry: An Exact Solution for Anisotropic Rigid/Plastic Material
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Generalizing Local Density for Density-Based Clustering

1
Department of Information Management, Yuan Ze University, Taoyuan 32003, Taiwan
2
Innovation Center for Big Data and Digital Convergence, Yuan Ze University, Taoyuan 32003, Taiwan
Symmetry 2021, 13(2), 185; https://doi.org/10.3390/sym13020185
Submission received: 4 January 2021 / Revised: 16 January 2021 / Accepted: 19 January 2021 / Published: 24 January 2021

Abstract

:
Discovering densely-populated regions in a dataset of data points is an essential task for density-based clustering. To do so, it is often necessary to calculate each data point’s local density in the dataset. Various definitions for the local density have been proposed in the literature. These definitions can be divided into two categories: Radius-based and k Nearest Neighbors-based. In this study, we find the commonality between these two types of definitions and propose a canonical form for the local density. With the canonical form, the pros and cons of the existing definitions can be better explored, and new definitions for the local density can be derived and investigated.

1. Introduction

Density-based clustering is the task of detecting densely-populated regions (called clusters) separated by sparsely-populated or empty regions in a data set of data points. It is an unsupervised process that can discover clusters of arbitrary shapes [1]. Many density-based clustering algorithms have been proposed in the literature [2,3,4,5,6,7,8,9], but most of them adopt their definitions of local density. Since clusters are derived based on each data point’s local density, using an inappropriate definition for local density could yield bad clustering results. Thus, it is crucial to define local density properly for density-based clustering.
This study divides the definitions for local density in the literature into two categories: Radius-based and k Nearest Neighbors-based (or kNN-based for short). Radius-based local density uses a radius to specify the neighborhood of a data point, and the data points within a data point’s neighborhood mainly determine the local density of the data point. In contrast, kNN-based local density uses the k nearest neighbors or the reverse k nearest neighbors of a data point to derive its local density.
In this study, we propose a canonical form for local density. All previous definitions for local density can be viewed as a special case of the canonical form. The canonical form decomposes local density definition into three parts: The contribution set, contribution function, and integration operator. The contribution set of a data point specifies the set of data points that contribute to the data point’s local density. The contribution function calculates the contribution of a data point to the local density of another data point. The integration operator is used to combine the contributions of the data points in the contribution set to yield local density.
The advantage of using this canonical form is twofold. First, it allows us to interpret the implicit difference between different definitions for local density. For example, in Section 2.2, we show that the kNN-based local density defined in [6,7] implicitly uses a radius equal to one and k , respectively. Second, this canonical form facilitates exploring the pros and cons of these existing definitions for local density. We can then combine these definitions’ merits to derive suitable definitions for local density for the problem at hand.
The rest of this paper is organized as follows. Section 2 reviews the existing definitions for local density. Section 3 proposes the canonical form for local density and shows how these definitions fit the canonical form. Section 4 describes how to derive new definitions for local density using this canonical form. Section 5 conducts an experiment to show how the three parts (i.e., contribution set, contribution function, and integration operator) of the canonical form affect local density distribution. Section 6 concludes this paper.

2. Review on Local Density

Most density-based clustering algorithms require calculating each data point’s local density to derive clusters in the dataset. However, there is no standard definition for a data point’s local density. Many definitions for local density have been proposed in the literature. Based on the parameters used in the definitions, we can divide the existing definitions into two categories. A radius-based definition uses a parameter ϵ for the radius of a data point’s neighborhood, and a kNN-based definition uses a parameter k to limit the scope of the data points involved to the k nearest neighbors. In this section, we review these two types of definitions. For ease of exposition, some notations are defined in Table 1.

2.1. Radius-Based Local Density

As described earlier, a radius-based local density uses parameter ϵ to specify the radius of a data point’s neighborhood. Consider a dataset X = { x 1 , x 2 , , x n } of n data points and the local density ρ ( x i ) of a data point x i X . A radius-based local density ensures that those data points within x i ’s neighborhood have a large contribution to ρ ( x i ) and that the data points outside x i ’s neighborhood have little or no contribution to ρ ( x i ) . In what follows, we describe two definitions for the radius-based local density in the literature.
In [4], the local density of a data point is defined as the number of data points within the data point’s neighborhood, which is given as follows:
ρ ( x i ) = x j X X ( d ( x i , x j ) ϵ )
where
X ( d ) = { 1 if   d < 1 0 otherwise
and d ( x i , x j ) is the distance between data points x i and x j . Thus, each data point x j X with d ( x i , x j ) < ϵ contributes 1 to ρ ( x i ) . In [2], the constraint d ( x i , x j ) ϵ is adopted instead of d ( x i , x j ) < ϵ , i.e., each data point x j X with d ( x i , x j ) ϵ contributes 1 to ρ ( x i ) . However, this change should not make a significant difference on ρ ( x i ) .
Instead of using the radius ϵ as a hard threshold in Equation (1), [4] proposed a local density definition that uses an exponential kernel, as shown in Equation (3).
ρ ( x i ) = x j X e ( d ( x i , x j ) ϵ ) 2
With Equation (3), each data point x j X contributes e ( d ( x i , x j ) ϵ ) 2 to ρ ( x i ) . Notably, e ( d ( x i , x j ) ϵ ) 2 is an inverse S-shaped function of d ( x i , x j ) ϵ with an inflection point at d ( x i , x j ) ϵ = 1 2 . That is, the value of e ( d ( x i , x j ) ϵ ) 2 decreases at an increasing speed as d ( x i , x j ) ϵ approaches 1 2 from 0, and then at a decreasing speed after d ( x i , x j ) ϵ is greater than 1 2 . Thus, to be exact, Equation (3) uses a soft threshold at d ( x i , x j ) = ϵ 2 , instead of at d ( x i , x j ) = ϵ . Figure 1 shows the curves of e ( d ( x i , x j ) ϵ ) 2 and its first and secondary derivatives with respect to d ( x i , x j ) ϵ . The three black dots indicate that the inflection point occurs when the first and secondary derivatives reach minimum and zero, respectively.
The proper value for ϵ is dataset-dependent. Thus, instead of setting the value for ϵ directly, Ref. [4] used another parameter, p, to derive ϵ . Specifically, ϵ is set to the top p% distance of all pairs’ distances in X , and 1 ≤ p ≤ 2 is recommended. Alternatively, Ref. [5] used parameter k to determine the value of ϵ .

2.2. kNN-Based Local Density

Although the radius-based local density is intuitive and straightforward, using the same radius for all data points may be inappropriate for some datasets. The kNN-based local density adopts a different approach by restricting only the k nearest neighbors contributing to the local density. In what follows, we describe four definitions of the kNN-based local density in the literature.
In [6], a data point’s local density is defined using an exponential kernel and the distances to k nearest neighbors, as shown in Equation (4).
ρ ( x i ) = x j N k ( x i ) e d ( x i , x j )
where N k ( x i ) denotes the set of k nearest neighbors of x i . Notably, e d ( x i , x j ) is a monotonically decreasing function of d ( x i , x j ) . Its derivative to d ( x i , x j ) is e d ( x i , x j ) , which is a monotonically increasing function of d ( x i , x j ) . As d ( x i , x j ) increases from 0, the value of e d ( x i , x j ) drops at an exponentially decreasing speed. Such a property may cause a significantly different effect for different datasets. For example, if the maximum distance between any x i X and x j N k ( x i ) is small, then a fixed change to d ( x i , x j ) will cause a large change to e d ( x i , x j ) . In contrast, if the minimum distance between any x i X and x j N k ( x i ) is large, then a fixed change to d ( x i , x j ) will only cause a small change to e d ( x i , x j ) . The cause of such an inconsistent behavior is because Equation (4) is not unit-less. Alternatively, the function e d ( x i , x j ) can be interpreted as a unit-less function e ( d ( x i , x j ) ϵ ) with a fixed radius ϵ = 1 for any dataset.
Reference [7] used the mean of x i ’s squared distance to its k nearest neighbors to derive ρ ( x i ) , as shown in Equation (5):
ρ ( x i ) = e 1 k x j N k ( x i ) ( d ( x i , x j ) ) 2
Similar to Equation (4), ρ ( x i ) in Equation (5) is a monotonically decreasing function of 1 k x j N k ( x i ) ( d ( x i , x j ) ) 2 and is not unit-less. We can rewrite Equation (5) to remove the summation in the exponent as follows.
ρ ( x i ) = e x j N k ( x i ) ( d ( x i , x j ) k ) 2 = x j N k ( x i ) e ( d ( x i , x j ) k ) 2
Similar to e ( d ( x i , x j ) ϵ ) 2 in Equation (3), e ( d ( x i , x j ) k ) 2 in Equation (6) is an inverse S-shaped function of d ( x i , x j ) k with an inflection point at d ( x i , x j ) k = 1 2 . The function e ( d ( x i , x j ) k ) 2 can also be interpreted as a unit-less function e ( d ( x i , x j ) ϵ ) 2 with a fixed radius ϵ = k for any dataset. That is, Equation (6) uses the parameter k to implicitly derive the radius ϵ , which controls the positions of the inflection point of the inverse S-shaped function e ( d ( x i , x j ) k ) 2 .
Reference [5] proposed a kNN-based unit-less definition for ρ ( x i ) , which is similar to Equation (3) but limits the data points contributing to ρ ( x i ) only to N k ( x i ) , as shown in Equation (7).
ρ ( x i ) = x j N k ( x i ) e ( d ( x i , x j ) ϵ ) 2
Reference [5] also used the parameter k to determine the value of ϵ as follows:
ϵ = μ k + 1 | X | 1 x i X ( δ i k μ k ) 2
μ k = 1 | X | x i X δ i k
where δ i k is the distance between x i and its kth nearest neighbor, and μ k is the mean of δ i k of all data points in X . Equation (8) derives ϵ as μ k plus the standard deviation of δ i k , and thus a larger k yields a larger ϵ .
Reference [8] used the distance between x i and the mean of its k nearest neighbors to derive ρ ( x i ) , as follows:
ρ ( x i ) = e ( d ( x i , x ¯ i ) ) 2
x ¯ i = 1 k x j N k ( x i ) x j
This definition could yield counterintuitive results because using the mean of k nearest neighbors sacrifices their distribution. For example, consider the case of two nearest neighbors y i 1 and y i 2 of x i located at opposite sides of x i , and d ( x i , y i 1 ) = d ( x i , y i 2 ) . Then, ρ ( x i ) remains unchanged independent of the values of d ( x i , y i 1 ) and d ( x i , y i 2 ) , which contradicts the intuition that larger d ( x i , y i 1 ) and d ( x i , y i 2 ) should result in smaller ρ ( x i ) .
Reference [8] also proposed using the number of reverse k nearest neighbors as the local density, as follows:
ρ ( x i ) = | R k ( x i ) |
where R k ( x i ) = { x j X | x i N k ( x j ) } is the set of reverse k nearest neighbors of x i . This definition could render a data point x i having ρ ( x i ) = 0 even though x i is in a densely-populated region. Thus, this definition should be used with caution.
To avoid the bias of k nearest neighbors, [10] proposed the using mutual k nearest neighbors to define local density, as follows:
S N N ( x i , x j ) = ( N k ( x i )   { x i } )   ( N k ( x j )   { x j } )
S i m ( x i , x j ) = { | S N N ( x i , x j ) | 2 x p S N N ( x i , x j ) ( d ( x i , x p ) + d ( x j , x p ) ) i f   x i , x j S N N ( x i , x j ) 0 , o t h e r w i s e
ρ ( x i ) = x j L ( x i ) S i m ( x i , x j )
where S N N ( x i , x j ) is the set of mutual k nearest neighbors of x i and x j ; S i m ( x i , x j ) is the similarity between x i and x j ; and L ( x i ) is the set of k data points chosen from X \ { x i } with the largest S i m ( x i , x j ) .

3. Canonical Form for Local Density

In this section, we first propose the canonical form for local density. Then, we show how the existing definitions for local density fit the canonical form.

3.1. Canoncial Form

Based on the review in Section 2, this section proposes a canonical form for local density. Consider dataset X and data point x i X . The canonical form for the local density ρ ( x i ) includes three parts: The contribution set C i , the contribution function c ( x i , x j ) , and the integration operator. The contribution set C i X is the set of data points contributing to ρ ( x i ) . Three possible values for C i are commonly used in the literature: N k ( x i ) , X , and B ϵ ( x i ) = { x j X | d ( x i , x j ) < ϵ } . The first value N k ( x i ) is the set of k nearest neighbors of x i , where k is the parameter [5,6,7]. The second value   X is the entire dataset [4]. The third value B ϵ ( x i ) uses ϵ to specify the radius of a data point’s neighborhood, and only the data points within the neighborhood of x i contribute to ρ ( x i ) [2,4].
The contribution function c ( x i , x j ) calculates the contribution of a data point x j C i to the density of x i . A general form for c ( x i , x j ) is proposed as follows:
c ( x i , x j ) = e ( d ( x i , x j ) ϵ ) m
where ϵ is the radius of a data point’s neighborhood. In the literature, the value of the exponent m is 1, 2, or . In practice, we can use any m ≥ 1 to achieve a different effect, which is discussed further in Section 4.
The integration operator integrates the contributions of the data points in C i to yield ρ ( x i ) . In the literature, either the summation Σ or the product Π operator is used. Thus, the canonical form for local density can be defined using Equation (17) or Equation (18), as follows:
ρ ( x i ) = x j C i c ( x i , x j )
ρ ( x i ) = x j C i c ( x i , x j )

3.2. Fit the Existing Definitions to the Canoncial Form

Based on the canonical form defined in Section 3.1, we can derive most of the definitions for local density reviewed in Section 2, and Table 2 summarizes the results. We have excluded the definition in Equation (10) because it tends to conflict with the basic property of local density, as described in Section 2.
Notably, we have transformed Equation (1) to Equation (19) below such that it can match the canonical form in Equation (17):
ρ ( x i ) = x j X e ( d ( x i , x j ) ϵ ) ,
Here, e ( d ( x i , x j ) ϵ ) = 1 if 0 < d ( x i , x j ) ϵ < 1 , and e ( d ( x i , x j ) ϵ ) = 0 if d ( x i , x j ) ϵ > 1 . Thus, Equations (1) and (19) yield exactly the same results except at d ( x i , x j ) ϵ = 1 where Equation (1) has c ( x i , x j ) = 0 , but Equation (19) has c ( x i , x j ) = e 1 .
Similarly, we have transformed Equation (12) to Equation (20) below such that it can match the canonical form in Equation (17).
ρ ( x i ) = x j R k ( x i ) 1
Additionally, Equation (15) is rewritten as Equation (21) to avoid using L ( x i ) .
ρ ( x i ) = x j N k ( x i )   R k ( x i ) S i m ( x i , x j )
Notably, by (14), S i m ( x i , x j ) 0 only if x i , x j S N N ( x i , x j ) , and by (13), S N N ( x i , x j ) \ { x i } N k ( x i ) contains at most k data points, and thus we replace L ( x i ) in (15) by N k ( x i )   R k ( x i ) or simply N k ( x i ) to speed up the computation.
By fitting the existing definitions to the canonical form, we can see that most of them use a radius ϵ , explicitly or implicitly. With Table 2, we can better explore the pros and cons of these definitions. For example, Equation (4) uses a fixed radius of ϵ = 1 , and Equation (6) uses radius ϵ = k which only depends on the parameter k. Both of them do not consider the data points’ distribution in the dataset to determine ϵ . Consequently, the chosen value for ϵ may not be adaptable to different datasets. In contrast, Equations (3), (7) and (19) not only use a parameter (p or k) but also consider the distribution of the data points to decide a proper value for ϵ .

4. Derive New Definitions Using the Canonical Form

As described in Section 3.1, there are three parts in the canonical form for local density. We can combine possible values for the three parts from the existing definitions to form new definitions for local density. However, some combinations may generate undesirable results, e.g., replacing the contribution set N k in Equation (6) with X . Thus, it is crucial to understand how the possible values for the three parts affect the results.
First, consider the integration operator in the canonical form. As shown in the second column of Table 2, most of the existing definitions for local density used the summation operator Σ . We can replace the summation operator Σ with the product operator Π (or vice versa) to yield new definitions for local density. The operators Π and Σ affect the local density differently. For example, if the value of x j C i c ( x i , x j ) is fixed, then the more evenly distributed the value of c ( x i , x j ) for all x j C i , the larger the value of x j C i c ( x i , x j ) . On the contrary, if the value of x j C i c ( x i , x j ) is fixed, then the more unevenly distributed the value of c ( x i , x j ) for all x j C i , the larger the value of x j C i c ( x i , x j ) . Notably, the contribution c ( x i , x j ) grows as the distance d ( x i , x j ) decreases. If we intend to give higher local density to those data points with more evenly distributed distances to their respective neighbors in C i , then the product operator Π is adopted. Otherwise, the summation operator Σ should be used in most cases.
Next, consider the contribution function c ( x i , x j ) . The general form of c ( x i , x j ) , defined in Equation (16), contains two parameters: The exponent m and the radius ϵ . First, focus on the impact of using different values for m. We can view e ( d ( x i , x j ) ϵ ) m in Equation (16) as a function of d ( x i , x j ) ϵ . Figure 2 shows that the value of m affects the shape of the function curve. For m > 1, e ( d ( x i , x j ) ϵ ) m is an inverse S-shaped function of d ( x i , x j ) ϵ with an inflection point at d ( x i , x j ) ϵ = m 1 m m . As the value of m approaches infinity, the inflection point approaches d ( x i , x j ) ϵ = 1 , yielding e ( d ( x i , x j ) ϵ ) m = e 1 , and the function e ( d ( x i , x j ) ϵ ) m approximates the step function in Equation (2). Notably, if m = 1, e ( d ( x i , x j ) ϵ ) m is not an inverse S-shaped function. The function curves for m = 1, 1.5, 2, 3, 4, and 50 are shown in Figure 2, where the positions of the inflection points are indicated with solid circles. To choose a suitable value for m, we can check whether the problem at hand prefers that a small increase in d ( x i , x j ) does not cause too much decrease in c ( x i , x j ) when d ( x i , x j ) < ϵ . If this is the case, then a large value for m should be adopted to move the inflection point to the right, i.e., closer to d ( x i , x j ) ϵ = 1 .
Next, consider the radius ϵ of a data point’s neighborhood. The value of ϵ should be dataset-dependent. For example, in [4], ϵ is set to the distance at the top p% of all pairs’ distances in X , where p is a parameter. This method’s intuition is to have ⌊p(n − 1)/200⌋ data points within a data point’s neighborhood on average. However, this method tends to emphasize the dense regions and overlooks the sparse regions in the dataset. We denote the radius derived using this method by ϵ p . In [5], ϵ is set to the mean plus one standard deviation of all data points’ distances to their respective k-th nearest neighbors (see Equation (8)). This method is sensitive to the outliers in the dataset and the value of k. We denote the radius derived using this method by ϵ k .
To avoid the shortcomings of the above two methods, we integrate both methods and propose a new method, shown in Algorithm 1. The new method requires two parameters: k and P. First, it collects the distance of each data point to its k-th nearest neighbor. Then, it sorts these distances in ascending order and sets ϵ to the P-th percentile location, i.e., the P × n 100 -th distance, where n is the number of data points in the dataset. This new method considers each data point’s k-th nearest neighbor instead of the top p% of all pairs’ distances. Thus, it is less likely to overlook the sparse regions in the dataset. Furthermore, because the new method does not use mean and standard deviation, it is less sensitive to outliers than the second method. We denote the radius derived using this method by ϵ k P .
Algorithm 1: The proposed method to derive ϵ.
Input: the set of data points X n × m ,   k ,   and   P  
Output: the radius ϵ
1. Set S = { δ i k |   x i X } , where δ i k is the distance between x i and its k-th nearest neighbor.
2. Sort the elements in S in ascending order.
3. Set s   =   P × n 100 .
4. Set ϵ = the s-th element in S.
5. Return ϵ
Finally, consider the contribution set C i . As described in Section 3.1, N k ( x i ) , X , and B ϵ ( x i ) are three commonly used values for C i . Setting C i = X allows every data point contributing to ρ ( x i ) . It should only be used when the adopted c ( x i , x j ) is near zero for any data point x j far from x i (e.g., Equation (16) with a large m value). For a data point x i in a dense region, its k nearest neighbors are likely to locate within its neighborhood, i.e., B ϵ ( x i ) N k ( x i ) . However, for x i in a sparse region, B ϵ ( x i ) N k ( x i ) usually holds.
Using the product operator Π with C i = X (i.e., ρ ( x i ) = x j X c ( x i , x j ) ) is a poor combination. Most of the data points in X are far from x i thus, this combination involves multiplying many small c ( x i , x j ) rendering a small ρ ( x i ) that fails to represent the local density of x i properly. In contrast, using the summation operator Σ with C i = X does not cause such a problem.
Using the product operator Π with C i = B ϵ ( x i ) could also render strange results. For example, let h be the current local density of x i , and y X be a data point where d ( x i , y ) is less than the distance between x i and x i ’s nearest neighbor in X . Intuitively, adding y to X should increase the local density of x i . However, according to Equation (16), c ( x i , x j ) is between 0 and 1 for any two data points x i and x j . Thus, with the addition of y to X , the local density of x i becomes h c ( x i , y ) , which is less than the original local density h . Thus, the combination of using the product operator Π and C i = B ϵ ( x i ) is also a poor definition for local density.

5. Experiment

5.1. Experiment Design

For brevity, we use a tuple with four components to describe a definition for local density, where the first component indicates the integration operator, the second component indicates the contribution set, and the third and the fourth components indicate the exponent m and the radius ϵ in the contribution function, respectively. For example, the row for Equation (7) in Table 2 can be represented as ( Σ ,   N k , 2 , ϵ k ) . This representation facilitates modifying an existing definition to create new definitions. For example, ( Π ,   N k , 2 , ϵ k ) , ( Σ ,   N k , 20 , ϵ k ) , and ( Σ ,   N k , 2 , ϵ k P ) are three new definitions modified from ( Σ ,   N k , 2 , ϵ k ) .
This experiment is divided into four tests. In each test, we use the definition ( Σ ,   N k , 2 , ϵ k ) proposed in [5] as the benchmark and vary one component in the tuple to study how this component affects the results. In Test 1, we compare three different ways (i.e., ϵ p , ϵ k , and ϵ k P , described in Section 4) to derive radius ϵ . Here, ϵ p and ϵ k P are derived by setting the parameters p = 2 and P = 75, respectively. Parameter k is also set to 5 to 50 in a step of 5 for both ϵ k P and ϵ k . Test 2 compares the three definitions ( Σ ,   N k , 2 , ϵ k ) , ( Σ ,   X , 2 , ϵ k ) , and ( Σ ,   B ϵ ( x i ) , 2 , ϵ k ) to study the impact of using different values for the contribution set C i . Test 3 compares the three definitions ( Σ ,   N k , 2 , ϵ k ) ,   ( Σ ,   N k , 4 , ϵ k ) , and ( Σ ,   N k , 8 , ϵ k ) to study the impact of using different values for the exponent m . Test 4 compares the two definitions ( Σ ,   N k , 2 , ϵ k ) and ( Π ,   N k , 2 , ϵ k ) to study the impact of using a different integration operator. In Tests 2 to 4, parameter k is set to 10 to derive ϵ k and N k .
This experiment uses 16 well-known two-dimensional synthetic datasets. Table 3 shows the number of points and the number of clusters in these datasets.

5.2. Test 1: Comparing the Radiuses ϵ p , ϵ k , and ϵ k P

Test 1 compares radiuses ϵ p , ϵ k , and ϵ k P derived by the three methods described in Section 4. Obviously, increasing the values of p and P increases the values for ϵ p and ϵ k P , respectively.
Figure 3 shows the experimental results of ϵ p , ϵ k , and ϵ k P by setting p = 2, P = 75, and k = 5 to 50 in a step of 5. The larger the value of k, the larger the values of ϵ k and ϵ k P . In most cases, ϵ k > ϵ k P . For smaller datasets, ϵ p tends to be smaller than ϵ k and ϵ k P . It appears that the size of the dataset influences the behaviors of ϵ p , ϵ k , and ϵ k P differently. Let n denote the size of the dataset X . The number of possible pairs of the data points in X is n ( n 1 ) 2 . Since ϵ p is set to the n ( n 1 ) 2 × p 100 -th smallest value of all pairs’ distances in X , the location of ϵ p is linear with n 2 . In contrast, ϵ k P is set to the n × P 100 -th smallest value of the distances between all data points and their k-th nearest neighbors. That is, the location of ϵ k P is only linear with n. Thus, the dataset size appears to have a greater impact on ϵ p than on ϵ k P .
Two small datasets (Compound dataset and Path_based dataset) and two large datasets (D31 dataset and A2 dataset) are selected to show the impact of the database size on ϵ p , ϵ k , and ϵ k P . Three definitions, ( Σ ,   N k , 2 , ϵ k ) , ( Σ ,   N k , 2 , ϵ p ) , and ( Σ ,   N k , 2 , ϵ k P ) , are used to calculate each data point’s local density, where the values of ϵ p , ϵ k , and ϵ k P (shown in Table 4) are derived by setting parameters p = 2, P = 75, and k =10. Notably, ( Σ ,   N k , 2 , ϵ k ) is the definition proposed in [5].
In Figure 4, the color scale legend on each subfigure’s right indicates the measure of local density. For the two small datasets (i.e., Compound and Path_based), we have ϵ p < ϵ k P < ϵ k , and thus using ϵ = ϵ k or ϵ k P results in more data points having high local density than using ϵ = ϵ p does, as shown in the upper two rows of Figure 4. In contrast, for the two large datasets (i.e., D31 and A2), ϵ p > ϵ k > ϵ k P , and thus using ϵ = ϵ p results in more data points having high local density than using ϵ = ϵ k or ϵ k P , as shown in the lower two rows of Figure 4.

5.3. Test 2: Impact of the Contribution Set C i on Local Density

Test 2 adopts three definitions, ( Σ ,   N k , 2 , ϵ k ) , ( Σ ,   X , 2 , ϵ k ) , and ( Σ ,   B ϵ ( x i ) , 2 , ϵ k ) , to calculate local density and evaluates the impact of using different values for C i . Here, k is set to 10 to derive ϵ k and N k . The results are shown in Figure 5, where the subfigures in the same row are the results for a dataset and the subfigures in the same column are the results using the same method to determine C i .
In Figure 5, the color scale legend on each subfigure’s right indicates the measure of local density. A large local density range is usually preferred because it provides more discrepancy to compare the local density among data points. Using C i = X has the largest local density range than using C i = B ϵ ( x i ) or C i = N k ( x i ) because using C i = X combines all data points’ contributions and Test 2 adopts the summation operator. Using C i = N k ( x i ) results in a much smaller range of local density than using C i = B ϵ ( x i ) does, indicating that, for any data point x i in a densely-populated region, B ϵ ( x i ) N k ( x i ) usually holds.
In the literature, all kNN-based methods (e.g., Equations (4), (6), and (7) in Table 2) adopted C i = N k ( x i ) to calculate the local density. Figure 5 shows that replacing C i = N k ( x i ) with C i = B ϵ ( x i ) or C i = X can enlarge the range of local density. Using C i = N k ( x i ) tends to result in more data points within the high-density regions (see the subfigures in column 3 of Figure 5). For example, the subfigure of “Flame” database using C i = N k ( x i ) shows that a majority of the data points have high local densities, making it difficult to partition the two densely-populated regions in the dataset. It is better to have each densely-populated region surrounded by low-density data points to facilitate clustering, e.g., the subfigure for “aggregation” dataset using C i = B ϵ ( x i ) . Therefore, overall, using C i = B ϵ ( x i ) is preferred.
However, for datasets containing both high-density clusters and low-density clusters (e.g., the Path_based dataset and the Unbalance dataset in the last two rows of Figure 5), using C i = N k ( x i ) or C i = B ϵ ( x i ) tends to yield very low local density for the data points in the low-density clusters. A dense-based clustering algorithm must handle this situation carefully to avoid omitting the low-density clusters.

5.4. Test 3: Impact of the Exponent m on Local Density

Test 3 varies the value of m in the contribution function c ( x i , x j ) = e ( d ( x i , x j ) ϵ ) m to study the impact of m on the local density. Specifically, we compare three definitions, ( Σ ,   N k , 2 , ϵ k ) ,   ( Σ ,   N k , 4 , ϵ k ) , and ( Σ ,   N k , 8 , ϵ k ) , where k is set to 10 to derive ϵ k and N k . The results are shown in Figure 6, where the subfigures in the same row are the results for a dataset, and the subfigures in the same column are the results using the same value for m .
Comparing the subfigures in the same row of Figure 6 reflects that a larger m incurs more data points to have a higher local density. For those datasets with nicely separated clusters (e.g., dataset R15), using a large m helps identify the cores of the clusters. However, for datasets with poorly separated clusters (e.g., dataset S4), using a large m makes it challenging to spot the boundary between two adjacent clusters. For datasets containing both high-density clusters and low-density clusters (e.g., the Unbalance dataset), the impact of the value of m on the local density is not significant.

5.5. Test 4: Impact of the Integration Operator ( Π or Σ ) on Local Density

Test 4 studies the impact of using different integration operator ( Π or Σ ) using two definitions ( Σ ,   N k , 2 , ϵ k ) and ( Π ,   N k , 2 , ϵ k ) , to calculate local density. As in Tests 2 and 3, k is set to 10 to derive ϵ k and N k . The results are shown in Figure 7, where the subfigures in the same column are the results using the same integration operator.
The contribution function c ( x i , x j ) in Equation (16) yields a value between 0 and 1, so using the product operator Π to integrate the data points’ contributions results in a smaller local density than using the summation operator Σ does. Using Π tends to keep only a small portion of data points having a higher local density, and thus it helps to identify the density peaks in the dataset. However, for datasets containing both high-density clusters and low-density clusters (e.g., the Path_based dataset and the Unbalance dataset), using Π cannot find the density peaks in the low-density clusters.

6. Conclusions

In this study, we first divided the existing definitions for local density into two categories, radius-based and kNN-based. It was shown that a kNN-based definition is implicitly radius-based. Then, we propose a canonical form to decompose the definition of local density into three parts: The integration operator ( Σ or Π ), the contribution set C i , and the contribution function c ( x i , x j ) . Furthermore, the contribution function could be controlled with a radius ϵ and an exponent m . Thus, a definition for local density could be represented as a tuple of four components ( Σ or Π ,   C i ,   m ,   ϵ ) to derive new definitions for local density. We conclude the following guidelines for developing new definitions for local density based on our analysis and experiment:
( Π ,   B ϵ ( x i ) ,*,*) and ( Π ,   X ,*,*) should be avoided because they could incur results contradicting the notion of local density. For example, they could yield a low density to a should-be high-density data point. Here, ‘*’ is used to represent a do not-care term;
Product operator Π could be used only when the size of the contribution set C i is fixed for every data point, e.g., C i = N k ( x i ) ;
In most cases, the summation operator Σ should be adopted. However, product operator Π helps to identify the density peaks in a dataset;
The value for ϵ should be dataset-dependent, e.g., ϵ p , ϵ k , and ϵ k P . Notably, ϵ p is sensitive to the dataset’s size, ϵ k is sensitive to the parameter k and the outliers in the dataset, and ϵ k P provides a compromise between them;
The value of m should be ≥2 so that the contribution function c ( x i , x j ) has an inflection point at d ( x i , x j ) ϵ = m 1 m m . The greater the value of m, the closer the inflection point near d ( x i , x j ) ϵ = 1 .
Notably, using the above ( Σ or Π ,   C i ,   m ,   ϵ ) representation assumes that the contribution function c ( x i , x j ) = e ( d ( x i , x j ) ϵ ) m is adopted. That is, given the parameters m and   ϵ , the value of c ( x i , x j ) depends only on the distance d ( x i , x j ) . However, in recent studies [8,10], c ( x i , x j ) may involve not only x i and x j but also their k nearest neighbors. In such cases, a tuple of three components ( Σ or Π ,   C i ,   c ( x i , x j ) ) should be adopted to represent a definition for local density, where c ( x i , x j ) may require additional parameters, e.g., k for k nearest neighbors. Furthermore, c ( x i , x j ) could incorporate the symmetric distance based on the mutual k nearest neighbors of x i and x j , as did in [10]. Other symmetric distance matrices can also be adopted.
Using only one local density definition can be challenging to identify clusters in a dataset containing clusters with different densities. Future studies can address how to apply the proposed canonical form to handle this problem. For example, we can adopt a stepwise approach. Each step uses a different definition of local density to target the clusters of a specific feature. The proposed canonical form can facilitate changing the density definition at different stages of a clustering approach. The effective integration of the canonical form and a clustering approach is currently under-studied.

Funding

This research is supported by the Ministry of Science and Technology, Taiwan, under Grant MOST 108-2221-E-155-013.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available in a publicly accessible repository. Please refer to the references in Table 3 for availability.

Acknowledgments

The author acknowledges the Innovation Center for Big Data and Digital Convergence at Yuan Ze University for supporting this study.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques, 3rd ed.; Morgan Kaufmann Publishers Inc.: Waltham, MA, USA, 2011. [Google Scholar]
  2. Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
  3. Ankerst, M.; Breunig, M.M.; Kriegel, H.-P.; Sander, J. OPTICS: Ordering points to identify the clustering structure. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, Philadelphia, PA, USA, 1–3 June 1999; pp. 49–60. [Google Scholar]
  4. Rodriguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science 2014, 344, 1492–1496. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Liu, Y.; Ma, Z.; Fang, Y. Adaptive density peak clustering based on K-nearest neighbors with aggregating strategy. Knowl. Based Syst. 2017, 133, 208–220. [Google Scholar] [CrossRef]
  6. Xie, J.; Gao, H.; Xie, W.; Liu, X.; Grant, P.W. Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors. Inf. Sci. 2016, 354, 19–40. [Google Scholar] [CrossRef]
  7. Du, M.; Ding, S.; Jia, H. Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowl. Based Syst. 2016, 99, 135–145. [Google Scholar] [CrossRef]
  8. Liu, Y.; Liu, D.; Yu, F.; Ma, Z. A Double-Density Clustering Method Based on “Nearest to First in” Strategy. Symmetry 2020, 12, 747. [Google Scholar] [CrossRef]
  9. Lin, J.-L.; Kuo, J.-C.; Chuang, H.-W. Improving Density Peak Clustering by Automatic Peak Selection and Single Linkage Clustering. Symmetry 2020, 12, 1168. [Google Scholar] [CrossRef]
  10. Lv, Y.; Liu, M.; Xiang, Y. Fast Searching Density Peak Clustering Algorithm Based on Shared Nearest Neighbor and Adaptive Clustering Center. Symmetry 2020, 12, 2014. [Google Scholar] [CrossRef]
  11. Chang, H.; Yeung, D.-Y. Robust path-based spectral clustering. Pattern Recognit. 2008, 41, 191–203. [Google Scholar] [CrossRef]
  12. Fu, L.; Medico, E. FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data. BMC Bioinform. 2007, 8, 3. [Google Scholar] [CrossRef] [PubMed]
  13. Gionis, A.; Mannila, H.; Tsaparas, P. Clustering aggregation. ACM Trans. Knowl. Discov. Data 2007, 1, 4. [Google Scholar] [CrossRef] [Green Version]
  14. Jain, A.K.; Law, M.H. Data clustering: A user’s dilemma. In Proceedings of the 2005 International Conference on Pattern Recognition and Machine Intelligence, Kolkata, India, 20–22 December 2005; pp. 1–10. [Google Scholar]
  15. Veenman, C.J.; Reinders, M.J.T.; Backer, E. A maximum variance cluster algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 1273–1280. [Google Scholar] [CrossRef] [Green Version]
  16. Zahn, C.T. Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters. IEEE Trans. Comput. 1971, 100, 68–86. [Google Scholar] [CrossRef] [Green Version]
  17. Kärkkäinen, I.; Fränti, P. Dynamic Local Search Algorithm for the Clustering Problem; A-2002-6; University of Joensuu: Joensuu, Finland, 2002. [Google Scholar]
  18. Fränti, P.; Virmajoki, O. Iterative shrinking method for clustering problems. Pattern Recognit. 2006, 39, 761–775. [Google Scholar] [CrossRef]
  19. Rezaei, M.; Fränti, P. Set Matching Measures for External Cluster Validity. IEEE Trans. Knowl. Data Eng. 2016, 28, 2173–2186. [Google Scholar] [CrossRef]
Figure 1. The horizontal axis is d ( x i , x j ) ϵ and the vertical axis is the values of e ( d ( x i , x j ) ϵ ) 2 (in red) and its first (in blue) and secondary (in purple) derivatives with respect to d ( x i , x j ) ϵ .
Figure 1. The horizontal axis is d ( x i , x j ) ϵ and the vertical axis is the values of e ( d ( x i , x j ) ϵ ) 2 (in red) and its first (in blue) and secondary (in purple) derivatives with respect to d ( x i , x j ) ϵ .
Symmetry 13 00185 g001
Figure 2. The contribution c ( x i , x j ) for different values of m. The horizontal axis is d ( x i , x j ) ϵ , and the vertical axis is the contribution c ( x i , x j ) = e ( d ( x i , x j ) ϵ ) m , as defined in Equation (16).
Figure 2. The contribution c ( x i , x j ) for different values of m. The horizontal axis is d ( x i , x j ) ϵ , and the vertical axis is the contribution c ( x i , x j ) = e ( d ( x i , x j ) ϵ ) m , as defined in Equation (16).
Symmetry 13 00185 g002
Figure 3. The radiuses ϵ p , ϵ k , and ϵ k P for p = 2, P = 75, and k = 5 to 50. The horizontal axis is the value of k, and the vertical axis is the value of radius.
Figure 3. The radiuses ϵ p , ϵ k , and ϵ k P for p = 2, P = 75, and k = 5 to 50. The horizontal axis is the value of k, and the vertical axis is the value of radius.
Symmetry 13 00185 g003
Figure 4. The local densities calculated using ϵ p , ϵ k , or ϵ k P for the data points in four datasets (i.e., Path_based, Compound, D31, and A2). The horizontal and the vertical coordination show the position of data points, and the color indicates the value of local densities.
Figure 4. The local densities calculated using ϵ p , ϵ k , or ϵ k P for the data points in four datasets (i.e., Path_based, Compound, D31, and A2). The horizontal and the vertical coordination show the position of data points, and the color indicates the value of local densities.
Symmetry 13 00185 g004
Figure 5. The local densities calculated using C i = X , B ϵ ( x i ) , or N k ( x i ) . The horizontal and the vertical coordination show the position of data points, and the color indicates the value of local densities.
Figure 5. The local densities calculated using C i = X , B ϵ ( x i ) , or N k ( x i ) . The horizontal and the vertical coordination show the position of data points, and the color indicates the value of local densities.
Symmetry 13 00185 g005aSymmetry 13 00185 g005bSymmetry 13 00185 g005c
Figure 6. The local densities calculated using m = 2, 4, or 10 in c ( x i , x j ) . The horizontal and the vertical coordination show the position of data points, and the color indicates the value of local densities.
Figure 6. The local densities calculated using m = 2, 4, or 10 in c ( x i , x j ) . The horizontal and the vertical coordination show the position of data points, and the color indicates the value of local densities.
Symmetry 13 00185 g006aSymmetry 13 00185 g006bSymmetry 13 00185 g006c
Figure 7. The local densities calculated using different integration operator ( Π or Σ ). The horizontal and the vertical coordination show the position of data points, and the color indicates the value of local densities.
Figure 7. The local densities calculated using different integration operator ( Π or Σ ). The horizontal and the vertical coordination show the position of data points, and the color indicates the value of local densities.
Symmetry 13 00185 g007aSymmetry 13 00185 g007bSymmetry 13 00185 g007c
Table 1. Notations.
Table 1. Notations.
X = { x 1 , , x n } the dataset of n data points to be clustered
ρ ( x i ) the local density of a data point x i X
d ( x i , x j ) the distance between two data points x i and x j
ϵ the radius of a data point’s neighborhood
ϵ p the radius derived from top p% of all pairs’ distances. (1st used in Section 4)
ϵ k the radius derived using the parameter k and Equation (8). (1st used in Section 4)
ϵ k P the radius derived using the P-th percentile of the distances between all data points and their k-th nearest neighbors. (1st used in Section 4)
N k ( x i ) the set of k nearest neighbors of x i . (1st used in Equation (4))
R k ( x i ) the set of reverse k nearest neighbors of x i . (1st used in Equation (12))
y i j the j-th nearest neighbor of x i . (1st used in Section 2.2)
δ i j the distance between x i and its j-th nearest neighbor y i j . (1st used in Equation (8))
C i the set of data points that contribute to the density of x i . (1st used in Equation (17))
c ( x i , x j ) the contribution of x j to the density of x i . (1st used in Equation (17))
Table 2. Equations (3), (4), (6), (7) and (19)–(21) fit the canonical forms defined in Equations (16)–(18).
Table 2. Equations (3), (4), (6), (7) and (19)–(21) fit the canonical forms defined in Equations (16)–(18).
Equation Π or Σ C i c ( x i , x j ) m ϵ
(19) x j X e ( d ( x i , x j ) ϵ ) Σ X e ( d ( x i , x j ) ϵ ) ϵ is set to the distance at the top p% of all pairs’ distances in X , where p is a parameter [4].
(3) x j X e ( d ( x i , x j ) ϵ ) 2 Σ X e ( d ( x i , x j ) ϵ ) 2 2 ϵ is set to the distance at the top p% of all pairs’ distances in X , where p is a parameter [4].
(4) x j N k ( x i ) e ( d ( x i , x j ) 1 ) Σ N k ( x i ) e ( d ( x i , x j ) 1 ) 11
(6) x j N k ( x i ) e ( d ( x i , x j ) k ) 2 Π N k ( x i ) e ( d ( x i , x j ) k ) 2 2 k
(7) x j N k ( x i ) e ( d ( x i , x j ) ϵ ) 2 Σ N k ( x i ) e ( d ( x i , x j ) ϵ ) 2 2 ϵ is derived from the distance between each data point to its kth nearest neighbor using Equation (8) [5].
(20) x j R k ( x i ) 1 Σ R k ( x i ) 1
(21) x j L ( x i ) S i m ( x i , x j ) Σ N k ( x i )   R k ( x i ) S i m ( x i , x j )
Table 3. Number of points and number of clusters in the 16 synthetic datasets.
Table 3. Number of points and number of clusters in the 16 synthetic datasets.
DatasetNumber of ClustersNumber of Points
Spiral [11]3312
Flame [12]2240
Aggregation [13]7788
Jain [14]2373
D31 [15]313100
R15 [15]15600
Compound [16]6399
A1 [17]203000
A2 [17]355250
A3 [17]507500
S1 [18]155000
S2 [18]155000
S3 [18]155000
S4 [18]155000
Path_based [11]3300
Unbalance [19]86500
Table 4. ϵ p (p = 2), ϵ k (k = 10), and ϵ k P (k = 10 and P = 75) for four datasets.
Table 4. ϵ p (p = 2), ϵ k (k = 10), and ϵ k P (k = 10 and P = 75) for four datasets.
DatasetCompoundPath_BasedD31A2
ϵ p 0.1826060.2236880.2035950.206687
ϵ k P 0.2808390.5229620.0949540.071405
ϵ k 0.4307440.5587930.1144880.088676
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Lin, J.-L. Generalizing Local Density for Density-Based Clustering. Symmetry 2021, 13, 185. https://doi.org/10.3390/sym13020185

AMA Style

Lin J-L. Generalizing Local Density for Density-Based Clustering. Symmetry. 2021; 13(2):185. https://doi.org/10.3390/sym13020185

Chicago/Turabian Style

Lin, Jun-Lin. 2021. "Generalizing Local Density for Density-Based Clustering" Symmetry 13, no. 2: 185. https://doi.org/10.3390/sym13020185

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop