1. Introduction
Density-based clustering is the task of detecting densely-populated regions (called clusters) separated by sparsely-populated or empty regions in a data set of data points. It is an unsupervised process that can discover clusters of arbitrary shapes [
1]. Many density-based clustering algorithms have been proposed in the literature [
2,
3,
4,
5,
6,
7,
8,
9], but most of them adopt their definitions of local density. Since clusters are derived based on each data point’s local density, using an inappropriate definition for local density could yield bad clustering results. Thus, it is crucial to define local density properly for density-based clustering.
This study divides the definitions for local density in the literature into two categories: Radius-based and k Nearest Neighbors-based (or kNN-based for short). Radius-based local density uses a radius to specify the neighborhood of a data point, and the data points within a data point’s neighborhood mainly determine the local density of the data point. In contrast, kNN-based local density uses the k nearest neighbors or the reverse k nearest neighbors of a data point to derive its local density.
In this study, we propose a canonical form for local density. All previous definitions for local density can be viewed as a special case of the canonical form. The canonical form decomposes local density definition into three parts: The contribution set, contribution function, and integration operator. The contribution set of a data point specifies the set of data points that contribute to the data point’s local density. The contribution function calculates the contribution of a data point to the local density of another data point. The integration operator is used to combine the contributions of the data points in the contribution set to yield local density.
The advantage of using this canonical form is twofold. First, it allows us to interpret the implicit difference between different definitions for local density. For example, in
Section 2.2, we show that the
kNN-based local density defined in [
6,
7] implicitly uses a radius equal to one and
, respectively. Second, this canonical form facilitates exploring the pros and cons of these existing definitions for local density. We can then combine these definitions’ merits to derive suitable definitions for local density for the problem at hand.
The rest of this paper is organized as follows.
Section 2 reviews the existing definitions for local density.
Section 3 proposes the canonical form for local density and shows how these definitions fit the canonical form.
Section 4 describes how to derive new definitions for local density using this canonical form.
Section 5 conducts an experiment to show how the three parts (i.e., contribution set, contribution function, and integration operator) of the canonical form affect local density distribution.
Section 6 concludes this paper.
2. Review on Local Density
Most density-based clustering algorithms require calculating each data point’s local density to derive clusters in the dataset. However, there is no standard definition for a data point’s local density. Many definitions for local density have been proposed in the literature. Based on the parameters used in the definitions, we can divide the existing definitions into two categories. A radius-based definition uses a parameter
for the radius of a data point’s neighborhood, and a
kNN-based definition uses a parameter
k to limit the scope of the data points involved to the
k nearest neighbors. In this section, we review these two types of definitions. For ease of exposition, some notations are defined in
Table 1.
2.1. Radius-Based Local Density
As described earlier, a radius-based local density uses parameter to specify the radius of a data point’s neighborhood. Consider a dataset of n data points and the local density of a data point . A radius-based local density ensures that those data points within ’s neighborhood have a large contribution to and that the data points outside ’s neighborhood have little or no contribution to . In what follows, we describe two definitions for the radius-based local density in the literature.
In [
4], the local density of a data point is defined as the number of data points within the data point’s neighborhood, which is given as follows:
where
and
is the distance between data points
and
. Thus, each data point
with
contributes 1 to
. In [
2], the constraint
is adopted instead of
, i.e., each data point
with
contributes 1 to
. However, this change should not make a significant difference on
.
Instead of using the radius
as a hard threshold in Equation (1), [
4] proposed a local density definition that uses an exponential kernel, as shown in Equation (3).
With Equation (3), each data point
contributes
to
. Notably,
is an inverse S-shaped function of
with an inflection point at
. That is, the value of
decreases at an increasing speed as
approaches
from 0, and then at a decreasing speed after
is greater than
. Thus, to be exact, Equation (3) uses a soft threshold at
, instead of at
.
Figure 1 shows the curves of
and its first and secondary derivatives with respect to
. The three black dots indicate that the inflection point occurs when the first and secondary derivatives reach minimum and zero, respectively.
The proper value for
is dataset-dependent. Thus, instead of setting the value for
directly, Ref. [
4] used another parameter,
p, to derive
. Specifically,
is set to the top
p% distance of all pairs’ distances in
, and 1 ≤
p ≤ 2 is recommended. Alternatively, Ref. [
5] used parameter
k to determine the value of
.
2.2. kNN-Based Local Density
Although the radius-based local density is intuitive and straightforward, using the same radius for all data points may be inappropriate for some datasets. The kNN-based local density adopts a different approach by restricting only the k nearest neighbors contributing to the local density. In what follows, we describe four definitions of the kNN-based local density in the literature.
In [
6], a data point’s local density is defined using an exponential kernel and the distances to
k nearest neighbors, as shown in Equation (4).
where
denotes the set of
k nearest neighbors of
. Notably,
is a monotonically decreasing function of
. Its derivative to
is
, which is a monotonically increasing function of
. As
increases from 0, the value of
drops at an exponentially decreasing speed. Such a property may cause a significantly different effect for different datasets. For example, if the maximum distance between any
and
is small, then a fixed change to
will cause a large change to
. In contrast, if the minimum distance between any
and
is large, then a fixed change to
will only cause a small change to
. The cause of such an inconsistent behavior is because Equation (4) is not unit-less. Alternatively, the function
can be interpreted as a unit-less function
with a fixed radius
for any dataset.
Reference [
7] used the mean of
’s squared distance to its
k nearest neighbors to derive
, as shown in Equation (5):
Similar to Equation (4),
in Equation (5) is a monotonically decreasing function of
and is not unit-less. We can rewrite Equation (5) to remove the summation in the exponent as follows.
Similar to in Equation (3), in Equation (6) is an inverse S-shaped function of with an inflection point at . The function can also be interpreted as a unit-less function with a fixed radius for any dataset. That is, Equation (6) uses the parameter k to implicitly derive the radius , which controls the positions of the inflection point of the inverse S-shaped function .
Reference [
5] proposed a
kNN-based unit-less definition for
, which is similar to Equation (3) but limits the data points contributing to
only to
, as shown in Equation (7).
Reference [
5] also used the parameter
k to determine the value of
as follows:
where
is the distance between
and its
kth nearest neighbor, and
is the mean of
of all data points in
. Equation (8) derives
as
plus the standard deviation of
, and thus a larger
k yields a larger
.
Reference [
8] used the distance between
and the mean of its
k nearest neighbors to derive
, as follows:
This definition could yield counterintuitive results because using the mean of k nearest neighbors sacrifices their distribution. For example, consider the case of two nearest neighbors and of located at opposite sides of , and . Then, remains unchanged independent of the values of and , which contradicts the intuition that larger and should result in smaller .
Reference [
8] also proposed using the number of reverse
nearest neighbors as the local density, as follows:
where
is the set of reverse
nearest neighbors of
. This definition could render a data point
having
even though
is in a densely-populated region. Thus, this definition should be used with caution.
To avoid the bias of
nearest neighbors, [
10] proposed the using mutual
nearest neighbors to define local density, as follows:
where
is the set of mutual
nearest neighbors of
and
;
is the similarity between
and
; and
is the set of
k data points chosen from
with the largest
.
4. Derive New Definitions Using the Canonical Form
As described in
Section 3.1, there are three parts in the canonical form for local density. We can combine possible values for the three parts from the existing definitions to form new definitions for local density. However, some combinations may generate undesirable results, e.g., replacing the contribution set
in Equation (6) with
. Thus, it is crucial to understand how the possible values for the three parts affect the results.
First, consider the integration operator in the canonical form. As shown in the second column of
Table 2, most of the existing definitions for local density used the summation operator
. We can replace the summation operator
with the product operator
(or vice versa) to yield new definitions for local density. The operators
and
affect the local density differently. For example, if the value of
is fixed, then the more evenly distributed the value of
for all
, the larger the value of
. On the contrary, if the value of
is fixed, then the more unevenly distributed the value of
for all
, the larger the value of
. Notably, the contribution
grows as the distance
decreases. If we intend to give higher local density to those data points with more evenly distributed distances to their respective neighbors in
, then the product operator
is adopted. Otherwise, the summation operator
should be used in most cases.
Next, consider the contribution function
. The general form of
, defined in Equation (16), contains two parameters: The exponent
m and the radius
. First, focus on the impact of using different values for
m. We can view
in Equation (16) as a function of
.
Figure 2 shows that the value of
m affects the shape of the function curve. For
m > 1,
is an inverse S-shaped function of
with an inflection point at
. As the value of
m approaches infinity, the inflection point approaches
, yielding
, and the function
approximates the step function in Equation (2). Notably, if
m = 1,
is not an inverse S-shaped function. The function curves for
m = 1, 1.5, 2, 3, 4, and 50 are shown in
Figure 2, where the positions of the inflection points are indicated with solid circles. To choose a suitable value for
m, we can check whether the problem at hand prefers that a small increase in
does not cause too much decrease in
when
. If this is the case, then a large value for
m should be adopted to move the inflection point to the right, i.e., closer to
.
Next, consider the radius
of a data point’s neighborhood. The value of
should be dataset-dependent. For example, in [
4],
is set to the distance at the top
p% of all pairs’ distances in
, where
p is a parameter. This method’s intuition is to have ⌊
p(
n − 1)/200⌋ data points within a data point’s neighborhood on average. However, this method tends to emphasize the dense regions and overlooks the sparse regions in the dataset. We denote the radius derived using this method by
. In [
5],
is set to the mean plus one standard deviation of all data points’ distances to their respective
k-th nearest neighbors (see Equation (8)). This method is sensitive to the outliers in the dataset and the value of
k. We denote the radius derived using this method by
.
To avoid the shortcomings of the above two methods, we integrate both methods and propose a new method, shown in Algorithm 1. The new method requires two parameters:
k and
P. First, it collects the distance of each data point to its
k-th nearest neighbor. Then, it sorts these distances in ascending order and sets
to the
P-th percentile location, i.e., the
-th distance, where
n is the number of data points in the dataset. This new method considers each data point’s
k-th nearest neighbor instead of the top
p% of all pairs’ distances. Thus, it is less likely to overlook the sparse regions in the dataset. Furthermore, because the new method does not use mean and standard deviation, it is less sensitive to outliers than the second method. We denote the radius derived using this method by
.
Algorithm 1: The proposed method to derive ϵ. |
Input: the set of data points Output: the radius
1. Set , where is the distance between and its k-th nearest neighbor.
2. Sort the elements in in ascending order.
3. Set
4. Set the s-th element in S.
5. Return |
Finally, consider the contribution set
. As described in
Section 3.1,
,
, and
are three commonly used values for
. Setting
allows every data point contributing to
It should only be used when the adopted
is near zero for any data point
far from
(e.g., Equation (16) with a large
m value). For a data point
in a dense region, its
k nearest neighbors are likely to locate within its neighborhood, i.e.,
. However, for
in a sparse region,
usually holds.
Using the product operator with (i.e., ) is a poor combination. Most of the data points in are far from thus, this combination involves multiplying many small rendering a small that fails to represent the local density of properly. In contrast, using the summation operator with does not cause such a problem.
Using the product operator with could also render strange results. For example, let be the current local density of , and be a data point where is less than the distance between and ’s nearest neighbor in . Intuitively, adding to should increase the local density of . However, according to Equation (16), is between 0 and 1 for any two data points and . Thus, with the addition of to , the local density of becomes , which is less than the original local density . Thus, the combination of using the product operator and is also a poor definition for local density.
6. Conclusions
In this study, we first divided the existing definitions for local density into two categories, radius-based and kNN-based. It was shown that a kNN-based definition is implicitly radius-based. Then, we propose a canonical form to decompose the definition of local density into three parts: The integration operator ( or ), the contribution set , and the contribution function . Furthermore, the contribution function could be controlled with a radius and an exponent . Thus, a definition for local density could be represented as a tuple of four components ( or ,,,) to derive new definitions for local density. We conclude the following guidelines for developing new definitions for local density based on our analysis and experiment:
- ●
(,,*,*) and (,,*,*) should be avoided because they could incur results contradicting the notion of local density. For example, they could yield a low density to a should-be high-density data point. Here, ‘*’ is used to represent a do not-care term;
- ●
Product operator could be used only when the size of the contribution set is fixed for every data point, e.g., ;
- ●
In most cases, the summation operator should be adopted. However, product operator helps to identify the density peaks in a dataset;
- ●
The value for should be dataset-dependent, e.g., , , and . Notably, is sensitive to the dataset’s size, is sensitive to the parameter k and the outliers in the dataset, and provides a compromise between them;
- ●
The value of m should be ≥2 so that the contribution function has an inflection point at . The greater the value of m, the closer the inflection point near .
Notably, using the above (
or
,
,
,
) representation assumes that the contribution function
is adopted. That is, given the parameters
and
, the value of
depends only on the distance
. However, in recent studies [
8,
10],
may involve not only
and
but also their
nearest neighbors. In such cases, a tuple of three components (
or
,
,
) should be adopted to represent a definition for local density, where
may require additional parameters, e.g.,
for
nearest neighbors. Furthermore,
could incorporate the symmetric distance based on the mutual
nearest neighbors of
and
, as did in [
10]. Other symmetric distance matrices can also be adopted.
Using only one local density definition can be challenging to identify clusters in a dataset containing clusters with different densities. Future studies can address how to apply the proposed canonical form to handle this problem. For example, we can adopt a stepwise approach. Each step uses a different definition of local density to target the clusters of a specific feature. The proposed canonical form can facilitate changing the density definition at different stages of a clustering approach. The effective integration of the canonical form and a clustering approach is currently under-studied.