Generalizing Local Density for Density-Based Clustering

Lin, Jun-Lin

doi:10.3390/sym13020185

Open AccessArticle

Generalizing Local Density for Density-Based Clustering

by

Jun-Lin Lin

^1,2

¹

Department of Information Management, Yuan Ze University, Taoyuan 32003, Taiwan

²

Innovation Center for Big Data and Digital Convergence, Yuan Ze University, Taoyuan 32003, Taiwan

Symmetry 2021, 13(2), 185; https://doi.org/10.3390/sym13020185

Submission received: 4 January 2021 / Revised: 16 January 2021 / Accepted: 19 January 2021 / Published: 24 January 2021

Download

Browse Figures

Versions Notes

Abstract

:

Discovering densely-populated regions in a dataset of data points is an essential task for density-based clustering. To do so, it is often necessary to calculate each data point’s local density in the dataset. Various definitions for the local density have been proposed in the literature. These definitions can be divided into two categories: Radius-based and k Nearest Neighbors-based. In this study, we find the commonality between these two types of definitions and propose a canonical form for the local density. With the canonical form, the pros and cons of the existing definitions can be better explored, and new definitions for the local density can be derived and investigated.

Keywords:

density-based clustering; local density; data mining

1. Introduction

Density-based clustering is the task of detecting densely-populated regions (called clusters) separated by sparsely-populated or empty regions in a data set of data points. It is an unsupervised process that can discover clusters of arbitrary shapes [1]. Many density-based clustering algorithms have been proposed in the literature [2,3,4,5,6,7,8,9], but most of them adopt their definitions of local density. Since clusters are derived based on each data point’s local density, using an inappropriate definition for local density could yield bad clustering results. Thus, it is crucial to define local density properly for density-based clustering.

This study divides the definitions for local density in the literature into two categories: Radius-based and k Nearest Neighbors-based (or kNN-based for short). Radius-based local density uses a radius to specify the neighborhood of a data point, and the data points within a data point’s neighborhood mainly determine the local density of the data point. In contrast, kNN-based local density uses the k nearest neighbors or the reverse k nearest neighbors of a data point to derive its local density.

In this study, we propose a canonical form for local density. All previous definitions for local density can be viewed as a special case of the canonical form. The canonical form decomposes local density definition into three parts: The contribution set, contribution function, and integration operator. The contribution set of a data point specifies the set of data points that contribute to the data point’s local density. The contribution function calculates the contribution of a data point to the local density of another data point. The integration operator is used to combine the contributions of the data points in the contribution set to yield local density.

The advantage of using this canonical form is twofold. First, it allows us to interpret the implicit difference between different definitions for local density. For example, in Section 2.2, we show that the kNN-based local density defined in [6,7] implicitly uses a radius equal to one and

\sqrt{k}

, respectively. Second, this canonical form facilitates exploring the pros and cons of these existing definitions for local density. We can then combine these definitions’ merits to derive suitable definitions for local density for the problem at hand.

The rest of this paper is organized as follows. Section 2 reviews the existing definitions for local density. Section 3 proposes the canonical form for local density and shows how these definitions fit the canonical form. Section 4 describes how to derive new definitions for local density using this canonical form. Section 5 conducts an experiment to show how the three parts (i.e., contribution set, contribution function, and integration operator) of the canonical form affect local density distribution. Section 6 concludes this paper.

2. Review on Local Density

Most density-based clustering algorithms require calculating each data point’s local density to derive clusters in the dataset. However, there is no standard definition for a data point’s local density. Many definitions for local density have been proposed in the literature. Based on the parameters used in the definitions, we can divide the existing definitions into two categories. A radius-based definition uses a parameter

ϵ

for the radius of a data point’s neighborhood, and a kNN-based definition uses a parameter k to limit the scope of the data points involved to the k nearest neighbors. In this section, we review these two types of definitions. For ease of exposition, some notations are defined in Table 1.

2.1. Radius-Based Local Density

As described earlier, a radius-based local density uses parameter

ϵ

to specify the radius of a data point’s neighborhood. Consider a dataset

X = {x_{1}, x_{2}, \dots, x_{n}}

of n data points and the local density

ρ (x_{i})

of a data point

x_{i} \in X

. A radius-based local density ensures that those data points within

x_{i}

’s neighborhood have a large contribution to

ρ (x_{i})

and that the data points outside

x_{i}

’s neighborhood have little or no contribution to

ρ (x_{i})

. In what follows, we describe two definitions for the radius-based local density in the literature.

In [4], the local density of a data point is defined as the number of data points within the data point’s neighborhood, which is given as follows:

ρ (x_{i}) = \sum_{x_{j} \in X} X (\frac{d (x_{i}, x_{j})}{ϵ})

(1)

where

X (d) = {\begin{matrix} 1 & if d < 1 \\ 0 & otherwise \end{matrix}

(2)

and

d (x_{i}, x_{j})

is the distance between data points

x_{i}

and

x_{j}

. Thus, each data point

x_{j} \in X

with

d (x_{i}, x_{j}) < ϵ

contributes 1 to

ρ (x_{i})

. In [2], the constraint

d (x_{i}, x_{j}) \leq ϵ

is adopted instead of

d (x_{i}, x_{j}) < ϵ

, i.e., each data point

x_{j} \in X

with

d (x_{i}, x_{j}) \leq ϵ

contributes 1 to

ρ (x_{i})

. However, this change should not make a significant difference on

ρ (x_{i})

.

Instead of using the radius

ϵ

as a hard threshold in Equation (1), [4] proposed a local density definition that uses an exponential kernel, as shown in Equation (3).

ρ (x_{i}) = \sum_{x_{j} \in X} e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{2}}

(3)

With Equation (3), each data point

x_{j} \in X

contributes

e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{2}}

to

ρ (x_{i})

. Notably,

e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{2}}

is an inverse S-shaped function of

\frac{d (x_{i}, x_{j})}{ϵ}

with an inflection point at

\frac{d (x_{i}, x_{j})}{ϵ} = \frac{1}{\sqrt{2}}

. That is, the value of

e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{2}}

decreases at an increasing speed as

\frac{d (x_{i}, x_{j})}{ϵ}

approaches

\frac{1}{\sqrt{2}}

from 0, and then at a decreasing speed after

\frac{d (x_{i}, x_{j})}{ϵ}

is greater than

\frac{1}{\sqrt{2}}

. Thus, to be exact, Equation (3) uses a soft threshold at

d (x_{i}, x_{j}) = \frac{ϵ}{\sqrt{2}}

, instead of at

d (x_{i}, x_{j}) = ϵ

. Figure 1 shows the curves of

e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{2}}

and its first and secondary derivatives with respect to

\frac{d (x_{i}, x_{j})}{ϵ}

. The three black dots indicate that the inflection point occurs when the first and secondary derivatives reach minimum and zero, respectively.

The proper value for

ϵ

is dataset-dependent. Thus, instead of setting the value for

ϵ

directly, Ref. [4] used another parameter, p, to derive

ϵ

. Specifically,

ϵ

is set to the top p% distance of all pairs’ distances in

X

, and 1 ≤ p ≤ 2 is recommended. Alternatively, Ref. [5] used parameter k to determine the value of

ϵ

.

2.2. kNN-Based Local Density

Although the radius-based local density is intuitive and straightforward, using the same radius for all data points may be inappropriate for some datasets. The kNN-based local density adopts a different approach by restricting only the k nearest neighbors contributing to the local density. In what follows, we describe four definitions of the kNN-based local density in the literature.

In [6], a data point’s local density is defined using an exponential kernel and the distances to k nearest neighbors, as shown in Equation (4).

ρ (x_{i}) = \sum_{x_{j} \in N_{k} (x_{i})} e^{- d (x_{i}, x_{j})}

(4)

where

N_{k} (x_{i})

denotes the set of k nearest neighbors of

x_{i}

. Notably,

e^{- d (x_{i}, x_{j})}

is a monotonically decreasing function of

d (x_{i}, x_{j})

. Its derivative to

d (x_{i}, x_{j})

is

- e^{- d (x_{i}, x_{j})}

, which is a monotonically increasing function of

d (x_{i}, x_{j})

. As

d (x_{i}, x_{j})

increases from 0, the value of

e^{- d (x_{i}, x_{j})}

drops at an exponentially decreasing speed. Such a property may cause a significantly different effect for different datasets. For example, if the maximum distance between any

x_{i} \in X

and

x_{j} \in N_{k} (x_{i})

is small, then a fixed change to

d (x_{i}, x_{j})

will cause a large change to

e^{- d (x_{i}, x_{j})}

. In contrast, if the minimum distance between any

x_{i} \in X

and

x_{j} \in N_{k} (x_{i})

is large, then a fixed change to

d (x_{i}, x_{j})

will only cause a small change to

e^{- d (x_{i}, x_{j})}

. The cause of such an inconsistent behavior is because Equation (4) is not unit-less. Alternatively, the function

e^{- d (x_{i}, x_{j})}

can be interpreted as a unit-less function

e^{- (\frac{d (x_{i}, x_{j})}{ϵ})}

with a fixed radius

ϵ = 1

for any dataset.

Reference [7] used the mean of

x_{i}

’s squared distance to its k nearest neighbors to derive

ρ (x_{i})

, as shown in Equation (5):

ρ (x_{i}) = e^{- \frac{1}{k} \sum_{x_{j} \in N_{k} (x_{i})} {(d (x_{i}, x_{j}))}^{2}}

(5)

Similar to Equation (4),

ρ (x_{i})

in Equation (5) is a monotonically decreasing function of

\frac{1}{k} \sum_{x_{j} \in N_{k} (x_{i})} {(d (x_{i}, x_{j}))}^{2}

and is not unit-less. We can rewrite Equation (5) to remove the summation in the exponent as follows.

ρ (x_{i}) = e^{- \sum_{x_{j} \in N_{k} (x_{i})} {(\frac{d (x_{i}, x_{j})}{\sqrt{k}})}^{2}} = \prod_{x_{j} \in N_{k} (x_{i})} e^{- {(\frac{d (x_{i}, x_{j})}{\sqrt{k}})}^{2}}

(6)

Similar to

e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{2}}

in Equation (3),

e^{- {(\frac{d (x_{i}, x_{j})}{\sqrt{k}})}^{2}}

in Equation (6) is an inverse S-shaped function of

\frac{d (x_{i}, x_{j})}{\sqrt{k}}

with an inflection point at

\frac{d (x_{i}, x_{j})}{\sqrt{k}} = \frac{1}{\sqrt{2}}

. The function

e^{- {(\frac{d (x_{i}, x_{j})}{\sqrt{k}})}^{2}}

can also be interpreted as a unit-less function

e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{2}}

with a fixed radius

ϵ = \sqrt{k}

for any dataset. That is, Equation (6) uses the parameter k to implicitly derive the radius

ϵ

, which controls the positions of the inflection point of the inverse S-shaped function

e^{- {(\frac{d (x_{i}, x_{j})}{\sqrt{k}})}^{2}}

.

Reference [5] proposed a kNN-based unit-less definition for

ρ (x_{i})

, which is similar to Equation (3) but limits the data points contributing to

ρ (x_{i})

only to

N_{k} (x_{i})

, as shown in Equation (7).

ρ (x_{i}) = \sum_{x_{j} \in N_{k} (x_{i})} e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{2}}

(7)

Reference [5] also used the parameter k to determine the value of

ϵ

as follows:

ϵ = μ^{k} + \sqrt{\frac{1}{| X | - 1} \sum_{x_{i} \in X} {(δ_{i}^{k} - μ^{k})}^{2}}

(8)

μ^{k} = \frac{1}{| X |} \sum_{x_{i} \in X} δ_{i}^{k}

(9)

where

δ_{i}^{k}

is the distance between

x_{i}

and its kth nearest neighbor, and

μ^{k}

is the mean of

δ_{i}^{k}

of all data points in

X

. Equation (8) derives

ϵ

as

μ^{k}

plus the standard deviation of

δ_{i}^{k}

, and thus a larger k yields a larger

ϵ

.

Reference [8] used the distance between

x_{i}

and the mean of its k nearest neighbors to derive

ρ (x_{i})

, as follows:

ρ (x_{i}) = e^{- {(d (x_{i}, {\bar{x}}_{i}))}^{2}}

(10)

{\bar{x}}_{i} = \frac{1}{k} \sum_{x_{j} \in N_{k} (x_{i})}^{} x_{j}

(11)

This definition could yield counterintuitive results because using the mean of k nearest neighbors sacrifices their distribution. For example, consider the case of two nearest neighbors

y_{i}^{1}

and

y_{i}^{2}

of

x_{i}

located at opposite sides of

x_{i}

, and

d (x_{i}, y_{i}^{1}) = d (x_{i}, y_{i}^{2})

. Then,

ρ (x_{i})

remains unchanged independent of the values of

d (x_{i}, y_{i}^{1})

and

d (x_{i}, y_{i}^{2})

, which contradicts the intuition that larger

d (x_{i}, y_{i}^{1})

and

d (x_{i}, y_{i}^{2})

should result in smaller

ρ (x_{i})

.

Reference [8] also proposed using the number of reverse

k

nearest neighbors as the local density, as follows:

ρ (x_{i}) = | R_{k} (x_{i}) |

(12)

where

R_{k} (x_{i}) = {x_{j} \in X | x_{i} \in N_{k} (x_{j})}

is the set of reverse

k

nearest neighbors of

x_{i}

. This definition could render a data point

x_{i}

having

ρ (x_{i}) = 0

even though

x_{i}

is in a densely-populated region. Thus, this definition should be used with caution.

To avoid the bias of

k

nearest neighbors, [10] proposed the using mutual

k

nearest neighbors to define local density, as follows:

S N N (x_{i}, x_{j}) = (N_{k} (x_{i}) \cup^{} {x_{i}}) \cap^{} (N_{k} (x_{j}) \cup^{} {x_{j}})

(13)

S i m (x_{i}, x_{j}) = {\begin{matrix} \frac{{| S N N (x_{i}, x_{j}) |}^{2}}{\sum_{x_{p} \in S N N (x_{i}, x_{j})} (d (x_{i}, x_{p}) + d (x_{j}, x_{p}))} & i f x_{i}, x_{j} \in S N N (x_{i}, x_{j}) \\ 0, & o t h e r w i s e \end{matrix}

(14)

ρ (x_{i}) = \sum_{x_{j} \in L (x_{i})} S i m (x_{i}, x_{j})

(15)

where

S N N (x_{i}, x_{j})

is the set of mutual

k

nearest neighbors of

x_{i}

and

x_{j}

;

S i m (x_{i}, x_{j})

is the similarity between

x_{i}

and

x_{j}

; and

L (x_{i})

is the set of k data points chosen from

X \ {x_{i}}

with the largest

S i m (x_{i}, x_{j})

.

3. Canonical Form for Local Density

In this section, we first propose the canonical form for local density. Then, we show how the existing definitions for local density fit the canonical form.

3.1. Canoncial Form

Based on the review in Section 2, this section proposes a canonical form for local density. Consider dataset

X

and data point

x_{i} \in X

. The canonical form for the local density

ρ (x_{i})

includes three parts: The contribution set

C_{i}

, the contribution function

c (x_{i}, x_{j})

, and the integration operator. The contribution set

C_{i} \subset X

is the set of data points contributing to

ρ (x_{i})

. Three possible values for

C_{i}

are commonly used in the literature:

N_{k} (x_{i})

,

X

, and

B_{ϵ} (x_{i}) = {x_{j} \in X | d (x_{i}, x_{j}) < ϵ}

. The first value

N_{k} (x_{i})

is the set of

k

nearest neighbors of

x_{i}

, where

k

is the parameter [5,6,7]. The second value

X

is the entire dataset [4]. The third value

B_{ϵ} (x_{i})

uses

ϵ

to specify the radius of a data point’s neighborhood, and only the data points within the neighborhood of

x_{i}

contribute to

ρ (x_{i})

[2,4].

The contribution function

c (x_{i}, x_{j})

calculates the contribution of a data point

x_{j} \in C_{i}

to the density of

x_{i}

. A general form for

c (x_{i}, x_{j})

is proposed as follows:

c (x_{i}, x_{j}) = e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{m}}

(16)

where

ϵ

is the radius of a data point’s neighborhood. In the literature, the value of the exponent

m

is 1, 2, or

\infty

. In practice, we can use any m ≥ 1 to achieve a different effect, which is discussed further in Section 4.

The integration operator integrates the contributions of the data points in

C_{i}

to yield

ρ (x_{i})

. In the literature, either the summation

Σ

or the product

Π

operator is used. Thus, the canonical form for local density can be defined using Equation (17) or Equation (18), as follows:

ρ (x_{i}) = \sum_{x_{j} \in C_{i}} c (x_{i}, x_{j})

(17)

ρ (x_{i}) = \prod_{x_{j} \in C_{i}} c (x_{i}, x_{j})

(18)

3.2. Fit the Existing Definitions to the Canoncial Form

Based on the canonical form defined in Section 3.1, we can derive most of the definitions for local density reviewed in Section 2, and Table 2 summarizes the results. We have excluded the definition in Equation (10) because it tends to conflict with the basic property of local density, as described in Section 2.

Notably, we have transformed Equation (1) to Equation (19) below such that it can match the canonical form in Equation (17):

ρ (x_{i}) = \sum_{x_{j} \in X} e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{\infty}},

(19)

Here,

e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{\infty}}

= 1 if

0 < \frac{d (x_{i}, x_{j})}{ϵ} < 1

, and

e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{\infty}}

= 0 if

\frac{d (x_{i}, x_{j})}{ϵ} > 1

. Thus, Equations (1) and (19) yield exactly the same results except at

\frac{d (x_{i}, x_{j})}{ϵ} = 1

where Equation (1) has

c (x_{i}, x_{j}) = 0

, but Equation (19) has

c (x_{i}, x_{j}) = e^{- 1}

.

Similarly, we have transformed Equation (12) to Equation (20) below such that it can match the canonical form in Equation (17).

ρ (x_{i}) = \sum_{x_{j} \in R_{k} (x_{i})} 1

(20)

Additionally, Equation (15) is rewritten as Equation (21) to avoid using

L (x_{i})

.

ρ (x_{i}) = \sum_{x_{j} \in N_{k} (x_{i}) \cap^{} R_{k} (x_{i})} S i m (x_{i}, x_{j})

(21)

Notably, by (14),

S i m (x_{i}, x_{j}) \neq 0

only if

x_{i}, x_{j} \in S N N (x_{i}, x_{j})

, and by (13),

S N N (x_{i}, x_{j}) \ {x_{i}} \subseteq N_{k} (x_{i})

contains at most k data points, and thus we replace

L (x_{i})

in (15) by

N_{k} (x_{i}) \cap^{} R_{k} (x_{i})

or simply

N_{k} (x_{i})

to speed up the computation.

By fitting the existing definitions to the canonical form, we can see that most of them use a radius

ϵ

, explicitly or implicitly. With Table 2, we can better explore the pros and cons of these definitions. For example, Equation (4) uses a fixed radius of

ϵ = 1

, and Equation (6) uses radius

ϵ = \sqrt{k}

which only depends on the parameter k. Both of them do not consider the data points’ distribution in the dataset to determine

ϵ

. Consequently, the chosen value for

ϵ

may not be adaptable to different datasets. In contrast, Equations (3), (7) and (19) not only use a parameter (p or k) but also consider the distribution of the data points to decide a proper value for

ϵ

.

4. Derive New Definitions Using the Canonical Form

As described in Section 3.1, there are three parts in the canonical form for local density. We can combine possible values for the three parts from the existing definitions to form new definitions for local density. However, some combinations may generate undesirable results, e.g., replacing the contribution set

N_{k}

in Equation (6) with

X

. Thus, it is crucial to understand how the possible values for the three parts affect the results.

First, consider the integration operator in the canonical form. As shown in the second column of Table 2, most of the existing definitions for local density used the summation operator

Σ

. We can replace the summation operator

Σ

with the product operator

Π

(or vice versa) to yield new definitions for local density. The operators

Π

and

Σ

affect the local density differently. For example, if the value of

\sum_{x_{j} \in C_{i}} c (x_{i}, x_{j})

is fixed, then the more evenly distributed the value of

c (x_{i}, x_{j})

for all

x_{j} \in C_{i}

, the larger the value of

\prod_{x_{j} \in C_{i}} c (x_{i}, x_{j})

. On the contrary, if the value of

\prod_{x_{j} \in C_{i}} c (x_{i}, x_{j})

is fixed, then the more unevenly distributed the value of

c (x_{i}, x_{j})

for all

x_{j} \in C_{i}

, the larger the value of

\sum_{x_{j} \in C_{i}} c (x_{i}, x_{j})

. Notably, the contribution

c (x_{i}, x_{j})

grows as the distance

d (x_{i}, x_{j})

decreases. If we intend to give higher local density to those data points with more evenly distributed distances to their respective neighbors in

C_{i}

, then the product operator

Π

is adopted. Otherwise, the summation operator

Σ

should be used in most cases.

Next, consider the contribution function

c (x_{i}, x_{j})

. The general form of

c (x_{i}, x_{j})

, defined in Equation (16), contains two parameters: The exponent m and the radius

ϵ

. First, focus on the impact of using different values for m. We can view

e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{m}}

in Equation (16) as a function of

\frac{d (x_{i}, x_{j})}{ϵ}

. Figure 2 shows that the value of m affects the shape of the function curve. For m > 1,

e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{m}}

is an inverse S-shaped function of

\frac{d (x_{i}, x_{j})}{ϵ}

with an inflection point at

\frac{d (x_{i}, x_{j})}{ϵ} = \sqrt[m]{\frac{m - 1}{m}}

. As the value of m approaches infinity, the inflection point approaches

\frac{d (x_{i}, x_{j})}{ϵ} = 1

, yielding

e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{m}} = e^{- 1}

, and the function

e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{m}}

approximates the step function in Equation (2). Notably, if m = 1,

e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{m}}

is not an inverse S-shaped function. The function curves for m = 1, 1.5, 2, 3, 4, and 50 are shown in Figure 2, where the positions of the inflection points are indicated with solid circles. To choose a suitable value for m, we can check whether the problem at hand prefers that a small increase in

d (x_{i}, x_{j})

does not cause too much decrease in

c (x_{i}, x_{j})

when

d (x_{i}, x_{j}) <

ϵ

. If this is the case, then a large value for m should be adopted to move the inflection point to the right, i.e., closer to

\frac{d (x_{i}, x_{j})}{ϵ} = 1

.

Next, consider the radius

ϵ

of a data point’s neighborhood. The value of

ϵ

should be dataset-dependent. For example, in [4],

ϵ

is set to the distance at the top p% of all pairs’ distances in

X

, where p is a parameter. This method’s intuition is to have ⌊p(n − 1)/200⌋ data points within a data point’s neighborhood on average. However, this method tends to emphasize the dense regions and overlooks the sparse regions in the dataset. We denote the radius derived using this method by

ϵ_{p}

. In [5],

ϵ

is set to the mean plus one standard deviation of all data points’ distances to their respective k-th nearest neighbors (see Equation (8)). This method is sensitive to the outliers in the dataset and the value of k. We denote the radius derived using this method by

ϵ_{k}

.

To avoid the shortcomings of the above two methods, we integrate both methods and propose a new method, shown in Algorithm 1. The new method requires two parameters: k and P. First, it collects the distance of each data point to its k-th nearest neighbor. Then, it sorts these distances in ascending order and sets

ϵ

to the P-th percentile location, i.e., the

⌈ \frac{P \times n}{100} ⌉

-th distance, where n is the number of data points in the dataset. This new method considers each data point’s k-th nearest neighbor instead of the top p% of all pairs’ distances. Thus, it is less likely to overlook the sparse regions in the dataset. Furthermore, because the new method does not use mean and standard deviation, it is less sensitive to outliers than the second method. We denote the radius derived using this method by

ϵ_{k P}

.

Algorithm 1: The proposed method to derive ϵ.

Input: the set of data points

X \in ℝ_{n \times m}, k, and P

Output: the radius

ϵ

1. Set

S = {δ_{i}^{k} | x_{i} \in X}

, where

δ_{i}^{k}

is the distance between

x_{i}

and its k-th nearest neighbor.
2. Sort the elements in

S

in ascending order.
3. Set

s = ⌈ \frac{P \times n}{100} ⌉ .

4. Set

ϵ =

the s-th element in S.
5. Return

ϵ

Finally, consider the contribution set

C_{i}

. As described in Section 3.1,

N_{k} (x_{i})

,

X

, and

B_{ϵ} (x_{i})

are three commonly used values for

C_{i}

. Setting

C_{i} = X

allows every data point contributing to

ρ (x_{i}) .

It should only be used when the adopted

c (x_{i}, x_{j})

is near zero for any data point

x_{j}

far from

x_{i}

(e.g., Equation (16) with a large m value). For a data point

x_{i}

in a dense region, its k nearest neighbors are likely to locate within its neighborhood, i.e.,

B_{ϵ} (x_{i}) \supset N_{k} (x_{i})

. However, for

x_{i}

in a sparse region,

B_{ϵ} (x_{i}) \subset N_{k} (x_{i})

usually holds.

Using the product operator

Π

with

C_{i} = X

(i.e.,

ρ (x_{i}) = \prod_{x_{j} \in X} c (x_{i}, x_{j})

) is a poor combination. Most of the data points in

X

are far from

x_{i}

thus, this combination involves multiplying many small

c (x_{i}, x_{j})

rendering a small

ρ (x_{i})

that fails to represent the local density of

x_{i}

properly. In contrast, using the summation operator

Σ

with

C_{i} = X

does not cause such a problem.

Using the product operator

Π

with

C_{i} = B_{ϵ} (x_{i})

could also render strange results. For example, let

h

be the current local density of

x_{i}

, and

y \notin X

be a data point where

d (x_{i}, y)

is less than the distance between

x_{i}

and

x_{i}

’s nearest neighbor in

X

. Intuitively, adding

y

to

X

should increase the local density of

x_{i}

. However, according to Equation (16),

c (x_{i}, x_{j})

is between 0 and 1 for any two data points

x_{i}

and

x_{j}

. Thus, with the addition of

y

to

X

, the local density of

x_{i}

becomes

h * c (x_{i}, y)

, which is less than the original local density

h

. Thus, the combination of using the product operator

Π

and

C_{i} = B_{ϵ} (x_{i})

is also a poor definition for local density.

5. Experiment

5.1. Experiment Design

For brevity, we use a tuple with four components to describe a definition for local density, where the first component indicates the integration operator, the second component indicates the contribution set, and the third and the fourth components indicate the exponent

m

and the radius

ϵ

in the contribution function, respectively. For example, the row for Equation (7) in Table 2 can be represented as

(Σ, N_{k}, 2, ϵ_{k})

. This representation facilitates modifying an existing definition to create new definitions. For example,

(Π, N_{k}, 2, ϵ_{k})

,

(Σ, N_{k}, 20, ϵ_{k}),

and

(Σ, N_{k}, 2, ϵ_{k P})

are three new definitions modified from

(Σ, N_{k}, 2, ϵ_{k})

.

This experiment is divided into four tests. In each test, we use the definition

(Σ, N_{k}, 2, ϵ_{k})

proposed in [5] as the benchmark and vary one component in the tuple to study how this component affects the results. In Test 1, we compare three different ways (i.e.,

ϵ_{p}

,

ϵ_{k}

, and

ϵ_{k P}

, described in Section 4) to derive radius

ϵ

. Here,

ϵ_{p}

and

ϵ_{k P}

are derived by setting the parameters p = 2 and P = 75, respectively. Parameter k is also set to 5 to 50 in a step of 5 for both

ϵ_{k P}

and

ϵ_{k}

. Test 2 compares the three definitions

(Σ, N_{k}, 2, ϵ_{k})

,

(Σ, X, 2, ϵ_{k})

, and

(Σ, B_{ϵ} (x_{i}), 2, ϵ_{k})

to study the impact of using different values for the contribution set

C_{i}

. Test 3 compares the three definitions

(Σ, N_{k}, 2, ϵ_{k})

,

(Σ, N_{k}, 4, ϵ_{k})

, and

(Σ, N_{k}, 8, ϵ_{k})

to study the impact of using different values for the exponent

m

. Test 4 compares the two definitions

(Σ, N_{k}, 2, ϵ_{k})

and

(Π, N_{k}, 2, ϵ_{k})

to study the impact of using a different integration operator. In Tests 2 to 4, parameter k is set to 10 to derive

ϵ_{k}

and

N_{k}

.

This experiment uses 16 well-known two-dimensional synthetic datasets. Table 3 shows the number of points and the number of clusters in these datasets.

5.2. Test 1: Comparing the Radiuses $ϵ_{p}$ , $ϵ_{k}$ , and $ϵ_{k P}$

Test 1 compares radiuses

ϵ_{p}

,

ϵ_{k}

, and

ϵ_{k P}

derived by the three methods described in Section 4. Obviously, increasing the values of p and P increases the values for

ϵ_{p}

and

ϵ_{k P}

, respectively.

Figure 3 shows the experimental results of

ϵ_{p}

,

ϵ_{k}

, and

ϵ_{k P}

by setting p = 2, P = 75, and k = 5 to 50 in a step of 5. The larger the value of k, the larger the values of

ϵ_{k}

and

ϵ_{k P}

. In most cases,

ϵ_{k} > ϵ_{k P}

. For smaller datasets,

ϵ_{p}

tends to be smaller than

ϵ_{k}

and

ϵ_{k P}

. It appears that the size of the dataset influences the behaviors of

ϵ_{p}

,

ϵ_{k}

, and

ϵ_{k P}

differently. Let n denote the size of the dataset

X

. The number of possible pairs of the data points in

X

is

\frac{n (n - 1)}{2}

. Since

ϵ_{p}

is set to the

⌊ \frac{n (n - 1)}{2} \times \frac{p}{100} ⌋

-th smallest value of all pairs’ distances in

X

, the location of

ϵ_{p}

is linear with

n^{2}

. In contrast,

ϵ_{k P}

is set to the

⌊ n \times \frac{P}{100} ⌋

-th smallest value of the distances between all data points and their k-th nearest neighbors. That is, the location of

ϵ_{k P}

is only linear with n. Thus, the dataset size appears to have a greater impact on

ϵ_{p}

than on

ϵ_{k P}

.

Two small datasets (Compound dataset and Path_based dataset) and two large datasets (D31 dataset and A2 dataset) are selected to show the impact of the database size on

ϵ_{p}

,

ϵ_{k}

, and

ϵ_{k P}

. Three definitions,

(Σ, N_{k}, 2, ϵ_{k})

,

(Σ, N_{k}, 2, ϵ_{p})

, and

(Σ, N_{k}, 2, ϵ_{k P})

, are used to calculate each data point’s local density, where the values of

ϵ_{p}

,

ϵ_{k}

, and

ϵ_{k P}

(shown in Table 4) are derived by setting parameters p = 2, P = 75, and k =10. Notably,

(Σ, N_{k}, 2, ϵ_{k})

is the definition proposed in [5].

In Figure 4, the color scale legend on each subfigure’s right indicates the measure of local density. For the two small datasets (i.e., Compound and Path_based), we have

ϵ_{p} < ϵ_{k P} < ϵ_{k}

, and thus using

ϵ = ϵ_{k}

or

ϵ_{k P}

results in more data points having high local density than using

ϵ = ϵ_{p}

does, as shown in the upper two rows of Figure 4. In contrast, for the two large datasets (i.e., D31 and A2),

ϵ_{p} > ϵ_{k} > ϵ_{k P}

, and thus using

ϵ = ϵ_{p}

results in more data points having high local density than using

ϵ = ϵ_{k}

or

ϵ_{k P}

, as shown in the lower two rows of Figure 4.

5.3. Test 2: Impact of the Contribution Set $C_{i}$ on Local Density

Test 2 adopts three definitions,

(Σ, N_{k}, 2, ϵ_{k})

,

(Σ, X, 2, ϵ_{k})

, and

(Σ, B_{ϵ} (x_{i}), 2, ϵ_{k})

, to calculate local density and evaluates the impact of using different values for

C_{i}

. Here,

k

is set to 10 to derive

ϵ_{k}

and

N_{k}

. The results are shown in Figure 5, where the subfigures in the same row are the results for a dataset and the subfigures in the same column are the results using the same method to determine

C_{i}

.

In Figure 5, the color scale legend on each subfigure’s right indicates the measure of local density. A large local density range is usually preferred because it provides more discrepancy to compare the local density among data points. Using

C_{i} = X

has the largest local density range than using

C_{i} = B_{ϵ} (x_{i})

or

C_{i} = N_{k} (x_{i})

because using

C_{i} = X

combines all data points’ contributions and Test 2 adopts the summation operator. Using

C_{i} = N_{k} (x_{i})

results in a much smaller range of local density than using

C_{i} = B_{ϵ} (x_{i})

does, indicating that, for any data point

x_{i}

in a densely-populated region,

B_{ϵ} (x_{i}) \supset N_{k} (x_{i})

usually holds.

In the literature, all kNN-based methods (e.g., Equations (4), (6), and (7) in Table 2) adopted

C_{i} = N_{k} (x_{i})

to calculate the local density. Figure 5 shows that replacing

C_{i} = N_{k} (x_{i})

with

C_{i} = B_{ϵ} (x_{i})

or

C_{i} = X

can enlarge the range of local density. Using

C_{i} = N_{k} (x_{i})

tends to result in more data points within the high-density regions (see the subfigures in column 3 of Figure 5). For example, the subfigure of “Flame” database using

C_{i} = N_{k} (x_{i})

shows that a majority of the data points have high local densities, making it difficult to partition the two densely-populated regions in the dataset. It is better to have each densely-populated region surrounded by low-density data points to facilitate clustering, e.g., the subfigure for “aggregation” dataset using

C_{i} = B_{ϵ} (x_{i})

. Therefore, overall, using

C_{i} = B_{ϵ} (x_{i})

is preferred.

However, for datasets containing both high-density clusters and low-density clusters (e.g., the Path_based dataset and the Unbalance dataset in the last two rows of Figure 5), using

C_{i} = N_{k} (x_{i})

or

C_{i} = B_{ϵ} (x_{i})

tends to yield very low local density for the data points in the low-density clusters. A dense-based clustering algorithm must handle this situation carefully to avoid omitting the low-density clusters.

5.4. Test 3: Impact of the Exponent m on Local Density

Test 3 varies the value of

m

in the contribution function

c (x_{i}, x_{j}) = e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{m}}

to study the impact of

m

on the local density. Specifically, we compare three definitions,

(Σ, N_{k}, 2, ϵ_{k})

,

(Σ, N_{k}, 4, ϵ_{k})

, and

(Σ, N_{k}, 8, ϵ_{k})

, where

k

is set to 10 to derive

ϵ_{k}

and

N_{k}

. The results are shown in Figure 6, where the subfigures in the same row are the results for a dataset, and the subfigures in the same column are the results using the same value for

m

.

Comparing the subfigures in the same row of Figure 6 reflects that a larger m incurs more data points to have a higher local density. For those datasets with nicely separated clusters (e.g., dataset R15), using a large m helps identify the cores of the clusters. However, for datasets with poorly separated clusters (e.g., dataset S4), using a large m makes it challenging to spot the boundary between two adjacent clusters. For datasets containing both high-density clusters and low-density clusters (e.g., the Unbalance dataset), the impact of the value of m on the local density is not significant.

5.5. Test 4: Impact of the Integration Operator ( $Π$ or $Σ$ ) on Local Density

Test 4 studies the impact of using different integration operator (

Π

or

Σ

) using two definitions

(Σ, N_{k}, 2, ϵ_{k})

and

(Π, N_{k}, 2, ϵ_{k})

, to calculate local density. As in Tests 2 and 3,

k

is set to 10 to derive

ϵ_{k}

and

N_{k}

. The results are shown in Figure 7, where the subfigures in the same column are the results using the same integration operator.

The contribution function

c (x_{i}, x_{j})

in Equation (16) yields a value between 0 and 1, so using the product operator

Π

to integrate the data points’ contributions results in a smaller local density than using the summation operator

Σ

does. Using

Π

tends to keep only a small portion of data points having a higher local density, and thus it helps to identify the density peaks in the dataset. However, for datasets containing both high-density clusters and low-density clusters (e.g., the Path_based dataset and the Unbalance dataset), using

Π

cannot find the density peaks in the low-density clusters.

6. Conclusions

In this study, we first divided the existing definitions for local density into two categories, radius-based and kNN-based. It was shown that a kNN-based definition is implicitly radius-based. Then, we propose a canonical form to decompose the definition of local density into three parts: The integration operator (

Σ

or

Π

), the contribution set

C_{i}

, and the contribution function

c (x_{i}, x_{j})

. Furthermore, the contribution function could be controlled with a radius

ϵ

and an exponent

m

. Thus, a definition for local density could be represented as a tuple of four components (

Σ

or

Π

,

C_{i}

,

m

,

ϵ

) to derive new definitions for local density. We conclude the following guidelines for developing new definitions for local density based on our analysis and experiment:

●: ( $Π$ , $B_{ϵ} (x_{i})$ ,*,*) and ( $Π$ , $X$ ,*,*) should be avoided because they could incur results contradicting the notion of local density. For example, they could yield a low density to a should-be high-density data point. Here, ‘*’ is used to represent a do not-care term;
●: Product operator $Π$ could be used only when the size of the contribution set $C_{i}$ is fixed for every data point, e.g., $C_{i} = N_{k} (x_{i})$ ;
●: In most cases, the summation operator $Σ$ should be adopted. However, product operator $Π$ helps to identify the density peaks in a dataset;
●: The value for $ϵ$ should be dataset-dependent, e.g., $ϵ_{p}$ , $ϵ_{k}$ , and $ϵ_{k P}$ . Notably, $ϵ_{p}$ is sensitive to the dataset’s size, $ϵ_{k}$ is sensitive to the parameter k and the outliers in the dataset, and $ϵ_{k P}$ provides a compromise between them;
●: The value of m should be ≥2 so that the contribution function $c (x_{i}, x_{j})$ has an inflection point at $\frac{d (x_{i}, x_{j})}{ϵ} = \sqrt[m]{\frac{m - 1}{m}}$ . The greater the value of m, the closer the inflection point near $\frac{d (x_{i}, x_{j})}{ϵ} = 1$ .

Notably, using the above (

Σ

or

Π

,

C_{i}

,

m

,

ϵ

) representation assumes that the contribution function

c (x_{i}, x_{j}) = e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{m}}

is adopted. That is, given the parameters

m

and

ϵ

, the value of

c (x_{i}, x_{j})

depends only on the distance

d (x_{i}, x_{j})

. However, in recent studies [8,10],

c (x_{i}, x_{j})

may involve not only

x_{i}

and

x_{j}

but also their

k

nearest neighbors. In such cases, a tuple of three components (

Σ

or

Π

,

C_{i}

,

c (x_{i}, x_{j})

) should be adopted to represent a definition for local density, where

c (x_{i}, x_{j})

may require additional parameters, e.g.,

k

for

k

nearest neighbors. Furthermore,

c (x_{i}, x_{j})

could incorporate the symmetric distance based on the mutual

k

nearest neighbors of

x_{i}

and

x_{j}

, as did in [10]. Other symmetric distance matrices can also be adopted.

Using only one local density definition can be challenging to identify clusters in a dataset containing clusters with different densities. Future studies can address how to apply the proposed canonical form to handle this problem. For example, we can adopt a stepwise approach. Each step uses a different definition of local density to target the clusters of a specific feature. The proposed canonical form can facilitate changing the density definition at different stages of a clustering approach. The effective integration of the canonical form and a clustering approach is currently under-studied.

Funding

This research is supported by the Ministry of Science and Technology, Taiwan, under Grant MOST 108-2221-E-155-013.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available in a publicly accessible repository. Please refer to the references in Table 3 for availability.

Acknowledgments

The author acknowledges the Innovation Center for Big Data and Digital Convergence at Yuan Ze University for supporting this study.

Conflicts of Interest

The author declares no conflict of interest.

References

Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques, 3rd ed.; Morgan Kaufmann Publishers Inc.: Waltham, MA, USA, 2011. [Google Scholar]
Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
Ankerst, M.; Breunig, M.M.; Kriegel, H.-P.; Sander, J. OPTICS: Ordering points to identify the clustering structure. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, Philadelphia, PA, USA, 1–3 June 1999; pp. 49–60. [Google Scholar]
Rodriguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science 2014, 344, 1492–1496. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Liu, Y.; Ma, Z.; Fang, Y. Adaptive density peak clustering based on K-nearest neighbors with aggregating strategy. Knowl. Based Syst. 2017, 133, 208–220. [Google Scholar] [CrossRef]
Xie, J.; Gao, H.; Xie, W.; Liu, X.; Grant, P.W. Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors. Inf. Sci. 2016, 354, 19–40. [Google Scholar] [CrossRef]
Du, M.; Ding, S.; Jia, H. Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowl. Based Syst. 2016, 99, 135–145. [Google Scholar] [CrossRef]
Liu, Y.; Liu, D.; Yu, F.; Ma, Z. A Double-Density Clustering Method Based on “Nearest to First in” Strategy. Symmetry 2020, 12, 747. [Google Scholar] [CrossRef]
Lin, J.-L.; Kuo, J.-C.; Chuang, H.-W. Improving Density Peak Clustering by Automatic Peak Selection and Single Linkage Clustering. Symmetry 2020, 12, 1168. [Google Scholar] [CrossRef]
Lv, Y.; Liu, M.; Xiang, Y. Fast Searching Density Peak Clustering Algorithm Based on Shared Nearest Neighbor and Adaptive Clustering Center. Symmetry 2020, 12, 2014. [Google Scholar] [CrossRef]
Chang, H.; Yeung, D.-Y. Robust path-based spectral clustering. Pattern Recognit. 2008, 41, 191–203. [Google Scholar] [CrossRef]
Fu, L.; Medico, E. FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data. BMC Bioinform. 2007, 8, 3. [Google Scholar] [CrossRef] [PubMed]
Gionis, A.; Mannila, H.; Tsaparas, P. Clustering aggregation. ACM Trans. Knowl. Discov. Data 2007, 1, 4. [Google Scholar] [CrossRef] [Green Version]
Jain, A.K.; Law, M.H. Data clustering: A user’s dilemma. In Proceedings of the 2005 International Conference on Pattern Recognition and Machine Intelligence, Kolkata, India, 20–22 December 2005; pp. 1–10. [Google Scholar]
Veenman, C.J.; Reinders, M.J.T.; Backer, E. A maximum variance cluster algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 1273–1280. [Google Scholar] [CrossRef] [Green Version]
Zahn, C.T. Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters. IEEE Trans. Comput. 1971, 100, 68–86. [Google Scholar] [CrossRef] [Green Version]
Kärkkäinen, I.; Fränti, P. Dynamic Local Search Algorithm for the Clustering Problem; A-2002-6; University of Joensuu: Joensuu, Finland, 2002. [Google Scholar]
Fränti, P.; Virmajoki, O. Iterative shrinking method for clustering problems. Pattern Recognit. 2006, 39, 761–775. [Google Scholar] [CrossRef]
Rezaei, M.; Fränti, P. Set Matching Measures for External Cluster Validity. IEEE Trans. Knowl. Data Eng. 2016, 28, 2173–2186. [Google Scholar] [CrossRef]

Figure 1. The horizontal axis is

\frac{d (x_{i}, x_{j})}{ϵ}

and the vertical axis is the values of

e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{2}}

(in red) and its first (in blue) and secondary (in purple) derivatives with respect to

\frac{d (x_{i}, x_{j})}{ϵ}

.

Figure 1. The horizontal axis is

\frac{d (x_{i}, x_{j})}{ϵ}

and the vertical axis is the values of

e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{2}}

(in red) and its first (in blue) and secondary (in purple) derivatives with respect to

\frac{d (x_{i}, x_{j})}{ϵ}

.

Figure 2. The contribution

c (x_{i}, x_{j})

for different values of m. The horizontal axis is

\frac{d (x_{i}, x_{j})}{ϵ}

, and the vertical axis is the contribution

c (x_{i}, x_{j}) = e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{m}}

, as defined in Equation (16).

Figure 2. The contribution

c (x_{i}, x_{j})

for different values of m. The horizontal axis is

\frac{d (x_{i}, x_{j})}{ϵ}

, and the vertical axis is the contribution

c (x_{i}, x_{j}) = e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{m}}

, as defined in Equation (16).

Figure 3. The radiuses

ϵ_{p}

,

ϵ_{k}

, and

ϵ_{k P}

for p = 2, P = 75, and k = 5 to 50. The horizontal axis is the value of k, and the vertical axis is the value of radius.

Figure 3. The radiuses

ϵ_{p}

,

ϵ_{k}

, and

ϵ_{k P}

for p = 2, P = 75, and k = 5 to 50. The horizontal axis is the value of k, and the vertical axis is the value of radius.

Figure 4. The local densities calculated using

ϵ_{p}

,

ϵ_{k}

, or

ϵ_{k P}

for the data points in four datasets (i.e., Path_based, Compound, D31, and A2). The horizontal and the vertical coordination show the position of data points, and the color indicates the value of local densities.

Figure 4. The local densities calculated using

ϵ_{p}

,

ϵ_{k}

, or

ϵ_{k P}

for the data points in four datasets (i.e., Path_based, Compound, D31, and A2). The horizontal and the vertical coordination show the position of data points, and the color indicates the value of local densities.

Figure 5. The local densities calculated using

C_{i} = X

,

B_{ϵ} (x_{i})

, or

N_{k} (x_{i})

. The horizontal and the vertical coordination show the position of data points, and the color indicates the value of local densities.

Figure 5. The local densities calculated using

C_{i} = X

,

B_{ϵ} (x_{i})

, or

N_{k} (x_{i})

. The horizontal and the vertical coordination show the position of data points, and the color indicates the value of local densities.

Figure 6. The local densities calculated using m = 2, 4, or 10 in

c (x_{i}, x_{j})

. The horizontal and the vertical coordination show the position of data points, and the color indicates the value of local densities.

Figure 6. The local densities calculated using m = 2, 4, or 10 in

c (x_{i}, x_{j})

. The horizontal and the vertical coordination show the position of data points, and the color indicates the value of local densities.

Figure 7. The local densities calculated using different integration operator (

Π

or

Σ

). The horizontal and the vertical coordination show the position of data points, and the color indicates the value of local densities.

Figure 7. The local densities calculated using different integration operator (

Π

or

Σ

). The horizontal and the vertical coordination show the position of data points, and the color indicates the value of local densities.

Table 1. Notations.

$X = {x_{1}, \dots, x_{n}}$	the dataset of n data points to be clustered
$ρ (x_{i})$	the local density of a data point $x_{i} \in X$
$d (x_{i}, x_{j})$	the distance between two data points $x_{i}$ and $x_{j}$
$ϵ$	the radius of a data point’s neighborhood
$ϵ_{p}$	the radius derived from top p% of all pairs’ distances. (1st used in Section 4)
$ϵ_{k}$	the radius derived using the parameter k and Equation (8). (1st used in Section 4)
$ϵ_{k P}$	the radius derived using the P-th percentile of the distances between all data points and their k-th nearest neighbors. (1st used in Section 4)
$N_{k} (x_{i})$	the set of $k$ nearest neighbors of $x_{i}$ . (1st used in Equation (4))
$R_{k} (x_{i})$	the set of reverse $k$ nearest neighbors of $x_{i}$ . (1st used in Equation (12))
$y_{i}^{j}$	the j-th nearest neighbor of $x_{i}$ . (1st used in Section 2.2)
$δ_{i}^{j}$	the distance between $x_{i}$ and its j-th nearest neighbor $y_{i}^{j}$ . (1st used in Equation (8))
$C_{i}$	the set of data points that contribute to the density of $x_{i}$ . (1st used in Equation (17))
$c (x_{i}, x_{j})$	the contribution of $x_{j}$ to the density of $x_{i}$ . (1st used in Equation (17))

Table 2. Equations (3), (4), (6), (7) and (19)–(21) fit the canonical forms defined in Equations (16)–(18).

Equation	$Π$ or $Σ$	$C_{i}$	$c (x_{i}, x_{j})$	m	$ϵ$
(19) $\sum_{x_{j} \in X} e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{\infty}}$	$Σ$	$X$	$e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{\infty}}$	$\infty$	$ϵ$ is set to the distance at the top p% of all pairs’ distances in $X$ , where p is a parameter [4].
(3) $\sum_{x_{j} \in X} e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{2}}$	$Σ$	$X$	$e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{2}}$	2	$ϵ$ is set to the distance at the top p% of all pairs’ distances in $X$ , where p is a parameter [4].
(4) $\sum_{x_{j} \in N_{k} (x_{i})} e^{- (\frac{d (x_{i}, x_{j})}{1})}$	$Σ$	$N_{k} (x_{i})$	$e^{- (\frac{d (x_{i}, x_{j})}{1})}$	1	1
(6) $\prod_{x_{j} \in N_{k} (x_{i})} e^{- {(\frac{d (x_{i}, x_{j})}{\sqrt{k}})}^{2}}$	$Π$	$N_{k} (x_{i})$	$e^{- {(\frac{d (x_{i}, x_{j})}{\sqrt{k}})}^{2}}$	2	$\sqrt{k}$
(7) $\sum_{x_{j} \in N_{k} (x_{i})} e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{2}}$	$Σ$	$N_{k} (x_{i})$	$e^{- {(\frac{d (x_{i}, x_{j})}{ϵ})}^{2}}$	2	$ϵ$ is derived from the distance between each data point to its kth nearest neighbor using Equation (8) [5].
(20) $\sum_{x_{j} \in R_{k} (x_{i})} 1$	$Σ$	$R_{k} (x_{i})$	1
(21) $\sum_{x_{j} \in L (x_{i})} S i m (x_{i}, x_{j})$	$Σ$	$N_{k} (x_{i}) \cap^{} R_{k} (x_{i})$	$S i m (x_{i}, x_{j})$

Table 3. Number of points and number of clusters in the 16 synthetic datasets.

Dataset	Number of Clusters	Number of Points
Spiral [11]	3	312
Flame [12]	2	240
Aggregation [13]	7	788
Jain [14]	2	373
D31 [15]	31	3100
R15 [15]	15	600
Compound [16]	6	399
A1 [17]	20	3000
A2 [17]	35	5250
A3 [17]	50	7500
S1 [18]	15	5000
S2 [18]	15	5000
S3 [18]	15	5000
S4 [18]	15	5000
Path_based [11]	3	300
Unbalance [19]	8	6500

Table 4.

ϵ_{p}

(p = 2),

ϵ_{k}

(k = 10), and

ϵ_{k P}

(k = 10 and P = 75) for four datasets.

Table 4.

ϵ_{p}

(p = 2),

ϵ_{k}

(k = 10), and

ϵ_{k P}

(k = 10 and P = 75) for four datasets.

Dataset	Compound	Path_Based	D31	A2
$ϵ_{p}$	0.182606	0.223688	0.203595	0.206687
$ϵ_{k P}$	0.280839	0.522962	0.094954	0.071405
$ϵ_{k}$	0.430744	0.558793	0.114488	0.088676

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, J.-L. Generalizing Local Density for Density-Based Clustering. Symmetry 2021, 13, 185. https://doi.org/10.3390/sym13020185

AMA Style

Lin J-L. Generalizing Local Density for Density-Based Clustering. Symmetry. 2021; 13(2):185. https://doi.org/10.3390/sym13020185

Chicago/Turabian Style

Lin, Jun-Lin. 2021. "Generalizing Local Density for Density-Based Clustering" Symmetry 13, no. 2: 185. https://doi.org/10.3390/sym13020185

APA Style

Lin, J.-L. (2021). Generalizing Local Density for Density-Based Clustering. Symmetry, 13(2), 185. https://doi.org/10.3390/sym13020185

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Generalizing Local Density for Density-Based Clustering

Abstract

1. Introduction

2. Review on Local Density

2.1. Radius-Based Local Density

2.2. kNN-Based Local Density

3. Canonical Form for Local Density

3.1. Canoncial Form

3.2. Fit the Existing Definitions to the Canoncial Form

4. Derive New Definitions Using the Canonical Form

5. Experiment

5.1. Experiment Design

5.2. Test 1: Comparing the Radiuses $ϵ_{p}$ , $ϵ_{k}$ , and $ϵ_{k P}$

5.3. Test 2: Impact of the Contribution Set $C_{i}$ on Local Density

5.4. Test 3: Impact of the Exponent m on Local Density

5.5. Test 4: Impact of the Integration Operator ( $Π$ or $Σ$ ) on Local Density

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Generalizing Local Density for Density-Based Clustering

Abstract

1. Introduction

2. Review on Local Density

2.1. Radius-Based Local Density

2.2. kNN-Based Local Density

3. Canonical Form for Local Density

3.1. Canoncial Form

3.2. Fit the Existing Definitions to the Canoncial Form

4. Derive New Definitions Using the Canonical Form

5. Experiment

5.1. Experiment Design

5.2. Test 1: Comparing the Radiuses ϵ p , ϵ k , and ϵ k P

5.3. Test 2: Impact of the Contribution Set C i on Local Density

5.4. Test 3: Impact of the Exponent m on Local Density

5.5. Test 4: Impact of the Integration Operator ( Π or Σ ) on Local Density

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.2. Test 1: Comparing the Radiuses $ϵ_{p}$ , $ϵ_{k}$ , and $ϵ_{k P}$

5.3. Test 2: Impact of the Contribution Set $C_{i}$ on Local Density

5.5. Test 4: Impact of the Integration Operator ( $Π$ or $Σ$ ) on Local Density