# Generalizing Local Density for Density-Based Clustering

^{1}

^{2}

## Abstract

**:**

## 1. Introduction

## 2. Review on Local Density

#### 2.1. Radius-Based Local Density

#### 2.2. kNN-Based Local Density

## 3. Canonical Form for Local Density

#### 3.1. Canoncial Form

#### 3.2. Fit the Existing Definitions to the Canoncial Form

## 4. Derive New Definitions Using the Canonical Form

Algorithm 1: The proposed method to derive ϵ. |

Input: the set of data points $X\in {\mathbb{R}}_{n\times m},k,\mathrm{and}P$Output: the radius $\u03f5$1. Set $S=\left\{{\delta}_{i}^{k}\right|{x}_{i}\in X\}$, where ${\delta}_{i}^{k}$ is the distance between ${x}_{i}$ and its k-th nearest neighbor. 2. Sort the elements in $S$ in ascending order. 3. Set $s=\lceil \frac{P\times n}{100}\rceil .$ 4. Set $\u03f5=$ the s-th element in S.5. Return $\u03f5$ |

## 5. Experiment

#### 5.1. Experiment Design

#### 5.2. Test 1: Comparing the Radiuses ${\u03f5}_{p}$, ${\u03f5}_{k}$, and ${\u03f5}_{kP}$

#### 5.3. Test 2: Impact of the Contribution Set ${C}_{i}$ on Local Density

#### 5.4. Test 3: Impact of the Exponent m on Local Density

#### 5.5. Test 4: Impact of the Integration Operator ($\Pi $ or $\Sigma $) on Local Density

## 6. Conclusions

- ●
- ($\mathsf{\Pi}$,${B}_{\u03f5}\left({x}_{i}\right)$,*,*) and ($\mathsf{\Pi}$,$X$,*,*) should be avoided because they could incur results contradicting the notion of local density. For example, they could yield a low density to a should-be high-density data point. Here, ‘*’ is used to represent a do not-care term;
- ●
- Product operator $\mathsf{\Pi}$ could be used only when the size of the contribution set ${C}_{i}$ is fixed for every data point, e.g., ${C}_{i}={N}_{k}\left({x}_{i}\right)$;
- ●
- In most cases, the summation operator $\mathsf{\Sigma}$ should be adopted. However, product operator $\mathsf{\Pi}$ helps to identify the density peaks in a dataset;
- ●
- The value for $\u03f5$ should be dataset-dependent, e.g., ${\u03f5}_{p}$, ${\u03f5}_{k}$, and ${\u03f5}_{kP}$. Notably, ${\u03f5}_{p}$ is sensitive to the dataset’s size, ${\u03f5}_{k}$ is sensitive to the parameter k and the outliers in the dataset, and ${\u03f5}_{kP}$ provides a compromise between them;
- ●
- The value of m should be ≥2 so that the contribution function $c\left({x}_{i},{x}_{j}\right)$ has an inflection point at $\frac{\mathrm{d}\left({x}_{i},{x}_{j}\right)}{\u03f5}=\sqrt[m]{\frac{m-1}{m}}$. The greater the value of m, the closer the inflection point near $\frac{\mathrm{d}\left({x}_{i},{x}_{j}\right)}{\u03f5}=1$.

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques, 3rd ed.; Morgan Kaufmann Publishers Inc.: Waltham, MA, USA, 2011. [Google Scholar]
- Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
- Ankerst, M.; Breunig, M.M.; Kriegel, H.-P.; Sander, J. OPTICS: Ordering points to identify the clustering structure. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, Philadelphia, PA, USA, 1–3 June 1999; pp. 49–60. [Google Scholar]
- Rodriguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science
**2014**, 344, 1492–1496. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Liu, Y.; Ma, Z.; Fang, Y. Adaptive density peak clustering based on K-nearest neighbors with aggregating strategy. Knowl. Based Syst.
**2017**, 133, 208–220. [Google Scholar] [CrossRef] - Xie, J.; Gao, H.; Xie, W.; Liu, X.; Grant, P.W. Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors. Inf. Sci.
**2016**, 354, 19–40. [Google Scholar] [CrossRef] - Du, M.; Ding, S.; Jia, H. Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowl. Based Syst.
**2016**, 99, 135–145. [Google Scholar] [CrossRef] - Liu, Y.; Liu, D.; Yu, F.; Ma, Z. A Double-Density Clustering Method Based on “Nearest to First in” Strategy. Symmetry
**2020**, 12, 747. [Google Scholar] [CrossRef] - Lin, J.-L.; Kuo, J.-C.; Chuang, H.-W. Improving Density Peak Clustering by Automatic Peak Selection and Single Linkage Clustering. Symmetry
**2020**, 12, 1168. [Google Scholar] [CrossRef] - Lv, Y.; Liu, M.; Xiang, Y. Fast Searching Density Peak Clustering Algorithm Based on Shared Nearest Neighbor and Adaptive Clustering Center. Symmetry
**2020**, 12, 2014. [Google Scholar] [CrossRef] - Chang, H.; Yeung, D.-Y. Robust path-based spectral clustering. Pattern Recognit.
**2008**, 41, 191–203. [Google Scholar] [CrossRef] - Fu, L.; Medico, E. FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data. BMC Bioinform.
**2007**, 8, 3. [Google Scholar] [CrossRef] [PubMed] - Gionis, A.; Mannila, H.; Tsaparas, P. Clustering aggregation. ACM Trans. Knowl. Discov. Data
**2007**, 1, 4. [Google Scholar] [CrossRef] [Green Version] - Jain, A.K.; Law, M.H. Data clustering: A user’s dilemma. In Proceedings of the 2005 International Conference on Pattern Recognition and Machine Intelligence, Kolkata, India, 20–22 December 2005; pp. 1–10. [Google Scholar]
- Veenman, C.J.; Reinders, M.J.T.; Backer, E. A maximum variance cluster algorithm. IEEE Trans. Pattern Anal. Mach. Intell.
**2002**, 24, 1273–1280. [Google Scholar] [CrossRef] [Green Version] - Zahn, C.T. Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters. IEEE Trans. Comput.
**1971**, 100, 68–86. [Google Scholar] [CrossRef] [Green Version] - Kärkkäinen, I.; Fränti, P. Dynamic Local Search Algorithm for the Clustering Problem; A-2002-6; University of Joensuu: Joensuu, Finland, 2002. [Google Scholar]
- Fränti, P.; Virmajoki, O. Iterative shrinking method for clustering problems. Pattern Recognit.
**2006**, 39, 761–775. [Google Scholar] [CrossRef] - Rezaei, M.; Fränti, P. Set Matching Measures for External Cluster Validity. IEEE Trans. Knowl. Data Eng.
**2016**, 28, 2173–2186. [Google Scholar] [CrossRef]

**Figure 1.**The horizontal axis is $\frac{\mathrm{d}\left({x}_{i},{x}_{j}\right)}{\u03f5}$ and the vertical axis is the values of ${e}^{-{\left(\frac{\mathrm{d}\left({x}_{i},{x}_{j}\right)}{\u03f5}\right)}^{2}}$ (in red) and its first (in blue) and secondary (in purple) derivatives with respect to $\frac{\mathrm{d}\left({x}_{i},{x}_{j}\right)}{\u03f5}$.

**Figure 2.**The contribution $c\left({x}_{i},{x}_{j}\right)$ for different values of m. The horizontal axis is $\frac{\mathrm{d}\left({x}_{i},{x}_{j}\right)}{\u03f5}$, and the vertical axis is the contribution $c\left({x}_{i},{x}_{j}\right)={e}^{-{\left(\frac{\mathrm{d}\left({x}_{i},{x}_{j}\right)}{\u03f5}\right)}^{m}}$, as defined in Equation (16).

**Figure 3.**The radiuses ${\u03f5}_{p}$, ${\u03f5}_{k}$, and ${\u03f5}_{kP}$ for p = 2, P = 75, and k = 5 to 50. The horizontal axis is the value of k, and the vertical axis is the value of radius.

**Figure 4.**The local densities calculated using ${\u03f5}_{p}$, ${\u03f5}_{k}$, or ${\u03f5}_{kP}$ for the data points in four datasets (i.e., Path_based, Compound, D31, and A2). The horizontal and the vertical coordination show the position of data points, and the color indicates the value of local densities.

**Figure 5.**The local densities calculated using ${C}_{i}=X$, ${B}_{\u03f5}\left({x}_{i}\right)$, or ${N}_{k}\left({x}_{i}\right)$. The horizontal and the vertical coordination show the position of data points, and the color indicates the value of local densities.

**Figure 6.**The local densities calculated using m = 2, 4, or 10 in $c\left({x}_{i},{x}_{j}\right)$. The horizontal and the vertical coordination show the position of data points, and the color indicates the value of local densities.

**Figure 7.**The local densities calculated using different integration operator ($\mathsf{\Pi}$ or $\mathsf{\Sigma}$). The horizontal and the vertical coordination show the position of data points, and the color indicates the value of local densities.

$X=\left\{{x}_{1},\cdots ,{x}_{n}\right\}$ | the dataset of n data points to be clustered |

$\mathsf{\rho}\left({x}_{i}\right)$ | the local density of a data point ${x}_{i}\in X$ |

$\mathrm{d}\left({x}_{i},{x}_{j}\right)$ | the distance between two data points ${x}_{i}$ and ${x}_{j}$ |

$\u03f5$ | the radius of a data point’s neighborhood |

${\u03f5}_{p}$ | the radius derived from top p% of all pairs’ distances. (1st used in Section 4) |

${\u03f5}_{k}$ | the radius derived using the parameter k and Equation (8). (1st used in Section 4) |

${\u03f5}_{kP}$ | the radius derived using the P-th percentile of the distances between all data points and their k-th nearest neighbors. (1st used in Section 4) |

${N}_{k}\left({x}_{i}\right)$ | the set of $k$ nearest neighbors of ${x}_{i}$. (1st used in Equation (4)) |

${R}_{k}\left({x}_{i}\right)$ | the set of reverse $k$ nearest neighbors of ${x}_{i}$. (1st used in Equation (12)) |

${y}_{i}^{j}$ | the j-th nearest neighbor of ${x}_{i}$. (1st used in Section 2.2) |

${\delta}_{i}^{j}$ | the distance between ${x}_{i}$ and its j-th nearest neighbor ${y}_{i}^{j}$. (1st used in Equation (8)) |

${C}_{i}$ | the set of data points that contribute to the density of ${x}_{i}$. (1st used in Equation (17)) |

$c\left({x}_{i},{x}_{j}\right)$ | the contribution of ${x}_{j}$ to the density of ${x}_{i}$. (1st used in Equation (17)) |

**Table 2.**Equations (3), (4), (6), (7) and (19)–(21) fit the canonical forms defined in Equations (16)–(18).

Equation | $\Pi $ or $\Sigma $ | ${C}_{i}$ | $c\left({x}_{i},{x}_{j}\right)$ | m | $\u03f5$ |
---|---|---|---|---|---|

(19) $\sum}_{{x}_{j}\in X}{e}^{-{\left(\frac{\mathrm{d}\left({x}_{i},{x}_{j}\right)}{\u03f5}\right)}^{\infty}$ | $\mathsf{\Sigma}$ | $X$ | ${e}^{-{\left(\frac{\mathrm{d}\left({x}_{i},{x}_{j}\right)}{\u03f5}\right)}^{\infty}}$ | $\infty $ | $\u03f5$ is set to the distance at the top p% of all pairs’ distances in $X$, where p is a parameter [4]. |

(3) $\sum}_{{x}_{j}\in X}{e}^{-{\left(\frac{\mathrm{d}\left({x}_{i},{x}_{j}\right)}{\u03f5}\right)}^{2}$ | $\mathsf{\Sigma}$ | $X$ | ${e}^{-{\left(\frac{\mathrm{d}\left({x}_{i},{x}_{j}\right)}{\u03f5}\right)}^{2}}$ | 2 | $\u03f5$ is set to the distance at the top p% of all pairs’ distances in $X$, where p is a parameter [4]. |

(4) $\sum}_{{x}_{j}\in {N}_{k}\left({x}_{i}\right)}{e}^{-\left(\frac{\mathrm{d}\left({x}_{i},{x}_{j}\right)}{1}\right)$ | $\mathsf{\Sigma}$ | ${N}_{k}\left({x}_{i}\right)$ | ${e}^{-\left(\frac{\mathrm{d}\left({x}_{i},{x}_{j}\right)}{1}\right)}$ | 1 | 1 |

(6) $\prod}_{{x}_{j}\in {N}_{k}\left({x}_{i}\right)}{e}^{-{\left(\frac{\mathrm{d}\left({x}_{i},{x}_{j}\right)}{\sqrt{k}}\right)}^{2}$ | $\mathsf{\Pi}$ | ${N}_{k}\left({x}_{i}\right)$ | ${e}^{-{\left(\frac{\mathrm{d}\left({x}_{i},{x}_{j}\right)}{\sqrt{k}}\right)}^{2}}$ | 2 | $\sqrt{k}$ |

(7) $\sum}_{{x}_{j}\in {N}_{k}\left({x}_{i}\right)}{e}^{-{\left(\frac{\mathrm{d}\left({x}_{i},{x}_{j}\right)}{\u03f5}\right)}^{2}$ | $\mathsf{\Sigma}$ | ${N}_{k}\left({x}_{i}\right)$ | ${e}^{-{\left(\frac{\mathrm{d}\left({x}_{i},{x}_{j}\right)}{\u03f5}\right)}^{2}}$ | 2 | $\u03f5$ is derived from the distance between each data point to its kth nearest neighbor using Equation (8) [5]. |

(20) $\sum}_{{x}_{j}\in {R}_{k}\left({x}_{i}\right)}1$ | $\mathsf{\Sigma}$ | ${R}_{k}\left({x}_{i}\right)$ | 1 | ||

(21) $\sum}_{{x}_{j}\in L\left({x}_{i}\right)}Sim\left({x}_{i},{x}_{j}\right)$ | $\mathsf{\Sigma}$ | ${N}_{k}\left({x}_{i}\right){{\displaystyle \cap}}^{}{R}_{k}\left({x}_{i}\right)$ | $Sim\left({x}_{i},{x}_{j}\right)$ |

Dataset | Number of Clusters | Number of Points |
---|---|---|

Spiral [11] | 3 | 312 |

Flame [12] | 2 | 240 |

Aggregation [13] | 7 | 788 |

Jain [14] | 2 | 373 |

D31 [15] | 31 | 3100 |

R15 [15] | 15 | 600 |

Compound [16] | 6 | 399 |

A1 [17] | 20 | 3000 |

A2 [17] | 35 | 5250 |

A3 [17] | 50 | 7500 |

S1 [18] | 15 | 5000 |

S2 [18] | 15 | 5000 |

S3 [18] | 15 | 5000 |

S4 [18] | 15 | 5000 |

Path_based [11] | 3 | 300 |

Unbalance [19] | 8 | 6500 |

**Table 4.**${\u03f5}_{p}$ (p = 2), ${\u03f5}_{k}$ (k = 10), and ${\u03f5}_{kP}$ (k = 10 and P = 75) for four datasets.

Dataset | Compound | Path_Based | D31 | A2 |
---|---|---|---|---|

${\u03f5}_{p}$ | 0.182606 | 0.223688 | 0.203595 | 0.206687 |

${\u03f5}_{kP}$ | 0.280839 | 0.522962 | 0.094954 | 0.071405 |

${\u03f5}_{k}$ | 0.430744 | 0.558793 | 0.114488 | 0.088676 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Lin, J.-L.
Generalizing Local Density for Density-Based Clustering. *Symmetry* **2021**, *13*, 185.
https://doi.org/10.3390/sym13020185

**AMA Style**

Lin J-L.
Generalizing Local Density for Density-Based Clustering. *Symmetry*. 2021; 13(2):185.
https://doi.org/10.3390/sym13020185

**Chicago/Turabian Style**

Lin, Jun-Lin.
2021. "Generalizing Local Density for Density-Based Clustering" *Symmetry* 13, no. 2: 185.
https://doi.org/10.3390/sym13020185