# Common Nearest Neighbor Clustering—A Benchmark

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Molecular Dynamics Simulations

## 3. Materials and Methods

#### 3.1. Common Nearest Neighbor Algorithm

- 0.
- Choose R and N.
- 1.
**Cluster initialization:**Choose an arbitrary data point ${\mathbf{x}}_{i}$ from the set of unclustered data points U as the first member of the cluster C. Set ${\mathbf{x}}_{i}$ to ${\mathbf{x}}_{\mathrm{current}}$ (first line in Figure 4g).- 2.
**Cluster expansion:**- (a)
- Add any unclustered data point within $2R$ of ${\mathbf{x}}_{\mathrm{current}}$, which fulfils the density criterion with respect to ${\mathbf{x}}_{\mathrm{current}}$, to the cluster C (second line in Figure 4g) and remove it from U. Keep a list L of the newly added data points.
- (b)
- Choose a data point ${\mathbf{x}}_{j}$ from L and set ${\mathbf{x}}_{j}$ to ${\mathbf{x}}_{\mathrm{current}}$.
- (c)
- Repeat steps (a) and (b) until no further data point can be added to C (third and fourth line in Figure 4g).
- (d)
- Save C to file and reset C to an empty list $C=\left\{\right\}$.

- 3.
- Repeat steps 1 and 2 until all data points are assigned to a cluster.

#### 3.2. Hierarchical Clustering

#### 3.3. Dataset Reduction

#### 3.4. Centroid Index at Cluster Level

#### 3.5. Markov Chain Monte Carlo Sampling

## 4. Results

#### 4.1. Benchmark

#### 4.1.1. Cluster Size and Shape

#### 4.1.2. Number of Clusters

#### 4.1.3. Unbalanced Data-Point Density

#### 4.1.4. Cluster Overlap

#### 4.1.5. Dimensionality

#### 4.1.6. Comparison to Other Cluster Algorithms

#### 4.2. Core-Set Models of Molecular Dynamics

#### 4.3. Run-Time Analysis

## 5. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## Abbreviations

CI | Centroid Index at Cluster Level |

CNN | Common nearest neighbor |

DBSCAN | Density-Based Spatial Clustering of Applications with Noise |

kM | kMeans algorithm |

Max | Further point heuristic |

MCMC | Markov chain Monte Carlo |

MD | Molecular dynamics |

MSM | Markov state model |

PCA | Principle component analysis |

PES | Potential energy surface |

## References

- JeraldBeno, T.R.; Karnan, M. Dimensionality Reduction: Rough Set Based Feature Reduction. Int. J. Sci. Res. Publ.
**2012**, 2, 1–6. [Google Scholar] - Karypis, G.; Han, E.H.; Kumar, V. CHAMELEON: A hierarchical 765 clustering algorithm using dynamic modeling. IEEE Trans. Comput.
**1999**, 32, 68–75. [Google Scholar] - Fu, L.; Medico, E. FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data. BMC Bioinform.
**2007**, 8, 3. [Google Scholar] [CrossRef] [PubMed] - Keller, B.; Daura, X.; van Gunsteren, W.F. Comparing geometric and kinetic cluster algorithms for molecular simulation data. J. Chem. Phys.
**2010**, 132, 074110. [Google Scholar] [CrossRef] [PubMed] - Jarvis, R.A.; Patrick, E.A. Clustering Using a Similarity Measure Based on Shared Near Neighbors. IEEE Trans. Comp.
**1973**, C-22, 1025–1034. [Google Scholar] [CrossRef] - Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the KDD-96 the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
- Comaniciu, D.; Meer, P. Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell.
**2002**, 24, 603–619. [Google Scholar] [CrossRef] - Ankerst, M.; Breuning, M.M.; Kriegel, H.P.; Sander, J. OPTICS: Ordering Points To Identify the Clustering Structure. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Philadelphia, PA, USA, 1–3 June 1999; pp. 49–60. [Google Scholar]
- Rodriguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science
**2014**, 344, 1492–1496. [Google Scholar] [CrossRef] [PubMed] - Liu, S.; Zhu, L.; Sheong, F.K.; Wang, W.; Huang, X. Adaptive partitioning by local density-peaks: An efficient density-based clustering algorithm for analyzing molecular dynamics trajectories. J. Comput. Chem.
**2016**, 38, 152–160. [Google Scholar] [CrossRef] [PubMed] - Jain, A.K.; Topchy, A.; Law, M.H.C.; Buhmann, J.M. Landscape of Clustering Algorithms. In Proceedings of the ICPR’04 17th International Conference on Pattern Recognition, Cambridge, UK, 23–26 August 2004; Volume 1, pp. 260–263. [Google Scholar]
- Kärkkäinen, I.; Fränti, P. Dynamic Local Search Algorithm for the Clustering Problem; Technical Report A-2002-6; University of Joensuu: Joensuu, Finland, 2002. [Google Scholar]
- Fränti, P.; Virmajoki, O. Iterative shrinking method for clustering problems. Pattern Recognit.
**2006**, 39, 761–765. [Google Scholar] [CrossRef] - Zhang, T.; Ramakrishnan, R.; Livny, M. BIRCH: A new data clustering algorithm and its applications. Data Min. Knowl. Discov.
**1997**, 1, 141–182. [Google Scholar] [CrossRef] - Kärkkäinen, I.; Fränti, P. Gradual model generator for single-pass clustering. Pattern Recognit.
**2007**, 40, 784–795. [Google Scholar] [CrossRef] - Fränti, P.; Virmajoki, O.; Hautamäki, V. Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Trans. Pattern Anal. Mach. Intell.
**2006**, 28, 1875–1881. [Google Scholar] [CrossRef] [PubMed] - Rezaei, M.; Fränti, P. Set-matching methods for external cluster validity. IEEE Trans. Knowl. Data Eng.
**2016**, 28, 2173–2186. [Google Scholar] [CrossRef] - Gionis, A.; Mannila, H.; Tsaparas, P. Clustering aggregation. ACM Trans. Knowl. Discov. Data
**2007**, 1, 1–30. [Google Scholar] [CrossRef] - Zahn, C.T. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans. Comput.
**1971**, 100, 68–86. [Google Scholar] [CrossRef] - Veenman, C.J.; Reinders, M.J.T.; Backer, E. A maximum variance cluster algorithm. IEEE Trans. Pattern Anal. Mach. Intell.
**2002**, 24, 1273–1280. [Google Scholar] [CrossRef] - Jain, A.K.; Law, M.H.C. Data Clustering: A User’s Dilemma. In Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2005; pp. 1–10. [Google Scholar]
- Chang, H.; Yeung, D.Y. Robust path-based spectral clustering. Pattern Recognit.
**2008**, 41, 191–203. [Google Scholar] [CrossRef] - Lemke, O.; Keller, B.G. CNNClustering. Available online: https://github.com/BDGSoftware/CNNClustering (accessed on 6 January 2017).
- Lemke, O.; Keller, B.G. Density-based cluster algorithms for the identification of core sets. J. Chem. Phys.
**2016**, 145, 164104. [Google Scholar] [CrossRef] [PubMed] - Sarich, M.; Banisch, R.; Hartmann, C.; Schütte, C. Markov State Models for Rare Events in Molecular Dynamics. Entropy
**2014**, 16, 258–286. [Google Scholar] [CrossRef] - Vanden-Eijnden, E.; Venturoli, M.; Ciccotti, G.; Elber, R. On the assumptions underlying milestoning. J. Chem. Phys.
**2008**, 129, 174102. [Google Scholar] [CrossRef] [PubMed] - Schütte, C. Conformational Dynamics: Modelling, Theory, Algorithm, and Application to Biomolecules. Habilitation Thesis, Konrad-Zuse-Zentrum für Informationstechnik, Berlin, Germany, 1999. [Google Scholar]
- Schütte, C.; Noé, F.; Lu, J.; Sarich, M.; Vanden-Eijnden, E. Markov state models based on milestoning. J. Chem. Phys.
**2011**, 134, 204105. [Google Scholar] [CrossRef] [PubMed] - Schütte, C.; Sarich, M. A critical appraisal of Markov state models. Eur. Phys. J. Spec. Top.
**2015**, 224, 2445. [Google Scholar] [CrossRef] - Frenkel, D.; Smit, B. Understanding Molecular Simulations; Academic Press: San Diego, CA, USA, 1996. [Google Scholar]
- Allen, M.P.; Tildesley, D.J. Computer Simulation of Liquids; Oxford University Press: New York, NY, USA, 1987. [Google Scholar]
- Leach, A.R. Molecular Modelling; Addison Wesley Longman: Essex, UK, 1996. [Google Scholar]
- Hanske, J.; Aleksić, S.; Ballaschk, M.; Jurk, M.; Shanina, E.; Beerbaum, M.; Schmieder, P.; Keller, B.G.; Rademacher, C. Intradomain Allosteric Network Modulates Calcium Affinity of the C-Type Lectin Receptor Langerin. J. Am. Chem. Soc.
**2016**, 138, 12176–12186. [Google Scholar] [CrossRef] [PubMed] - Witek, J.; Keller, B.G.; Blatter, M.; Meissner, A.; Wagner, T.; Riniker, S. Kinetic Models of Cyclosporin a in Polar and Apolar Environments Reveal Multiple Congruent Conformational States. J. Chem. Inf. Model.
**2016**, 56, 1547–1562. [Google Scholar] [CrossRef] [PubMed] - Tsai, C.J.; Nussinov, R. A Unified View of “How Allostery Works”. PLoS Comput. Biol.
**2014**, 10, e1003394. [Google Scholar] [CrossRef] [PubMed] - Ball, G.H.; Hall, D.J. A clustering technique for summarizing multivariate data. Behav. Sci.
**1967**, 12, 153–155. [Google Scholar] [CrossRef] [PubMed] - Fränti, P.; Rezaei, M.; Zhao, Q. Centroid index: Cluster level similarity measure. Pattern Recognit.
**2014**, 47, 3034–3045. [Google Scholar] [CrossRef] - Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. Equation of State Calculations by Fast Computing Machines. J. Chem. Phys.
**1953**, 21, 1087–1092. [Google Scholar] [CrossRef] - Fränti, P.; Sieranoja, S. Clustering datasets. Algorithms
**2017**. submitted. [Google Scholar] - Arthur, D.; Vassilvitskii, S. K-means++: The advantages of careful seeding. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; pp. 1027–1035. [Google Scholar]
- Scherer, M.K.; Trendelkamp-Schroer, B.; Paul, F.; Pérez-Hernández, G.; Hoffmann, M.; Plattner, N.; Wehmeyer, C.; Prinz, J.H.; Noé, F. PyEMMA 2: A Software Package for Estimation, Validation, and Analysis of Markov Models. J. Chem. Theory Comput.
**2015**, 11, 5525–5542. [Google Scholar] [CrossRef] [PubMed] - Lloyd, S.P. Least squares quantization in pcm. IEEE Trans. Inf. Theory
**1982**, 28, 129–137. [Google Scholar] [CrossRef] - Gonzalez, T.F. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci.
**1985**, 38, 293–306. [Google Scholar] [CrossRef] - Fränti, P.; Mariescu-Istodor, R.; Zhong, C. XNN graph. Joint Int. Workshop Struct. Syntactic Stat. Pattern Recognit.
**2016**, LNCS 10029, 207–217. [Google Scholar] - Schwantes, C.R.; Pande, V.S. Modeling Molecular Kinetics with tICA and the Kernel Trick. J. Chem. Theory Comput.
**2015**, 11, 600–608. [Google Scholar] [CrossRef] [PubMed] - Aghabozorgi, S.; Shirkhorshidi, A.S.; Wah, T.Y. Time-series clustering—A decade review. Inf. Syst.
**2015**, 53, 16–38. [Google Scholar] [CrossRef] - Mariescu-Istodor, R.; Fränti, P. Grid-Based Method for GPS Route Analysis for Retrieval. ACM Trans. Algorithm
**2017**, 3, 1–28. [Google Scholar] [CrossRef] - Chandrakala, S.; Sekhar, C.C. A density based method for multivariate time series clustering in kernel feature space. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008. [Google Scholar]
- Hamprecht, F.A.; Peter, C.; Daura, X.; Thiel, W.; van Gunsteren, W.F. A strategy for analysis of (molecular) equilibrium simulations: Configuration space density estimation, clustering, and visualization. J. Chem. Phys.
**2001**, 114, 2079–2089. [Google Scholar] [CrossRef] - Schütte, C.; Fischer, A.; Huisinga, W.; Deuflhard, P. A Direct Approach to Conformational Dynamics Based on Hybrid Monte Carlo. J. Comput. Phys.
**1999**, 151, 146–168. [Google Scholar] [CrossRef] - Swope, W.C.; Pitera, J.W.; Suits, F. Describing Protein Folding Kinetics by Molecular Dynamics Simulations. J. Phys. Chem. B
**2004**, 108, 6571–6581. [Google Scholar] [CrossRef] - Chodera, J.D.; Singhal, N.; Pande, V.S.; Dill, K.A.; Swope, W.C. Automatic discovery of metastable states for the construction of Markov models of macromolecular conformational dynamics. J. Chem. Phys.
**2007**, 126, 155101. [Google Scholar] [CrossRef] [PubMed] - Buchete, N.V.; Hummer, G. Coarse Master Equations for Peptide Folding Dynamics. J. Phys. Chem. B
**2008**, 112, 6057–6069. [Google Scholar] [CrossRef] [PubMed] - Keller, B.; Hünenberger, P.; van Gunsteren, W.F. An Analysis of the Validity of Markov State Models for Emulating the Dynamics of Classical Molecular Systems and Ensembles. J. Chem. Theory Comput.
**2011**, 7, 1032–1044. [Google Scholar] [CrossRef] [PubMed] - Prinz, J.H.; Wu, H.; Sarich, M.; Keller, B.; Senne, M.; Held, M.; Chodera, J.D.; Schütte, C.; Noé, F. Markov models of molecular kinetics: Generation and validation. J. Chem. Phys.
**2011**, 134, 174105. [Google Scholar] [CrossRef] [PubMed] - Sarich, M.; Noé, F.; Schütte, C. On the Approximation Quality of Markov State Models. Multisc. Model. Simul.
**2010**, 8, 1154–1177. [Google Scholar] [CrossRef] - Nüske, F.; Keller, B.G.; Pérez-Hernández, G.; Mey, A.S.J.S.; Noé, F. Variational Approach to Molecular Kinetics. J. Chem. Theory Comput.
**2014**, 10, 1739–1752. [Google Scholar] [CrossRef] [PubMed] - Vitalini, F.; Noé, F.; Keller, B.G. A Basis Set for Peptides for the Variational Approach to Conformational Kinetics. J. Chem. Theory Comput.
**2015**, 11, 3992–4004. [Google Scholar] [CrossRef] [PubMed] - Fackeldey, K.; Röblitz, S.; Scharkoi, O.; Weber, M. Soft Versus Hard Metastable Conformations in Molecular Simulations; Technical Report 11-27; ZIB: Berlin, Germany, 2011. [Google Scholar]
- Weber, M.; Fackeldey, K.; Schütte, C. Set-free Markov state model building. J. Chem. Phys.
**2017**, 146, 124133. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**(

**a**) Potential energy surface (PES) spanned by two coordinates $x1$ and $x2$; (

**b**) trajectories within the PES (projection onto $x1$ and $x2$); (

**c**) Boltzmann distribution of the PES divided into three states; (

**d**) core sets (

**blue**) within the Boltzmann distribution; all non-assigned data points are outside the core sets.

**Figure 2.**Performance of the common nearest neighbor (CNN) algorithm on a variety of different two-dimensional (2D) datasets; the parameter sets are presented in Table 1; non-assigned data points (noise) are highlighted in light gray; if accessible, the true centroids are highlighted blue. ${}^{A}$ For the Birch3 dataset, a three-step hierarchical clustering approach was applied. The original clusters after the first step are highlighted in light red and light blue.

**Figure 3.**Molecular dynamics (MD) simulation trajectory of Langerin [33] (

**a**) projected onto the first two principle components. Each core set (highlighted in different colors) represents a minimum of the potential energy surface spanned by the principle components and is linked to (

**b**) a metastable state. (The core sets were identified by common nearest neighbor clustering, where the distance metric was the Euclidean distance in the space spanned by the first two principle components. A preprocessing of the MD dataset by principle component analysis is however not strictly necessary [4,24]). (

**c**) Distance distribution of the distance Met260-Gly290 (highlighted in (

**b**) by orange spheres).

**Figure 4.**Clustering process using the common nearest neighbor (CNN) algorithm: (

**a**) A dataset that consists of two clusters and several noise points. (

**b**–

**e**) Depiction of the density criteria that determine whether two data points ${\mathbf{x}}_{i}$ and ${\mathbf{x}}_{j}$ belong to the same cluster: (

**b**) The first density criterion, N shared nearest neighbors, is fulfilled. (

**c**) The second density criterion, ${\mathbf{x}}_{i}$ and ${\mathbf{x}}_{j}$ are nearest neighbors with respect to the other data point, is fulfilled. (

**d**) Both criteria (

**b**,

**c**) are fulfilled at the same time. Hence, ${\mathbf{x}}_{i}$ and ${\mathbf{x}}_{j}$ belong to the same cluster. (

**e**) One criterion is fulfilled, the other is not. Hence, ${\mathbf{x}}_{i}$ and ${\mathbf{x}}_{j}$ do not belong to the same cluster. (

**f**) ${\mathbf{x}}_{a}$ and ${\mathbf{x}}_{b}$ fulfil no density criteria but belong to the same cluster as they are density-connected (reproduced from Lemke, O.; Keller, B.G. “Density-based cluster algorithms for the identification of core sets”, J. Chem. Phys.

**2016**, 145, 164104, with the permission of AIP Publishing). (

**g**) Depiction of a clustering process over several cluster expansion steps (iterations). We note that the cluster is expanded by adding single data points to the current cluster that fulfil the density criterion with respect to a data point within the cluster.

**Figure 5.**Distance distribution of different datasets (left axis, black). Distance distribution of the isolated clusters (right axis, colored according to the clusters in Figure 2). In the third row, a zoom into the datasets of the second row is shown. The dashed line denotes the chosen cutoff distance R for each dataset.

**Figure 6.**Hierarchical clustering: (

**a**) the density of a molecular dynamics simulation trajectory projected onto two meaningful coordinates; (

**b**) the dataset of the trajectory; (

**c**) a reduced dataset (

**red**) for the clustering; (

**d**) outcome of the first clustering step; (

**e**) outcome of the second clustering step; (

**f**) final core sets. A direct step from (

**b**–

**f**) is not possible because of the dataset size. A direct step from (

**c**–

**e**) is not possible because of the high difference in the cluster density.

**Figure 7.**Further evaluation of the clustering: (

**a**) Parameters R (solid, left axis) and N (dashed, right axis) for the clustering of the G2 datasets for 2 (

**blue**), 8 (

**red**) and 128 (

**green**) dimensions. For N the minimal value is reported for which a split of the two clusters is observed. The colored region highlights the cluster sizes that are possible to extract. We note that beyond $\sigma =60$ for two dimensions and $\sigma =70$ for eight dimensions, no split of the clusters is observed. (

**b**) Clustering for the Birch3 (

**upper**) and the Path-based (

**lower**) datasets using the common nearest neighbor (CNN), the DBSCAN and the density peaks algorithms.

**Figure 8.**(

**a**) Core sets of the two-dimensional (2D) potential energy surface (PES) using different parameter sets. A smaller core set corresponds to a smaller cutoff radius. Implied time scales (colored according to the upper right cluster in (

**a**)) of the slower process (

**b**) and the faster process (

**c**). The black dashed line denotes the convergence limit. (

**d**) Corresponding eigenvectors to the slow processes: Stationary distribution (

**left**), slower process (

**middle**), and faster process (

**right**). The eigenvectors describe transitions of probability density between states with different sign (highlighted by the color).

**Figure 9.**Run-time of the clustering step (without calculation of the distance matrix) for different dataset sizes for the (

**blue**) Birch2 and (

**green**) sample of the two-dimensional Boltzmann distribution (Section 4.2). The blue area highlights the scaling of the clustering step between linear (lower bound) and quadratic (upper bound) behavior. The dashed lines represent 1 h, 1 min and 1 s respectively. All calculations were performed on an Intel(R) Core(TM) i5-4590 CPU with 3.30 GHz and 16 GB of RAM.

**Table 1.**Properties of the benchmark datasets: n: dataset size; d: dimensionality; ${k}_{R}$: reference number of clusters. Parameters for the common nearest neighbor (CNN) algorithm: R and N: cluster parameter set; M: minimal cluster size. Parameter for the Density based spatial clustering of applications with noise DBSCAN algorithm: $Eps$: cutoff distance; $MP$: minimum number of neighbors; M: minimal cluster size. ${}^{A}$ Hierarchical clustering was necessary. ${}^{B}$ The size was reduced from 100,000 extracting every 10th data point.

Dataset | CNN Parameter | DBSCAN Parameter | |||||||
---|---|---|---|---|---|---|---|---|---|

Name | $\mathit{n}$ | $\mathit{d}$ | ${\mathit{k}}_{\mathit{R}}$ | $\mathit{R}$ | $\mathit{N}$ | $\mathit{M}$ | $\mathit{Eps}$ | $\mathit{MP}$ | $\mathit{M}$ |

A sets [12] | |||||||||

A1 | 3000 | 2 | 20 | 1 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{3}$ | 9 | 10 | 800 | 9 | 10 |

A2 | 5200 | 2 | 35 | 1 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{3}$ | 9 | 10 | 800 | 9 | 10 |

A3 | 7500 | 2 | 50 | 1 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{3}$ | 9 | 10 | 800 | 9 | 10 |

S sets [13] | |||||||||

S1 | 5000 | 2 | 15 | 2.5 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{4}$ | 5 | 10 | 2.0 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{4}$ | 9 | 10 |

S2 | 5000 | 2 | 15 | 2.5 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{4}$ | 17 | 10 | 2.0 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{4}$ | 9 | 10 |

S3 | 5000 | 2 | 15 | 2.5 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{4}$ | 25 | 10 | 1.7 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{4}$ | 9 | 10 |

S4 | 5000 | 2 | 15 | 2.5 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{4}$ | 30 | 10 | 1.7 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{4}$ | 9 | 10 |

Birch sets [14] | |||||||||

Birch1 | 10,000 ${}^{B}$ | 2 | 100 | 2.5 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{4}$ | 23 | 10 | 1.5 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{4}$ | 9 | 10 |

Birch2 | 10,000 ${}^{B}$ | 2 | 100 | 2.5 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{3}$ | 20 | 10 | 1.5 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{3}$ | 9 | 10 |

Birch3 ${}^{A}$ | 10,000 ${}^{B}$ | 2 | 100 | 3 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{4}$ | 10 | 10 | --- | --- | --- |

--- | --- | --- | --- | 2 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{4}$ | 10 | 10 | --- | --- | --- |

--- | --- | --- | --- | 1.5 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{4}$ | 10 | 10 | 1.5 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{4}$ | 9 | 10 |

Dim sets [15,16] | |||||||||

Dim-32 | 1024 | 32 | 16 | 10 | 5 | 10 | 15 | 9 | 10 |

Dim-64 | 1024 | 64 | 16 | 10 | 5 | 10 | 15 | 9 | 10 |

Dim-128 | 1024 | 128 | 16 | 10 | 5 | 10 | 15 | 9 | 10 |

Dim-256 | 1024 | 256 | 16 | 10 | 5 | 10 | 15 | 9 | 10 |

Dim-512 | 1024 | 512 | 16 | 10 | 5 | 10 | 20 | 9 | 10 |

Dim-1024 | 1024 | 1024 | 16 | 10 | 5 | 10 | 20 | 9 | 10 |

Unbalance [17] | |||||||||

Unbalance | 6500 | 2 | 8 | 1 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{4}$ | 5 | 10 | 6 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{3}$ | 9 | 10 |

Shape sets | |||||||||

Aggregation [18] | 788 | 2 | 7 | 2.50 | 17 | 10 | --- | --- | --- |

Compound [19] | 399 | 2 | 6 | 2.00 | 15 | 10 | --- | --- | --- |

D31 [20] | 3100 | 2 | 31 | 0.45 | 4 | 10 | --- | --- | --- |

Flame [3] | 240 | 2 | 2 | 2.00 | 15 | 10 | --- | --- | --- |

Jain [21] | 373 | 2 | 2 | 2.50 | 3 | 10 | --- | --- | --- |

Path-based [22] | 300 | 2 | 3 | 1.40 | 2 | 5 | 1.4 | 2 | 5 |

R15 [20] | 600 | 2 | 15 | 0.60 | 9 | 10 | --- | --- | --- |

Spiral [22] | 312 | 2 | 3 | 3.00 | 3 | 10 | --- | --- | --- |

t4.8k [2] | 8000 | 2 | 6 | 10.00 | 15 | 10 | --- | --- | --- |

**Table 2.**Parameters used for the construction of a generated potential energy surface as shown in Figure 1a.

i | ${\mathit{a}}_{\mathit{i}\mathit{x}}$ | ${\mathit{b}}_{\mathit{i}\mathit{x}}$ | ${\mathit{a}}_{\mathit{i}\mathit{y}}$ | ${\mathit{b}}_{\mathit{i}\mathit{y}}$ | ${\mathit{B}}_{\mathit{i}}$ |
---|---|---|---|---|---|

1 | 0.8 | −45 | 0.9 | 40 | 10 |

2 | 0.8 | 0 | 0.9 | −65 | 0 |

3 | 0.8 | 45 | 0.9 | 55 | 10 |

**Table 3.**Clustering results for different datasets. The left part shows the number of isolated clusters ${k}_{C}$ for the common nearest neighbor (CNN) and the DBSCAN algorithm (compared to the reference number of cluster ${k}_{R}$) as well as the percentage of data points declared as noise. The right part shows the centroid index at cluster level (CI) [37] using the CNN algorithm, the DBSCAN-algorithm and the kMeans(++) algorithm ($kM(++)$) [40] as implemented in $pyEMMA$ [41]. The CI values are compared to reference values provided by the authors of [39] for the kMeans algorithm [42] using a random initialization (kM) and using a further point heuristic (Max) [43] initialization. The reference data were averaged over 5,000 runs. ${}^{A}$ Hierarchical clustering was necessary. ${}^{B}$ After the last clustering step.

Dataset | CNN | DBSCAN | CI | |||||||
---|---|---|---|---|---|---|---|---|---|---|

Name | ${\mathit{k}}_{\mathit{R}}$ | ${\mathit{k}}_{\mathit{C}}$ | Noise | ${k}_{\mathit{C}}$ | Noise | CNN | DBSCAN | kM(++) | kM [39] | kM (Max) [39] |

A sets [12] | ||||||||||

A1 | 20 | 20 | 22% | 19 | 19% | 0 | 1 | 1 | 2.5 | 1.0 |

A2 | 35 | 35 | 22% | 34 | 18% | 0 | 1 | 1 | 4.5 | 2.6 |

A3 | 50 | 50 | 22% | 50 | 18% | 0 | 1 | 2 | 6.6 | 2.9 |

S sets [13] | ||||||||||

S1 | 15 | 15 | 4% | 16 | 6% | 0 | 1 | 0 | 1.8 | 0.7 |

S2 | 15 | 15 | 28% | 14 | 12% | 0 | 2 | 1 | 1.4 | 1.0 |

S3 | 15 | 16 | 46% | 15 | 20% | 1 | 5 | 0 | 1.3 | 0.7 |

S4 | 15 | 16 | 48% | 12 | 14% | 1 | 4 | 1 | 0.9 | 1.0 |

Birch sets [14] | ||||||||||

Birch1 | 100 | 100 | 34% | 95 | 14% | 0 | 9 | 4 | 6.6 | 5.5 |

Birch2 | 100 | 100 | 9% | 100 | 3% | 0 | 0 | 0 | 16.6 | 7.3 |

Birch3 ${}^{A}$ | 100 | 21 | 2% | |||||||

35 | 11% | |||||||||

42 | 32% | 41 | 6% | 58 ${}^{B}$ | 59 | 16 | --- | --- | ||

Dim sets [15,16] | ||||||||||

Dim-32 | 16 | 16 | 50% | 16 | 29% | 0 | 0 | 0 | 3.6 | 0.0 |

Dim-64 | 16 | 16 | 47% | 16 | 27% | 0 | 0 | 0 | 3.7 | --- |

Dim-128 | 16 | 16 | 51% | 16 | 32% | 0 | 0 | 0 | 3.8 | --- |

Dim-256 | 16 | 16 | 60% | 16 | 34% | 0 | 0 | 0 | 3.9 | --- |

Dim-512 | 16 | 15 | 71% | 16 | 25% | 1 | 0 | 0 | 4.1 | --- |

Dim-1024 | 16 | 16 | 78% | 16 | 26% | 0 | 0 | 0 | 3.8 | --- |

Unbalance [17] | ||||||||||

Unbalance | 8 | 8 | 1 % | 8 | 4 % | 0 | 0 | 0 | 3.6 | 0.9 |

Shape sets | ||||||||||

Aggregation [18] | 7 | 7 | 9% | --- | --- % | --- | --- | --- | --- | --- |

Compound [19] | 6 | 5 | 27% | --- | --- % | --- | --- | --- | --- | --- |

D31 [20] | 31 | 31 | 17% | --- | --- % | --- | --- | --- | --- | --- |

Flame [3] | 2 | 2 | 15% | --- | --- % | --- | --- | --- | --- | --- |

Jain [21] | 2 | 3 | 1% | --- | --- % | --- | --- | --- | --- | --- |

Path-based [22] | 3 | 13 | 3% | 13 | 3% | --- | --- | --- | --- | --- |

R15 [20] | 15 | 15 | 5% | --- | --- % | --- | --- | --- | --- | --- |

Spiral [22] | 3 | 3 | 0% | --- | --- % | --- | --- | --- | --- | --- |

t4.8k [2] | 6 | 6 | 13% | --- | --- % | --- | --- | --- | --- | --- |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Lemke, O.; Keller, B.G. Common Nearest Neighbor Clustering—A Benchmark. *Algorithms* **2018**, *11*, 19.
https://doi.org/10.3390/a11020019

**AMA Style**

Lemke O, Keller BG. Common Nearest Neighbor Clustering—A Benchmark. *Algorithms*. 2018; 11(2):19.
https://doi.org/10.3390/a11020019

**Chicago/Turabian Style**

Lemke, Oliver, and Bettina G. Keller. 2018. "Common Nearest Neighbor Clustering—A Benchmark" *Algorithms* 11, no. 2: 19.
https://doi.org/10.3390/a11020019