# Scikit-Dimension: A Python Package for Intrinsic Dimension Estimation

^{1}

^{2}

^{3}

^{4}

^{5}

^{*}

## Abstract

**:**

`scikit-dimension`, an open-source Python package for intrinsic dimension estimation. The

`scikit-dimension`package provides a uniform implementation of most of the known ID estimators based on the scikit-learn application programming interface to evaluate the global and local intrinsic dimension, as well as generators of synthetic toy and benchmark datasets widespread in the literature. The package is developed with tools assessing the code quality, coverage, unit testing and continuous integration. We briefly describe the package and demonstrate its use in a large-scale (more than 500 datasets) benchmarking of methods for ID estimation for real-life and synthetic data.

## 1. Introduction

`scikit-dimension`, an open-source Python package for global and local intrinsic dimension (ID) estimation. The package has two main objectives: (i) foster research in ID estimation by providing code to benchmark algorithms and a platform to share algorithms; and (ii) democratize the use of ID estimation by providing user-friendly implementations of algorithms using the scikit-learn application programming interface (API) [1].

`scikit-dimension`, these two notions are not distinguished.

`scikit-dimension`can be used to derive a consensus measure of data dimensionality by averaging multiple individual measures. The latter can be a robust measure of data dimensionality in various applications.

## 2. Materials and Methods

#### 2.1. Software Features

`Scikit-dimension`is an open-source software available at https://github.com/j-bac/scikit-dimension (accessed on 18 October 2021).

`Scikit-dimension`consists of two modules. The id module provides ID estimators, and the datasets module provides synthetic benchmark datasets.

#### 2.1.1. id Module

#### 2.1.2. Datasets Module

#### 2.2. Development

`Scikit-dimension`is built according to the scikit-learn API [1] with support for Linux, MacOS, Windows and Python $>=$ 3.6. The code style and API design are based on the guidelines of scikit-learn, with the NumPy [34] documentation format, and continuous integration on all three platforms. The online documentation is built using Sphinx and hosted with ReadTheDocs.

#### 2.3. Dependencies

`Scikit-dimension`depends on a limited number of external dependencies on the user side for ease of installation and maintenance:

#### 2.4. Related Software

`scikit-dimension`.

`scikit-dimension`is the first Python implementation of an extensive collection of ID methods. Compared to similar efforts in other languages, the package puts emphasis on estimators, quantifying various properties of high-dimensional data geometry, such as the concentration of measure. It is the only package to include ID estimation based on linear separability of data, using Fisher discriminants [4,32,50,51].

## 3. Results

#### 3.1. Benchmarking `Scikit-Dimension` on a Large Collection of Datasets

`scikit-dimension`to a wide range of real-life datasets of various configurations and sizes, we performed a large-scale benchmarking of

`scikit-dimension`, using the collection of datasets from the

`OpenML`repository [52]. We selected those datasets having at least 1000 observations and 10 features, without missing values. We excluded those datasets which were difficult to fetch, either because of their size or an error in the

`OpenML`API. After filtering out repetitive entries, 499 datasets were collected. Their number of observations varied from 1002 to 9,199,930, and their number of features varied from 10 to 13,196. We focused only on numerical variables, and we subsampled the number of rows in the matrix to a maximum of 100,000. All dataset features were scaled to unit interval using Min/Max scaling. In addition, we filtered out approximate non-unique columns and rows in the data matrices since some of the ID methods could be affected by the presence of identical (or approximately identical) rows or columns.

#### 3.1.1. `Scikit-Dimension` ID Estimator Method Features

`scikit-dimension`, with default parameter values, including 7 methods based on application of principal component analysis (“linear” or PCA-based ID methods), and 12 based on application of various other principles, including correlation dimension and concentration of measure-based methods (“nonlinear” ID methods).

#### 3.1.2. Metanalysis of `Scikit-Dimension` ID Estimates

`scikit-dimension`, each dataset was characterized by a vector of 19 measurements of intrinsic dimensionality. The resulting matrix of ID values contained 2.5% missing values, which were imputed, using the standard IterativeImputer from the

`sklearn`Python package.

## 4. Conclusions

`scikit-dimension`is to our knowledge the first package implemented in Python, containing implementations of the most-used estimators of ID.

`scikit-dimension`on a large collection of real-life and synthetic datasets revealed that different estimators of ID possess internal consistency and that the ensemble of ID estimators allows us to achieve more robust classification of datasets into low- or high-dimensionality.

`scikit-dimension`package can be used, but this description is by no means comprehensive.

`scikit-dimension`will continuously seek to incorporate new estimators and benchmark datasets introduced in the literature, or new features, such as alternative nearest neighbor search for local ID estimates. The package will also include new ID estimators, which can be derived using the most recent achievements in understanding the properties of high-dimensional data geometry [71,75].

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

`OpenML`repository, CytoTRACE website https://cytotrace.stanford.edu/ (section “Downloads”, accessed on 18 October 2021), from the Data Portal of National Cancer Institute https://portal.gdc.cancer.gov/ (accessed on 18 October 2021).

## Conflicts of Interest

## References

- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Bishop, C.M. Neural Networks for Pattern Recognition; Oxford University Press: Oxford, UK, 1995. [Google Scholar] [CrossRef]
- Fukunaga, K. Intrinsic dimensionality extraction. In Pattern Recognition and Reduction of Dimensionality, Handbook of Statistics; Krishnaiah, P.R., Kanal, L.N., Eds.; North-Holland: Amsterdam, The Netherlands, 1982; Volume 2, pp. 347–362. [Google Scholar]
- Albergante, L.; Bac, J.; Zinovyev, A. Estimating the effective dimension of large biological datasets using Fisher separability analysis. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
- Giudice, M.D. Effective Dimensionality: A Tutorial. Multivar. Behav. Res.
**2020**, 1–16. [Google Scholar] [CrossRef] - Palla, K.; Knowles, D.; Ghahramani, Z. A nonparametric variable clustering model. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2012; Volume 4, pp. 2987–2995. [Google Scholar]
- Giuliani, A.; Benigni, R.; Sirabella, P.; Zbilut, J.P.; Colosimo, A. Nonlinear Methods in the Analysis of Protein Sequences: A Case Study in Rubredoxins. Biophys. J.
**2000**, 78, 136–149. [Google Scholar] [CrossRef][Green Version] - Jiang, H.; Kim, B.; Guan, M.Y.; Gupta, M.R. To Trust Or Not To Trust A Classifier. In NeurIPS; Montreal Convention Centre: Montreal, QC, Canada, 2018; pp. 5546–5557. [Google Scholar] [CrossRef]
- Bac, J.; Zinovyev, A. Lizard Brain: Tackling Locally Low-Dimensional Yet Globally Complex Organization of Multi-Dimensional Datasets. Front. Neurorobotics
**2020**, 13, 110. [Google Scholar] [CrossRef][Green Version] - Hino, H. ider: Intrinsic Dimension Estimation with R. R J.
**2017**, 9, 329–341. [Google Scholar] [CrossRef] - Campadelli, P.; Casiraghi, E.; Ceruti, C.; Rozza, A. Intrinsic Dimension Estimation: Relevant Techniques and a Benchmark Framework. Math. Probl. Eng.
**2015**, 2015, 759567. [Google Scholar] [CrossRef][Green Version] - Camastra, F.; Staiano, A. Intrinsic dimension estimation: Advances and open problems. Inf. Sci.
**2016**, 328, 26–41. [Google Scholar] [CrossRef] - Little, A.V.; Lee, J.; Jung, Y.; Maggioni, M. Estimation of intrinsic dimensionality of samples from noisy low-dimensional manifolds in high dimensions with multiscale SVD. In Proceedings of the 2009 IEEE/SP 15th Workshop on Statistical Signal Processing, Cardiff, UK, 31 August–3 September 2009; pp. 85–88. [Google Scholar] [CrossRef]
- Hein, M.; Audibert, J.Y. Intrinsic dimensionality estimation of submanifolds in R
^{d}. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005; ACM: New York, NY, USA, 2005; pp. 289–296. [Google Scholar] [CrossRef][Green Version] - Mirkes, E.; Allohibi, J.; Gorban, A.N. Fractional Norms and Quasinorms Do Not Help to Overcome the Curse of Dimensionality. Entropy
**2020**, 22, 1105. [Google Scholar] [CrossRef] [PubMed] - Golovenkin, S.E.; Bac, J.; Chervov, A.; Mirkes, E.M.; Orlova, Y.V.; Barillot, E.; Gorban, A.N.; Zinovyev, A. Trajectories, bifurcations, and pseudo-time in large clinical datasets: Applications to myocardial infarction and diabetes data. GigaScience
**2020**, 9, giaa128. [Google Scholar] [CrossRef] [PubMed] - Zinovyev, A.; Sadovsky, M.; Calzone, L.; Fouché, A.; Groeneveld, C.S.; Chervov, A.; Barillot, E.; Gorban, A.N. Modeling Progression of Single Cell Populations Through the Cell Cycle as a Sequence of Switches. bioRxiv
**2021**. [Google Scholar] [CrossRef] - Grassberger, P.; Procaccia, I. Measuring the strangeness of strange attractors. Phys. D Nonlinear Phenom.
**1983**, 9, 189–208. [Google Scholar] [CrossRef] - Farahmand, A.M.; Szepesvári, C.; Audibert, J.Y. Manifold-adaptive dimension estimation. In Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR, USA, 20–24 June 2007; pp. 265–272. [Google Scholar] [CrossRef]
- Amsaleg, L.; Chelly, O.; Furon, T.; Girard, S.; Houle, M.E.; Kawarabayashi, K.; Nett, M. Extreme-value-theoretic estimation of local intrinsic dimensionality. Data Min. Knowl. Discov.
**2018**, 32, 1768–1805. [Google Scholar] [CrossRef] - Jackson, D.A. Stopping rules in principal components analysis: A comparison of heuristical and statistical approaches. Ecology
**1993**, 74, 2204–2214. [Google Scholar] [CrossRef] - Fukunaga, K.; Olsen, D.R. An Algorithm for Finding Intrinsic Dimensionality of Data. IEEE Trans. Comput.
**1971**, C-20, 176–183. [Google Scholar] [CrossRef] - Mingyu, F.; Gu, N.; Qiao, H.; Zhang, B. Intrinsic dimension estimation of data by principal component analysis. arXiv
**2010**, arXiv:1002.2050. [Google Scholar] - Hill, B.M. A simple general approach to inference about the tail of a distribution. Ann. Stat.
**1975**, 1163–1174. [Google Scholar] [CrossRef] - Levina, E.; Bickel, P.J. Maximum Likelihood estimation of intrinsic dimension. In Proceedings of the 17th International Conference on Neural Information Processing Systems, Vancouver, Canada, 1 December 2004; MIT Press: Cambridge, MA, USA, 2004; pp. 777–784. [Google Scholar] [CrossRef]
- Haro, G.; Randall, G.; Sapiro, G. Translated poisson mixture model for stratification learning. Int. J. Comput. Vis.
**2008**, 80, 358–374. [Google Scholar] [CrossRef] - Carter, K.M.; Raich, R.; Hero, A.O. On Local Intrinsic Dimension Estimation and Its Applications. IEEE Trans. Signal Process.
**2010**, 58, 650–663. [Google Scholar] [CrossRef][Green Version] - Rozza, A.; Lombardi, G.; Ceruti, C.; Casiraghi, E.; Campadelli, P. Novel high intrinsic dimensionality estimators. Mach. Learn.
**2012**, 89, 37–65. [Google Scholar] [CrossRef] - Ceruti, C.; Bassis, S.; Rozza, A.; Lombardi, G.; Casiraghi, E.; Campadelli, P. DANCo: An intrinsic dimensionality estimator exploiting angle and norm concentration. Pattern Recognit.
**2014**, 47, 2569–2581. [Google Scholar] [CrossRef] - Johnsson, K. Structures in High-Dimensional Data: Intrinsic Dimension and Cluster Analysis. Ph.D. Thesis, Faculty of Engineering, LTH, Perth, Australia, 2016. [Google Scholar]
- Facco, E.; D’Errico, M.; Rodriguez, A.; Laio, A. Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Sci. Rep.
**2017**, 7, 12140. [Google Scholar] [CrossRef] [PubMed][Green Version] - Gorban, A.; Golubkov, A.; Grechuk, B.; Mirkes, E.; Tyukin, I. Correction of AI systems by linear discriminants: Probabilistic foundations. Inf. Sci.
**2018**, 466, 303–322. [Google Scholar] [CrossRef][Green Version] - Amsaleg, L.; Chelly, O.; Houle, M.E.; Kawarabayashi, K.; Radovanović, M.; Treeratanajaru, W. Intrinsic dimensionality estimation within tight localities. In Proceedings of the 2019 SIAM International Conference on Data Mining, Calgary, AB, Canada, 2–4 May 2019; SIAM: Philadelphia, PA, USA, 2019; pp. 181–189. [Google Scholar]
- Harris, C.R.; Millman, K.J.; van der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array programming with NumPy. Nature
**2020**, 585, 357–362. [Google Scholar] [CrossRef] - Hunter, J.D. Matplotlib: A 2D graphics environment. Comput. Sci. Eng.
**2007**, 9, 90–95. [Google Scholar] [CrossRef] - The Pandas Development Team.Pandas-Dev/Pandas: Pandas 1.3.4, Zenodo. Available online: https://zenodo.org/record/5574486#.YW50jhpByUk (accessed on 18 October 2021). [CrossRef]
- Lam, S.K.; Pitrou, A.; Seibert, S. Numba: A llvm-based python jit compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, Austin, TX, USA, 15 November 2015; pp. 1–6. [Google Scholar] [CrossRef]
- Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods
**2020**, 17, 261–272. [Google Scholar] [CrossRef] [PubMed][Green Version] - Johnsson, K. intrinsicDimension: Intrinsic Dimension Estimation (R Package). 2019. Available online: https://rdrr.io/cran/intrinsicDimension/ (accessed on 6 September 2021).
- You, K. Rdimtools: An R package for Dimension Reduction and Intrinsic Dimension Estimation. arXiv
**2020**, arXiv:2005.11107. [Google Scholar] - Denti, Francesco intRinsic: An R package for model-based estimation of the intrinsic dimension of a dataset. arXiv
**2021**, arXiv:2102.11425. - Hein, M.J.Y.A. IntDim: Intrindic Dimensionality Estimation. 2016. Available online: https://www.ml.uni-saarland.de/code/IntDim/IntDim.htm (accessed on 6 September 2021).
- Lombardi, G. Intrinsic Dimensionality Estimation Techniques (MATLAB Package). 2013. Available online: https://fr.mathworks.com/matlabcentral/fileexchange/40112-intrinsic-dimensionality-estimation-techniques (accessed on 6 September 2021).
- Van der Maaten, L. Drtoolbox: Matlab Toolbox for Dimensionality Reduction. 2020. Available online: https://lvdmaaten.github.io/drtoolbox/ (accessed on 6 September 2021).
- Radovanović, M. Tight Local Intrinsic Dimensionality Estimator (TLE) (MATLAB Package). 2020. Available online: https://perun.pmf.uns.ac.rs/radovanovic/tle/ (accessed on 6 September 2021).
- Gomtsyan, M.; Mokrov, N.; Panov, M.; Yanovich, Y. Geometry-Aware Maximum Likelihood Estimation of Intrinsic Dimension (Python Package). 2019. Available online: https://github.com/stat-ml/GeoMLE (accessed on 6 September 2021).
- Gomtsyan, M.; Mokrov, N.; Panov, M.; Yanovich, Y. Geometry-Aware Maximum Likelihood Estimation of Intrinsic Dimension. In Proceedings of the Eleventh Asian Conference on Machine Learning, Nagoya, Japan, 17–19 November 2019; pp. 1126–1141. [Google Scholar]
- Erba, V. pyFCI: A Package for Multiscale-Full-Correlation-Integral Intrinsic Dimension Estimation. 2019. Available online: https://github.com/vittorioerba/pyFCI (accessed on 6 September 2021).
- Granata, D. Intrinsic-Dimension (Python Package). 2016. Available online: https://github.com/dgranata/Intrinsic-Dimension (accessed on 6 September 2021).
- Bac, J.; Zinovyev, A. Local intrinsic dimensionality estimators based on concentration of measure. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]
- Gorban, A.N.; Makarov, V.A.; Tyukin, I.Y. The unreasonable effectiveness of small neural ensembles in high-dimensional brain. Phys. Life Rev.
**2019**, 29, 55–88. [Google Scholar] [CrossRef] - Vanschoren, J.; van Rijn, J.N.; Bischl, B.; Torgo, L. OpenML: Networked Science in Machine Learning. SIGKDD Explor.
**2013**, 15, 49–60. [Google Scholar] [CrossRef][Green Version] - Gulati, G.; Sikandar, S.; Wesche, D.; Manjunath, A.; Bharadwaj, A.; Berger, M.; Ilagan, F.; Kuo, A.; Hsieh, R.; Cai, S.; et al. Single-cell transcriptional diversity is a hallmark of developmental potential. Science
**2020**, 24, 405–411. [Google Scholar] [CrossRef] - Giuliani, A. The application of principal component analysis to drug discovery and biomedical data. Drug Discov. Today
**2017**, 22, 1069–1076. [Google Scholar] [CrossRef] [PubMed] - Cangelosi, R.; Goriely, A. Component retention in principal component analysis with application to cDNA microarray data. Biol. Direct
**2007**, 2, 2. [Google Scholar] [CrossRef][Green Version] - Johnsson, K.; Soneson, C.; Fontes, M. Low Bias Local Intrinsic Dimension Estimation from Expected Simplex Skewness. IEEE Trans. Pattern Anal. Mach. Intell.
**2015**, 37, 196–202. [Google Scholar] [CrossRef] [PubMed] - Jolliffe, I.T. Principal Component Analysis; Springer Series in Statistics; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
- Kaiser, H. The Application of Electronic Computers to Factor Analysis. Educ. Psychol. Meas.
**1960**, 20, 141–151. [Google Scholar] [CrossRef] - Frontier, S. Étude de la décroissance des valeurs propres dans une analyse en composantes principales: Comparaison avec le modèle du bâton brisé. J. Exp. Mar. Biol. Ecol.
**1976**, 25, 67–75. [Google Scholar] [CrossRef] - Gorban, A.N.; Sumner, N.R.; Zinovyev, A.Y. Topological grammars for data approximation. Appl. Math. Lett.
**2007**, 20, 382–386. [Google Scholar] [CrossRef][Green Version] - Albergante, L.; Mirkes, E.; Bac, J.; Chen, H.; Martin, A.; Faure, L.; Barillot, E.; Pinello, L.; Gorban, A.; Zinovyev, A. Robust and scalable learning of complex intrinsic dataset geometry via ElPiGraph. Entropy
**2020**, 22, 296. [Google Scholar] [CrossRef][Green Version] - Lähnemann, D.; Köster, J.; Szczurek, E.; McCarthy, D.J.; Hicks, S.C.; Robinson, M.D.; Vallejos, C.A.; Campbell, K.R.; Beerenwinkel, N.; Mahfouz, A.; et al. Eleven grand challenges in single-cell data science. Genome Biol.
**2020**, 21, 1–31. [Google Scholar] [CrossRef] - Chen, H.; Albergante, L.; Hsu, J.Y.; Lareau, C.A.; Lo Bosco, G.; Guan, J.; Zhou, S.; Gorban, A.N.; Bauer, D.E.; Aryee, M.J.; et al. Single-cell trajectories reconstruction, exploration and mapping of omics data with STREAM. Nat. Commun.
**2019**, 10, 1–14. [Google Scholar] [CrossRef][Green Version] - Sritharan, D.; Wang, S.; Hormoz, S. Computing the Riemannian curvature of image patch and single-cell RNA sequencing data manifolds using extrinsic differential geometry. Proc. Natl. Acad. Sci. USA
**2021**, 118, e2100473118. [Google Scholar] [CrossRef] - Radulescu, O.; Gorban, A.N.; Zinovyev, A.; Lilienbaum, A. Robust simplifications of multiscale biochemical networks. BMC Syst. Biol.
**2008**, 2, 86. [Google Scholar] [CrossRef][Green Version] - Gorban, A.N.; Zinovyev, A. Principal manifolds and graphs in practice: From molecular biology to dynamical systems. Int. J. Neural Syst.
**2010**, 20, 219–232. [Google Scholar] [CrossRef][Green Version] - Donoho, D.L. High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lect.
**2000**, 1, 1–32. [Google Scholar] - Gorban, A.N.; Tyukin, I.Y. Blessing of dimensionality: Mathematical foundations of the statistical physics of data. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci.
**2018**, 376, 20170237. [Google Scholar] [CrossRef][Green Version] - Kainen, P.C.; Kůrková, V. Quasiorthogonal dimension of euclidean spaces. Appl. Math. Lett.
**1993**, 6, 7–10. [Google Scholar] [CrossRef][Green Version] - Tyukin, I.Y.; Higham, D.J.; Gorban, A.N. On Adversarial Examples and Stealth Attacks in Artificial Intelligence Systems. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–6. [Google Scholar]
- Gorban, A.N.; Grechuk, B.; Mirkes, E.M.; Stasenko, S.V.; Tyukin, I.Y. High-Dimensional Separability for One- and Few-Shot Learning. Entropy
**2021**, 23, 1090. [Google Scholar] [CrossRef] [PubMed] - Amblard, E.; Bac, J.; Chervov, A.; Soumelis, V.; Zinovyev, A. Hubness reduction improves clustering and trajectory inference in single-cell transcriptomic data. bioRxiv
**2021**. [Google Scholar] [CrossRef] - Gionis, A.; Hinneburg, A.; Papadimitriou, S.; Tsaparas, P. Dimension Induced Clustering. In KDD ’05: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining; Association for Computing Machinery: New York, NY, USA, 2005; pp. 51–60. [Google Scholar] [CrossRef]
- Allegra, M.; Facco, E.; Denti, F.; Laio, A.; Mira, A. Data segmentation based on the local intrinsic dimension. Sci. Rep.
**2020**, 10, 1–12. [Google Scholar] [CrossRef] [PubMed] - Grechuk, B.; Gorban, A.N.; Tyukin, I.Y. General stochastic separation theorems with optimal bounds. Neural Netw.
**2021**, 138, 33–56. [Google Scholar] [CrossRef]

**Figure 1.**Example usage: generating the Line–Disk–Ball dataset [10]), which has clusters of varying local ID, and coloring points by estimates of local ID obtained by id.lPCA.

**Figure 2.**Illustrating different ID method general characteristics: (

**A**) range of estimated ID values; (

**B**) ability to produce interpretable (positive finite value) result; (

**C**) sensitivity to feature redundancy (after duplicating matrix columns); (

**D**) uniform ID estimation across datasets of similar nature; (

**E**) computational time needed to compute ID for matrices of four characteristic sizes.

**Figure 3.**Characterizing

`OpenML`dataset collection in terms of ID estimates. (

**A**) PCA visualizations of datasets characterized by vectors of 19 ID measures. Size of the point corresponds to the logarithm of the number of matrix entries (${N}_{obj}\times {N}_{var}$). The color corresponds to the mean ID estimate taken as the mean of all ID measure z-scores. (

**B**) Loadings of various methods into the first and the second principal component from (

**A**). (

**C**) Visualization of the mean ID score as a function of data matrix shape. The color is the same as in (

**A**). (

**D**) Correlation matrix between different ID estimates computed over all analyzed datasets.

**Figure 4.**A gallery of UMAP plots computed for a selection of datasets from

`OpenML`collection, with indication of ID estimates, ranked by the ID value estimated using Fisher separability-based method (indicated in the left top corner). The ambient dimension of the data (number of features ${N}_{var}$) is indicated in the bottom left corner, and the color reflects the $ID/{N}_{var}$ ratio, from red (close to 0.0 value) to green (close to 1.0). On the right from the UMAP plot, all 19 ID measures are indicated, with color mapped to the value range, from green (small dimension) to red (high dimension).

**Table 1.**Summary table of ID methods characteristics. The qualitative score changes from “$---$” (worst) to “+++” (best).

Method Name | Short Name(s) | Ref(s) | Valid Result | Insensitivity to Redundancy | Uniform ID Estimate in Similar Datasets | Performance with Many Observations | Performance with Many Features |
---|---|---|---|---|---|---|---|

PCA Fukunaga-Olsen | PCA FO, PFO | [15,22] | +++ | +++ | +++ | +++ | +++ |

PCA Fan | PFN | [23] | +++ | +++ | +++ | +++ | +++ |

PCA maxgap | PMG | [56] | +++ | $---$ | + | +++ | +++ |

PCA ratio | PRT | [57] | +++ | +++ | + | +++ | +++ |

PCA participation ratio | PPR | [57] | +++ | +++ | ++ | +++ | +++ |

PCA Kaiser | PKS | [54,58] | +++ | − | +++ | +++ | +++ |

PCA broken stick | PBS | [55,59] | +++ | $--$ | +++ | +++ | +++ |

Correlation (fractal) dimensionality | CorrInt, CID | [18] | + | +++ | ++ | + | + |

Fisher separability | FisherS, FSH | [4,32] | ++ | +++ | +++ | ++ | +++ |

K-nearest neighbours | KNN | [27] | ++ | $--$ | $--$ | − | ++ |

Manifold-adaptive fractal dimension | MADA, MDA | [19] | − | +++ | +++ | − | + |

Minimum neighbor distance—ML | MIND_ML,MMk, MMi | [28] | +++ | +++ | ++ | ++ | + |

Maximum likelihood | MLE | [25] | ++ | +++ | ++ | ++ | + |

Methods of moments | MOM | [20] | +++ | +++ | +++ | ++ | + |

Estimation within tight localities | TLE | [33] | $--$ | +++ | +++ | ++ | + |

Minimal neighborhood information | TwoNN,TNN | [31] | ++ | +++ | +++ | ++ | +++ |

Angle and norm concentration | DANCo,DNC | [29] | + | +++ | +++ | $---$ | $---$ |

Expected simplex skewness | ESS | [56] | +++ | +++ | +++ | $---$ | $---$ |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Bac, J.; Mirkes, E.M.; Gorban, A.N.; Tyukin, I.; Zinovyev, A.
Scikit-Dimension: A Python Package for Intrinsic Dimension Estimation. *Entropy* **2021**, *23*, 1368.
https://doi.org/10.3390/e23101368

**AMA Style**

Bac J, Mirkes EM, Gorban AN, Tyukin I, Zinovyev A.
Scikit-Dimension: A Python Package for Intrinsic Dimension Estimation. *Entropy*. 2021; 23(10):1368.
https://doi.org/10.3390/e23101368

**Chicago/Turabian Style**

Bac, Jonathan, Evgeny M. Mirkes, Alexander N. Gorban, Ivan Tyukin, and Andrei Zinovyev.
2021. "Scikit-Dimension: A Python Package for Intrinsic Dimension Estimation" *Entropy* 23, no. 10: 1368.
https://doi.org/10.3390/e23101368