Linear Dimensionality Reduction: What Is Better?

Baliyan, Mohit; Mirkes, Evgeny M.

doi:10.3390/data10050070

Open AccessArticle

Linear Dimensionality Reduction: What Is Better?

by

Mohit Baliyan

^† and

Evgeny M. Mirkes

^*,†

School of Computing and Mathematical Sciences, University of Leicester, Leicester LE1 7RH, UK

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Data 2025, 10(5), 70; https://doi.org/10.3390/data10050070

Submission received: 22 February 2025 / Revised: 29 April 2025 / Accepted: 30 April 2025 / Published: 6 May 2025

(This article belongs to the Section Information Systems and Data Management)

Download

Browse Figures

Versions Notes

Abstract

This research paper focuses on dimensionality reduction, which is a major subproblem in any data processing operation. Dimensionality reduction based on principal components is the most used methodology. Our paper examines three heuristics, namely Kaiser’s rule, the broken stick, and the conditional number rule, for selecting informative principal components when using principal component analysis to reduce high-dimensional data to lower dimensions. This study uses 22 classification datasets and three classifiers, namely Fisher’s discriminant classifier, logistic regression, and K nearest neighbors, to test the effectiveness of the three heuristics. The results show that there is no universal answer to the best intrinsic dimension, but the conditional number heuristic performs better, on average. This means that the conditional number heuristic is the best candidate for automatic data pre-processing.

Keywords:

principal components; dimensionality reduction

1. Introduction

There are several reasons to use dimensionality reduction (DR) in most data-based projects. The first reason is data-hungry problems [1]. This type of problem is very common in the modern world. In reality, this means that the number of observations is less than the number of attributes [2] or even less than the logarithm of the number of attributes [3,4]. The second reason for DR is the difference between the number of attributes and the intrinsic dimension (ID) of data: there are no reasons to spend resources to work with a large number of attributes when the ID of data is essentially less. Differences between the number of attributes and the ID can depend on many factors. In [5], Mirkes et al. presented table (Table 2 in [5]) with five ID values for 26 datasets. For some datasets, some of the ID values are the same as the number of attributes (for example, the cryotherapy and immunotherapy conditional number dimension equals the number of attributes: 6 and 7, respectively), but for some datasets, the ID is several times less than the number of attributes (for example, for MiniBooNE particle identification, the number of attributes is 50, but the maximum of the tested IDs is 4).

There is no unique, exact definition of ID because of the existence of many IDs. Bac et al. in [6] presented software to calculate 19 IDs and presented a list of IDs that is not exhaustive. In [6], Bac et al. wrote, “The well-known curse of dimensionality, which states that many problems become exponentially difficult in high dimensions, does not depend on the number of features, but on the dataset’s ID [7]. More precisely, the effects of the dimensionality curse are expected to be manifested when

I D ≫ ln (M)

, where M is the number of data points [8,9]”.

According to [10] there are two main DR approaches: filters and wrappers. Filtering methods do not required feedback on predictor/classifier performance. These methods can be considered universal methods to estimate IDs. In contrast, wrapping methods provide feature selection based on predictor/classifier feedback. The most well known filtering method based on principle components (PCs) is LASSO [11]. The set of selected features depends on the used predictor/classifier. This means that the ID can differ depending on the method. This method works for linear regression, generalised linear models, and some other models. There is no k nearest neighbours (KNN) or decision tree version of LASSO. In this study, we decided to consider only linear filtering DR methods.

There are two main classes of ID: projective and non-projective. For projective ID, there is a procedure of projection of data points into lower dimensional space. The most well known projective DR methods are methods based on PCs. There are many heuristics to identify a number of informative (useful, meaningful, etc.) PCs, some of which are considered below. There are many non-projective IDs: the correlation (fractal) dimension [12], manifold-adaptive fractal dimension [13], the method of moments [14], maximum likelihood [15,16,17], and estimators based on the concentration of measure [18,19,20,21,22,23,24].

There are also semi-projective methods (e.g., minimum spanning trees [25]). Such methods allow for the projection of points into low-dimensional space but require the full recalculation of embedding if any points are added to/removed from the dataset.

Many definitions of ID are based on different properties of “dimensionality”. In this study, we consider only projective linear DR. Reviews of different IDs and their usage can be found in [6,26,27]. Macocco et al. considered IDs for discrete spaces [28].

Almost all linear methods of dimensionality reduction are based on PCs. The standard version of PCs was proposed by Karl Pearson in 1901 [29]. There are many generalisations of PCs for different problems, for example, to search for PCs that do not explain the general variance of the dataset but select directions useful for solving the problem under consideration. Usually, such generalisations are called supervised principal components (SPCs). Some SPCs can be found in [30,31,32,33,34]. Recently, a version of PCs was proposed for the domain adaptation problem [35]. There are also sparse PCs [36] specially developed for genomics problems, where the number of attributes is essentially (tens or even hundreds of times) greater than the number of observations. There are also nonlinear generalisations of PCs. Some such generalisations are presented in [37].

An Alternative to the PC linear DR method is DR based on linear discriminant analysis (LDA) directions [38]. The main preference of this method is a search for direction, which is most useful for classification. The main drawback of this method is the ability to define only

C - 1

dimensions, where C is the number of classes. Since all benchmarks investigated in this paper are binary classification problems, the maximal LDA dimension for them is one, and this dimension is not appropriate for most of the considered benchmarks.

In this paper, we consider only projective linear DR methods based on PCs. We consider standard-version PCs and the following rules for defining the number of informative PCs:

The Kaiser rule [39,40];
The broken stick rule [41];
The Fukunaga–Olsen rule or the condition number of the covariance matrix rule [23,42].

A detailed description of these rules is presented in Section 3. All these rules are widely used, and the question of which of these rules is preferable is open. In this study, we tried to answer this question.

We estimated the quality of DR based on the quality of the solution to the classification problem. From our point of view, it is essentially more informative and reasonable than a comparison of IDs.

Our goal is to identify heuristics to identify the number of PCs that allow us to solve the classification problem successfully.

The general result was expected: there is no universally best method of DR, but, on average, the Fukunaga–Olsen rule is better than the Kaiser and broken stick rules.

In [43], Ayesha et al. presented a nice review of DR. The authors compared many different DR techniques but for a specific area of usage and general properties:

DR features: Supervised/unsupervised, linear/nonlinear, ability to work with data with non-Gaussian distribution, etc. (see Table 2 in [43]);
Robustness to outliers, sensitivity to noise, robustness to singularity of data (see Table 3 in [43]);
Areas of data sources (see Table 4 in [43]);
Types of original data: text, image, signals, etc. (see Table 5 in [43]).

The authors did not consider the classification problem and did not compare DR methods according to the quality of the final classifier.

In [44], Tang et al. compared several DR methods for clustering with the purity-induced measure microaverage of classification accuracy. Their work was similar to our study but considered different DR methods and clustering instead of classification.

In a paper titled “Dimensionality Reduction: A Comparative Review” [10], Van der Maaten et al. presented the results of a similar study. They studied 13 DR methods, one of which was PCA. They used generalisation errors, trustworthiness ((28) in [10]), and continuity ((29) in [10]) as quality measure for a nearest neighbour classifier. According to the authors, PCA does not have parameters, and the authors defined “the target dimensionality in the experiments by means of the maximum likelihood intrinsic dimensionality estimator [16]”. As we can see, the authors defined number of used components by estimating the ID, and the choice of this ID was not motivated. In our study, we compared three different estimations of ID.

In [45], Konstorum et al. compared four DR methods, including PCA for mass cytometry datasets “with respect to computation time, neighborhood proportion error, residual variance, and ability to cluster known cell types and track differentiation trajectories”. As we can see, the classification problem is outside the scope of consideration. Unfortunately, the authors simply chose to require a dimension of 2 without any explanation.

Most reviews of DR methods do not consider methods of estimation of the number of informative PCs (see, for example, [43,46]). Other reviews and comparative studied simply used the stated method (“means of the maximum likelihood intrinsic dimension” in [10] or the elbow rule or fixing of the threshold (

λ_{0}

) using only PCs with eigenvalues greater than

λ_{0}

in [47]). Deegalla et al. [48] estimated the number of PCs by exhaustively testing for the selected classifier. This may be the best approach, but it can be very expensive for classifiers other than KNN because of the long time required for classifier parameter fitting.

Another set of studies compared different heuristics to chose the number of PCAs. For example, in [41], Jackson described nine heuristics, include the Kaiser rule and the broken stick rule. The Fukunaga–Olsen rule—or the condition number of the covariance matrix rule—was published in 1971 but was not included in Jackson’s review. Jackson applied PCA for data with a known ID and simply compared the estimated number with the known ID. This is a robust measure, but it has no relation to the classification problem. The author found that the broken stick method was best among the compared methods, but in our work, we found that for the classification problem, the broken stick method was the worst.

In [49], Ledesma considered the number of factors to retain and consider four heuristics. The only included PCA heuristic was the Kaiser rule.

In [50], Cangelosi et al. described several heuristics for application to cDNA microarray data. The only conclusion was that several heuristics provided consistent IDs for some databases.

Our goal is to recommend one a heuristic for automatic data pre-processing or for non-experienced researchers. The literature review above shows that, to the best of our best knowledge, there has been no comparison of different heuristics for the estimation of the number of PCs necessary for the classification problem.

The remainder of this paper is organised as follows. Section 2 presents a list of benchmarks used in this study. Section 3 presents descriptions of the used IDs. Section 4 describes the classifiers used in this study. Section 5 presents the testing protocol and the methods and results of comparison. Section 6 presents the results of testing. Section 7 contains our conclusions.

2. Data

In this study, we used 22 benchmarks taken from the UCI machine learning data repository [51]. To select the datasets, we applied the following criteria:

Dataset does not contain any missing values.
Data are not time series.
Dataset is formed for the binary classification problem.
The number of attributes is less than the number of observations and is greater than 2.
All predictors are binary or numeric.

The used benchmarks are presented in Table 1. We can observe wide variation in the number of attributes (from 3 to 500) and the number of records (from 90 to 245,057).

Since some datasets have very big differences in values of different attributes (for example, for the Musk 2 dataset, the variance of 76 attributes is 14, and that of 64 attribute is 131), we decided to standardise all attributes of all datasets: we subtracted the mean value and divided by the standard deviation. As a result, each attribute of each pre-processed dataset has a zero mean and unit variance.

3. Used Intrinsic Dimensions

In this study, we used only projective DR methods because we need not only estimate the dimension but also project data to lower dimensional space and solve the classification problem. Since we want to use cross validation to estimate the quality of classifiers (see Section 5), we cannot use semi-projective methods because these methods require complete recalculation of projections into low-dimensional space if data are changed. Calculation of PCs followed by projection on the basis of all data (training and test sets or for all folds) compromises data because the test set is used in data pre-processing. As a result, we calculate PCs for the training set only. This means that we have 100 sets of PCs for each dataset (one collection of PCs for each fold).

We cannot use LDA-based DR methods because this means using projections into one-dimensional space, which is too radical for most datasets.

As a result, we can use only PC-based DR methods, and the question is, ’What is the best rule to define the number of PCs to use?’. We consider three of the most popular formal methods of estimation of the number of informative PCs:

The Kaiser rule [39,40];
The broken stick rule [41];
The Fukunaga–Olsen or the condition number of the covariance matrix rule [23,42].

There are several informal methods, like the elbow rule [41,79], but these methods are not appropriate for automatic work. There are also methods based on the explanation of the desired part of the total data variance (usually 95%). We do not consider such methods because it is very difficult to explain clearly why the specified value (e.g., 95%) is an appropriate choice for the problem under consideration.

In the next paragraph and in the following three subsections, we follow [5].

Let us consider a dataset (X) with n records (

X = x_{1}, \dots, x_{n}

) and d real-valued attributes (

x_{i} = (x_{i 1}, \dots, x_{i d})

). The empirical covariance matrix (

Σ (X)

) is symmetric and non-negative definite. The eigenvalues of the

Σ (X)

matrix are non-negative real numbers. These values are denoted as

λ_{1} \geq λ_{2} \geq \dots \geq λ_{d}

. PCs are defined using the eigenvectors of matrix

Σ (X)

. If the ith eigenvector (

w_{i}

) is defined, then the ith principal coordinate of the data vector (x) is the inner product

(x, w_{i})

. The Fraction of Variance Explained (FVE) by the ith principal component for the dataset (X) is

f_{i} = \frac{λ_{i}}{\sum_{j = 1}^{d} λ_{j}} .

3.1. The Kaiser Rule

The Kaiser rule [39,40] states that all principal components with an FVE greater than or equal to the average FVE are informative. The average FVE is

1 / d

. Thus, components with

f_{i} \geq 1 / d

are considered informative and should be retained, and components with

f_{i} < 1 / d

should not. Another popular version uses thresholds that are twice as low (

0.5 / d

) and retains more components.

Furthermore, we use PCA-K as the name of the DR method, as it is defined by the Kaiser rule and the number of PCs.

3.2. The Broken Stick Rule

The broken stick rule [41] compares set

f_{i}

with the distribution of random intervals that appear if we break the stick at

d - 1

points randomly and independently sampled from the uniform distribution. Consider a unit interval (stick) randomly broken into d fragments. Let us numerate these fragments in descending order of their length:

s_{1} \geq s_{2} \geq \dots \geq s_{d}

. The expected length of fragment i is [41]

b_{i} = \frac{1}{d} \sum_{j = i}^{d} \frac{1}{j} .

(1)

The broken stick rule states that the first k principal components are informative, where k is the maximum number such that

f_{i} \geq b_{i}, \forall i \leq k

.

Furthermore, we use PCA-BS as the name of this DR method, which is defined by the broken stick rule and the number of PCs.

3.3. The Fukunaga–Olsen or Condition Number Rule

In many problems, the empirical covariance matrix degenerates or almost degenerates. That means that the smallest eigenvalues are much smaller than the largest one. Consider the projection of data on the first k principal components:

\hat{X} = X V

, where the columns of the matrix (V) are the first k eigenvectors of matrix

Σ (X)

. Eigenvalues of the empirical covariance matrix of the reduced data (

Σ (\hat{X})

) are

λ_{1}, λ_{2}, \dots, λ_{k}

. After DR, the condition number (the ratio of the lowest eigenvalue to the greatest eigenvalue) [80] of the reduced covariance matrix should not be too high in order to avoid the multicollinearity problem. The relevant definition [23,42] of intrinsic dimensionality refers directly to the condition number of the matrix (

Σ (\hat{X})

): k is the number of informative principal components if it is the smallest number such that

\frac{λ_{k + 1}}{λ_{1}} < \frac{1}{C},

(2)

where C is the specified condition number, for example,

C = 10

. This approach is hereafter referred to as PCA-CN. The PCA-CN intrinsic dimensionality is defined as the number of eigenvalues of the covariance matrix exceeding a specified fraction of its largest eigenvalue [42].

3.4. General Comparison of Three IDs

Before a study question about which of the three listed IDs is better, it is necessary to investigate the following question: Are these IDs statistically significantly different? Upon first looking at Table 1, it seems that the broken stick rule is too radical: for six datasets, the FVE of the first PC is less than the first broken stick value (

b_{1}

). This means that according to the broken stick rule, the ID of the data is zero. In such cases, we increase the dimension to 1. From the other side, there are datasets that have PCA-BS values greater than the PCA-K or PCA-CN value. It is interesting, but for all considered datasets, PCA-BS was never preferred.

To test the significance of differences in dimensions that can be used, Student’s t-test for dependent samples (t-test) [81] can be used if we want to compare mean values, the Wilcoxon signed-rank test (WSR test) [82] if we want to compare medians of two samples; or the Kolmogorov–Smirnov test (KS test) [83] can be used if we want to compare two empirical distributions. We exactly understand that the KS test was developed for two independent samples, but we could find any statistical test for paired samples to test differences between two distributions. Since we have only 22 datasets, we decided to use the 95% confidence level (the critical value is 5%).

We used all three listed tests for dimensions, as presented in Table 1. Since we have only three IDs, we can ignore the multiple testing problem [84]. The p-values for three tests with three pairs of original dimensions and one of IDs and for three pairs of IDs are presented in Table 2. For comparison of the original dimension with IDs, we can see that, formally, all IDs are significantly different from the number of attributes. However the t-test has a relatively large p-value for PCA-BS. In the ID comparisons, the t-test shows that we do not have enough evidence to reject the hypothesis that all three samples of distances were taken from populations with the same mean value. From the KS test, we can conclude that we do not have enough evidence to reject the hypothesis that distributions of IDs for PCA-K and for PCA-CN are identical, but both these distributions are statistically significantly different from the distribution of IDs for PCA-BS with a confidence level of 95%. From the WSR test, we can conclude that the median of PCA-BS is statistically significantly less than the median of PCA-K and the median of PCA-CN with a significance level of 99%. This finding confirms the well known fact that the Kaiser rule is too conservative and considers too many PCs as informative [79] compared to the broken stick rule. From the other side, the Fukunaga–Olsen rule is even more conservative than Kaiser’s rule.

Let us compare the three listed rules as possible IDs. What properties do we expect from any dimension? This value must be, at least, uniquely defined. This means that if our data have a dimension of d, then the data have a dimension of d in any space that includes these data. This also means that if we originally had a dataset (

D B 1

) and used the DR method (

D R

), then we have

d i m (D R (D B 1)) = d

, where

d i m (S)

is the dimension of space S. Moreover, we can expect that after the second DR, we will have the same dimensions:

d = d i m (D R (D B 1)) = d i m (D R (D R (D B 1)))

.

In [5], Mirkes et al. presented proof that the Kaiser rule and the broken stick rule almost never satisfy this property (see pages 13–15). From the other side, PCA-CN IDs satisfy the described property.

4. Used Classifiers

We decided to use three robust classifiers with a minimal number of hyperparameters: KNN [85], Logistic Regression (LR) [86] and Fisher’s linear discriminant analysis (LDA) [87]. All these methods are robust with respect to small changes in the training set and are reproducible (each run produces the same result). KNN has one hyperparameter: the number of neighbours (k). LR and LDA also have one hyperparameter: the threshold to split two classes. We should also emphasise that we used Fisher’s linear discriminant analysis (the first model in [87]), which does not assume normality of distribution and equality of covariance matrices for two classes (these two assumptions are usual for linear discriminant analysis in most software packages). A description of hyperparameters selection is presented below.

4.1. Performance Estimation

There are many measures of the quality of the classifier, from usual accuracy to the F₁ score ([88] lists more than 20 such quality measures). Since some datasets have class imbalances (a very different fraction of cases for two classes), we decided to use balanced accuracy (BA ) as a quality measure:

BA = \frac{T P R + T N R}{2},

(3)

where

T P R

is the fraction of correctly recognised cases of the positive class and

T N R

is the fraction of correctly recognised cases of the negative class.

One of the very important elements of performance measures is the testing protocol. We decided to use 10 times 10-fold cross validation to calculate the BA. This means that we independently applied 10-fold cross validation 10 times. For each fold (there are 100 folds in total), we used the selected fold as a test set and all other data as a training set. For the test set, we calculated the BA. As a result, for each dataset and for each classifier, we have 100 BAs.

To ensure the comparability of results, we generated folds for each dataset, then used the same set of folds for all classifiers and for all datasets, whether original or with reduced dimensionality.

4.2. k Selection in KNN

There is no unique approach to defining an optimal k for KNN. There are some heuristics for k selection (see, for example, [89]), but the most useful and widely used method is estimation of k through empirical testing. The optimal value of k for each dataset was selected by calculating the classifier performance (BA) using the leave-one-out cross validation (LOOCV) technique for the original dataset. We tested all k values from 1 to

min {N u m b e r_o f_c a s e s, 500}

. This is computationally expensive but results in a reliable and unbiased estimation of classifier performance. Selected values of k are presented in Table 3. For each dataset, we selected k only once because optimal value of k is a property of the underlying (unknown) distribution of data and must not change with small changes in the data. As we can see from Table 3, there is no correlation between the optimal k and the number of cases (n) (Pearson’s correlation coefficient is −0.1). Figure 1 presents the BA as a function of k for the Banknote authentication and Maledon datasets. Graphs for all other datasets can be found in the ‘figures’ folder in [90]. As we can see, for some datasets, the optimal k is very small and the decay of the BA is observed with increasing k values (for example, in the Banknote authentication dataset), but for other datasets, the optimal k is large enough and the BA behaviour is different.

4.3. Decision Threshold Selection in LDA and LR

LR and LDA are both linear classifiers. Both these classifiers are defined by a linear direction for class discrimination. Then, all data points are projected in this direction, and the value of the projection is the score. For logistic regression, the calculated score is processed by the logit function to define the final score. There are special methods to define thresholds between classes in this one-dimensional problem: for LDA, this threshold can be calculated through the assumption of the multivariate normality of data distributions, and for LR, on the basis of the assumption that the final score is a probability. These methods are not applicable in the general case because data are not normally distributed, and output of LR is not a probability if not all LR assumptions are held. To avoid problems with testing assumptions, we applied an empirical search of the optimal threshold for LR and LDA. For this, we calculated a set of m unique scores (

s_{i}, i = 1, \dots, m

), then tested all thresholds in the middle of pairs of identified scores:

t_{i} = (s_{i} + s_{i + 1}) / 2

. Now, we can define the optimal threshold as

t r = \arg max_{i = 1, \dots, m} a c c (t_{i}),

(4)

where

a c c (t_{i})

is the selected function of classifier quality, for example, accuracy, sensitivity, specificity, balanced accuracy, etc. In our study, we used balanced accuracy to define the optimal threshold.

More exactly, for each fold, we fit the model using the training set and calculated the scores of LR or LDA for the training set. Then, we searched for threshold using Equation (4) for the BA (3). Then, we used the calculated threshold to estimate the BA for the test set.

Figure 2 illustrates the selection of the best threshold for the Banknote authentication and Maledon datasets. Graphs for all other datasets can be found in the ‘figures’ folder in [90]. For illustration, we used LDA for a full dataset. We should emphasise that Figure 2 illustrates the process of threshold selection but does not present optimal threshold values because, in our experiments, for each fold, the individual threshold was searched for using the training set only. Figure 2 shows that for some datasets, we can expect good accuracy because of the minimal overlapping of classes (for example, in the Banknote authentication dataset), but for other datasets, we cannot expect good results because of massive class overlapping (for example, in the Maledon dataset).

5. ID Comparison

As stated in Section 4.1, for each ID and for each dataset, we have 100 BAs. There are many ways to compare such data. The simplest is a comparison of mean values. These results are presented in Table A1, Table A2 and Table A3. Our goal is a comparison of IDs and average values; this is not a good method of comparison, but we can apply a ranking of average values.

5.1. Average Ranking

Average ranking is ranking according to values of the average BA: 1 is the minimal value, and 3 is the maximal value. In cases in which several values coincided (ties), we applied the average method (rank is the average of ranks of all elements with the same value) [91]. Results of the average ranking are presented in Table A1, Table A2 and Table A3.

5.2. T-Test-Based Ranking

For one dataset, we can compare average BAs for two IDs: is the observed difference statistically significant? For this goal, we used a paired t-test [92]. In this case, we created the ranking in the following way: if two elements have statistically insignificantly different values, then they have the same rank. The basic assumptions of a paired t-test are listed as follows:

The value should be continuous in nature. BAs are continuous.
The mean values of samples should follow the normal distribution. This can be checked, but it is well known (see the Central Limit Theorem [92]) that, for a large sample from any distribution, the sample mean follows a normal distribution. We can conclude that for 100 observations, we have acceptable accuracy.
Samples should be large and randomly selected from the population. In our case, we have a sample size of 100, which is not too large but not small. BA values are calculated using independently randomly generated splits.

One more question concerns the selection of the significance level. For 100 observations, we consider 95% to an acceptable significance. T-test-based ranks are presented in Table A4, Table A5 and Table A6.

5.3. Wilcoxon Signed-Rank Test Ranking

The Wilcoxon signed-rank test is a non-parametric analogue of the t-test. We used a paired version of this test with a two-tailed alternative hypothesis [92]. In this case, we created the rank in the same way as in t-test-based ranking: if two elements have statistically insignificantly different values, then they have the same rank. We used the same significance level of 95%. WSR test-based ranks are presented in Table A4, Table A5 and Table A6.

5.4. Final ID Comparison

Table A1, Table A2, Table A3, Table A4, Table A5 and Table A6 present three ranks of IDs for 22 datasets and three classifiers. We need to compare IDs “in general”. There are several methods for such a comparison. The simplest is the calculation of the mean value of ranks for each dimension and each classifier. We can also apply a t-test to test the significance of differences in the mean of ranks. In this case, we use a 95% significance level because we have only 22 observations.

6. Results

During the described testing protocol, it was found that for some datasets, the formation of models faced the multicollinearity problem [93] for logistic regression (QSAR biodegradation, MiniBooNE particle identification, Musk 2, and SPECTF Heart datasets). In our version of LDA (see the fisherDir function in forPaper.py in [90]), we used Tikhonov regularisation [94] to avoid this problem. We did not use any approach to resolve this problem that could influence to result, as our goal was to compare IDs. The multicollinearity problem was observed only for the original dataset and supports the idea that DR is necessary for some data.

The average balanced accuracies for all datasets and for all IDs are presented in Table A1, Table A2 and Table A3. Average rankings are presented in Table A1, Table A2 and Table A3. T-test ranking and Wilcoxon signed-rank test ranking are presented in Table A4, Table A5 and Table A6. Average ranks for all three rankings are presented in the bottom rows of Table A1, Table A2, Table A3, Table A4, Table A5 and Table A6.

In Table A1, Table A2, Table A3, Table A4, Table A5 and Table A6, we can see that average, t-test-based, and WSR-based rankings are consistent: for all classifiers, we have almost the same mean value of IDs (see Table 4). As we can see, exact values of mean ranks are different, but for all methods of ranking and for all classifiers, the best ID (minimal mean rank) is associated with PCA-CN, and the worst ID (maximal mean rank) is associated with PCA-BS.

The results of the test of significance of differences in mean ranking are presented in Table 4. We can see that for KNN and LDA for all three rankings and for LR for the t-test and WSR test rankings, PCA-K is statistically significantly better than PCA-BS. For LR and the average ranking, this difference is statistically insignificant.

For all three rankings and for all three classifiers, PCA-CN is statistically significantly better than PCA-BS.

Differences between PCA-CN and PCA-K are statistically insignificant for all rankings and all classifiers, but the mean rank of PCA-CN is less than the mean rank of PCA-K for all rankings and all classifiers.

These observations correlate with findings reported in Section 3.4: PCA-BS ID is statistically significantly different from PCA-K and PCA-CN.

7. Conclusions

After extensive testing of the balanced accuracy of three classifiers (KNN, LR, and LDA), for three PCA-based IDs, we found that there is no universal answer with respect to whether one of the IDs is better for all data (see Table 5). For each classifier, PCA-BS was the best ID for one or two datasets.

However, on average, PCA-CN and PCA-K are statistically significantly better than PCA-BS (see Table 4). This is interesting, but in all cases, the mean rank of all three of the considered rankings for PCA-K is greater than for PCA-CN, although in all cases, this difference is statistically insignificant.

As a result, we can recommend not using the broken stick rule to define the ID of data. To choose the preferable ID among PCA-K and PCA-CN, we should take into account one more property of these dimensions. As we described in Section 3.4, PCA-K does not have recurrent stability. If we applied PCA-K twice, then after the second DR, we would have a dimension off less than that after the first DR. From the other side, PCA-CN is recurrently stable; if you applied PCA-CN, then after an arbitrary number of further applications, the dimension would not change (see Table 6). As we can see, the only linear DR method that corresponds to the intuitive meaning of the intrinsic dimension is PCA-CN. This method is the best according to mean ranks, even if this is not statistically significant.

We can recommend the usage of PCA-CN for automatic data pre-processing for classification problems because this method is the only recurrently stable linear method and, on average, is the best one.

Author Contributions

Conceptualization, E.M.M. and M.B.; methodology, E.M.M.; software, E.M.M. and M.B.; validation, E.M.M.; formal analysis, E.M.M. and M.B.; investigation, M.B.; data curation, E.M.M. and M.B.; writing—original draft preparation, M.B.; writing—review and editing, E.M.M. and M.B.; visualization, M.B.; supervision, E.M.M.; project administration, M.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All datasets used in this study can be found in the UCI data repository. References are presented in Table 1. The code and copy of the datasets can be found in [90].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BA	Balanced Accuracy
DR	Dimensionality Reduction
FVE	Fraction of Variance Explained
ID	Intrinsic Dimension
KNN	K Nearest Neighbours
KS test	Kolmogorov–Smirnov test
LDA	Linear Discriminant Analysis
LOOCV	Leave-One-Out Cross Validation
LR	Logistic Regression
PCA	Principal Component Analysis
PCA-BS	Principal Components with Broken Stick rule
PCA-CN	Principal Components with Condition Number (Fukunaga-Olsen) rule
PCA-K	Principal Components with Kaiser rule
PCs	Principal Components
SPCs	Supervised Principal Components
t-test	Student’s t-test for dependent samples
WSR test	Wilcoxon Signed-Rank test

Appendix A. Tables of Average BA and Three Rankings for Three Classifiers

The appendix contains detailed tables with average BAs and rankings for all classifiers and all datasets. These tables were used to form the resulting tables in Section 6 and Section 7.

Table A1. Average BAs and results of average ranking for KNN.

Dataset	Average Balanced Accuracy				Average Rank
Dataset	All attr.	PCA-K	PCA-BS	PCA-CN	PCA-K	PCA-BS	PCA-CN
Banknote authentication	0.99849	0.85792	0.85792	0.95848	2.5	2.5	1
Blood	0.63163	0.58495	0.58495	0.63046	2.5	2.5	1
Breast Cancer	0.95981	0.95491	0.92729	0.94956	1	3	2
Climate Model Simulation Crashes	0.67376	0.61583	0.51278	0.67376	2	3	1
Cryotherapy	0.91025	0.80575	0.78675	0.91025	2	3	1
Diabetic Retinopathy Debrecen	0.65401	0.62264	0.63371	0.64629	3	2	1
EEG Eye State	0.83697	0.65245	0.65245	0.72798	2.5	2.5	1
HTRU2	0.91506	0.88699	0.88699	0.90465	2.5	2.5	1
Immunotherapy	0.64482	0.58063	0.53571	0.64482	2	3	1
ILPD (Indian Liver Patient Dataset)	0.61713	0.63630	0.57547	0.61332	1	3	2
Madelon	0.60312	0.60942	0.63589	0.60423	2	1	3
MiniBooNE particle identification	0.87217	0.76305	0.57920	0.57920	1	2.5	2.5
Musk 1	0.89606	0.87110	0.85144	0.82472	1	2	3
Musk 2	0.92549	0.91910	0.91200	0.88173	1	2	3
Planning Relax	0.55860	0.50050	0.52582	0.56630	3	2	1
QSAR biodegradation	0.85499	0.84119	0.82214	0.84616	2	3	1
Skin Segmentation	0.99965	0.89500	0.89500	0.99005	2.5	2.5	1
Connectionist Bench (Sonar)	0.86777	0.85531	0.83795	0.86116	2	3	1
SPECT Heart	0.53831	0.58770	0.56106	0.58773	2	3	1
SPECTF Heart	0.68828	0.70349	0.66088	0.72737	2	3	1
MAGIC Gamma Telescope	0.81087	0.68549	0.54955	0.77014	2	3	1
Vertebral Column	0.80417	0.73002	0.66879	0.80417	2	3	1
Mean of rank					1.98	2.59	1.43

Table A2. Average BAs and results of average ranking for LR.

Dataset	Average Balanced Accuracy				Average Rank
Dataset	All attr.	PCA-K	PCA-BS	PCA-CN	PCA-K	PCA-BS	PCA-CN
Banknote authentication	0.99070	0.77577	0.77577	0.91934	2.5	2.5	1
Blood	0.69630	0.68823	0.68823	0.69630	2.5	2.5	1
Breast Cancer	0.95275	0.96683	0.94342	0.96802	2	3	1
Climate Model Simulation Crashes	0.88419	0.74890	0.52699	0.88409	2	3	1
Cryotherapy	0.88625	0.81500	0.78825	0.88625	2	3	1
Diabetic Retinopathy Debrecen	0.74235	0.60971	0.61088	0.67891	3	2	1
EEG Eye State	0.62448	0.54818	0.54818	0.57282	2.5	2.5	1
HTRU2	0.94256	0.90952	0.90952	0.92837	2.5	2.5	1
Immunotherapy	0.67598	0.62446	0.56482	0.67598	2	3	1
ILPD (Indian Liver Patient Dataset)	0.68838	0.69217	0.64976	0.68910	1	3	2
Madelon	0.55762	0.57112	0.61377	0.56131	2	1	3
MiniBooNE particle identification	0.88054	0.76747	0.58211	0.58211	1	2.5	2.5
Musk 1	0.83407	0.79060	0.72277	0.70543	1	2	3
Musk 2	0.91264	0.83786	0.79275	0.67075	1	2	3
Planning Relax	0.43547	0.45121	0.44633	0.43049	1	2	3
QSAR biodegradation	0.84809	0.82462	0.82130	0.83721	2	3	1
Skin Segmentation	0.95009	0.64193	0.64193	0.95109	2.5	2.5	1
Connectionist Bench (Sonar)	0.75921	0.77880	0.73740	0.78446	2	3	1
SPECT Heart	0.74587	0.76329	0.76345	0.73099	2	1	3
SPECTF Heart	0.70703	0.75859	0.77300	0.77517	3	2	1
MAGIC Gamma Telescope	0.76887	0.71919	0.57313	0.76557	2	3	1
Vertebral Column	0.83057	0.73129	0.73321	0.83010	3	2	1
Mean of rank					2.02	2.41	1.57

Table A3. Average BAs and results of average ranking for LDA.

Dataset	Average Balanced Accuracy				Average Rank
Dataset	All attr.	PCA-K	PCA-BS	PCA-CN	PCA-K	PCA-BS	PCA-CN
Banknote authentication	0.84293	0.75323	0.75323	0.81406	2.5	2.5	1
Blood	0.66590	0.65882	0.65882	0.66590	2.5	2.5	1
Breast Cancer	0.93096	0.92943	0.91489	0.93146	2	3	1
Climate Model Simulation Crashes	0.87415	0.74411	0.51142	0.87415	2	3	1
Cryotherapy	0.87825	0.82475	0.78825	0.87825	2	3	1
Diabetic Retinopathy Debrecen	0.58636	0.58359	0.58311	0.57856	1	2	3
EEG Eye State	0.56719	0.54081	0.54081	0.55229	2.5	2.5	1
HTRU2	0.91210	0.90952	0.90952	0.91084	2.5	2.5	1
Immunotherapy	0.67545	0.63366	0.44875	0.67545	2	3	1
ILPD (Indian Liver Patient Dataset)	0.66663	0.65460	0.50179	0.66736	2	3	1
Madelon	0.60154	0.60912	0.52196	0.60573	1	3	2
MiniBooNE particle identification	0.77407	0.72854	0.58211	0.58211	1	2.5	2.5
Musk 1	0.68760	0.68412	0.66491	0.62216	1	2	3
Musk 2	0.70488	0.70708	0.68934	0.65670	1	2	3
Planning Relax	0.41376	0.43394	0.48717	0.42159	2	1	3
QSAR biodegradation	0.80041	0.79526	0.78380	0.79890	2	3	1
Skin Segmentation	0.85056	0.50974	0.50974	0.84985	2.5	2.5	1
Connectionist Bench (Sonar)	0.71214	0.70907	0.68904	0.70473	1	3	2
SPECT Heart	0.77615	0.78049	0.77791	0.77905	1	3	2
SPECTF Heart	0.75318	0.75102	0.74335	0.74549	1	3	2
MAGIC Gamma Telescope	0.72460	0.68250	0.57313	0.72325	2	3	1
Vertebral Column	0.75579	0.73379	0.49586	0.75579	2	3	1
Mean of rank					1.75	2.64	1.61

Table A4. T-test and WSR test ranking for KNN.

Dataset	t-Test Based Rank			WSR Test Based Rank
Dataset	PCA-K	PCA-BS	PCA-CN	PCA-K	PCA-BS	PCA-CN
Banknote authentication	2.5	2.5	1	2.5	2.5	1
Blood	2.5	2.5	1	2.5	2.5	1
Breast Cancer	1	3	2	1	3	2
Climate Model Simulation Crashes	2	3	1	2	3	1
Cryotherapy	2.5	2.5	1	2.5	2.5	1
Diabetic Retinopathy Debrecen	3	2	1	3	2	1
EEG Eye State	2.5	2.5	1	2.5	2.5	1
HTRU2	2.5	2.5	1	2.5	2.5	1
Immunotherapy	2.5	2.5	1	2.5	2.5	1
ILPD (Indian Liver Patient Dataset)	1	3	2	1	3	2
Madelon	2	1	3	2	1	3
MiniBooNE particle identification	1	2.5	2.5	1	2.5	2.5
Musk 1	1	2	3	1	2	3
Musk 2	1	2	3	1	2	3
Planning Relax	2.5	2.5	1	3	2	1
QSAR biodegradation	2	3	1	2	3	1
Skin Segmentation	2.5	2.5	1	2.5	2.5	1
Connectionist Bench (Sonar)	1.5	3	1.5	1.5	3	1.5
SPECT Heart	1.5	3	1.5	1.5	3	1.5
SPECTF Heart	2	3	1	2	3	1
MAGIC Gamma Telescope	2	3	1	2	3	1
Vertebral Column	2	3	1	2	3	1
Mean of rank	1.95	2.57	1.48	1.98	2.55	1.48

Table A5. T-test and WSR test ranking for LR.

Dataset	t-Test Based Rank			WSR Test Based Rank
Dataset	PCA-K	PCA-BS	PCA-CN	PCA-K	PCA-BS	PCA-CN
Banknote authentication	2.5	2.5	1	2.5	2.5	1
Blood	2.5	2.5	1	2.5	2.5	1
Breast Cancer	1.5	3	1.5	1.5	3	1.5
Climate Model Simulation Crashes	2	3	1	2	3	1
Cryotherapy	2	3	1	2	3	1
Diabetic Retinopathy Debrecen	2.5	2.5	1	2.5	2.5	1
EEG Eye State	2.5	2.5	1	2.5	2.5	1
HTRU2	2.5	2.5	1	2.5	2.5	1
Immunotherapy	2	3	1	2	3	1
ILPD (Indian Liver Patient Dataset)	1.5	3	1.5	1.5	3	1.5
Madelon	2	1	3	2	1	3
MiniBooNE particle identification	1	2.5	2.5	1	2.5	2.5
Musk 1	1	2	3	1	2	3
Musk 2	1	2	3	1	2	3
Planning Relax	2	2	2	2	2	2
QSAR biodegradation	2.5	2.5	1	2.5	2.5	1
Skin Segmentation	2.5	2.5	1	2.5	2.5	1
Connectionist Bench (Sonar)	1.5	3	1.5	1.5	3	1.5
SPECT Heart	1.5	1.5	3	1.5	1.5	3
SPECTF Heart	3	1.5	1.5	3	1.5	1.5
MAGIC Gamma Telescope	2	3	1	2	3	1
Vertebral Column	2.5	2.5	1	2.5	2.5	1
Mean of rank	2.00	2.43	1.57	2.00	2.43	1.57

Table A6. T-test and WSR test ranking for LDA.

Dataset	t-Test Based Rank			WSR Test Based Rank
Dataset	PCA-K	PCA-BS	PCA-CN	PCA-K	PCA-BS	PCA-CN
Banknote authentication	2.5	2.5	1	2.5	2.5	1
Blood	2.5	2.5	1	2.5	2.5	1
Breast Cancer	1.5	3	1.5	1.5	3	1.5
Climate Model Simulation Crashes	2	3	1	2	3	1
Cryotherapy	2	3	1	2	3	1
Diabetic Retinopathy Debrecen	1.5	1.5	3	1.5	1.5	3
EEG Eye State	2.5	2.5	1	2.5	2.5	1
HTRU2	2.5	2.5	1	2.5	2.5	1
Immunotherapy	2	3	1	2	3	1
ILPD (Indian Liver Patient Dataset)	2	3	1	2	3	1
Madelon	1	3	2	1	3	2
MiniBooNE particle identification	1	2.5	2.5	1	2.5	2.5
Musk 1	1	2	3	1	2	3
Musk 2	1	2	3	1	2	3
Planning Relax	2.5	1	2.5	2.5	1	2.5
QSAR biodegradation	2	3	1	2	3	1
Skin Segmentation	2.5	2.5	1	2.5	2.5	1
Connectionist Bench (Sonar)	1.5	3	1.5	1.5	3	1.5
SPECT Heart	2	2	2	2	2	2
SPECTF Heart	2	2	2	2	2	2
MAGIC Gamma Telescope	2	3	1	2	3	1
Vertebral Column	2	3	1	2	3	1
Mean of rank	1.89	2.52	1.59	1.89	2.52	1.59

References

van der Ploeg, T.; Austin, P.C.; Steyerberg, E.W. Modern modelling techniques are data hungry: A simulation study for predicting dichotomous endpoints. BMC Med. Res. Methodol. 2014, 14, 137. [Google Scholar] [CrossRef] [PubMed]
Donoho, D.L. High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lect. 2000, 1, 32. [Google Scholar]
Grechuk, B.; Gorban, A.N.; Tyukin, I.Y. General stochastic separation theorems with optimal bounds. Neural Netw. 2021, 138, 33–56. [Google Scholar] [CrossRef]
Gorban, A.N.; Grechuk, B.; Mirkes, E.M.; Stasenko, S.V.; Tyukin, I.Y. High-dimensional separability for one-and few-shot learning. Entropy 2021, 23, 1090. [Google Scholar] [CrossRef]
Mirkes, E.M.; Allohibi, J.; Gorban, A. Fractional norms and quasinorms do not help to overcome the curse of dimensionality. Entropy 2020, 22, 1105. [Google Scholar] [CrossRef]
Bac, J.; Mirkes, E.M.; Gorban, A.N.; Tyukin, I.; Zinovyev, A. Scikit-dimension: A python package for intrinsic dimension estimation. Entropy 2021, 23, 1368. [Google Scholar] [CrossRef]
Jiang, H.; Kim, B.; Guan, M.; Gupta, M. To trust or not to trust a classifier. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, Canada, 3–8 December 2018. [Google Scholar]
Bac, J.; Zinovyev, A. Lizard brain: Tackling locally low-dimensional yet globally complex organization of multi-dimensional datasets. Front. Neurorobotics 2020, 13, 110. [Google Scholar] [CrossRef] [PubMed]
Hino, H. ider: Intrinsic Dimension Estimation with R. R J. 2017, 9, 329. [Google Scholar] [CrossRef]
Van Der Maaten, L.; Postma, E.O.; Van Den Herik, H.J. Dimensionality reduction: A comparative review. J. Mach. Learn. Res. 2009, 10, 13. [Google Scholar]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
Grassberger, P.; Procaccia, I. Measuring the strangeness of strange attractors. In The theory of Chaotic Attractors; Springer: New York, NY, USA, 2004; pp. 170–189. [Google Scholar]
Farahmand, A.M.; Szepesvári, C.; Audibert, J.Y. Manifold-adaptive dimension estimation. In Proceedings of the 24th International Conference on Machine Learning, Corvalis, OR, USA, 20–24 June 2007; pp. 265–272. [Google Scholar]
Amsaleg, L.; Chelly, O.; Furon, T.; Girard, S.; Houle, M.E.; Kawarabayashi, K.I.; Nett, M. Extreme-value-theoretic estimation of local intrinsic dimensionality. Data Min. Knowl. Discov. 2018, 32, 1768–1805. [Google Scholar] [CrossRef]
Pickands, J., III. Statistical inference using extreme order statistics. Ann. Statist. 1975, 3, 119–131. [Google Scholar]
Levina, E.; Bickel, P.J. Maximum Likelihood Estimation of Intrinsic Dimension. In Proceedings of the 17th International Conference on Neural Information Processing Systems, Cambridge, MA, USA, 9–11 December 2003; pp. 777–784. [Google Scholar]
Haro, G.; Randall, G.; Sapiro, G. Translated poisson mixture model for stratification learning. Int. J. Comput. Vis. 2008, 80, 358–374. [Google Scholar] [CrossRef]
Albergante, L.; Bac, J.; Zinovyev, A. Estimating the effective dimension of large biological datasets using Fisher separability analysis. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
Rozza, A.; Lombardi, G.; Ceruti, C.; Casiraghi, E.; Campadelli, P. Novel high intrinsic dimensionality estimators. Mach. Learn. 2012, 89, 37–65. [Google Scholar] [CrossRef]
Ceruti, C.; Bassis, S.; Rozza, A.; Lombardi, G.; Casiraghi, E.; Campadelli, P. DANCo: An intrinsic dimensionality estimator exploiting angle and norm concentration. Pattern Recognit. 2014, 47, 2569–2581. [Google Scholar] [CrossRef]
Johnsson, K. Structures in High-Dimensional Data: Intrinsic Dimension and Cluster Analysis; Centre for Mathematical Sciences, Lund University: Lund, Sweden, 2016. [Google Scholar]
Facco, E.; D’Errico, M.; Rodriguez, A.; Laio, A. Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Sci. Rep. 2017, 7, 8. [Google Scholar] [CrossRef] [PubMed]
Gorban, A.N.; Golubkov, A.; Grechuk, B.; Mirkes, E.M.; Tyukin, I.Y. Correction of AI systems by linear discriminants: Probabilistic foundations. Inf. Sci. 2018, 466, 303–322. [Google Scholar] [CrossRef]
Amsaleg, L.; Chelly, O.; Houle, M.E.; Kawarabayashi, K.I.; Radovanović, M.; Treeratanajaru, W. Intrinsic dimensionality estimation within tight localities. In Proceedings of the 2019 SIAM International Conference on Data Mining, Calgary, AB, Canada, 2–4 May 2019; pp. 181–189. [Google Scholar]
Carter, K.M.; Raich, R.; Hero, A.O., III. On local intrinsic dimension estimation and its applications. IEEE Trans. Signal Process. 2009, 58, 650–663. [Google Scholar] [CrossRef]
Campadelli, P.; Casiraghi, E.; Ceruti, C.; Rozza, A. Intrinsic dimension estimation: Relevant techniques and a benchmark framework. Math. Probl. Eng. 2015, 2015, 759567. [Google Scholar] [CrossRef]
Camastra, F.; Staiano, A. Intrinsic dimension estimation: Advances and open problems. Inf. Sci. 2016, 328, 26–41. [Google Scholar] [CrossRef]
Macocco, I.; Glielmo, A.; Grilli, J.; Laio, A. Intrinsic dimension estimation for discrete metrics. Phys. Rev. Lett. 2023, 130, 067401. [Google Scholar] [CrossRef]
Pearson, K. LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1901, 2, 559–572. [Google Scholar] [CrossRef]
Barshan, E.; Ghodsi, A.; Azimifar, Z.; Jahromi, M.Z. Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds. Pattern Recognit. 2011, 44, 1357–1371. [Google Scholar] [CrossRef]
Koren, Y.; Carmel, L. Robust linear dimensionality reduction. IEEE Trans. Vis. Comput. Graph. 2004, 10, 459–470. [Google Scholar] [CrossRef] [PubMed]
Song, Y.; Nie, F.; Zhang, C.; Xiang, S. A unified framework for semi-supervised dimensionality reduction. Pattern Recognit. 2008, 41, 2789–2799. [Google Scholar] [CrossRef]
Gorban, A.; Mirkes, E.; Zinovyev, A. Supervised PCA. Available online: https://github.com/Mirkes/SupervisedPCA (accessed on 21 January 2023).
Bair, E.; Hastie, T.; Paul, D.; Tibshirani, R. Prediction by supervised principal components. J. Am. Stat. Assoc. 2006, 101, 119–137. [Google Scholar] [CrossRef]
Mirkes, E.M.; Bac, J.; Fouché, A.; Stasenko, S.V.; Zinovyev, A.; Gorban, A.N. Domain Adaptation Principal Component Analysis: Base linear method for learning with out-of-distribution data. Entropy 2023, 25, 33. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T.; Tibshirani, R. Sparse principal component analysis. J. Comput. Graph. Stat. 2006, 15, 265–286. [Google Scholar] [CrossRef]
Gorban, A.N.; Kégl, B.; Wunsch, D.C.; Zinovyev, A.Y. Principal Manifolds for Data Visualization and Dimension Reduction; Springer: Berlin/Heidelberg, Germany, 2008; Volume 58. [Google Scholar]
Gu, Q.; Li, Z.; Han, J. Linear discriminant dimensionality reduction. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, 5–9 September 2011; Proceedings, Part I 11; Springer: Berlin/Heidelberg, Germany, 2011; pp. 549–564. [Google Scholar]
Guttman, L. Some necessary conditions for common-factor analysis. Psychometrika 1954, 19, 149–161. [Google Scholar] [CrossRef]
Kaiser, H.F. The application of electronic computers to factor analysis. Educ. Psychol. Meas. 1960, 20, 141–151. [Google Scholar] [CrossRef]
Jackson, D.A. Stopping rules in principal components analysis: A comparison of heuristical and statistical approaches. Ecology 1993, 74, 2204–2214. [Google Scholar] [CrossRef]
Fukunaga, K.; Olsen, D.R. An algorithm for finding intrinsic dimensionality of data. IEEE Trans. Comput. 1971, C-20, 176–183. [Google Scholar] [CrossRef]
Ayesha, S.; Hanif, M.K.; Talib, R. Overview and comparative study of dimensionality reduction techniques for high dimensional data. Inf. Fusion 2020, 59, 44–58. [Google Scholar] [CrossRef]
Tang, B.; Shepherd, M.; Milios, E.; Heywood, M.I. Comparing and combining dimension reduction techniques for efficient text clustering. In Proceedings of the SIAM International Workshop on Feature Selection for Data Mining, Newport Beach, CA, USA, 21–23 April 2005; pp. 17–26. [Google Scholar]
Konstorum, A.; Jekel, N.; Vidal, E.; Laubenbacher, R. Comparative analysis of linear and nonlinear dimension reduction techniques on mass cytometry data. BioRxiv 2018. [Google Scholar] [CrossRef]
Buchala, S.; Davey, N.; Gale, T.M.; Frank, R.J. Analysis of linear and nonlinear dimensionality reduction methods for gender classification of face images. Int. J. Syst. Sci. 2005, 36, 931–942. [Google Scholar] [CrossRef]
Fodor, I.K. A Survey of Dimension Reduction Techniques; Technical Report; Lawrence Livermore National Lab. (LLNL): Livermore, CA, USA, 2002. [Google Scholar]
Deegalla, S.; Boström, H.; Walgama, K. Choice of dimensionality reduction methods for feature and classifier fusion with nearest neighbor classifiers. In Proceedings of the 2012 15th International Conference on Information Fusion, Singapore, 9–12 July 2012; pp. 875–881. [Google Scholar]
Ledesma, R.D.; Valero-Mora, P. Determining the number of factors to retain in EFA: An easy-to-use computer program for carrying out parallel analysis. Pract. Assess. Res. Eval. 2007, 12, 2. [Google Scholar] [CrossRef]
Cangelosi, R.; Goriely, A. Component retention in principal component analysis with application to cDNA microarray data. Biol. Direct 2007, 2, 2. [Google Scholar] [CrossRef]
Kelly, M.; Longjohn, R.; Nottingham, K. The UCI Machine Learning Repository. Available online: http://archive.ics.uci.edu (accessed on 1 May 2025).
Banknote Authentication. Available online: https://archive.ics.uci.edu/ml/datasets/banknote+authentication (accessed on 1 May 2025).
Blood Transfusion Service Center. Available online: https://archive.ics.uci.edu/ml/datasets/Blood+Transfusion+Service+Center (accessed on 1 May 2025).
Breast Cancer Wisconsin (Diagnostic). Available online: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29 (accessed on 1 May 2025).
Climate Model Simulation Crashes. Available online: https://archive.ics.uci.edu/ml/datasets/Climate+Model+Simulation+Crashes (accessed on 1 May 2025).
Khozeimeh, F.; Alizadehsani, R.; Roshanzamir, M.; Khosravi, A.; Layegh, P.; Nahavandi, S. An expert system for selecting wart treatment method. Comput. Biol. Med. 2017, 81, 167–175. [Google Scholar] [CrossRef]
Khozeimeh, F.; Jabbari Azad, F.; Mahboubi Oskouei, Y.; Jafari, M.; Tehranian, S.; Alizadehsani, R.; Layegh, P. Intralesional immunotherapy compared to cryotherapy in the treatment of warts. Int. J. Dermatol. 2017, 56, 474–478. [Google Scholar] [CrossRef]
Cryotherapy. Available online: https://archive.ics.uci.edu/ml/datasets/Cryotherapy+Dataset (accessed on 1 May 2025).
Diabetic Retinopathy Debrecen Data Set. Available online: https://archive.ics.uci.edu/dataset/329/diabetic+retinopathy+debrecen (accessed on 1 May 2025).
Antal, B.; Hajdu, A. An ensemble-based system for automatic screening of diabetic retinopathy. Knowl.-Based Syst. 2014, 60, 20–27. [Google Scholar] [CrossRef]
EEG Eye State. Available online: https://archive.ics.uci.edu/ml/datasets/EEG+Eye+State# (accessed on 1 May 2025).
HTRU2. Available online: https://archive.ics.uci.edu/ml/datasets/HTRU2 (accessed on 1 May 2025).
Lyon, R.J. HTRU2. 2016. Available online: https://figshare.com/articles/dataset/HTRU2/3080389/1?file=4787626 (accessed on 1 May 2025).
Lyon, R.J.; Stappers, B.W.; Cooper, S.; Brooke, J.M.; Knowles, J.D. Fifty years of pulsar candidate selection: From simple filters to a new principled real-time classification approach. Mon. Not. R. Astron. Soc. 2016, 459, 1104–1123. [Google Scholar] [CrossRef]
Immunotherapy. Available online: https://archive.ics.uci.edu/ml/datasets/Immunotherapy+Dataset (accessed on 1 May 2025).
Indian Liver Patient. Available online: https://archive.ics.uci.edu/ml/datasets/ILPD+%28Indian+Liver+Patient+Dataset%29 (accessed on 1 May 2025).
Guyon, I.; Gunn, S.; Ben-Hur, A.; Dror, G. Result analysis of the NIPS 2003 feature selection challenge. In Proceedings of the Advances in Neural Information Processing Systems 17; Saul, L.K., Weiss, Y., Bottou, L., Eds.; MIT Press: Cambridge, MA, USA, 2005; pp. 545–552. Available online: https://proceedings.neurips.cc/paper_files/paper/2004/file/5e751896e527c862bf67251a474b3819-Paper.pdf (accessed on 1 May 2025).
Madelon. Available online: https://archive.ics.uci.edu/ml/datasets/Madelon (accessed on 1 May 2025).
MiniBooNE Particle Identification. Available online: https://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identification (accessed on 1 May 2025).
Musk 1 and 2. Available online: https://archive.ics.uci.edu/ml/datasets/Musk+%28Version+1%29 (accessed on 1 May 2025).
Bhatt, R. Planning-Relax Dataset for Automatic Classification of EEG Signals. Available online: https://archive.ics.uci.edu/ml/datasets/Planning+Relax (accessed on 1 May 2025).
Mansouri, K.; Ringsted, T.; Ballabio, D.; Todeschini, R.; Consonni, V. Quantitative structure–activity relationship models for ready biodegradability of chemicals. J. Chem. Inf. Model 2013, 53, 867–878. [Google Scholar] [CrossRef] [PubMed]
QSAR Biodegradation. Available online: https://archive.ics.uci.edu/ml/datasets/QSAR+biodegradation (accessed on 1 May 2025).
Bhatt, R.; Dhall, A. Skin Segmentation. UCI Machine Learning Repository. 2009. Available online: https://archive.ics.uci.edu/dataset/229/skin+segmentation (accessed on 1 May 2025).
Connectionist Bench (Sonar Mines vs. Rocks). Available online: https://http://archive.ics.uci.edu/ml/datasets/connectionist+bench+%28sonar,+mines+vs%2E+rocks%29 (accessed on 1 May 2025).
SPECTF Heart. Available online: https://archive.ics.uci.edu/ml/datasets/SPECTF+Heart (accessed on 1 May 2025).
MAGIC Gamma Telescope. Available online: https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope (accessed on 1 May 2025).
Vertebral Column. Available online: https://archive.ics.uci.edu/ml/datasets/Vertebral+Column (accessed on 1 May 2025).
Zwick, W.R.; Velicer, W.F. Comparison of five rules for determining the number of components to retain. Psychol. Bull. 1986, 99, 432. [Google Scholar] [CrossRef]
Belsley, D.A.; Kuh, E.; Welsch, R.E. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity; John Wiley & Sons: Hoboken, NJ, USA, 2005; Volume 571. [Google Scholar]
Student. The probable error of a mean. Biometrika 1908, 6, 1–25. [Google Scholar] [CrossRef]
Wilcoxon, F. Individual comparisons by ranking methods. In Breakthroughs in Statistics: Methodology and Distribution; Springer: New York, NY, USA, 1992; pp. 196–202. [Google Scholar]
Sager, T. Kolmogorov–Smirnov Test. Encycl. Res. Des. 2010, 664–668. [Google Scholar] [CrossRef]
Miller, R.G.; Miller, R.G. Normal univariate techniques. In Simultaneous Statistical Inference; Springer: New York, NY, USA, 1981; pp. 37–108. [Google Scholar]
Clarkson, K.L. Nearest-Neighbor Searching and Metric Space Dimensions. In Nearest-Neighbor Methods in Learning and Vision; The MIT Press: Cambridge, MA, USA, 2006; pp. 15–60. [Google Scholar] [CrossRef]
Hosmer, D.W.; Lemeshow, S. Applied Logistic Regression; John Wiley & Sons: Hoboken, NJ, USA, 2000. [Google Scholar] [CrossRef]
Fisher, R.A. The Use of Multiple Measurements in Taxonomic Problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
Confusion Matrix. Available online: https://en.wikipedia.org/wiki/Confusion_matrix#Table_of_confusion (accessed on 20 June 2024).
Devroye, L.; Györfi, L.; Lugosi, G. A Probabilistic Theory of Pattern Recognition; Springer: New York, NY, USA, 1996. [Google Scholar] [CrossRef]
Baliyan, M.; Mirkes, E. Linear dimensionality reduction: Data and code. Available online: https://github.com/mohit-baliyan/Linear-Dimensionality-Reduction/ (accessed on 1 May 2025).
scipy.stats.rankdata in SciPy Documantation 1.14.0. Available online: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rankdata.html (accessed on 1 May 2025).
Dodge, Y. The Concise Encyclopedia of Statistics; Springer Science & Business Media: Berlin, Germany, 2008. [Google Scholar]
Gujarati, D.; Porter, D. Multicollinearity: What happens if the regressors are correlated. In Basic Econometrics; McGraw-Hill: New York, NY, USA, 2003; Volume 363. [Google Scholar]
Tikhonov, A.N.; Arsenin, V.Y. Solutions of Ill-Posed Problems; Winston: New York, NY, USA, 1977. [Google Scholar]

Figure 1. Examples of dependencies (

B A (k)

) for two datasets.

Figure 1. Examples of dependencies (

B A (k)

) for two datasets.

Figure 2. Illustration of the optimal threshold search for two datasets.

Table 1. Datasets selected for analysis.

Name	Source	#Attr.	Cases	PCA-K	PCA-BS	PCA-CN
Banknote authentication	[52]	4	1372	2	2	3
Blood	[53]	4	748	2	2	3
Breast Cancer	[54]	30	569	6	3	5
Climate Model Simulation Crashes	[55]	18	540	10	1 *	18
Cryotherapy	[56,57,58]	6	90	3	1 *	6
Diabetic Retinopathy Debrecen	[59,60]	19	1151	5	3	8
EEG Eye State	[61]	14	14,980	4	4	5
HTRU2	[62,63,64]	8	17,898	2	2	4
Immunotherapy	[56,57,65]	7	90	3	1 *	7
ILPD (Indian Liver Patient Dataset)	[66]	10	579	4	1 *	7
Madelon	[67,68]	500	2600	224	1*	362
MiniBooNE particle identification	[69]	50	130,064	4	1	1
Musk 1	[70]	166	476	23	9	7
Musk 2	[70]	166	6598	25	13	6
Planning Relax	[71]	10	182	4	1 *	6
QSAR biodegradation	[72,73]	41	1055	11	6	15
Skin Segmentation	[74]	3	245,057	1	1	2
Connectionist Bench (Sonar)	[75]	60	208	13	6	11
SPECT Heart	[76]	22	267	7	3	12
SPECTF Heart	[76]	44	267	10	3	6
MAGIC Gamma Telescope	[77]	10	19,020	3	1	6
Vertebral Column	[78]	6	310	2	1	5

* The maximal eigenvalue is less than the first broken stick value: the PCA-BS dimension is zero.

Table 2. p-values of Student’s t-test (T-test) for dependent samples, Wilcoxon’s signed-rank test (WSR test), and the Kolmogorov–Smirnov test (KS test) for pair comparison of the original dimension (Attr) and three IDs: PCA with the Kaiser rule (PCA-K), the broken stick rule (PCA-BS), and the Fukunaga–Olsen rule (PCA-CN).

Rule 1	Rule 2	T-Test	WSR Test	KS Test
Attr	PCA-K	0.01465	<0.00001	0.02005
Attr	PCA-BS	0.03789	<0.00001	<0.00001
Attr	PCA-CN	0.00935	0.00013	0.02005
PCA-K	PCA-BS	0.18430	0.00028	0.04935
PCA-K	PCA-CN	0.34277	0.12068	0.21837
PCA-BS	PCA-CN	0.23360	0.00126	0.00067

Table 3. Unbiased estimation of k.

Dataset Name	Cases	Best k
Banknote authentication	1372	17
Blood	748	23
Breast Cancer	569	5
Climate Model Simulation Crashes	540	2
Cryotherapy	90	7
Diabetic Retinopathy Debrecen	1.151	143
EEG Eye State	14,980	3
HTRU2	17,898	3
Immunotherapy	90	2
ILPD (Indian Liver Patient Dataset)	579	2
Madelon	2600	161
MiniBooNE particle identification	130,064	15
Musk 1	476	5
Musk 2	6598	3
Planning Relax	182	1
QSAR biodegradation	1055	34
Skin Segmentation	245,257	5
Connectionist Bench (Sonar)	208	1
SPECT Heart	267	4
SPECTF Heart	267	2
MAGIC Gamma Telescope	19,020	4
Vertebral Column	310	66

Table 4. Values of mean ranks for all classifiers.

Classifier	Ranking	Mean of Ranks			p-Value of t-Test for Differences
		PCA-K	PCA-BS	PCA-CN	PCA-K/PCA-BS	PCA-BS/PCA-CN	PCA-K/PCA-CN
KNN	Average	1.98	2.59	1.43	0.003402	0.000113	0.060596
	T-test	1.95	2.57	1.48	0.003006	0.000134	0.098642
	WSR test	1.98	2.55	1.48	0.008428	0.000163	0.089742
LR	Average	2.02	2.41	1.57	0.063491	0.008455	0.150906
	T-test	2.00	2.43	1.57	0.022280	0.003817	0.124516
	WSR test	2.00	2.43	1.57	0.022280	0.003817	0.124516
LDA	Average	1.75	2.64	1.61	0.000030	0.000762	0.639755
	T-test	1.89	2.52	1.59	0.001097	0.001975	0.257972
	WSR test	1.89	2.52	1.59	0.001097	0.001975	0.257972

Table 5. The number of datasets for which the specified ID provides the best BA.

Classifier	PCA-K	PCA-BS	PCA-CN
KNN	5	1	16
LR	5	2	15
LDA	8	1	13

Table 6. Dimensions after recurrent usage of three IDs for the Musk 2 dataset.

Iteration	PCA-K	PCA-BS	PCA-CN
0	166	166	166
1	25	13	6
2	4	2	6
3	2	0	6
4	1	0	6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Baliyan, M.; Mirkes, E.M. Linear Dimensionality Reduction: What Is Better? Data 2025, 10, 70. https://doi.org/10.3390/data10050070

AMA Style

Baliyan M, Mirkes EM. Linear Dimensionality Reduction: What Is Better? Data. 2025; 10(5):70. https://doi.org/10.3390/data10050070

Chicago/Turabian Style

Baliyan, Mohit, and Evgeny M. Mirkes. 2025. "Linear Dimensionality Reduction: What Is Better?" Data 10, no. 5: 70. https://doi.org/10.3390/data10050070

APA Style

Baliyan, M., & Mirkes, E. M. (2025). Linear Dimensionality Reduction: What Is Better? Data, 10(5), 70. https://doi.org/10.3390/data10050070

Article Menu

Linear Dimensionality Reduction: What Is Better?

Abstract

1. Introduction

2. Data

3. Used Intrinsic Dimensions

3.1. The Kaiser Rule

3.2. The Broken Stick Rule

3.3. The Fukunaga–Olsen or Condition Number Rule

3.4. General Comparison of Three IDs

4. Used Classifiers

4.1. Performance Estimation

4.2. k Selection in KNN

4.3. Decision Threshold Selection in LDA and LR

5. ID Comparison

5.1. Average Ranking

5.2. T-Test-Based Ranking

5.3. Wilcoxon Signed-Rank Test Ranking

5.4. Final ID Comparison

6. Results

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Tables of Average BA and Three Rankings for Three Classifiers

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI