1. Introduction
There are several reasons to use dimensionality reduction (DR) in most data-based projects. The first reason is data-hungry problems [
1]. This type of problem is very common in the modern world. In reality, this means that the number of observations is less than the number of attributes [
2] or even less than the logarithm of the number of attributes [
3,
4]. The second reason for DR is the difference between the number of attributes and the intrinsic dimension (ID) of data: there are no reasons to spend resources to work with a large number of attributes when the ID of data is essentially less. Differences between the number of attributes and the ID can depend on many factors. In [
5], Mirkes et al. presented table (Table 2 in [
5]) with five ID values for 26 datasets. For some datasets, some of the ID values are the same as the number of attributes (for example, the cryotherapy and immunotherapy conditional number dimension equals the number of attributes: 6 and 7, respectively), but for some datasets, the ID is several times less than the number of attributes (for example, for MiniBooNE particle identification, the number of attributes is 50, but the maximum of the tested IDs is 4).
There is no unique, exact definition of ID because of the existence of many IDs. Bac et al. in [
6] presented software to calculate 19 IDs and presented a list of IDs that is not exhaustive. In [
6], Bac et al. wrote, “The well-known curse of dimensionality, which states that many problems become exponentially difficult in high dimensions, does not depend on the number of features, but on the dataset’s ID [
7]. More precisely, the effects of the dimensionality curse are expected to be manifested when
, where
M is the number of data points [
8,
9]”.
According to [
10] there are two main DR approaches: filters and wrappers. Filtering methods do not required feedback on predictor/classifier performance. These methods can be considered universal methods to estimate IDs. In contrast, wrapping methods provide feature selection based on predictor/classifier feedback. The most well known filtering method based on principle components (PCs) is LASSO [
11]. The set of selected features depends on the used predictor/classifier. This means that the ID can differ depending on the method. This method works for linear regression, generalised linear models, and some other models. There is no k nearest neighbours (KNN) or decision tree version of LASSO. In this study, we decided to consider only linear filtering DR methods.
There are two main classes of ID: projective and non-projective. For projective ID, there is a procedure of projection of data points into lower dimensional space. The most well known projective DR methods are methods based on PCs. There are many heuristics to identify a number of informative (useful, meaningful, etc.) PCs, some of which are considered below. There are many non-projective IDs: the correlation (fractal) dimension [
12], manifold-adaptive fractal dimension [
13], the method of moments [
14], maximum likelihood [
15,
16,
17], and estimators based on the concentration of measure [
18,
19,
20,
21,
22,
23,
24].
There are also semi-projective methods (e.g., minimum spanning trees [
25]). Such methods allow for the projection of points into low-dimensional space but require the full recalculation of embedding if any points are added to/removed from the dataset.
Many definitions of ID are based on different properties of “dimensionality”. In this study, we consider only projective linear DR. Reviews of different IDs and their usage can be found in [
6,
26,
27]. Macocco et al. considered IDs for discrete spaces [
28].
Almost all linear methods of dimensionality reduction are based on PCs. The standard version of PCs was proposed by Karl Pearson in 1901 [
29]. There are many generalisations of PCs for different problems, for example, to search for PCs that do not explain the general variance of the dataset but select directions useful for solving the problem under consideration. Usually, such generalisations are called supervised principal components (SPCs). Some SPCs can be found in [
30,
31,
32,
33,
34]. Recently, a version of PCs was proposed for the domain adaptation problem [
35]. There are also sparse PCs [
36] specially developed for genomics problems, where the number of attributes is essentially (tens or even hundreds of times) greater than the number of observations. There are also nonlinear generalisations of PCs. Some such generalisations are presented in [
37].
An Alternative to the PC linear DR method is DR based on linear discriminant analysis (LDA) directions [
38]. The main preference of this method is a search for direction, which is most useful for classification. The main drawback of this method is the ability to define only
dimensions, where
C is the number of classes. Since all benchmarks investigated in this paper are binary classification problems, the maximal LDA dimension for them is one, and this dimension is not appropriate for most of the considered benchmarks.
In this paper, we consider only projective linear DR methods based on PCs. We consider standard-version PCs and the following rules for defining the number of informative PCs:
The broken stick rule [
41];
The Fukunaga–Olsen rule or the condition number of the covariance matrix rule [
23,
42].
A detailed description of these rules is presented in
Section 3. All these rules are widely used, and the question of which of these rules is preferable is open. In this study, we tried to answer this question.
We estimated the quality of DR based on the quality of the solution to the classification problem. From our point of view, it is essentially more informative and reasonable than a comparison of IDs.
Our goal is to identify heuristics to identify the number of PCs that allow us to solve the classification problem successfully.
The general result was expected: there is no universally best method of DR, but, on average, the Fukunaga–Olsen rule is better than the Kaiser and broken stick rules.
In [
43], Ayesha et al. presented a nice review of DR. The authors compared many different DR techniques but for a specific area of usage and general properties:
DR features: Supervised/unsupervised, linear/nonlinear, ability to work with data with non-Gaussian distribution, etc. (see Table 2 in [
43]);
Robustness to outliers, sensitivity to noise, robustness to singularity of data (see Table 3 in [
43]);
Areas of data sources (see Table 4 in [
43]);
Types of original data: text, image, signals, etc. (see Table 5 in [
43]).
The authors did not consider the classification problem and did not compare DR methods according to the quality of the final classifier.
In [
44], Tang et al. compared several DR methods for clustering with the purity-induced measure microaverage of classification accuracy. Their work was similar to our study but considered different DR methods and clustering instead of classification.
In a paper titled “Dimensionality Reduction: A Comparative Review” [
10], Van der Maaten et al. presented the results of a similar study. They studied 13 DR methods, one of which was PCA. They used generalisation errors, trustworthiness ((28) in [
10]), and continuity ((29) in [
10]) as quality measure for a nearest neighbour classifier. According to the authors, PCA does not have parameters, and the authors defined “the target dimensionality in the experiments by means of the maximum likelihood intrinsic dimensionality estimator [
16]”. As we can see, the authors defined number of used components by estimating the ID, and the choice of this ID was not motivated. In our study, we compared three different estimations of ID.
In [
45], Konstorum et al. compared four DR methods, including PCA for mass cytometry datasets “with respect to computation time, neighborhood proportion error, residual variance, and ability to cluster known cell types and track differentiation trajectories”. As we can see, the classification problem is outside the scope of consideration. Unfortunately, the authors simply chose to require a dimension of 2 without any explanation.
Most reviews of DR methods do not consider methods of estimation of the number of informative PCs (see, for example, [
43,
46]). Other reviews and comparative studied simply used the stated method (“means of the maximum likelihood intrinsic dimension” in [
10] or the elbow rule or fixing of the threshold (
) using only PCs with eigenvalues greater than
in [
47]). Deegalla et al. [
48] estimated the number of PCs by exhaustively testing for the selected classifier. This may be the best approach, but it can be very expensive for classifiers other than KNN because of the long time required for classifier parameter fitting.
Another set of studies compared different heuristics to chose the number of PCAs. For example, in [
41], Jackson described nine heuristics, include the Kaiser rule and the broken stick rule. The Fukunaga–Olsen rule—or the condition number of the covariance matrix rule—was published in 1971 but was not included in Jackson’s review. Jackson applied PCA for data with a known ID and simply compared the estimated number with the known ID. This is a robust measure, but it has no relation to the classification problem. The author found that the broken stick method was best among the compared methods, but in our work, we found that for the classification problem, the broken stick method was the worst.
In [
49], Ledesma considered the number of factors to retain and consider four heuristics. The only included PCA heuristic was the Kaiser rule.
In [
50], Cangelosi et al. described several heuristics for application to cDNA microarray data. The only conclusion was that several heuristics provided consistent IDs for some databases.
Our goal is to recommend one a heuristic for automatic data pre-processing or for non-experienced researchers. The literature review above shows that, to the best of our best knowledge, there has been no comparison of different heuristics for the estimation of the number of PCs necessary for the classification problem.
The remainder of this paper is organised as follows.
Section 2 presents a list of benchmarks used in this study.
Section 3 presents descriptions of the used IDs.
Section 4 describes the classifiers used in this study.
Section 5 presents the testing protocol and the methods and results of comparison.
Section 6 presents the results of testing.
Section 7 contains our conclusions.
3. Used Intrinsic Dimensions
In this study, we used only projective DR methods because we need not only estimate the dimension but also project data to lower dimensional space and solve the classification problem. Since we want to use cross validation to estimate the quality of classifiers (see
Section 5), we cannot use semi-projective methods because these methods require complete recalculation of projections into low-dimensional space if data are changed. Calculation of PCs followed by projection on the basis of all data (training and test sets or for all folds) compromises data because the test set is used in data pre-processing. As a result, we calculate PCs for the training set only. This means that we have 100 sets of PCs for each dataset (one collection of PCs for each fold).
We cannot use LDA-based DR methods because this means using projections into one-dimensional space, which is too radical for most datasets.
As a result, we can use only PC-based DR methods, and the question is, ’What is the best rule to define the number of PCs to use?’. We consider three of the most popular formal methods of estimation of the number of informative PCs:
The broken stick rule [
41];
The Fukunaga–Olsen or the condition number of the covariance matrix rule [
23,
42].
There are several informal methods, like the elbow rule [
41,
79], but these methods are not appropriate for automatic work. There are also methods based on the explanation of the desired part of the total data variance (usually 95%). We do not consider such methods because it is very difficult to explain clearly why the specified value (e.g., 95%) is an appropriate choice for the problem under consideration.
In the next paragraph and in the following three subsections, we follow [
5].
Let us consider a dataset (
X) with
n records (
) and
d real-valued attributes (
). The empirical covariance matrix (
) is symmetric and non-negative definite. The eigenvalues of the
matrix are non-negative real numbers. These values are denoted as
. PCs are defined using the eigenvectors of matrix
. If the
ith eigenvector (
) is defined, then the
ith principal coordinate of the data vector (
x) is the inner product
. The Fraction of Variance Explained (FVE) by the
ith principal component for the dataset (
X) is
3.1. The Kaiser Rule
The Kaiser rule [
39,
40] states that all principal components with an FVE greater than or equal to the average FVE are informative. The average FVE is
. Thus, components with
are considered informative and should be retained, and components with
should not. Another popular version uses thresholds that are twice as low (
) and retains more components.
Furthermore, we use PCA-K as the name of the DR method, as it is defined by the Kaiser rule and the number of PCs.
3.2. The Broken Stick Rule
The broken stick rule [
41] compares set
with the distribution of random intervals that appear if we break the stick at
points randomly and independently sampled from the uniform distribution. Consider a unit interval (stick) randomly broken into
d fragments. Let us numerate these fragments in descending order of their length:
. The expected length of fragment
i is [
41]
The broken stick rule states that the first k principal components are informative, where k is the maximum number such that .
Furthermore, we use PCA-BS as the name of this DR method, which is defined by the broken stick rule and the number of PCs.
3.3. The Fukunaga–Olsen or Condition Number Rule
In many problems, the empirical covariance matrix degenerates or almost degenerates. That means that the smallest eigenvalues are much smaller than the largest one. Consider the projection of data on the first
k principal components:
, where the columns of the matrix (
V) are the first
k eigenvectors of matrix
. Eigenvalues of the empirical covariance matrix of the reduced data (
) are
. After DR, the condition number (the ratio of the lowest eigenvalue to the greatest eigenvalue) [
80] of the reduced covariance matrix should not be too high in order to avoid the multicollinearity problem. The relevant definition [
23,
42] of intrinsic dimensionality refers directly to the condition number of the matrix (
):
k is the number of informative principal components if it is the smallest number such that
where
C is the specified condition number, for example,
. This approach is hereafter referred to as PCA-CN. The PCA-CN intrinsic dimensionality is defined as the number of eigenvalues of the covariance matrix exceeding a specified fraction of its largest eigenvalue [
42].
3.4. General Comparison of Three IDs
Before a study question about which of the three listed IDs is better, it is necessary to investigate the following question: Are these IDs statistically significantly different? Upon first looking at Table
1, it seems that the broken stick rule is too radical: for six datasets, the FVE of the first PC is less than the first broken stick value (
). This means that according to the broken stick rule, the ID of the data is zero. In such cases, we increase the dimension to 1. From the other side, there are datasets that have PCA-BS values greater than the PCA-K or PCA-CN value. It is interesting, but for all considered datasets, PCA-BS was never preferred.
To test the significance of differences in dimensions that can be used, Student’s
t-test for dependent samples (
t-test) [
81] can be used if we want to compare mean values, the Wilcoxon signed-rank test (WSR test) [
82] if we want to compare medians of two samples; or the Kolmogorov–Smirnov test (KS test) [
83] can be used if we want to compare two empirical distributions. We exactly understand that the KS test was developed for two independent samples, but we could find any statistical test for paired samples to test differences between two distributions. Since we have only 22 datasets, we decided to use the 95% confidence level (the critical value is 5%).
We used all three listed tests for dimensions, as presented in Table
1. Since we have only three IDs, we can ignore the multiple testing problem [
84]. The
p-values for three tests with three pairs of original dimensions and one of IDs and for three pairs of IDs are presented in
Table 2. For comparison of the original dimension with IDs, we can see that, formally, all IDs are significantly different from the number of attributes. However the
t-test has a relatively large
p-value for PCA-BS. In the ID comparisons, the
t-test shows that we do not have enough evidence to reject the hypothesis that all three samples of distances were taken from populations with the same mean value. From the KS test, we can conclude that we do not have enough evidence to reject the hypothesis that distributions of IDs for PCA-K and for PCA-CN are identical, but both these distributions are statistically significantly different from the distribution of IDs for PCA-BS with a confidence level of 95%. From the WSR test, we can conclude that the median of PCA-BS is statistically significantly less than the median of PCA-K and the median of PCA-CN with a significance level of 99%. This finding confirms the well known fact that the Kaiser rule is too conservative and considers too many PCs as informative [
79] compared to the broken stick rule. From the other side, the Fukunaga–Olsen rule is even more conservative than Kaiser’s rule.
Let us compare the three listed rules as possible IDs. What properties do we expect from any dimension? This value must be, at least, uniquely defined. This means that if our data have a dimension of d, then the data have a dimension of d in any space that includes these data. This also means that if we originally had a dataset () and used the DR method (), then we have , where is the dimension of space S. Moreover, we can expect that after the second DR, we will have the same dimensions: .
In [
5], Mirkes et al. presented proof that the Kaiser rule and the broken stick rule almost never satisfy this property (see pages 13–15). From the other side, PCA-CN IDs satisfy the described property.
6. Results
During the described testing protocol, it was found that for some datasets, the formation of models faced the multicollinearity problem [
93] for logistic regression (QSAR biodegradation, MiniBooNE particle identification, Musk 2, and SPECTF Heart datasets). In our version of LDA (see the fisherDir function in forPaper.py in [
90]), we used Tikhonov regularisation [
94] to avoid this problem. We did not use any approach to resolve this problem that could influence to result, as our goal was to compare IDs. The multicollinearity problem was observed only for the original dataset and supports the idea that DR is necessary for some data.
In
Table A1,
Table A2,
Table A3,
Table A4,
Table A5 and
Table A6, we can see that average,
t-test-based, and WSR-based rankings are consistent: for all classifiers, we have almost the same mean value of IDs (see
Table 4). As we can see, exact values of mean ranks are different, but for all methods of ranking and for all classifiers, the best ID (minimal mean rank) is associated with PCA-CN, and the worst ID (maximal mean rank) is associated with PCA-BS.
The results of the test of significance of differences in mean ranking are presented in
Table 4. We can see that for KNN and LDA for all three rankings and for LR for the
t-test and WSR test rankings, PCA-K is statistically significantly better than PCA-BS. For LR and the average ranking, this difference is statistically insignificant.
For all three rankings and for all three classifiers, PCA-CN is statistically significantly better than PCA-BS.
Differences between PCA-CN and PCA-K are statistically insignificant for all rankings and all classifiers, but the mean rank of PCA-CN is less than the mean rank of PCA-K for all rankings and all classifiers.
These observations correlate with findings reported in
Section 3.4: PCA-BS ID is statistically significantly different from PCA-K and PCA-CN.