1. Introduction
Research on dimension reduction techniques has been gaining traction in the past decade. While reducing the data dimension, the number of features in the original dataset is replaced with a smaller number of independent features while retaining a sufficiently large proportion of information in the original dataset. One of the popular dimension reduction techniques is principal component analysis (PCA). The set of predictor variables in the original dataset are first converted to uncorrelated variables called principal components (PCs) in order of decreasing variability (for, e.g., refer to Jolliffe [
1]) using PCA. Subsequently, the initial PCs are retained to maintain a sufficient level of information as contained in the original dataset. Such a standard PCA algorithm is also called offline PCA. However, online PCA is useful when data arrives sequentially or in batches. In the machine learning domain, as mentioned earlier, exciting new advances have been made due to the works of [
2,
3,
4] and others. We refer to [
5] for a detailed review of online PCA. Both offline or online PCA works exist in the literature and have been applied to various problems, including language recognition (e.g., Nasser et al. [
6]), process modeling (e.g., Tang [
7]), the selection of stocks (e.g., Hargreaves and Mani [
8]), monitoring systems, or as an image classifier (e.g., Horev et al. [
9], Feldman et al. [
10]), etc. Works on PCA under the high dimension and low sample size settings also exist in the literature (e.g., Aoshima et al. [
11], Yata and Aoshima [
12,
13] and others).
Much of the work on the online PCA algorithm is about the minimization of the total compression loss. There is little to no existing research on purely sequential or multi-stage procedures for dimension reduction algorithms like PCA. Recently, Chattopadhyay and Banerjee [
14] proposed a two-stage procedure that minimizes the risk associated with the expected cost-compression loss, in other words, the cost-compression risk. This was one of the first contributions in this area with certain limitations which are outlined below.
Suppose
denotes the dataset based on
n vectors each of dimension
p. Also, suppose that
follows a
p-variate normal distribution with a positive-definite variance–covariance matrix (
) for all
. Using PCA, suppose that the researcher or the practitioner wants to retain
out of
p PCs such that at least
of the total variation is retained. This is a common practice in biological sciences, especially when analyzing -omics datasets when researchers want to look at the first two to four PCs (Schneckener et al. [
15], Lukk et al. [
16]) among others. Suppose that
is the projection matrix of dimension
of rank
k (
k fixed) based on the sample variance covariance matrix,
, computed using
n observations. Furthermore, suppose that
c is the cost of sampling each observation vector of dimension
p, and
A is the probable cost per unit compression loss due to PCA using
n observation vectors. Then, the cost-compression loss as proposed by Chattopadhyay and Banerjee [
14] is given by
and the corresponding risk function is
Thus, the loss function, given in (
1), includes both the compression loss and the total sampling cost, which is termed the cost-compression loss and the corresponding risk function, given in (
2), is termed as the cost-compression risk.
Recall that the aim is to reduce the dataset,
, into
k dimensional subspace, which is approximately as good as in the case when the population variance–covariance matrix (
) is known. If the population variance–covariance matrix (
) is known, the risk function is
where
is the projection matrix of size
and rank
k associated with known
. The cost-compression risk
, defined in (
3), can be minimized by using an optimal sample size as
Since
is assumed to be positive-definite and provided that the variances of each of the
p variables exist, so
. Using
, the optimal cost-compression risk is
If
is known, then one can easily find out the optimal sample size required to get the minimized cost-compression risk using (
5). However, in reality,
is unknown, and so is the optimal sample size
. This is the minimum cost-compression risk (MCCR) problem as formulated in Chattopadhyay and Banerjee [
14].
Under some conditions, the two-stage PCA procedure developed by Chattopadhyay and Banerjee [
14] was shown to enjoy the first-order efficiency property. Through small simulations, the authors argued for the existence of the second-order efficiency property. So, in this article, we present a modified two-stage PCA procedure that proposes an optimal sample size that minimizes the cost-compression risk when the distribution of the data is assumed to follow a multivariate normal distribution. This article also shows that our procedure enjoys second-order efficiency and second-order risk efficiency properties under some conditions.
The rest of the article is organized as follows:
Section 2 presents the modified two-stage PCA procedure, followed by the characteristics of the procedure in
Section 3. We present extensive numerical studies in
Section 4.
Section 5 presents illustrations of the modified two-stage PCA procedure on a real stock market dataset and a large energy consumption dataset. In
Section 6, we include some discussion points outlining the strengths and challenges of the proposed method.
Section 7 finishes our article with some thoughts and conclusions.
2. Modified Two-Stage PCA Procedure
Recall that the dataset based on n p-dimensional vectors is such that follows a p-variate normal distribution with variance–covariance matrix . We assume that is a positive definite matrix. The goal is to reduce these data of dimension p to pre-specified dimension . Note that the choice of k is crucial and is discussed momentarily. Suppose that the p eigen values of are denoted by for . Furthermore, we assume that either the minimum of all the eigen values of or its lower bound, e.g., , is known. In any two-stage procedure, the choice of pilot sample size is very important. Based on this information, the final sample size is computed. If the number of data vectors in the pilot stage is too small, the estimation of may become unstable, potentially affecting the performance of the modified two-stage procedure. If the pilot sample size is large, larger than required, it may lead to oversampling and a waste of economic resources. Hence, pilot sample size determination is crucial.
Recall that the optimal sample size, using (
4), is
In (
6),
denotes the
ordered eigenvalue of
. Thus, we provide a piece of information on the value of
, and we estimate the number of data vectors in the pilot stage as
denotes the largest integer
. Here, as
,
. Next, the PCA is applied to the
m observations collected after the pilot stage to get
. Then, using
and
k, the final sample size can be computed as
is the projection matrix of size
and rank
k associated with
, the sample variance–covariance matrix. Now, using [
17],
’s is consistent for population eigenvalues
for all
i. So,
can be shown to be consistent for
as
. Next, we outlined the detailed algorithm for PCA using the modified two-stage PCA procedure for a prediction problem.
Thus, in the two stages, we collect
N observations from each of dimension
p using the modified two-stage PCA procedure as per Algorithm 1 and thereby apply the further predictive analytic tools on the final compressed dataset,
. In the next section, we explore the theoretical properties of our modified two-stage procedure.
Algorithm 1: Modified Two-Stage PCA procedure |
START |
Input: A, c, and k (). |
- Step 1.
Compute . Collect the pilot dataset, ; - Step 2.
Apply PCA on the pilot dataset and find ; - Step 3.
Calculate , where, ; - Step 4.
Collect remaining data such that the combined dataset is ; - Step 5.
Apply PCA on and subsequently get the compressed dataset, ;
|
Output: Implement a suitable predictive analytic/machine learning technique on . |
END |
3. Characteristics of the Modified Two-Stage PCA Procedure
For a fixed
,
, and
, the cost-compression risk based on
N may be computed as
From (
8), it can be observed that
N can be very large for small values of
. Although, for a particular experiment,
c is fixed, we study the asymptotic properties of the modified two-stage PCA procedure for small values of
c. First, we provide a crucial lemma for the modified two-stage PCA procedure, which is crucial in the sense that it ensures that the final number of observation vectors is finite.
Lemma 1. Under the assumption that , for any , the final sample size N is finite, i.e., .
The proof of this lemma is similar to Chattopadhyay and Banerjee [
14]. We now consider the main theorem associated with the modified two-stage PCA procedure.
Theorem 1. If is finite for all , then with the modified two-stage PCA procedure yields the following properties:
- (i)
in probability as .
- (ii)
is bounded as .
- (iii)
is bounded as , provided .
Proof. (i) Using the Equation (
8) and dividing throughout by
, we have
Now, since
is a consistent estimator of
, we can prove
.
(ii) Taking expectation in (
10), we write
Here,
.
being positive definite,
is negative.
Also note that
Let us consider the right-hand side and left-hand side separately of Equation (
11) and let
. Using Equation (
13), we have
This proves the second-order efficiency property for the modified two-stage PCA procedure. This, of course, implies that the procedure is also a first-order efficient.
(iii) Here, we proceed along the lines of Chattopadhyay and Banerjee [
14] to show that the difference between the two risks is bounded. Subtracting the optimum risk, defined in (
5), from the risk obtained using the final sample size given in (
9), we have
is bounded above using (ii) and the fact that
. Thus, we need to prove that
has a lower bound.
The difference between the two risks is thus bounded. Therefore, the second-order risk efficiency property is satisfied. This also implies that the two-stage PCA procedure has first-order risk efficiency. □
Here, Theorem 1 ensures that, under appropriate conditions, the final sample size is close to the optimal sample size on average and also that the cost-compression risk, while using the final sample size, is close to the minimized cost-compression risk provided that the population variance–covariance matrix is known.
Remark 1. We note that the assumption being finite is not very restrictive. This holds if the data follows a multivariate normal distribution (using Anderson [17]). 4. A Simulation Study
To verify the results related to the properties of the proposed modified two-stage procedure from Theorem 1, we use simulated data. In the following Monte Carlo simulation study, we simulated observations from a 5-dimensional Normal population with a mean vector
and population variance–covariance matrix
as given by
For the modified two-stage PCA procedure we also need to fix the value of
k. Here, we present results for
. Other choices, like
, were also considered but left out of the article for brevity. Since this is a simulation study, we noted that an eigen value decomposition of
reveals that
retains 90% of the variation in compressed data. For real data analyses, a more practical choice of
k is discussed in
Section 5.
Each of
Table 1,
Table 2 and
Table 3 shows the performance of the modified two-stage PCA procedure outlined in Algorithm 1. Each of these three tables considers a different value of the cost
c while different choices of
A are considered in the rows. The second and third columns provide the optimal sample size and optimal risk for our cost-compression loss function. This is followed by the pilot sample size as in (
7). The fifth column provides the average sample size
computed as in (
8) and averaged over 10,000 simulations along with its standard error. From the sixth column, we observed that the modified two-stage PCA procedure performs remarkably well under each of the configurations when compared with the optimal sample size (
). Although it is clear that the proposed procedure tends to oversample on an average in comparison to the optimal sample size, this oversampling rate remains significantly lower (between 2.8% and 4.4%) under each of the configurations. The eighth column shows
computed by taking the average of the cost-compression risks over 10,000 simulations accompanied by its standard error. We note that the standard error of the cost-compression risk is low and this holds for all the configurations. This column is followed by a comparison between the average cost-compression risk and the optimal risk, and we see that these two are remarkably close, as the ratio is very close to 1. Here, we fixed
. Recall that the eigenvalue decomposition of
retained at least
variation (
to be exact) for a reduced dimension of 3. The next column provides the value of
, which shows the average variation retained over 10,000 simulations. This is simply computed from the variation retained in the compressed data of size
N in each simulation run and then averaged. We are happy to note that
is comfortably above
and very close to the true value of
. The last column shows the total time taken (in seconds) to run and compile results from 10,000 simulations. It may be noted that it takes less than a minute to compute such a large number of simulations, which shows that the proposed two-stage PCA procedure is not only asymptotically efficient but also time efficient. Also, our analyses reveal that the choice of
k does not affect the performance of the proposed algorithm.
In order to determine whether the second-order efficiency and the second-order risk efficiency properties prevail, we reproduced the simulations such as those reported in
Table 1,
Table 2 and
Table 3 several times under some of the configurations and subsequently implemented the modified two-stage PCA procedure.
Figure 1 and
Figure 2 exhibit the observed values of the regret (
) and risk-regret (
). Our empirical observations show that regret and risk-regret remain closely aligned across the different choices of
A and
c. Thus, it is abundantly clear that the modified two-stage PCA procedure satisfies the second-order efficiency and risk efficiency properties. The plots strongly support our findings of asymptotic second-order efficiency and the second-order risk efficiency in Theorem 1.
6. Discussion
In this section, we discuss a few aspects of our proposed PCA algorithm.
(a) The proposed online PCA algorithm is developed based on several key assumptions. First, we assume that the data follows a normal distribution and that the cost-compression loss function is a linear combination of compression loss and data collection cost. Additionally, we assume that the cost of collecting each observation remains constant throughout the data collection process. Moreover, the lower bound (or minimum) of the eigenvalues of the population variance–covariance matrix,
, is known. Under these conditions, we propose a modified two-stage PCA algorithm that achieves second-order efficiency in terms of sample size and risk. Notably, no existing online PCA algorithms, including those in [
2,
3], operate under this specific setting. Consequently, comparing our proposed PCA algorithm with existing online PCA algorithms would not yield a meaningful comparison. Therefore, we refrain from making such a comparison in this work.
(b) The proposed algorithm requires a value of the lower bound (or minimum) of the eigenvalues of the (). In fact, in several instances, the specification of a lower bound may be beneficial. We note that, if smallest eigenvalue of is too close to zero, the matrix becomes nearly singular, making numerical operations unstable. The PCA algorithms involving matrix inversions or eigen decompositions can become almost singular when eigenvalues are very small. Setting a known lower bound ensures that computations remain well-posed and stable. For example, consider a dataset of high-resolution grayscale images (say, 1000 images of size ). Suppose that, after computing the variance–covariance matrix, we find that some eigenvalues are very close to zero (say less than ). These small eigenvalues correspond to directions in the data space with extremely low variance, possibly due to noise rather than a meaningful structure. Having a PCA keeping these components will lead to the construction of blurry or distorted images. In this situation, by imposing a lower bound (say ), we discard eigenvalues smaller than , thus removing weak components and ensuring that only meaningful patterns remain.
A meaningful choice of is crucial, particularly in determining the appropriate sample size for implementing the procedure. A higher value of leads to a larger pilot sample size, whereas a lower value results in a smaller pilot sample. As with any modified two-stage procedure in sequential analysis, a slight deviation from the actual (or optimal) lower bound value does not significantly impact the final sample size, except for adjustments in the second-stage sample size. Thus, a minor misspecification of does not cause significant issues.
However, if
is set excessively high, the pilot sample size may become unnecessarily large—potentially exceeding the optimal sample size. Conversely, if
is too low, the pilot sample may be insufficient, forcing the algorithm to learn the structure with fewer observations. This can increase the variance in the final sample size, affecting the overall efficiency of the procedure. For more details related to the impact of pilot sample size on final sample size and its standard deviation, we refer to [
23,
24,
25].
(c) The proposed procedure is developed to reduce the data of dimension
p to pre-specified dimension
requiring the pre-specifying of
k. Instead, suppose that the goal is to reduce the dimension of data from dimension
p to any lower dimension, retaining at most, say,
variability, with
being at least 1. Thus, the lower bound of the optimal sample size (
) using (
4) is
Thus, using Equation (
15), the pilot sample size is
Thus, in the first stage, we draw a pilot sample of size
. Based on this sample, we apply the PCA and find the value of
k using the scree plot and correspondingly find
. Then, using
and
k, the final sample size can be computed as
With this change, it can be shown that the revised procedure enjoys the characteristics outlined in Lemma 1 and Theorem 1. The proof requires a minute change and hence is left for brevity.
7. Some Concluding Comments
PCA is one of the most popular data reduction techniques. PCA achieves compression loss while reducing the dimension of the data, and this compression loss depends on the sample size as well. The higher the sample size, the lesser the compression loss, which increases the data collection cost. To take into account the compression loss as well as the data collection cost, the cost-compression risk function was used. A fixed sample size procedure cannot minimize such a cost-compression risk. As envisaged in an earlier work by the authors, a two-stage PCA procedure may be used; however, the procedure was not shown to be second-order efficient. In this article, we developed a modified two-stage PCA procedure under the constraint that either the minimum eigen value of the variance–covariance matrix or its lower bound is known.
The proposed modified two-stage PCA procedure is shown to be asymptotically second-order efficient. Additionally, we showed that the risk-regret for this procedure is bounded, thus achieving asymptotic second-order risk efficiency. We have shown that, by using extensive simulations and a real data application, the modified two-stage is fast, efficient, and easy to implement. In the presence of historical data or prior information, one may conveniently use the modified two-stage PCA procedure. The choice of
k could be determined from budget or other criterion for future analysis or determined from historical data. In the absence of any prior information or historical data, we would recommend the two-stage procedure proposed by [
23]. While the latter enjoys a first-order efficiency property only, it would be our only choice given the dearth of existing literature on such algorithms. Our ongoing research involves developing a three-stage PCA procedure that does not require any prior information regarding population parameters, is second-order efficient and developed in a distribution-free setting. User discretion is thus advised when choosing one of these for minimizing cost-compression risk in a dimension reduction problem. The existing literature on multi-stage methods for dimension reduction is lacking, and we are confident that the proposed procedure would prove to be very useful for problems like dimension reduction, feature selection and others that require the use of PCA on the impromptu arrival of incoming data.