Multi-Stage Methods for Cost Controlled Data Compression Using Principal Component Analysis

Banerjee, Swarnali; Chattopadhyay, Bhargab

doi:10.3390/math13132140

Open AccessArticle

Multi-Stage Methods for Cost Controlled Data Compression Using Principal Component Analysis

by

Swarnali Banerjee

^1,*

and

Bhargab Chattopadhyay

²

¹

Department of Mathematics and Statistics and Center for Data Science and Consulting, Loyola University, Chicago, IL 60660, USA

²

School of Management & Entrepreneurship, Indian Institute of Technology Jodhpur, Jodhpur 342030, Rajasthan, India

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(13), 2140; https://doi.org/10.3390/math13132140

Submission received: 18 March 2025 / Revised: 12 June 2025 / Accepted: 15 June 2025 / Published: 30 June 2025

(This article belongs to the Special Issue Sequential Sampling Methods for Statistical Inference)

Download

Browse Figures

Versions Notes

Abstract

Several online principal component analysis (PCA) methodologies exist for data arriving sequentially that focus only on compression risk minimization. Recent work in this realm revolves around minimizing the cost-compression risk, which takes into account compression loss and sampling costs using a two-stage PCA procedure. Even though the procedure enjoys first-order efficiency, the authors could not mathematically verify the existence of the second-order efficiency property. In this article, we minimize cost-compression risk using a modified two-stage PCA procedure, which takes into account the data compression loss as well as the sampling cost when the smallest eigenvalue of the population variance–covariance matrix or its positive lower bound is known when the data is assumed to follow a multivariate normal distribution. The modified two-stage PCA procedure is shown to possess the second-order efficiency property, among others, including the second-order risk efficiency property under some conditions. The proposed method is novel but also fast and efficient, as illustrated by extensive data analyses through simulations and real data analysis.

Keywords:

cost-compression risk; efficiency; risk; two-stage methodology

MSC:

62L10; 62L12; 62L15; 68Q87

1. Introduction

Research on dimension reduction techniques has been gaining traction in the past decade. While reducing the data dimension, the number of features in the original dataset is replaced with a smaller number of independent features while retaining a sufficiently large proportion of information in the original dataset. One of the popular dimension reduction techniques is principal component analysis (PCA). The set of predictor variables in the original dataset are first converted to uncorrelated variables called principal components (PCs) in order of decreasing variability (for, e.g., refer to Jolliffe [1]) using PCA. Subsequently, the initial PCs are retained to maintain a sufficient level of information as contained in the original dataset. Such a standard PCA algorithm is also called offline PCA. However, online PCA is useful when data arrives sequentially or in batches. In the machine learning domain, as mentioned earlier, exciting new advances have been made due to the works of [2,3,4] and others. We refer to [5] for a detailed review of online PCA. Both offline or online PCA works exist in the literature and have been applied to various problems, including language recognition (e.g., Nasser et al. [6]), process modeling (e.g., Tang [7]), the selection of stocks (e.g., Hargreaves and Mani [8]), monitoring systems, or as an image classifier (e.g., Horev et al. [9], Feldman et al. [10]), etc. Works on PCA under the high dimension and low sample size settings also exist in the literature (e.g., Aoshima et al. [11], Yata and Aoshima [12,13] and others).

Much of the work on the online PCA algorithm is about the minimization of the total compression loss. There is little to no existing research on purely sequential or multi-stage procedures for dimension reduction algorithms like PCA. Recently, Chattopadhyay and Banerjee [14] proposed a two-stage procedure that minimizes the risk associated with the expected cost-compression loss, in other words, the cost-compression risk. This was one of the first contributions in this area with certain limitations which are outlined below.

Suppose

D_{n}^{p} = {(X_{1}, \dots, X_{n})}^{'}

denotes the dataset based on n vectors each of dimension p. Also, suppose that

X_{i}

follows a p-variate normal distribution with a positive-definite variance–covariance matrix (

Σ

) for all

i = 1, 2, \dots, n

. Using PCA, suppose that the researcher or the practitioner wants to retain

k (< p)

out of p PCs such that at least

100 \times η % (0 < η < 1)

of the total variation is retained. This is a common practice in biological sciences, especially when analyzing -omics datasets when researchers want to look at the first two to four PCs (Schneckener et al. [15], Lukk et al. [16]) among others. Suppose that

P_{n}^{k}

is the projection matrix of dimension

p \times p

of rank k (k fixed) based on the sample variance covariance matrix,

S_{n}

, computed using n observations. Furthermore, suppose that c is the cost of sampling each observation vector of dimension p, and A is the probable cost per unit compression loss due to PCA using n observation vectors. Then, the cost-compression loss as proposed by Chattopadhyay and Banerjee [14] is given by

L (P_{n}^{k}, n) = \frac{A}{n} C L_{n} + c n = \frac{A}{n} t r {(I - P_{n}^{k}) S_{n}} + c n,

(1)

and the corresponding risk function is

R (P_{n}^{k}, n) = \frac{A}{n} E [C L_{n}] + c n = \frac{A}{n} E [t r {(I - P_{n}^{k}) S_{n}}] + c n .

(2)

Thus, the loss function, given in (1), includes both the compression loss and the total sampling cost, which is termed the cost-compression loss and the corresponding risk function, given in (2), is termed as the cost-compression risk.

Recall that the aim is to reduce the dataset,

D_{n}^{p}

, into k dimensional subspace, which is approximately as good as in the case when the population variance–covariance matrix (

Σ

) is known. If the population variance–covariance matrix (

Σ

) is known, the risk function is

R (P^{k}, n) = \frac{A}{n} [t r {(I - P^{k}) Σ}] + c n,

(3)

where

P^{k}

is the projection matrix of size

p \times p

and rank k associated with known

Σ

. The cost-compression risk

R (P^{k}, n)

, defined in (3), can be minimized by using an optimal sample size as

n_{0} = \sqrt{\frac{A ξ}{c}}, where, ξ = t r [(I - P^{k}) Σ] .

(4)

Since

Σ

is assumed to be positive-definite and provided that the variances of each of the p variables exist, so

0 < ξ < \infty

. Using

n_{0}

, the optimal cost-compression risk is

R_{n_{0}} \equiv R (P^{k}, n_{0}) = 2 c n_{0} .

(5)

If

ξ

is known, then one can easily find out the optimal sample size required to get the minimized cost-compression risk using (5). However, in reality,

ξ

is unknown, and so is the optimal sample size

n_{0}

. This is the minimum cost-compression risk (MCCR) problem as formulated in Chattopadhyay and Banerjee [14].

Under some conditions, the two-stage PCA procedure developed by Chattopadhyay and Banerjee [14] was shown to enjoy the first-order efficiency property. Through small simulations, the authors argued for the existence of the second-order efficiency property. So, in this article, we present a modified two-stage PCA procedure that proposes an optimal sample size that minimizes the cost-compression risk when the distribution of the data is assumed to follow a multivariate normal distribution. This article also shows that our procedure enjoys second-order efficiency and second-order risk efficiency properties under some conditions.

The rest of the article is organized as follows: Section 2 presents the modified two-stage PCA procedure, followed by the characteristics of the procedure in Section 3. We present extensive numerical studies in Section 4. Section 5 presents illustrations of the modified two-stage PCA procedure on a real stock market dataset and a large energy consumption dataset. In Section 6, we include some discussion points outlining the strengths and challenges of the proposed method. Section 7 finishes our article with some thoughts and conclusions.

2. Modified Two-Stage PCA Procedure

Recall that the dataset based on n p-dimensional vectors is

D_{n} = {(X_{1}, \dots, X_{n})}^{'}

such that

X_{i}

follows a p-variate normal distribution with variance–covariance matrix

Σ

. We assume that

Σ

is a positive definite matrix. The goal is to reduce these data of dimension p to pre-specified dimension

k (< p)

. Note that the choice of k is crucial and is discussed momentarily. Suppose that the p eigen values of

Σ

are denoted by

γ_{i} (> 0)

for

i = 1, 2, \dots, p

. Furthermore, we assume that either the minimum of all the eigen values of

Σ

or its lower bound, e.g.,

γ_{l}

, is known. In any two-stage procedure, the choice of pilot sample size is very important. Based on this information, the final sample size is computed. If the number of data vectors in the pilot stage is too small, the estimation of

Σ

may become unstable, potentially affecting the performance of the modified two-stage procedure. If the pilot sample size is large, larger than required, it may lead to oversampling and a waste of economic resources. Hence, pilot sample size determination is crucial.

Recall that the optimal sample size, using (4), is

n_{0} = \sqrt{\frac{A t r [(I - P^{k}) Σ]}{c}} = \sqrt{\frac{A \sum_{i = k + 1}^{p} γ_{(i)}}{c}} \geq \sqrt{\frac{A (p - k) γ_{l}}{c}} .

(6)

In (6),

γ_{(i)}

denotes the

i th

ordered eigenvalue of

Σ

. Thus, we provide a piece of information on the value of

γ_{l}

, and we estimate the number of data vectors in the pilot stage as

m = m a x [2, 〈\sqrt{\frac{A (p - k) γ_{l}}{c}}〉] .

(7)

〈 u 〉

denotes the largest integer

\leq u

. Here, as

c \to 0

,

m \to \infty

. Next, the PCA is applied to the m observations collected after the pilot stage to get

V_{m}

. Then, using

V_{m}

and k, the final sample size can be computed as

N = m a x [m, \sqrt{A V_{m} / c}], where, V_{m} = t r [(I - P_{m}^{k}) S_{m}],

(8)

P_{m}^{k}

is the projection matrix of size

p \times p

and rank k associated with

S_{m}

, the sample variance–covariance matrix. Now, using [17],

{\hat{γ}}_{i}

’s is consistent for population eigenvalues

γ_{i}

for all i. So,

V_{m}

can be shown to be consistent for

ξ

as

c \to 0

. Next, we outlined the detailed algorithm for PCA using the modified two-stage PCA procedure for a prediction problem.

Thus, in the two stages, we collect N observations from each of dimension p using the modified two-stage PCA procedure as per Algorithm 1 and thereby apply the further predictive analytic tools on the final compressed dataset,

D_{N}^{k}

. In the next section, we explore the theoretical properties of our modified two-stage procedure.

Algorithm 1: Modified Two-Stage PCA procedure

START

Input: A, c,

γ_{l}

and k (

0 < k < p

).

Step 1.: Compute $m = m a x [2, 〈\sqrt{\frac{A (p - k) γ_{l}}{c}}〉]$ . Collect the pilot dataset, $D_{m}^{p} = (X_{1}, X_{2}, \dots, X_{m})$ ;
Step 2.: Apply PCA on the pilot dataset $D_{m}^{p}$ and find $P_{m}^{k}$ ;
Step 3.: Calculate $N = m a x [m, 〈\sqrt{\frac{A}{c} V_{m}}〉]$ , where, $V_{m} = t r [(I - P_{m}^{k}) S_{m}]$ ;
Step 4.: Collect $N - m$ remaining data such that the combined dataset is $D_{N}^{p} = (X_{1}, X_{2}, \dots, X_{N})$ ;
Step 5.: Apply PCA on $D_{N}^{p}$ and subsequently get the compressed dataset, $D_{N}^{k}$ ;

Output: Implement a suitable predictive analytic/machine learning technique on

D_{N}^{k}

.

END

3. Characteristics of the Modified Two-Stage PCA Procedure

For a fixed

A (> 0)

,

k (< p)

, and

c (> 0)

, the cost-compression risk based on N may be computed as

R_{N} \equiv R (P_{N}^{k}, N) = A E (\frac{t r [V_{N}]}{N}) + c E (N) = A E (\frac{t r [(I - P_{N}^{k}) S_{N}]}{N}) + c E (N) .

(9)

From (8), it can be observed that N can be very large for small values of

c (> 0)

. Although, for a particular experiment, c is fixed, we study the asymptotic properties of the modified two-stage PCA procedure for small values of c. First, we provide a crucial lemma for the modified two-stage PCA procedure, which is crucial in the sense that it ensures that the final number of observation vectors is finite.

Lemma 1.

Under the assumption that

ξ < \infty

, for any

c > 0

, the final sample size N is finite, i.e.,

P (N < \infty) = 1

.

The proof of this lemma is similar to Chattopadhyay and Banerjee [14]. We now consider the main theorem associated with the modified two-stage PCA procedure.

Theorem 1.

If

E ({\hat{γ}}_{j}^{2})

is finite for all

j = 1, \dots, p

, then with

A (> 0), k (< p)

the modified two-stage PCA procedure yields the following properties:

(i): $N / n_{0} \to 1$ in probability as $c ↓ 0$ .
(ii): $E (N - n_{0})$ is bounded as $c ↓ 0$ .
(iii): $R_{N} - R_{n_{0}}$ is bounded as $c ↓ 0$ , provided $E [S_{N}] < \infty$ .

Proof.

(i) Using the Equation (8) and dividing throughout by

n_{0}

, we have

\sqrt{\frac{V_{m}}{ξ}} \leq \frac{N}{n_{0}} \leq \frac{m I (N = m)}{n_{0}} + \sqrt{\frac{V_{m}}{ξ}} .

(10)

Now, since

V_{m}

is a consistent estimator of

ξ

, we can prove

N / n_{0} \overset{p}{\to} 1

.

(ii) Taking expectation in (10), we write

\begin{matrix} \sqrt{\frac{A}{c}} E [V_{m}^{1 / 2} - ξ^{1 / 2}] \leq & E (N) - n_{0} \leq m P (N = m) + \sqrt{\frac{A}{c}} E [V_{m}^{1 / 2} - ξ^{1 / 2}] + 1 \\ \sqrt{\frac{A}{c}} [E {[t r (S_{m}) - t r (P_{m}^{k} S_{m})]}^{1 / 2} - ξ^{1 / 2}] \leq & E (N) - n_{0} \leq m P (N = m) + \sqrt{\frac{A}{c}} E {[t r (S_{m})]}^{1 / 2} + 1 . \end{matrix}

(11)

Here,

t r (Σ) = \sum_{i = 1}^{p} γ_{i}

.

Σ

being positive definite,

(p - k) γ_{l} - t r (Σ)

is negative.

\begin{matrix} P (N = m) & = P (V_{m} \leq c m^{2} / A) = P (t r (S_{m} - P_{m}^{k} S_{m}) \leq (p - k) γ_{l}) \\ \leq P (| t r (S_{m} - P_{m}^{k} S_{m}) - t r (Σ) | > t r (Σ) - (p - k) γ_{l}), as (p - k) γ_{l} - t r (Σ) < 0 \\ \leq \frac{E {[t r (S_{m}) - t r (Σ)]}^{2}}{{[t r (Σ) - (p - k) γ_{l}]}^{2}} \leq \frac{V a r [t r (S_{m})]}{{[t r (Σ) - (p - k) γ_{l}]}^{2}} = \frac{2}{m - 1} \times \frac{t r (Σ^{2})}{{[t r (Σ) - (p - k) γ_{l}]}^{2}} . \end{matrix}

(12)

Also note that

t r (S_{m}) > 0

\begin{matrix} E {[t r (S_{m})]}^{1 / 2} < E {[1 + t r (S_{m})]}^{1 / 2} < 1 + E [t r (S_{m})] . \end{matrix}

(13)

Let us consider the right-hand side and left-hand side separately of Equation (11) and let

c \to 0

. Using Equation (13), we have

\begin{matrix} R H S : E (N) - n_{0} < & m P (N = m) + \sqrt{\frac{A}{c}} + \sqrt{\frac{A}{c}} E [t r (S_{m})] + 1 \\ \leq & m P (N = m) + \sqrt{\frac{A}{c}} + \sqrt{\frac{A}{c}} (1 - \frac{1}{m}) t r (Σ) + 1 \\ \leq & m P (N = m) + \sqrt{\frac{A}{c}} (\sum_{j = 1}^{p} γ_{j} + 1) - \sqrt{\frac{A}{c}} \frac{\sum_{j = 1}^{p} γ_{j}}{m} + 1 \\ \leq & \frac{2 t r (Σ^{2})}{{[t r (Σ) - (p - k) γ_{l}]}^{2}} + O (c^{- 0.5}) - \frac{\sum_{j = 1}^{p} γ_{j}}{(p - k) γ_{l}} + 1 . \end{matrix}

\begin{matrix} L H S : E (N) - n_{0} \geq & \sqrt{\frac{A}{c}} E [t r (I - t r {(P_{m}^{k} S_{m})}^{1 / 2} - ξ^{1 / 2}] \\ \geq & \sqrt{\frac{A}{c}} (- ξ^{1 / 2}) = \sqrt{\frac{A}{c}} {[- t r (Σ) + t r (P^{k} Σ)]}^{1 / 2} = O (c^{- 0.5}) . \end{matrix}

This proves the second-order efficiency property for the modified two-stage PCA procedure. This, of course, implies that the procedure is also a first-order efficient.

(iii) Here, we proceed along the lines of Chattopadhyay and Banerjee [14] to show that the difference between the two risks is bounded. Subtracting the optimum risk, defined in (5), from the risk obtained using the final sample size given in (9), we have

\begin{matrix} R_{N} - R_{n_{0}} & = \frac{A}{N} E [V_{N}] + c E [N] - 2 c n_{0} \leq \frac{A}{m} E [t r (S_{N})] + c E [N] - 2 c n_{0} \\ \leq \frac{A}{m} E [t r (S_{N})] + c (E [N] - n_{0}), \end{matrix}

is bounded above using (ii) and the fact that

E [t r (S_{N})] < \infty

. Thus, we need to prove that

R_{N} - R_{n_{0}}

has a lower bound.

\begin{matrix} R_{N} - R_{n_{0}} & \geq c E [N] - 2 c n_{0} \geq c m - 2 c n_{0} = \sqrt{A c} (\sqrt{(p - k) γ_{l}} - 2 \sqrt{ξ}) = O (c^{0.5}) . \end{matrix}

The difference between the two risks is thus bounded. Therefore, the second-order risk efficiency property is satisfied. This also implies that the two-stage PCA procedure has first-order risk efficiency. □

Here, Theorem 1 ensures that, under appropriate conditions, the final sample size is close to the optimal sample size on average and also that the cost-compression risk, while using the final sample size, is close to the minimized cost-compression risk provided that the population variance–covariance matrix is known.

Remark 1.

We note that the assumption

E ({\hat{γ}}_{j}^{2})

being finite is not very restrictive. This holds if the data follows a multivariate normal distribution (using Anderson [17]).

4. A Simulation Study

To verify the results related to the properties of the proposed modified two-stage procedure from Theorem 1, we use simulated data. In the following Monte Carlo simulation study, we simulated observations from a 5-dimensional Normal population with a mean vector

μ = (2, 3, 5, 3, 2)

and population variance–covariance matrix

Σ

as given by

Σ = [\begin{matrix} 1.6945 & - 0.1434 & - 0.1687 & 0.4644 & - 0.2788 \\ - 0.1434 & 1.2945 & - 0.9096 & 0.1105 & - 0.6997 \\ - 0.1687 & - 0.9096 & 1.4219 & - 0.5213 & 0.9837 \\ 0.4644 & 0.1105 & - 0.5213 & 1.3308 & - 0.8785 \\ - 0.2788 & - 0.6997 & 0.9837 & - 0.8785 & 1.2135 \end{matrix}] .

For the modified two-stage PCA procedure we also need to fix the value of k. Here, we present results for

k = 3

. Other choices, like

k = 2, 4

, were also considered but left out of the article for brevity. Since this is a simulation study, we noted that an eigen value decomposition of

Σ

reveals that

k = 3

retains 90% of the variation in compressed data. For real data analyses, a more practical choice of k is discussed in Section 5.

Each of Table 1, Table 2 and Table 3 shows the performance of the modified two-stage PCA procedure outlined in Algorithm 1. Each of these three tables considers a different value of the cost c while different choices of A are considered in the rows. The second and third columns provide the optimal sample size and optimal risk for our cost-compression loss function. This is followed by the pilot sample size as in (7). The fifth column provides the average sample size

\bar{N}

computed as in (8) and averaged over 10,000 simulations along with its standard error. From the sixth column, we observed that the modified two-stage PCA procedure performs remarkably well under each of the configurations when compared with the optimal sample size (

n_{0}

). Although it is clear that the proposed procedure tends to oversample on an average in comparison to the optimal sample size, this oversampling rate remains significantly lower (between 2.8% and 4.4%) under each of the configurations. The eighth column shows

{\bar{R}}_{N}

computed by taking the average of the cost-compression risks over 10,000 simulations accompanied by its standard error. We note that the standard error of the cost-compression risk is low and this holds for all the configurations. This column is followed by a comparison between the average cost-compression risk and the optimal risk, and we see that these two are remarkably close, as the ratio is very close to 1. Here, we fixed

k = 3

. Recall that the eigenvalue decomposition of

Σ

retained at least

90 %

variation (

92.42 %

to be exact) for a reduced dimension of 3. The next column provides the value of

\bar{η}

, which shows the average variation retained over 10,000 simulations. This is simply computed from the variation retained in the compressed data of size N in each simulation run and then averaged. We are happy to note that

η

is comfortably above

0.9

and very close to the true value of

η = 0.9242

. The last column shows the total time taken (in seconds) to run and compile results from 10,000 simulations. It may be noted that it takes less than a minute to compute such a large number of simulations, which shows that the proposed two-stage PCA procedure is not only asymptotically efficient but also time efficient. Also, our analyses reveal that the choice of k does not affect the performance of the proposed algorithm.

In order to determine whether the second-order efficiency and the second-order risk efficiency properties prevail, we reproduced the simulations such as those reported in Table 1, Table 2 and Table 3 several times under some of the configurations and subsequently implemented the modified two-stage PCA procedure. Figure 1 and Figure 2 exhibit the observed values of the regret (

\bar{N} - n_{0}

) and risk-regret (

{\bar{R}}_{N} - R_{n_{0}}

). Our empirical observations show that regret and risk-regret remain closely aligned across the different choices of A and c. Thus, it is abundantly clear that the modified two-stage PCA procedure satisfies the second-order efficiency and risk efficiency properties. The plots strongly support our findings of asymptotic second-order efficiency and the second-order risk efficiency in Theorem 1.

5. Application

In this section, we examine the performance of the two-stage procedure on two datasets—a real stock prices dataset and an energy consumption dataset based on a house in Belgium.

5.1. Stock Prices Data

Financial datasets and their associated complications have always been of interest for research in statistics and machine learning. The high volatility of stock market data in particular needs to be dealt with for prediction problems. These datasets are usually high in dimension and may involve multicollinearity. A great example to understand such a phenomenon between financial securities would be the effect of reduced oil prices that shows an inverse relationship between the stocks of oil companies and airlines. Since the prediction of the stock market has always been of interest, PCA has been used for a long time on these datasets. Apart from stock market prediction, for the nature of these high-dimensional and correlated datasets, PCA proves to be a particularly useful tool for identifying underlying trends, winning stocks, exploring patterns in stock market, etc. For more details, one may refer to [18,19,20] and others.

The S&P 500 tracks the stocks of the 500 leading publicly traded companies in the United States and is therefore considered to be a measure of the performance of the American Stock Market. Our data includes most traded equities from 1 January 2015 to 21 March 2018 from 10 large companies: NVIDIA Corp (NVDA), Credit Suisse Group AG (CSGN), Intel Corp (INTC), Microsoft Corp (MSFT), Cisco Systems Inc (CSCO), Micron Technology Inc (MU), Twitter Inc (TWTR), General Electric Co (GE), Coca-Cola Co/The (KO) and Morgan Stanley (MS). In what follows, we apply the modified two-stage procedure on this data and our results are summarized in Table 4. For the purpose of illustration, we have computed a “true n” and “true risk” as in (5) and (6). These are computed using the estimated mean and variance–covariance matrix from the entire dataset. While we understand that these are sample estimates only, going forward, we will treat them as population parameters in order to illustrate the performances of the proposed procedures only.

Results from the modified two-stage PCA procedure are outlined in Table 4. Recall that

γ_{l}

is known. Here, for the purpose of illustration, we decided to randomly split the data into training and testing subsets. We randomly sampled

10 %

of this data, determined

γ_{l}

and used the remaining

90 %

of the data to compile our results in Table 4. Additionally, to provide a value of k, we make use of this “historical” or training data. Fixing

η = 0.9

, we obtain a value of

k = 1

which will be used for the modified two-stage PCA procedure. Alternative procedures may be used in the absence of a priori information of

γ_{l}

and k, which is discussed later in Section 6.

Each row of the table shows results from a single run. We considered two values of

c = 0.3, 0.1

and three choices for

A = 20, 40, 60

. The pilot sample size in column 3 is computed as in (7) using

γ_{l}

from above. This is also used to compute the “true” optimal sample size

n_{0}

and the “true” risk

R_{n_{0}}

. Next comes the modified two-stage sample size N computed as in Algorithm 1. We note from columns 7 and 9 that the ratio of the modified two-stage sample size and the optimal sample size remain close to one. This is also true for the ratio of the modified two-stage cost-compression risk and the optimal cost-compression risk. This is expected from Section 3 and Section 4 as the procedure is shown to have asymptotic first-order efficiency and risk efficiency. Recall that the reduced dimension k for all the cases was 1. Column 10 estimates the variation retained in the actual dataset obtained from the two-stage procedure. We notice that, for all cases,

\hat{η}

is comfortably over

90 %

. The last column is the time taken in seconds to run the modified two-stage PCA procedure and compile the results from the data for each case (row), and the results underline our claim that the algorithm is extremely time-efficient.

Refer to Table 4, which corresponds to the modified two-stage PCA procedure for

c = 0.3

and

A = 40

. Based on the data of size 475, if we compute the first PC, then we obtain

\begin{matrix} P C 1 = 0.9472 \times N V D A - 0.0100 \times C S G N + 0.0822 \times I N T C + 0.2480 \times M S F T \\ + 0.0807 \times C S C O + 0.1287 \times M U + 0.0045 \times T W T R \\ - 0.0497 \times G E + 0.0316 \times K O + 0.0883 \times M S . \end{matrix}

(14)

Here, we see that the first PC, using the modified two-stage PCA procedure, accounts for

96.26 %

variation, has the most weight on NVIDIA Corp (NVDA) and moderate weights on Microsoft Corp (MSFT) and Micron Technology Inc. (MU). All others have almost very small weights. This intuitively makes sense as NVIDIA Corp has the highest correlations with other companies in the raw data. This is also true for Microsoft Corp.

5.2. Energy Consumption Data

Energy consumption is an important and relevant issue, as with each passing day, we are using more and more devices at home that use electricity. Many homes use smart devices now to reduce energy consumption. Ref. [21] finds that buildings constitute

40 %

of the total energy usage in the world. Additionally, weather patterns affect energy consumption by buildings, which is why most studies include weather data when building predictive models for power consumption. Given the large number of variables, feature selection would help build a parsimonious predictive model for power consumption.

These data were collected from a low-energy house in Belgium continuously for 137 days. The data includes 19,735 instances and 27 features. A detailed description may be found in [22]. Energy consumption data were collected every 10 min from indoor and outdoor sensors network of a two-storey house and a nearby airport. Some of the variables include kitchen temperature, living room temperature, temperature outside, etc. Variables from the weather station included temperature, humidity visibility, wind pressure etc. This publicly available (github) dataset is mostly used in order to predict energy consumption using wireless sensors. Just as in Section 5.1, here too we applied the modified two-stage procedure on the predictor variables. As before, the true values of the optimal sample size (6) and optimal risk (5) are computed using the entire dataset and treated as population parameters for the purpose of illustration. Additionally, we randomly sampled

10 %

of this large dataset to compute

γ_{l}

and the remaining

90 %

was used to generate results in Table 5. Setting

η = 0.9

in the training set, we get a value of

k = 9

which will be used for the modified two-stage PCA procedure.

In Table 5, owing to the larger sample size, we were able to consider smaller values for c and larger values for A. The columns and structure of the table is exactly the same as Table 4. The pilot sample size, optimal sample size and optimal risk in columns 3, 4 and 5 are computed as in (7), (6) and (5), respectively. This is followed by the two-stage sample size computed by our proposed algorithm in (8) in column 6. The next two columns refer to our results from Theorem 1. The ratio of the two-stage sample size and the optimal sample size is expected to be close to 1 and their difference is expected to be bounded. The latter aspect is difficult to gauge from a single run for values of C and A and hence better illustrated in Section 4, Figure 1 and Figure 2. In column 9, we compute the two-stage risk as in (9). The ratio of the two-stage risk and optimal risk is expected to be close to 1. Since we fixed

e t a

at

0.90

and determined the value of k, column 11, the proportion of variation was retained in the final sample (drawn from the two-stage procedure). In each of the cases displayed here in Table 5 and others that we ran ourselves, the value of

e t a

remained comfortably over

0.90

. The last column displays the time in seconds to run the procedure and compute all results in each row, i.e., for each choice of C and A in the table. Owing to the large number of features, instead of displaying the first PC, we include a traditional biplot run for the case

c = 0.05

and

A = 3000

(See Figure 3). Note here that, although the total variation retained in the first nine PCs is

91.7 %

, the first two PCs cover

60.62 %

variation. The original data points are spread out in the background. The variable names are written in short form so as not to crowd the plot—most variables parallel to the PC1 axis such as T8, T9, etc., are temperature variables in different rooms of the house. Furthermore, variables like RH_1, RH_4, RH_7 that are parallel to the PC2 axis are humidity variables in different rooms. We see that variables like light, visibility and wind speed contribute much less than temperature and humidity. Such findings remain consistent when the entire dataset was used to run principal component analysis.

6. Discussion

In this section, we discuss a few aspects of our proposed PCA algorithm.

(a) The proposed online PCA algorithm is developed based on several key assumptions. First, we assume that the data follows a normal distribution and that the cost-compression loss function is a linear combination of compression loss and data collection cost. Additionally, we assume that the cost of collecting each observation remains constant throughout the data collection process. Moreover, the lower bound (or minimum) of the eigenvalues of the population variance–covariance matrix,

γ_{l}

, is known. Under these conditions, we propose a modified two-stage PCA algorithm that achieves second-order efficiency in terms of sample size and risk. Notably, no existing online PCA algorithms, including those in [2,3], operate under this specific setting. Consequently, comparing our proposed PCA algorithm with existing online PCA algorithms would not yield a meaningful comparison. Therefore, we refrain from making such a comparison in this work.

(b) The proposed algorithm requires a value of the lower bound (or minimum) of the eigenvalues of the (

Σ

). In fact, in several instances, the specification of a lower bound may be beneficial. We note that, if smallest eigenvalue of

Σ

is too close to zero, the

Σ

matrix becomes nearly singular, making numerical operations unstable. The PCA algorithms involving matrix inversions or eigen decompositions can become almost singular when eigenvalues are very small. Setting a known lower bound ensures that computations remain well-posed and stable. For example, consider a dataset of high-resolution grayscale images (say, 1000 images of size

100 \times 100

). Suppose that, after computing the variance–covariance matrix, we find that some eigenvalues are very close to zero (say less than

10^{- 5}

). These small eigenvalues correspond to directions in the data space with extremely low variance, possibly due to noise rather than a meaningful structure. Having a PCA keeping these components will lead to the construction of blurry or distorted images. In this situation, by imposing a lower bound (say

γ_{l} = 0.01

), we discard eigenvalues smaller than

0.01

, thus removing weak components and ensuring that only meaningful patterns remain.

A meaningful choice of

γ_{l}

is crucial, particularly in determining the appropriate sample size for implementing the procedure. A higher value of

γ_{l}

leads to a larger pilot sample size, whereas a lower value results in a smaller pilot sample. As with any modified two-stage procedure in sequential analysis, a slight deviation from the actual (or optimal) lower bound value does not significantly impact the final sample size, except for adjustments in the second-stage sample size. Thus, a minor misspecification of

γ_{l}

does not cause significant issues.

However, if

γ_{l}

is set excessively high, the pilot sample size may become unnecessarily large—potentially exceeding the optimal sample size. Conversely, if

γ_{l}

is too low, the pilot sample may be insufficient, forcing the algorithm to learn the structure with fewer observations. This can increase the variance in the final sample size, affecting the overall efficiency of the procedure. For more details related to the impact of pilot sample size on final sample size and its standard deviation, we refer to [23,24,25].

(c) The proposed procedure is developed to reduce the data of dimension p to pre-specified dimension

k (< p)

requiring the pre-specifying of k. Instead, suppose that the goal is to reduce the dimension of data from dimension p to any lower dimension, retaining at most, say,

100 η %

variability, with

p - k

being at least 1. Thus, the lower bound of the optimal sample size (

n_{0}

) using (4) is

n_{0} = \sqrt{\frac{A t r [(I - P^{k}) Σ]}{c}} = \sqrt{\frac{A \sum_{i = k + 1}^{p} γ_{(i)}}{c}} \geq \sqrt{\frac{A (p - k) γ_{l}}{c}} \geq \sqrt{\frac{A γ_{l}}{c}} .

(15)

Thus, using Equation (15), the pilot sample size is

m_{1} = m a x [2, 〈\sqrt{\frac{A γ_{l}}{c}}〉] .

(16)

Thus, in the first stage, we draw a pilot sample of size

m_{1}

. Based on this sample, we apply the PCA and find the value of k using the scree plot and correspondingly find

V_{m_{1}}

. Then, using

V_{m_{1}}

and k, the final sample size can be computed as

N_{1} = m a x [m_{1}, \sqrt{A V_{m_{1}} / c}], where, V_{m_{1}} = t r [(I - P_{m_{1}}^{k}) S_{m_{1}}],

(17)

With this change, it can be shown that the revised procedure enjoys the characteristics outlined in Lemma 1 and Theorem 1. The proof requires a minute change and hence is left for brevity.

7. Some Concluding Comments

PCA is one of the most popular data reduction techniques. PCA achieves compression loss while reducing the dimension of the data, and this compression loss depends on the sample size as well. The higher the sample size, the lesser the compression loss, which increases the data collection cost. To take into account the compression loss as well as the data collection cost, the cost-compression risk function was used. A fixed sample size procedure cannot minimize such a cost-compression risk. As envisaged in an earlier work by the authors, a two-stage PCA procedure may be used; however, the procedure was not shown to be second-order efficient. In this article, we developed a modified two-stage PCA procedure under the constraint that either the minimum eigen value of the variance–covariance matrix or its lower bound is known.

The proposed modified two-stage PCA procedure is shown to be asymptotically second-order efficient. Additionally, we showed that the risk-regret for this procedure is bounded, thus achieving asymptotic second-order risk efficiency. We have shown that, by using extensive simulations and a real data application, the modified two-stage is fast, efficient, and easy to implement. In the presence of historical data or prior information, one may conveniently use the modified two-stage PCA procedure. The choice of k could be determined from budget or other criterion for future analysis or determined from historical data. In the absence of any prior information or historical data, we would recommend the two-stage procedure proposed by [23]. While the latter enjoys a first-order efficiency property only, it would be our only choice given the dearth of existing literature on such algorithms. Our ongoing research involves developing a three-stage PCA procedure that does not require any prior information regarding population parameters, is second-order efficient and developed in a distribution-free setting. User discretion is thus advised when choosing one of these for minimizing cost-compression risk in a dimension reduction problem. The existing literature on multi-stage methods for dimension reduction is lacking, and we are confident that the proposed procedure would prove to be very useful for problems like dimension reduction, feature selection and others that require the use of PCA on the impromptu arrival of incoming data.

Author Contributions

Conceptualization, S.B. and B.C.; Methodology, S.B. and B.C.; Software coding, S.B.; Validation, S.B. and B.C.; Formal analysis, S.B. and B.C.; Writing—original draft, S.B. and B.C.; Writing—review & editing, S.B. and B.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request. (SP500, https://www.spglobal.com/spdji/en/indices/equity/sp-500/#data, accessed on 1 March 2025).

Acknowledgments

We extend our gratitude to Sara Dutra Lopes, Research Analyst at the European Central Bank, for kindly providing the Stock Market Equity Prices dataset. The R (version 4.3.2) codes for implementing the algorithm and the datasets used in the application section can be provided upon request.

Conflicts of Interest

The authors declare that neither financial nor non-financial competing interests exist that should be declare in the manuscript.

References

Jolliffe, I.T. Principal components in regression analysis. In Principal Component Analysis; Springer: Cham, Switzerland, 1986; pp. 129–155. [Google Scholar]
Boutsidis, C.; Zouzias, A.; Mahoney, M.W.; Drineas, P. Randomized dimensionality reduction for k-means clustering. IEEE Trans. Inf. Theory 2014, 61, 1045–1062. [Google Scholar] [CrossRef]
Garber, D. On the regret minimization of nonconvex online gradient ascent for online PCA. In Proceedings of the Conference on Learning Theory, PMLR. Phoenix, AZ, USA, 25–28 June 2019; pp. 1349–1373. [Google Scholar]
Warmuth, M.K.; Kuzmin, D. Randomized online PCA algorithms with regret bounds that are logarithmic in the dimension. J. Mach. Learn. Res. 2008, 9, 2287–2320. [Google Scholar]
Nie, J.; Kotłowski, W.; Warmuth, M.K. Online PCA with optimal regret. J. Mach. Learn. Res. 2016, 17, 6022–6070. [Google Scholar]
Naser, M.; Hossain, M.M.; Tito, S.R.; Hoque, M.A. Recognition of Bangla characters using regional features and principal component analysis. In Proceedings of the International Conference on Electrical & Computer Engineering (ICECE 2010), Wuhan, China, 18–20 December 2010; pp. 506–509. [Google Scholar]
Tang, J.; Yu, W.; Chai, T.; Zhao, L. On-line principal component analysis with application to process modeling. Neurocomputing 2012, 82, 167–178. [Google Scholar] [CrossRef]
Hargreaves, C.A.; Mani, C.K. The Selection of winning stocks using principal component analysis. Am. J. Mark. Res. 2015, 1, 183–188. [Google Scholar]
Horev, I.; Yger, F.; Sugiyama, M. Geometry-aware principal component analysis for symmetric positive definite matrices. Mach. Learn. 2017, 106, 493–522. [Google Scholar] [CrossRef]
Feldman, D.; Schmidt, M.; Sohler, C. Turning big data into tiny data: Constant-size coresets for k-means, PCA, and projective clustering. SIAM J. Comput. 2020, 49, 601–657. [Google Scholar] [CrossRef]
Aoshima, M.; Shen, D.; Shen, H.; Yata, K.; Zhou, Y.-H.; Marron, J. A survey of high dimension low sample size asymptotics. Aust. N. Z. J. Stat. 2018, 60, 4–19. [Google Scholar] [CrossRef]
Yata, K.; Aoshima, M. PCA consistency for non-Gaussian data in high dimension, low sample size context. Commun. Stat. Methods 2009, 38, 2634–2652. [Google Scholar] [CrossRef]
Yata, K.; Aoshima, M. PCA consistency for the power spiked model in high-dimensional settings. J. Multivar. Anal. 2013, 122, 334–354. [Google Scholar] [CrossRef]
Chattopadhyay, B.; Banerjee, S. Minimum Cost-Compression Risk in Principal Component Analysis. Aust. N. Z. J. Stat. 2022, 64, 422–441. [Google Scholar] [CrossRef]
Schneckener, S.; Arden, N.S.; Schuppert, A. Quantifying stability in gene list ranking across microarray derived clinical biomarkers. BMC Med. Genom. 2011, 4, 73. [Google Scholar] [CrossRef] [PubMed]
Lukk, M.; Kapushesky, M.; Nikkilä, J.; Parkinson, H.; Goncalves, A.; Huber, W.; Ukkonen, E.; Brazma, A. A Global Map of Human Gene Expression. Nat. Biotechnol. 2010, 28, 322–324. [Google Scholar] [CrossRef]
Anderson, T.W. An Introduction to Multivariate Statistical Analysis; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1962. [Google Scholar]
Chen, H.-C. Visualisation of financial time series by linear principal component analysis and nonlinear principal component analysis. arXiv 2014, arXiv:1410.7961. [Google Scholar]
Ghorbani, M.; Chong, E.K. Stock price prediction using principal components. PLoS ONE 2020, 15, e0230124. [Google Scholar] [CrossRef]
Zhong, X.; Enke, D. Forecasting daily stock market return using dimensionality reduction. Expert Syst. Appl. 2017, 67, 126–139. [Google Scholar] [CrossRef]
Chammas, M.; Makhoul, A.; Demerjian, J. An efficient data model for energy prediction using wireless sensors. Comput. Electr. Eng. 2019, 76, 249–257. [Google Scholar] [CrossRef]
Candanedo, L.M.I.C.; Feldheim, V.; Deramaix, D. Data driven prediction models of energy use of appliances in a low-energy house. Energy Build. 2017, 140, 81–97. [Google Scholar] [CrossRef]
Chattopadhyay, B.; Banerjee, S. Estimation of location parameter within pre-specified error bound with second-order efficient two-stage procedure. S. Afr. Stat. J. 2021, 55, 45–54. [Google Scholar] [CrossRef]
Mukhopadhyay, N. A consistent and asymptotically efficient two-stage procedure to construct fixed width confidence intervals for the mean. Metrika 1980, 27, 281–284. [Google Scholar] [CrossRef]
Mukhopadhyay, N.; Chattopadhyay, B. Two-stage fixed-width confidence interval procedures for the mean of a normal distribution. Seq. Anal. 2013, 32, 83–157. [Google Scholar] [CrossRef]

Figure 1. Regret for various choices of A and C. Note: A is the probable cost per unit compression loss due to PCA using n observation vectors; c is the per-observation sampling cost; The regret denotes the risk regret, that is,

R_{N} - R_{n_{0}}

corresponding to a specific run.

Figure 1. Regret for various choices of A and C. Note: A is the probable cost per unit compression loss due to PCA using n observation vectors; c is the per-observation sampling cost; The regret denotes the risk regret, that is,

R_{N} - R_{n_{0}}

corresponding to a specific run.

Figure 2. Risk Regret for various choices of A and C. Note: A is the probable cost per unit compression loss due to PCA using n observation vectors; c is the per-observation sampling cost; The regret denotes the risk regret, that is,

R_{N} - R_{n_{0}}

corresponding to a specific run.

Figure 2. Risk Regret for various choices of A and C. Note: A is the probable cost per unit compression loss due to PCA using n observation vectors; c is the per-observation sampling cost; The regret denotes the risk regret, that is,

R_{N} - R_{n_{0}}

corresponding to a specific run.

Figure 3. Note: The above biplot plots the first two principal components in the axes and the original variables are displayed with red vectors. All data points may be seen on the graph and the labels correspond to variable names.

Table 1. Simulation results for the modified two-stage PCA procedure (7) and (8) for

c = 0.1

.

Table 1. Simulation results for the modified two-stage PCA procedure (7) and (8) for

c = 0.1

.

A	$n_{0}$	$R_{n_{0}}$	m	$\bar{N}$	$\frac{\bar{N}}{n_{0}}$	$\bar{N} - n_{0}$	${\bar{R}}_{N}$	$\frac{{\bar{R}}_{N}}{R_{n_{0}}}$	$\bar{η}$	Time
				$(s (\bar{N}))$			$(s ({\bar{R}}_{N}))$
250	$129.5908$	$25.9182$	29	$133.9827$	$1.0339$	$4.3919$	$26.7965$	$1.0339$	$0.9264$	$50.44$
				$(0.1839)$			$(0.0368)$
500	$183.2691$	$36.6538$	41	$190.7328$	$1.0407$	$7.4637$	$38.1466$	$1.0407$	$0.9252$	$51.03$
				$(0.2331)$			$(0.0466)$
750	$224.4579$	$44.8916$	50	$234.1079$	$1.0429$	$9.6500$	$46.8216$	$1.0429$	$0.9248$	$51.65$
				$(0.2664)$			$(0.0533)$
1000	$259.1817$	$51.8363$	58	$270.1247$	$1.0422$	$10.943$	$54.0249$	$1.0422$	$0.9251$	$53.22$
				$(0.2856)$			$(0.0571)$

Note: c is the per-observation sampling cost; A is the probable cost per unit compression loss due to PCA using n observation vectors; n₀ is the optimal sample size if the procedure is used assuming the population parameters to be known; R_n₀ is the optimum cost-compression risk. m is the pilot sample size;

\bar{N}

and s(

\bar{N}

) are, respectively, the mean of the final sample sizes and the standard error of the final sample sizes; the sixth column provides the ratio (

\bar{N}

/n₀), which represents the closeness of the final sample size and the optimal sample size, needs to be closer to 1. The difference in the risks (

\bar{N}

− n₀) in column seven, which represents the closeness of the final sample size and the optimal sample size, needs to be bounded;

{\bar{R}}_{N}

and s(

{\bar{R}}_{N}

) are, respectively, the average cost-compression risk from N observations and its standard error. The ratio

{\bar{R}}_{N}

/R_n₀, which represents the closeness of the average cost-compression risk using final sample size and the optimal cost-compression risk, needs to be closer to 1;

\bar{η}

shows the average variation retained over 10,000 simulations. Time denotes the total time (in seconds) to carry out all 10,000 simulations.

Table 2. Simulation results for the modified two-stage PCA procedure (7) and (8) for

c = 0.05

.

Table 2. Simulation results for the modified two-stage PCA procedure (7) and (8) for

c = 0.05

.

A	$n_{0}$	$R_{n_{0}}$	m	$\bar{N}$	$\frac{\bar{N}}{n_{0}}$	$\bar{N} - n_{0}$	${\bar{R}}_{N}$	$\frac{{\bar{R}}_{N}}{R_{n_{0}}}$	$\bar{η}$	Time
				$(s (\bar{N}))$			$(s ({\bar{R}}_{N}))$
250	$183.2691$	$18.3269$	41	$190.7472$	$1.0408$	$7.4781$	$19.0747$	$1.0408$	$0.9254$	$52.54$
				$(0.2303)$			$(0.0230)$
500	$259.1817$	$25.9182$	58	$270.5803$	$1.0439$	$11.398$	$27.0580$	$1.0439$	$0.9249$	$52.37$
				$(0.2864)$			$(0.0286)$
750	$317.4314$	$31.7431$	71	$330.9725$	$1.0426$	$13.541$	$33.0972$	$1.0426$	$0.9249$	$53.65$
				$(0.3329)$			$(0.0333)$
1000	$366.5383$	$36.6538$	82	$381.477$	$1.0407$	$14.9394$	$38.1478$	$1.0407$	$0.9245$	$56.71$
				$(0.3706)$			$(0.0371)$

Note: c is the per-observation sampling cost; A is the probable cost per unit compression loss due to PCA using n observation vectors; n₀ is the optimal sample size if the procedure is used assuming the population parameters to be known; R_n₀ is the optimum cost-compression risk. m is the pilot sample size;

\bar{N}

and s(

\bar{N}

) are, respectively, the mean of the final sample sizes and the standard error of the final sample sizes; the sixth column provides the ratio (

\bar{N}

/n₀), which represents the closeness of the final sample size and the optimal sample size, needs to be closer to 1. The difference in the risks (

\bar{N}

− n₀) in column seven, which represents the closeness of the final sample size and the optimal sample size, needs to be bounded;

{\bar{R}}_{N}

and s(

{\bar{R}}_{N}

) are, respectively, the average cost-compression risk from N observations and its standard error. The ratio

{\bar{R}}_{N}

/R_n₀, which represents the closeness of the average cost-compression risk using final sample size and the optimal cost-compression risk, needs to be closer to 1;

\bar{η}

shows the average variation retained over 10,000 simulations. Time denotes the total time (in seconds) to carry out all 10,000 simulations.

Table 3. Simulation results for the modified two-stage PCA procedure (7) and (8) for

c = 0.005

.

Table 3. Simulation results for the modified two-stage PCA procedure (7) and (8) for

c = 0.005

.

A	$n_{0}$	$R_{n_{0}}$	m	$\bar{N}$	$\frac{\bar{N}}{n_{0}}$	$\bar{N} - n_{0}$	${\bar{R}}_{N}$	$\frac{{\bar{R}}_{N}}{R_{n_{0}}}$	$\bar{η}$	Time
				$(s (\bar{N}))$			$(s ({\bar{R}}_{N}))$
250	$579.5479$	$5.7955$	130	$601.2102$	$1.0374$	$21.662$	$6.0121$	$1.0374$	$0.9245$	$50.33$
				$(0.5003)$			$(0.0050)$
500	$819.6045$	$8.1960$	184	$845.0556$	$1.0310$	$25.451$	$8.4505$	$1.0310$	$0.9245$	$52.30$
				$(0.6444)$			$(0.0064)$
750	$1003.806$	$10.0381$	226	$1033.177$	$1.0293$	$29.371$	$10.3318$	$1.0293$	$0.9244$	$53.15$
				$(0.7325)$			$(0.0073)$
1000	$1159.10$	$11.5910$	261	$1191.567$	$1.0280$	$32.471$	$11.9157$	$1.0280$	$0.9244$	$53.85$
				$(0.8189)$			$(0.0082)$

Note: c is the per-observation sampling cost; A is the probable cost per unit compression loss due to PCA using n observation vectors; n₀ is the optimal sample size if the procedure is used assuming the population parameters to be known; R_n₀ is the optimum cost-compression risk. m is the pilot sample size;

\bar{N}

and s(

\bar{N}

) are, respectively, the mean of the final sample sizes and the standard error of the final sample sizes; the sixth column provides the ratio (

\bar{N}

/n₀), which represents the closeness of the final sample size and the optimal sample size, needs to be closer to 1. The difference in the risks (

\bar{N}

− n₀) in column seven, which represents the closeness of the final sample size and the optimal sample size, needs to be bounded;

{\bar{R}}_{N}

and s(

{\bar{R}}_{N}

) are, respectively, the average cost-compression risk from N observations and its standard error. The ratio

{\bar{R}}_{N}

/R_n₀), which represents the closeness of the average cost-compression risk using final sample size and the optimal cost-compression risk, needs to be closer to 1;

\bar{η}

shows the average variation retained over 10,000 simulations. Time denotes the total time (in seconds) to carry out all 10,000 simulations.

Table 4. Illustration of the modified two-stage PCA procedure on stock market data.

c	A	m	$n_{0}$	$R_{n_{0}}$	N	$\frac{N}{n_{0}}$	$N - n_{0}$	$R_{N}$	$\frac{R_{N}}{R_{n_{0}}}$	$\hat{η}$	Time
$0.3$	20	27	$317.3278$	$190.3967$	343	$1.0809$	$25.6722$	$192.9528$	$1.0134$	$0.9604$	$0.03$
$0.3$	40	36	$451.0484$	$270.6291$	475	$1.0531$	$23.9516$	$270.821$	$1.0007$	$0.9626$	$0.03$
$0.3$	60	45	$553.6308$	$332.1785$	556	$1.0043$	$2.3692$	$330.9327$	$0.9962$	$0.9629$	$0.04$
$0.1$	20	41	$552.1659$	$110.4332$	567	$1.0269$	$14.8341$	$108.9562$	$0.9866$	$0.9618$	$0.03$
$0.1$	40	53	$785.3204$	$157.0641$	802	$1.0212$	$16.6796$	$155.8768$	$0.9924$	$0.9625$	$0.04$
$0.1$	60	73	$959.3836$	$191.3306$	981	$1.0225$	$21.6164$	$190.9912$	$0.9982$	$0.9623$	$0.04$

Note: c is the per-observation sampling cost; A is the probable cost per unit compression loss due to PCA using n observation vectors; m is the pilot sample size; n₀ is the optimal sample size; R_n₀ is the optimal cost-compression risk. N is the final sample size computed by adopting the proposed method; the seventh column provides the ratio (N/n₀) which represents the closeness of the final sample size and the optimal sample size needs to be closer to 1. The difference in the sample sizes (N − n₀) in column eight, which represents the closeness of the final sample size and the optimal sample size, needs to be bounded; R_N represents the achieved cost-compression risk from N observations. The ratio R_N/R_n₀, which represents the closeness of the cost-compression risk using final sample size and the optimal cost-compression risk, needs to be closer to 1.

\hat{η}

shows the estimated variation retained; Time denotes the total time (in seconds) to compute the results in each row.

Table 5. Illustration of the modified two-stage PCA procedure on energy consumption data.

c	A	m	$n_{0}$	$R_{n_{0}}$	N	$\frac{N}{n_{0}}$	$N - n_{0}$	$R_{N}$	$\frac{R_{N}}{R_{n_{0}}}$	$\hat{η}$	Time
$0.1$	1000	13	$2673.71$	$534.742$	2812	$1.052$	$138.290$	$558.121$	$1.044$	$0.916$	$0.02$
$0.1$	3000	23	$4631.00$	$926.200$	4864	$1.050$	$232.998$	$965.977$	$1.043$	$0.918$	$0.02$
$0.1$	5000	30	$5978.597$	$1195.719$	6179	$1.033$	$200.403$	$1237.492$	$1.035$	$0.919$	$0.02$
$0.05$	1000	18	$3781.197$	$378.120$	3850	$1.016$	$60.803$	$388.070$	$1.026$	$0.919$	$0.02$
$0.05$	3000	32	$6549.225$	$654.922$	6618	$1.010$	$68.775$	$670.735$	$1.024$	$0.917$	$0.02$
$0.05$	5000	42	$8455.014$	$845.501$	8475	$1.002$	$19.986$	$862.246$	$1.020$	$0.918$	$0.02$

Note: c is the per-observation sampling cost; A is the probable cost per unit compression loss due to PCA using n observation vectors; m is the pilot sample size; n₀ is the optimal sample size; R_n₀ is the optimal cost-compression risk. N is the final sample size computed by adopting the proposed method; the seventh column provides the ratio (N/n₀) which represents the closeness of the final sample size and the optimal sample size needs to be closer to 1. The difference in the sample sizes (N − n₀) in column eight, which represents the closeness of the final sample size and the optimal sample size, needs to be bounded; R_N represents the achieved cost-compression risk from N observations. The ratio R_N/R_n₀, which represents the closeness of the cost-compression risk using final sample size and the optimal cost-compression risk, needs to be closer to 1.

\hat{η}

shows the estimated variation retained; Time denotes the total time (in seconds) to compute the results in each row.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Banerjee, S.; Chattopadhyay, B. Multi-Stage Methods for Cost Controlled Data Compression Using Principal Component Analysis. Mathematics 2025, 13, 2140. https://doi.org/10.3390/math13132140

AMA Style

Banerjee S, Chattopadhyay B. Multi-Stage Methods for Cost Controlled Data Compression Using Principal Component Analysis. Mathematics. 2025; 13(13):2140. https://doi.org/10.3390/math13132140

Chicago/Turabian Style

Banerjee, Swarnali, and Bhargab Chattopadhyay. 2025. "Multi-Stage Methods for Cost Controlled Data Compression Using Principal Component Analysis" Mathematics 13, no. 13: 2140. https://doi.org/10.3390/math13132140

APA Style

Banerjee, S., & Chattopadhyay, B. (2025). Multi-Stage Methods for Cost Controlled Data Compression Using Principal Component Analysis. Mathematics, 13(13), 2140. https://doi.org/10.3390/math13132140

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Stage Methods for Cost Controlled Data Compression Using Principal Component Analysis

Abstract

1. Introduction

2. Modified Two-Stage PCA Procedure

3. Characteristics of the Modified Two-Stage PCA Procedure

4. A Simulation Study

5. Application

5.1. Stock Prices Data

5.2. Energy Consumption Data

6. Discussion

7. Some Concluding Comments

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI