Analysis of Fat Big Data Using Factor Models and Penalization Techniques: A Monte Carlo Simulation and Application

: This article assesses the predictive accuracy of factor models utilizing Partial · Least · Squares (PLS) and Principal · Component · Analysis (PCA) in comparison to autometrics and penalization techniques. The simulation exercise examines three types of scenarios by introducing the issues of multicollinearity, heteroscedasticity, and autocorrelation. The number of predictors and sample size are adjusted to observe the effects. The accuracy of the models is evaluated by calculating the Root · Mean · Square · Error (RMSE) and the Mean · Absolute · Error (MAE). In the presence of severe multicollinearity, the factor approach utilizing (PLS demonstrates exceptional performance in comparison. Autometrics achieves the lowest RMSE and MAE values across all levels of heteroscedas-ticity. Autometrics provides better forecasts with low and moderate autocorrelation. However, Elastic · Smoothly · Clipped · Absolute · Deviation (E-SCAD) forecasts well with severe autocorrelation. In addition to the simulation, we employ a popular Pakistani macroeconomic dataset for empirical research. The dataset contains 79 monthly variables from January 2013 to December 2020. The competing approaches perform differently compared to the simulation datasets, although “The PLS factor approach outperforms its competing approaches in forecasting, with lower RMSE and MAE”. It is more probable that the actual dataset exhibits a high degree of multicollinearity.


Introduction
Regression analysis is a widely recognized statistical method employed in various fields, including finance and the social sciences.The main objective of regression analysis is to create a model that accurately represents the influence of one or more independent variables on a dependent variable.The Ordinary•Least•Squares (OLS) approach is a frequently employed technique for estimating unknown parameters of a regression model [1].The OLS estimates are derived by reducing the squared errors of the residuals.The approach is widely favored because of its high interpretability and ability to generate accurate estimates, provided that the underlying assumptions are met [2].
In the era of big data, dataset formats have changed.Previously, the number of observations, n, was generally much bigger than the number of explanatory variables, p.However, currently, n ≈ p or even n < p is common, referred to as high-dimensional data.These large datasets have presented new issues, such as degrees of freedom, multicollinearity, heteroscedasticity, etc., rendering standard linear regression models ineffective.Traditional econometric models do not provide sparse models, which may result in inefficient behavior when n < p. Advanced regression approaches are consequently necessary for enormous datasets, commonly known as big data [3].
The recent developments in the collection of macroeconomic data have led to a great focus on big data.An accurate analysis can be performed if we extract the important information suitably from a huge set of features.However, the performance alters depending on the data dimension and estimation tool which are applied as well.Failure in dimensional reduction induces poor output because of redundant variables.Since influential work on the forecasting through Diffusion•Index (DI) was conducted by [4], factors models are considered the common approach for predictive modeling in a data-rich environment.Stock and Watson [5] showed that forecasting via factor models is more accurate than the existing forecasting tools like autoregressive forecasts, bagging, pretest methods, empirical Bayes, and Bayesian model averaging.They inferred that the DI is an effective approach to lessen the regression dimension, and it appears to be difficult to enhance this performance without introducing severe changes to the predictive model.Recently, the factor models extended for forecasting aims include those of [6][7][8][9][10].
In addition to the DI methodology, sparse regression is another family of tools utilized for dimension reduction and forecasting, and it is specifically well-known in the econometrics and statistics fields.The sparse regression tools attempt to keep the relevant features and force the coefficients of irrelevant features to zero.The benefit of such tools is that they permit a curse of dimensionality that is available in macroeconomic time series for a substantial amount of time, but the predictions that statistical tools produce also serve to devise productive monetary policies [11,12].
The sparse regression models can be fitted through penalized regression, also known as shrinkage methods, such as the Least•Absolute•Shrinkage and Selection•Operator (Lasso) of [13], the Smoothly•Clipped•Absolute•Deviation (SCAD) of [14], the Elastic•net (Enet) of [15], the Adaptive•Lasso of [16], the Adaptive•Enet of [17], the Minimax•Concave•Penalty (MCP) of [18], and the regression with an Elastic-SCAD (E-SCAD) of [19].In general, these penalties are collectively referred to as folded concave penalties.However, it is interesting that shrinkage methods can attain both accurate forecasts and consistent feature selection.
The use of these methodologies along with sparse modeling has become well known because they can successfully tackle huge sets of macroeconomic data and are a noticeable alternative to factor models, as shown by  By employing a reduction in size, the Stochastic•Dynamic•Factor (SDF, which is equivalent to large factor models) model can exhibit significant effectiveness [41,42], even when dealing with basic linear attributes from the conventional factor collection.According to [43], it may be necessary to use multiple characteristics-based parameters in order to accurately approximate the SDF.[44] provided formalization and evidence of the longstanding conjecture that, where there are a large number of characteristics-based factors, an unconditional SDF constructed from these factors will converge to the actual, conditional SDF.One can utilize Large•Factor•Models (LFMs) to construct the genuine, conditional stochastic discount factor (SDF).Although LFMs possess a high level of approximation capacity, they encounter a significant obstacle: These phenomena display significant statistical complexity and necessitate the estimation of a vast number of parameters (such as factor weights in the SDF) which greatly surpasses the number of observations.One could predict that a simplified version of the LFMs would perform better when tested with new data because it effectively reduces the problem of overfitting with the available data.[44] disproved this intuition.The studies conducted by [41] demonstrated the importance of complexity in factor pricing models.Specifically, LFMs with higher dimensions and a large number of parameters demonstrate superior performance when tested on data not used during training.These models exploit the numerous nonlinearities that are concealed in the connection between attributes and stock returns.
Similarly, ref. [45] employed a four-layer neural network consisting of 64 neurons in each layer, while ref.[46] utilized a four-layer neural network with four neurons in each layer.These narrow network topologies had a high number of parameters and functioned in regimes that were almost overfitting, as evidenced by the significant changes in their performances between training and testing data, as described in the aforementioned articles.These characteristics render them highly challenging to analyze systematically.The loss landscape which they have is already extremely non-convex, containing many local minima and having questionable performance when applied to new data [42].
The big data environment and machine learning tools have currently garnered a great deal of attention in economic analysis [36].When it comes to macroeconomic forecasting, ref. [29] recommended penalized regression methods; refs.[4,5,47] suggested factor-based models; and similarly, autometrics was suggested by [48].Recently, big data was categorized by [24]  P and N indicate the number of covariates and the number of observations, respectively.Visually, the three types of big data are depicted in Figure 1.
regimes that were almost overfitting, as evidenced by the significant changes in their performances between training and testing data, as described in the aforementioned articles.These characteristics render them highly challenging to analyze systematically.The loss landscape which they have is already extremely non-convex, containing many local minima and having questionable performance when applied to new data [42].
The big data environment and machine learning tools have currently garnered a great deal of attention in economic analysis [36].When it comes to macroeconomic forecasting, [29] recommended penalized regression methods; [4,5] and [47] suggested factorbased models; and similarly, autometrics was suggested by [48].Recently, big data was categorized by [24] into three classes-Fat•Big•Data, Huge•Big•Data, and Tall•Big•Datawhich can be further illustrated as: P and N indicate the number of covariates and the number of observations, respectively.Visually, the three types of big data are depicted in Figure 1.Earlier research works have focused on independent component analysis, PCA, and sparse PCA for the formulation of factor-based models.However, very few past studies have used the classical method (autometrics) for time series forecasting [22] and [48,49].Apart from this, we have not found even a single paper to date in which the forecasting performance of a factor model based on PLS analysis has been explored theoretically.Moreover, various past studies have used penalization techniques such as ridge regression, elastic net, Lasso, adaptive Lasso, and non-negative garrote, but none of the published works have yet utilized the modified versions of penalization techniques for the forecasting of macroeconomic variables.
This work employs several novel methods in big data analysis to enhance the existing empirical and theoretical research on macroeconomic forecasting by addressing the following shortcomings of a recent study which specifically concentrated on Fat•Big•Data.By utilizing dimension reduction techniques, we develop factor-based models to emphasize the impact of these models on macroeconomic forecasting.To achieve this objective, factor-based models are developed by employing PCA and PLS.In addition, we evaluate both the conventional approach and updated forms of penalization approaches, namely, MCP and E-SCAD.We provide a thorough examination of the predictive capacities of Earlier research works have focused on independent component analysis, PCA, and sparse PCA for the formulation of factor-based models.However, very few past studies have used the classical method (autometrics) for time series forecasting [22] and [48,49].Apart from this, we have not found even a single paper to date in which the forecasting performance of a factor model based on PLS analysis has been explored theoretically.Moreover, various past studies have used penalization techniques such as ridge regression, elastic net, Lasso, adaptive Lasso, and non-negative garrote, but none of the published works have yet utilized the modified versions of penalization techniques for the forecasting of macroeconomic variables.
This work employs several novel methods in big data analysis to enhance the existing empirical and theoretical research on macroeconomic forecasting by addressing the following shortcomings of a recent study which specifically concentrated on Fat•Big•Data.By utilizing dimension reduction techniques, we develop factor-based models to emphasize the impact of these models on macroeconomic forecasting.To achieve this objective, factor-based models are developed by employing PCA and PLS.In addition, we evaluate both the conventional approach and updated forms of penalization approaches, namely, MCP and E-SCAD.We provide a thorough examination of the predictive capacities of factor models, classical methods, and penalized regression techniques.To summarize the entire discussion, our primary contribution is a comparison of the forecasting performance of penalized regression tools and autometrics with factor models that have recently been established.The comparison is constructed through the use of exhaustive simulation exercises, such as multicollinearity, autocorrelation, and heteroscedasticity, as well as empirical application to the macroeconomic dataset.The purpose of this research is to develop a more advanced tool that can be used to provide assistance to practitioners and policymakers who are working with fat big data.The improved tool is not restricted to inflation, but can be applied to any macroeconomic time series.
The remaining sections are organized as follows.Section 2 provides a thorough discussion of factor, classical and penalized methods.Simulation exercise on the comparative performance of various forecasting methods is discussed in Section 3. Empirical results and visualization are presented in Section 4. Concluding remarks are given in Section 5.

Methods
To effectively tackle the challenges presented by fat big data, we use a comprehensive set of advanced statistical methodologies, as well as penalization and machine learning techniques.Figure 2 depicts several approaches in great detail, including factor models based on PLS and PCA, as well as traditional econometric methods such as autometrics.In addition, we use penalization methods like Lasso, Elastic Net, and SCAD to improve predictive accuracy and model selection.Our methodology is intended to solve major concerns such as multicollinearity, heteroscedasticity, and autocorrelation, resulting in a robust and dependable forecasting performance.
factor models, classical methods, and penalized regression techniques.To summarize the entire discussion, our primary contribution is a comparison of the forecasting performance of penalized regression tools and autometrics with factor models that have recently been established.The comparison is constructed through the use of exhaustive simulation exercises, such as multicollinearity, autocorrelation, and heteroscedasticity, as well as empirical application to the macroeconomic dataset.The purpose of this research is to develop a more advanced tool that can be used to provide assistance to practitioners and policymakers who are working with fat big data.The improved tool is not restricted to inflation, but can be applied to any macroeconomic time series.
The remaining sections are organized as follows.Section 2 provides a thorough discussion of factor, classical and penalized methods.Simulation exercise on the comparative performance of various forecasting methods is discussed in Section 3. Empirical results and visualization are presented in Section 4. Concluding remarks are given in Section 5.

Methods
To effectively tackle the challenges presented by fat big data, we use a comprehensive set of advanced statistical methodologies, as well as penalization and machine learning techniques.Figure 2 depicts several approaches in great detail, including factor models based on PLS and PCA, as well as traditional econometric methods such as autometrics.In addition, we use penalization methods like Lasso, Elastic Net, and SCAD to improve predictive accuracy and model selection.Our methodology is intended to solve major concerns such as multicollinearity, heteroscedasticity, and autocorrelation, resulting in a robust and dependable forecasting performance.

Factor Models
One of the most widely applied methods in macroeconomic forecasting, under a large set of features, is principal component analysis, which is based on factor models suggested by [4,5].The basic notion behind factor models is to distill the unseen, hidden factors from a huge set of features and then to utilize a relatively small number of factors as covariates for predictive modeling.Suppose   is a potential candidate covariate generated from the following equation:

Factor Models
One of the most widely applied methods in macroeconomic forecasting, under a large set of features, is principal component analysis, which is based on factor models suggested by [4,5].The basic notion behind factor models is to distill the unseen, hidden factors from a huge set of features and then to utilize a relatively small number of factors as covariates for predictive modeling.Suppose Z it is a potential candidate covariate generated from the following equation: For j = 1, 2, . .., M, and k = 1, 2, . .., N, F s k = ( f 1k , f 2k , . . . ,f lk ) ′ is a vector of size 's' common factors, π ′ j is a vector of size 's' factor loadings, and ϵ jk denotes the idiosyncratic random term.
The PCR: The formulation of factor-based model requires the following two steps.In the first step, the F s k latent factors are extracted as principal components using all included covariates Z jk by minimizing the term In the second step, the h-period of the sample forecast is constructed by running the PCR as follows: where γ j , the dimension of estimated coefficients, is 's', which is basically estimated from R k and F k .Detailed discussions regarding the factor approach are given by [4,30,50,51].This method is most commonly employed in the literature on factor model, as PCs are easily generated using singular value decompositions [4,52,53].
It is more likely that the factor approach will provide a poor forecast if the included common factors are dominated by omitted common factors [54].Likewise, ref. [34] argued that PCA utilizes the factor structure for Z and does not account for the response variable.It illustrates, no matter what, that the response is variable in the forecast.Due to ignoring the response variable during factors extraction, the resulting model's forecast is inaccurate.
The PLS method: This study takes into account PLS regression, a widely used alternative to PCA which was first introduced by [55].The approach is suitable in a mountain-ofdata environment (fat big data) and is deemed an alternative to factor models constructed using PCA.In contrast to PCA, PLS yields independent components by utilizing the existing association amid covariates and the corresponding response variable, although it also retains most of the variance of the covariates.PLS has proved to be successful in situations where the number of predictors (P) is sufficiently larger than the number of data points (N) and extreme multicollinearity exists among the covariates [56].Generally, the PLS approach seeks the directions of maximal variance that assist in delineating the covariates, as well as the response variable.The mathematical form of the PLS can be expressed as where y t = [y 1,t , y 2,t , . . . ,y k,t ] ′ is a vector of covariates of size k × 1 observed at time t = 1, . .., T; α P is a vector of coefficients with a dimension k × 1; and ε t is a random error.To achieve a k-period ahead of the sample forecast, we may utilize the equation given below:

Panelized Regression and Classical Approach
In addition to the above factor models, we also consider methods of penalized regression, including MCP and E-SCAD, as well as the classical method (autometrics), as both approaches are good alternatives to factor models.Here, we give concise outlines of a number of these approaches, as well as the corresponding citations to thorough debates regarding them.

Panelized•Regression•Methods
The parameters of the included Panelized•Regression•Methods are estimated according to the following objective function: where π refers to the hyperparameter of regularization.The specification of a penalty term g(α) differs for the aforementioned penalized techniques; by definition, α is equal to (α 1 , α 2 , . . . ,α n ) ′ .For the selection of hyperparameter π, we adopt a cross-validation approach in our study, following [36].

MCP:
The MCP was initially developed by [18].It corresponds to the penalized family of regression with a penalty term g π (α) = (ℵπ−ϑ)+ ℵ .According to Zhang, the probability that the MCP penalty may choose the right model tends to be 1.Moreover, in terms of Lq-loss, the MCP estimator enjoys oracle properties provided that ℵ and π ensure certain Axioms 2024, 13, 418 6 of 15 conditions in a high-dimensional setting [57].More recently, the MCP has shown very interesting findings in terms of variable selection, estimation, and forecasting [58].
E-SCAD: SCAD was modified by adding the L 2 penalty.The new method is called elastic SCAD (E-SCAD).In addition to an oracle property, E-SCAD achieves an extra property in which the penalty function drives the inclusion or exclusion of a strongly correlated set of predictors from the model.To accomplish this, the process does not require any prior information [19].Mathematically, the penalty function of E-SCAD is given as follows:

Classical Approach
Autometrics is a popular statistical approach which is applicable in the case of huge big data as well as fat big data [24].In general, the algorithm of autometrics basically consists of five steps.In the initial step, the model is designed in a linear form in which all the covariates are included, called a Generalized•Unrestricted•Model (GUM).The second step provides us with the estimates of unknown parameters and tests them for statistical significance.The third step involves the pre-search process and is followed by a tree path search in step four.In step five, the model is selected for forecasting.
We obtain the forecasting model by implementing autometrics into the GUM: For model selection, the liberal strategy, also known as the super-conservative strategy, is considered in this study.This strategy is primarily based on a level of significance of one percent.Put differently, the significance of the estimated coefficients is based on a significance level of one percent.

Monte Carlo Evidence on Forecasting Performance
This section performs some simulation exercises intending to explore the predictive power of factor models against classical and penalization methods.In doing so, we consider three main scenarios: multicollinearity, heteroscedasticity, and autocorrelation.Considering the cases of multicollinearity, three types of correlation structure among the set of features are assumed-low (0.25), moderate (0.50), and high (0.90)-under the normally distributed errors.To generate the artificial data for our simulation experiments, we follow the data generation process of [24,59].

Data•Generating•Process (DGP)
The following equation is used to generate data: The set of covariates y i is generated from a multivariate normal distribution with a mean of zero, and the pairwise covariance between m and n is cov(x m , x n ) = ∑ |m−n| [59].We split the two candidate sets of variables into 50 and 70, then further divide them into relevant (p) and irrelevant (q) variables, as depicted in Figure 3.
The second scenario explores the forecast performance in the presence of autocorrelation.More specifically, this refers to how factor models compete with the rival methods provided the error term of a model is autocorrelated.The correlation between current and lagged realizations is symbolized by ρ, which is generated as The second scenario explores the forecast performance in the presence of autocorrelation.More specifically, this refers to how factor models compete with the rival methods provided the error term of a model is autocorrelated.The correlation between current and lagged realizations is symbolized by , which is generated as Our simulation exercises assume three levels of autocorrelation-low, moderate, and high-as, for example,  ∈ {0.25, 0.5, 0.9}.
Similarly, the third scenario focuses on heteroscedasticity, which demonstrates the variance of the error term across observations by δ  .

E(𝜇𝜇
Thus, we break the variance δ  into two segments, i.e., δ 1 and δ 2 .Suppose there are 'n' data points, we adjust the variance of (n/2) data points as δ 1 and the remaining data points variance to be δ 2 .Our simulation exercises conjecture the low, moderate, and high levels of heteroscedasticity and adjust the values of   = (σ 1 /σ 2 ), where i = 1, 2, 3 and   ∈ {0.1/0.3,0.2/0.6,0.3/0.9}.For all penalization techniques and factor models in our study, we select the optimal hyper parameter(s) by means of tenfold cross-validation We divide the dataset so that 80 percent of the data are used for model training and the remaining data are used for model evaluation in order to compare the prediction capabilities of all procedures.We repeat the process H = 1000 times.The mean of the Root•Mean•Square•Error (RMSE) and the Mean•Absolute•Error (MAE) are calculated over 'H' to evaluate the predictive power.Through these two criteria, we can achieve prediction accuracy with all included methods.The smaller values of MAE and RMSE indicate comparatively better forecasts.To obtain the simulation and empirical results, we rely on various packages, like gets, pls, caret, ncvreg, Metrics, and forecast, under the R programming language.

Simulation Results
The forecast comparison output obtained from Monte Carlo exercises is reported in Tables 1-3.The entries in bold show the best performances of the underlying model.It can be observed that the performances of all procedures improve with the increasing data points.
Scenario 1. Considering the cases of low and moderate multicollinearity, the predictive ability of autometrics is more effective than that of competing methods.But, in case of a small sample, the RMSE and MAE associated with autometrics are slightly better than the PLS-based factor approach.This clearly indicates that the PLS-based factor approach is strongly competitive.Similarly, despite achieving a considerable improvement in RMSE Our simulation exercises assume three levels of autocorrelation-low, moderate, and high-as, for example, ρ ∈ {0.25, 0.5, 0.9}.
Similarly, the third scenario focuses on heteroscedasticity, which demonstrates the variance of the error term across observations by δ k .
Thus, we break the variance δ k into two segments, i.e., δ 1 and δ 2 .Suppose there are 'n' data points, we adjust the variance of (n/2) data points as δ 1 and the remaining data points variance to be δ 2 .Our simulation exercises conjecture the low, moderate, and high levels of heteroscedasticity and adjust the values of π i = (σ 1 /σ 2 ), where i = 1, 2, 3 and π i ∈ {0.1/0.3,0.2/0.6,0.3/0.9}.For all penalization techniques and factor models in our study, we select the optimal hyper parameter(s) by means of tenfold cross-validation We divide the dataset so that 80 percent of the data are used for model training and the remaining data are used for model evaluation in order to compare the prediction capabilities of all procedures.We repeat the process H = 1000 times.The mean of the Root•Mean•Square•Error (RMSE) and the Mean•Absolute•Error (MAE) are calculated over 'H' to evaluate the predictive power.Through these two criteria, we can achieve prediction accuracy with all included methods.The smaller values of MAE and RMSE indicate comparatively better forecasts.To obtain the simulation and empirical results, we rely on various packages, like gets, pls, caret, ncvreg, Metrics, and forecast, under the R programming language.

Simulation Results
The forecast comparison output obtained from Monte Carlo exercises is reported in Tables 1-3.The entries in bold show the best performances of the underlying model.It can be observed that the performances of all procedures improve with the increasing data points.
Scenario 1. Considering the cases of low and moderate multicollinearity, the predictive ability of autometrics is more effective than that of competing methods.But, in case of a small sample, the RMSE and MAE associated with autometrics are slightly better than the PLS-based factor approach.This clearly indicates that the PLS-based factor approach is strongly competitive.Similarly, despite achieving a considerable improvement in RMSE and MAE by E-SCAD when the sample size is increased, the results are not satisfactory in contrast to autometrics.Moreover, by increasing the number of active and inactive variables, autometrics remains dominant, with the lowest RMSE and MAE.In the presence of extreme multicollinearity, the factor approach based on PLS outperforms its rival counterparts in terms of the lowest forecast error.According to both error criteria, autometrics stands as a good competing method.
Scenario 2. Based on RMSE and MAE, the forecasting capabilities of autometrics are superior to those of all competing counterparts in the presence of heteroscedasticity.In contrast, the MCP and E-SCAD perform poorly using a small size, but as we expand the data window (large sample), their forecasting performance dramatically improves.This indicates that penalized regression models require a large number of data points in order to provide accurate forecasts.Scenario 3. Despite adding more irrelevant variables, autometrics demonstrates remarkable forecasting performance for low and moderate autocorrelation.E-SCAD remains a good competing method, particularly when more observations are used.Considering the extreme autocorrelation, E-SCAD provided the lowest RMSE and MAE compared to the other competing counterparts, but autometrics was still a good contester.

Testing on Empirical Dataset
Complementing the simulation exercises, we analyze the macroeconomic time series dataset for Pakistan.
The dataset consists of 79 aggregated and dis-aggregated variables collected at a monthly frequency during the period from 2013 to 2020.The dataset covers the fiscal sector, real sector, financial and monetary sector, and external sector of the economy of Pakistan.The data are taken from the state bank of Pakistan.The forecasting model is constructed for inflation (INF).For this model, a long list of predictor variables is selected.All the variables are transformed in order to make them stationary prior to empirical analysis.Generally, logarithmic transformation is performed for all non-negative time series that are not already in rates [5].A complete list of variables is given in Appendix A. Table A1 (given in Appendix A) contains information on the variables utilized in the analysis.

Out-of-Sample Inflation Forecasting
The time series is divided into two parts (shown by dashed line) in order to facilitate out-of-sample forecast accuracy, as shown in      5a,b, we can infer that the PLS-based factor model was more superior to its rival counterparts in the post-sample forecast.In contrast, the autometrics produced a good forecast, but it was not as satisfactory as that provided by the PLS-based factor model.

Discussion, Implications, and Limitations
In this section, we provide a discussion and explore the implications and limitatio of the study.We evaluate the prediction capacities of several statistical models and m chine learning technologies in time series forecasting scenarios using both theoretical amination and empirical research.

Discussion
This article explores the predictive power of widely used statistical models again classical and sophisticated machine learning tools theoretically as well as empirically.be more specific, our core aim was to discover how well the most popular models in t context of time series forecasting, that is, factor models, performed against classical a shrinkage methods.Different sample sizes and predictor variables were used to evalu

Discussion, Implications, and Limitations
In this section, we provide a discussion and explore the implications and limitations of the study.We evaluate the prediction capacities of several statistical models and machine learning technologies in time series forecasting scenarios using both theoretical examination and empirical research.

Discussion
This article explores the predictive power of widely used statistical models against classical and sophisticated machine learning tools theoretically as well as empirically.To be more specific, our core aim was to discover how well the most popular models in the context of time series forecasting, that is, factor models, performed against classical and shrinkage methods.Different sample sizes and predictor variables were used to evaluate each technique under the conditions of multicollinearity, heteroscedasticity, and autocorrelation.Across the simulation exercises, it was found that all methods were consistent.In the presence of low and moderate multicollinearity, based on RMSE and MAE values, autometrics outperformed the other competitive counterparts.Considering the extreme case of multicollinearity, the PLS-based factor approach beat the rival counterparts, as it had the lowest forecast error.Considering different levels of heteroscedasticity, the lowest RMSE and MAE values were attained by autometrics, which indicates its dominance over all other methods in post-sample forecasting.Across low and moderate levels of autocorrelation, autometrics produced a better forecast, but in contrast, the E-SCAD provided the lowest RMSE and MAE values for extreme autocorrelation.

Implication
Complementing the simulation exercise, we carried out an empirical application on a well-known Pakistan macroeconomic dataset.The dataset entailed 79 time series observed at a monthly frequency from January 2013 to December 2020, and was collected from the state bank of Pakistan.For model estimation, we utilized data from January 2013 to February 2019 and March 2019 to December 2020 for evaluating the models' post-sample forecasting accuracy multiple steps ahead.The statistical accuracy measures, namely, RMSE and MAE, were used in order to compare the post-sample predictive ability of the factor models against autometrics and ML techniques.Based on both statistical measures, the factor approach derived from PLS produced a better forecast than the competing counterparts.These results are consistent with the findings of [59].

Limitations and Future Avenue
There are several limitations of this study.First, it concentrated merely on linear models and was confined to monthly data.Moreover, the simulation exercise was confined to normally distributed errors, but in general, this would not be the case in a real-world phenomenon.Hence, future work can be carried out to fill the preceding research's gaps.
into three classes-Fat•Big•Data, Huge•Big•Data, and Tall•Big•Data-which can be further illustrated as: • Fat•Big•Data: the length of covariates (large P) exceeds the number of observations (large N); • Tall•Big•Data: the length of covariates (large P) is considerably lower than the number of observations (sufficient large N); • Huge•Big•Data: the length of covariates (large P) is lower than the number of observations (large N).

Figure 2 .
Figure 2. Schematic representation of fat big data methods.

Figure 2 .
Figure 2. Schematic representation of fat big data methods.

Figure 3 .
Figure 3. Classification of candidate variables into p and q variables.

Figure 3 .
Figure 3. Classification of candidate variables into p and q variables.

Figure 4 .
For model estimation, we utilize the data from January 2013 to February 2019 and March 2019 to December 2020 to assess the models' post-sample prediction accuracy multiple steps ahead.Axioms 2024, 13, x FOR PEER REVIEW 10 of 15

Figure 4 .
Figure 4. Monthly inflation series against time.

Figure 5a ,
Figure 5a,b present the forecasting experiment across different forecasting methods for one of the core macroeconomic variables of interest (inflation).The forecasting accuracy is given as the RMSE and MAE, represented in our case by a bar chart showing the results of different methods.The smaller the length of a bar, the better the forecast attained by the model, comparatively.By observing the length of a bar given in Figure5a,b, we can infer that the PLS-based factor model was more superior to its rival counterparts in the post-sample forecast.In contrast, the autometrics produced a good forecast, but it was not

Figure 4 .
Figure 4. Monthly inflation series against time.

Figure
Figure5a,b present the forecasting experiment across different forecasting methods for one of the core macroeconomic variables of interest (inflation).The forecasting accuracy is given as the RMSE and MAE, represented in our case by a bar chart showing the results of different methods.The smaller the length of a bar, the better the forecast attained by the model, comparatively.By observing the length of a bar given in Figure5a,b, we can infer that the PLS-based factor model was more superior to its rival counterparts in the post-sample forecast.In contrast, the autometrics produced a good forecast, but it was not as satisfactory as that provided by the PLS-based factor model.

Figure 4 .
Figure 4. Monthly inflation series against time.

FigureFigure 5 .
Figure5a,b present the forecasting experiment across different forecasting metho for one of the core macroeconomic variables of interest (inflation).The forecasting acc racy is given as the RMSE and MAE, represented in our case by a bar chart showing t results of different methods.The smaller the length of a bar, the better the forecast attain by the model, comparatively.By observing the length of a bar given in Figure5a,b, we c infer that the PLS-based factor model was more superior to its rival counterparts in t post-sample forecast.In contrast, the autometrics produced a good forecast, but it was n as satisfactory as that provided by the PLS-based factor model.

Figure 5 .
Figure 5. Out-of-sample forecast comparison.PLS based factor model outperforms the competing methods.

Table 1 .
Variable selection under multicollinearity from Monte Carlo simulation.
Noted: Bold values show a better forecast.

Table 2 .
Variable selection under heteroscedasticity from Monte Carlo simulation.

Table 3 .
Variable selection under autocorrelation from Monte Carlo simulation.

Monetary Sector (Money, Reserves and Banking System) Money
Domestic) Assets of State Bank of Pakistan 43 Net Foreign Assets of the Scheduled Banks in Pakistan