The Lasso and the Factor Zoo-Predicting Expected Returns in the Cross-Section

: We investigate whether Lasso-type linear methods are able to improve the predictive accuracy of OLS in selecting relevant ﬁrm characteristics for forecasting the future cross-section of stock returns. Through extensive Monte Carlo simulations, we show that Lasso-type predictions are superior to OLS when type II errors are a concern. The results change if the aim is to minimize type I errors. Finally, we analyze the predictive performance of the competing methods on the US cross-section of stock returns between 1974 and 2020 and show that only small and micro-cap stocks are highly predictable throughout the entire sample.


Introduction
After years of strong growth in the number of published firm characteristics (FC) claiming to explain differences in average cross-sectional returns, some researchers have more recently shifted their attention to the fundamental question of which statistical method to employ in selecting these variables; see for example, Harvey et al. [1], McLean and Pontiff [2] or Green et al. [3]. Given that understanding differences in cross-sectional returns has farreaching implications for finance theory in general and consequently also for a vast part of the investment management industry, improving these methods is a pre-requisite for future finance research. This work aims to contribute to the task by investigating the importance of selecting FC that matter for prediction in a selection process focusing on prediction and highlighting the relative predictive accuracy of various shrinkage methods through an extensive simulation study and an empirical investigation of cross-sectional returns in the US.
More generally, selecting variables, estimating coefficients and predicting noisy targets are common challenges for finance and economics. An important application in the context of selecting FC is the seminal contribution by Fama and French [4], where variable selection is performed based on the multivariate regression framework and where insignificant coefficients are discarded. In particular, they regress cross-sectional returns on several firm characteristics to determine the crucial set of criteria that explain differences in returns. Based on this selection procedure, Fama and French [5] form the well-known Fama-French (FF) three-factor model, which has set the benchmark and raised the bar for detecting new relevant FC. However, these estimates, usually obtained from ordinary least squares (OLS), often suffer from a large variance and, hence, conclusions about the relevance of coefficients come potentially with a high degree of uncertainty.
To overcome the high variance problem of classical linear methods, the machine learning literature has introduced alternative methods for variance reduction by tolerating a small bias. In an important contribution Tibshirani [6] presents the least absolute shrinkage and selection operator (Lasso) method for estimating linear models. It simultaneously performs variable selection and coefficient estimation by shrinkage. To preserve the advantages of absolute shrinkage, Zou [7] proposes a modified version, the so-called adaptive Lasso, such that consistent variable selection can be achieved even under less stringent conditions.
This study contributes to the literature by developing an extensive Monte Carlo simulation to generate a panel of plausible cross-sectional returns in which a distinct and novel feature is the flexible simulation of high-dimensional FC correlation matrices. This simulation design allows us to investigate extensively the predictive performance of Lasso methods in panels for various error specifications and to highlight eventual problems related to the correct selection of FC that contain useful information to predict the crosssection of expected returns.
The primary goal of the paper is to answer the question of whether Lasso-type methods can be useful in predicting differences in expected cross-sectional returns. Secondly, the paper aims to determine which firm characteristics drive these predictions and how they compare to classical approaches. In addition to the empirical evaluation, we use a simulation study to shed light on the properties of the methods in finite samples. For the empirical part of our analysis, we focus on the US cross-section. We include 62 published firm characteristics constructed based on the CRSP/Compustat merged database with monthly data starting from 1974 until 2020.
It is important to note that we constrain our selection and prediction procedure to the linear setting. Specifically, we want to perform our prediction based on a multivariate regression that is consistent with the original scope when the considered FC was introduced in the literature. Hence, this study builds on the work of Green et al. [3]. In particular, the authors analyze a large set of FC in a linear multivariate Fama and MacBeth [8] regression; we closely follow their data construction and FC pre-selection procedure. However, instead of relying on a multivariate regression, we apply the adaptive Lasso a true variable selection method.
The simulation results indicate some advantages of the adaptive Lasso over the Lasso in selecting the true set of FC. In contrast, Lasso-type predictions rank consistently better when predictive accuracy is the main objective. We find patterns consistent with the simulation results when predicting US small-cap stock returns, as two of the considered Lasso-type specifications achieve the best predictions. Large-cap stocks are not forecastable with the methods included in this work and the naïve zero return forecast cannot be rejected as being inferior to the included set of linear estimators. These results on the US expected returns cross-section confirm and extend the empirical evidence provided in the previous literature.
The full pooled panel adaptive Lasso selection characterizes 21 FC of relevance for future differences in stock returns. This is in stark contrast to 47 variables selected by the Lasso, 23 by pooled ordinary least squares (POLS) and 13 by POLS inference corrected for multiple testing. The most dominant FC for prediction is based on price information; the most consistently selected is short-term reversal. Moreover, the Fama and French [9] five-factor model is fully represented in the Lasso-based selection, but complemented by additional FC. Although the methods considered in the current study substantially differ, generally this contrasts with the findings of Green et al. [3], as they identify a relatively low-dimensional linear cross-section.
This study contributes to different strands of the literature. First, it contributes to the asset pricing literature by analyzing the usefulness of Lasso-type methods in selecting relevant FC for estimating and predicting expected cross-sectional stock returns; we refer, among others, to Cochrane [10,11], Goyal [12], and Hou et al. [13] for reviews of the different research questions, estimation methods, and introduced FC related to asset pricing. Harvey et al. [1] introduce the concepts of family-wise error and the false discovery rate to the finance literature. Applying the t-value adjustment reveals that many published factors would lose their status as a significant factor. However, the method suffers shortcomings from a prediction perspective that our work takes into consideration: It does not explicitly take into account the dependence structure of the FC and it neglects to trade-off type I vs. type II errors.
More recently, Kozak et al. [14] investigate the problem from a portfolio perspective in combination with L 2 and L 1 penalties. The authors identify a sparse set of FC. A meanvariance (MV) optimized portfolio including 50 anomaly variables yields a CAPM alpha similar to the Fama and French [9] five-factor model. Furthermore, the authors include two-dimensional interactions between these 50 FC and show a substantial increase in an alpha of the MV portfolio compared to the case without interactions.
Feng et al. [15] propose a double Lasso model selection methodology to systematically investigate the in-sample marginal contribution to asset pricing of some new, additional factors beyond what is explained by a possibly vast number of already existing ones. They introduce a framework for conducting in-sample statistical inference in such a highdimensional setting and provide robustness checks to verify the sensitivity of the results with respect to the involved tuning parameters in finite samples. In contrast to their study, our analysis focuses on out-of-sample prediction and on the evaluation of the forecasting accuracy of the resulting Lasso-based models. We provide new evidence about the finite sample properties of the Lasso (and other) estimators to select relevant factors for prediction through an extensive simulation exercise that is broader in terms of competing models and model selection criteria designed for forecasting than the one presented by Feng et al. [15].
Finally, Bryzgalova [16] pays particular attention to problems arising from model misspecification when using shrinkage methods in the context of factor models. She introduces an alternative adaptive weighting scheme based on partial correlations instead of a twostage procedure as compared with the adaptive Lasso. The work of Freyberger et al. [17] approaches the problem using non-parametric techniques. DeMiguel et al. [18] analyze the FC selection from a portfolio perspective in a framework that combines shrinkage and meanvariance (MV) optimization. Moreover, a fast-growing strand of the literature addresses the prediction problem from a non-linear perspective; see, for example, Messmer [19], Moritz and Zimmermann [20] or Gu et al. [21].
Second, our study is related to the literature that investigates the finite samples or asymptotical properties of shrinkage approaches in financial settings. The Lasso introduced by Tibshirani [6] is motivated by the desire to improve OLS estimates without the shortcomings of subset selection and ridge regression. Tibshirani [6] notes that subset selection suffers from high variability, as small data changes can cause subset selection to easily select a different model. Zou [7] remarks that subset selection can become computationally infeasible if the number of variables is large. Ridge regression, which penalizes the sum of the squared coefficients (l 2 norm) in a linear regression framework, on the other hand, has no obvious interpretation due to the fact that the coefficients are not exactly set to zero.The Lasso estimator optimizes least squares under an additional condition involving the total sum of the absolute size of the coefficients (known as the 1 -norm) that cannot be larger than a given tolerance value. The inclusion of a penalty term leads to consistent coefficient estimation and variable selection if two necessary conditions are fulfilled, as Meinshausen [22] shows; see Bühlmann and Van De Geer [23] for a detailed discussion. These conditions are too restrictive in many empirical applications. Zou [7] modifies the Lasso insofar as the weight of each coefficient in the penalization term is adaptive. This is achieved by scaling the absolute value of each coefficient with a first-stage estimator such that more highly relevant variables are less strongly affected by the penalty. Setting adaptive weights leads to consistent variable selection and coefficient estimation even if one of the two is not fulfilled.
The previously mentioned consistency properties are developed for a cross-sectional set-up with iid errors. Typically, the majority of applications in finance require the use of time-series or panel data. Moreover, an iid error specification is more an exception than the rule. Consequently, Medeiros and Mendes [24], Caner and Zhang [25], Caner and Kock [26], Kock and Callot [27], Audrino and Camponovo [28] and Kock [29,30] derive asymptotic properties of the Lasso and the adaptive Lasso in time series and panel settings. In particular, Medeiros and Mendes [24] and Audrino and Camponovo [28] derive consistency properties of the adaptive Lasso in time series environments. Medeiros and Mendes [31] prove that the oracle properties of the adaptive Lasso are preserved for linear time series models even under non-Gaussian, conditionally heteroscedastic and timedependent errors. Audrino and Camponovo [28] show that the adaptive Lasso combines efficient parameter estimation, variable selection and valid finite sample inference for general time series regression models. We contribute to this strand of the literature by investigating the finite sample properties of Lasso-type estimators by performing extended simulations in a realistic panel data setting mimicking closely the behavior of the expected returns cross-section with a prediction target.
The paper is organized as follows. Section 2 provides a description of the relevant methodology. This section is followed by a description of the estimation objective and how it relates to a factor structure. Section 3 presents the simulation study. The data are briefly discussed in Section 4. The penultimate section covers the empirical work, including the return prediction and FC selection results. The final section concludes.

Methodology
This section introduces the notation and presents the underlying estimation methods and the statistics we use to evaluate the selection and prediction performance of the methods in our simulations and the empirical analysis.

Notation
Generally, if not explicitly otherwise stated, we follow the notation that n refers to stock n of N t total stocks and t to period (i.e., month) t of T total periods. Moreover, factors are indexed by c of C total factors and belong to set C, where each c belongs to one of the following three groups or subsets: priced factors are denoted by p (of a total P) and define the set P; unpriced factors are defined by u (of a total U, set U ), and spurious factors with respect to the return process are described by s (of a total S, set S). The total number of factors, P + U + S = C. A specification of each type of factor is outlined in more detail in Section 2.2. The indexing of FC is identical to that of the factors.

FC and the Return Generating Process
Generally, we assume a Rosenberg [32] and Daniel and Titman [33] type cross-sectional return structure. Covariances are determined based on a factor structure and expected returns mark a compensation for factor risk (default assumption). Following Daniel and Titman [33], we consider the following excess return generating process, R e n,t = β n,t−1 f t + x n,t−1 δ + η n,t , t = 1, ...., T, n = 1, ..., N where f t defines the vector of factor returns of length C and η n,t each stock idiosyncratic noise component, assumed to be normally distributed and orthogonal to the factors and other stocks' idiosyncratic components. x n,t = [1 C n,t ], the vector of the corresponding C FC and an intercept: C n,t = [size n,t , bm n,t , mom n,t , ...] . Each factor, f t , follows the dynamics, where µ i defines the risk-premium of the i'th factor and i,t , i = 1, · · · , C, the sequence of independent factors' innovations. Moreover, we assume a linear functional relation of the FC and the factor exposures, β n,t = g(x n,t ) = a t + Bx n,t with B a C × C matrix of coefficients. Note that a t cancels, once we consider de-meaned cross-sectional returns. Naturally, exposures are time-varying. This is in line with the empirical characteristics, as momentum or value exposures vary with the price level movements of each stock. The model 1 then becomes where η n,t is the new zero mean innovation. As a consequence, the linear predictive dependence we aim to measure is of the form: To allow for different interpretations of the relationship between FC and expected returns from an asset pricing perspective we differentiate among three types of factors, namely priced, unpriced, and spurious factors: S spurious factors: β s,t = 0 and δ s = 0, ∀s ∈ S.
Examples of priced factors are the market or value factor; for unpriced factors, sector factors; and for spurious factors, an independently created random time series. In a first model setting we consider risk-premia always coupled to the underlying risk-exposure, that is δ = 0 and γ i = b i µ = 0 ∀i ∈ U , where b i = B ·,i denotes the i-th column of B. As an example, the CAPM can be found in this asset pricing model interpretation by considering only one priced factor, the market factor and no unpriced or spurious factors.
Under a second asset pricing modeling interpretation, we consider a model where δ i is not constrained to be equal to zero for the unpriced factors in (3). In case γ i = δ i = 0 for some i ∈ U , δ i measures the sensitivity of FC to expected returns that do not directly compensate for factor risk. It imposes a non-zero covariance between FC (x n,t−1 ) and some factors in case the FC are linked to the non-zero elements in δ as described in Daniel and Titman [33]. In this model setting, we might have zero-priced factors. In particular, this asset pricing model allows two stocks with an identical book-to-market ratio to have different risk exposures to a book-to-market value factor. Here the return compensation is associated with the book-to-market characteristic, i.e., mispricing, and not its risk sensitivity to the value dimension. The first asset pricing modeling framework rules this out. The second asset pricing model implies the presence of asymptotic arbitrage. Regardless of the interpretation, both models are estimated using (3).

Methods
We focus on three different linear models, which are defined as: where Y ∈ R and X ∈ R p and the corresponding response vector Y n×1 , the design matrix X n×p , the parameter vector β p×1 . We slightly deviate in this subsection and denote the regression coefficient as β (vs. γ). In all other sections we use the term β exclusively as a measure of factor exposure, and γ as the regression coefficient we aim to estimate. Moreover, throughout this work we treat Y and X as standardized matrices, with µ = 0 and σ = 1, where the standardization is applied column by column. As defined in (1) Y corresponds to the vector of excess returns, R e , and X to the matrix of FC. Equation (4) defines the ordinary least squares (OLS) estimator, Equation (5) the Lasso estimator (Tibshirani [6]) and (6) the adaptive Lasso (Zou [7]). The Lasso and the adaptive Lasso differ in terms of the penalization term, which allows the weights to vary for each parameter. The assigned individual weights are inversely proportional to a first-stage β estimate. Zou [7] suggests the use of the OLS estimator,β ols asβ init , unless collinearity is an issue. Bühlmann and Van De Geer [23] setβ init =β Lasso (λ). The use of the Lasso as a first-stage estimator is justified by the screening property of the Lasso, which still allows consistent variable selection of the adaptive Lasso at the second stage. We solely use the Lasso as a first stage estimator in (6) in this work. The penalty term λ used in (5) and (6) is determined by cross-validation (CV) or classical selection criteria, typically five-fold or ten-fold CV, the Bayesian information criterion (BIC), or the Akaike information criterion (AIC). Bühlmann and Van De Geer [23] show that the optimalλ based on the BIC evaluation reads as follows: and accordingly the AIC, Alternatively, the optimal λ can be estimated by cross-validation. Here we randomly split the samples along time points and never within a given period. Assume we observe T periods each containing N stocks S 1,t , S 2,t ,..., S N,t . Consequently, we can randomly select a training and testing set along the time index t. Hence, high cross-sectional correlations cannot cause biased estimates for the optimal λ.
The empirical set-up presented above makes shrinkage methods like the ones introduced above an attractive choice as they possess the ability to reduce the variance at the cost of slightly increasing the bias. First, as the variance increases in p, the ratio of p T can potentially be high, as we have 400+ presented factors in the literature and in the best case 50 years of monthly data (T = 600). Moreover, if some FC is available only for a shorter period of time, we can still perform the regression, as the Lasso methods are feasible even for the case where we have a truly high-dimensional problem (p > T), which imposes a constraint for classical OLS. Moreover, the noise component makes up unambiguously a significant proportion of the return process (even when assuming that the efficient market hypothesis is violated). Hence, the noise variance component has an important impact.

Data Sparsity
It is important to highlight the role played by the assumption of data sparsity connected to the use of the (adaptive) Lasso. Data sparsity is generally an untestable assumption and we consider it only a rough although reasonable approximation of reality. According to Zhang et al. [34] the concept of exact sparsity can be relaxed while still maintaining the same rate of convergence of the Lasso estimator to the true coefficients. They define that a model is sparse if most coefficients are small, in the sense that the sum of their absolute values is below a certain level. Under this general sparsity assumption, it is no longer sensible to select exactly the set of nonzero coefficients. Therefore, in cases where the exact selection consistency is unattainable or undesirable, the authors show that the Lasso is able to select the important variables with coefficients above a certain threshold determined by the controlled bias of the selected model. Thus, under this generalized sparsity concept, the (adaptive) Lasso is able to successfully discriminate between small and large coefficients and identify with high probability the most important firm characteristics; see also Bühlmann and Van De Geer [23] for a general review of the corresponding theory.
Moreover, given that our interest focuses primarily on the predictive ability of the competing approaches, the results discussed by Greenshtein et al. [35], Bickel et al. [36], and Sirimongkolkasem and Drikvandi [37] are reassuring: They highlight the fact that assuming sparsity as an approximation of the true design of the data does not generally significantly degrade the predictive accuracy of the models in high-dimensional settings with a large number of covariates. Greenshtein et al. [35] show that under various sparsity assumptions there is "asymptotically no harm" in considering a large number of covariates (many more than observations) for prediction purposes in a linear regression model under an l 1 constrained optimization. Bickel et al. [36] provide bounds on the l p prediction loss, 1 ≤ p ≤ 2, of the Lasso in a high dimensional linear regression in terms of the best possible (oracle) approximation under the sparsity constraint. Finally, by comparing different shrinkage approaches in a linear regression simulation setting, Sirimongkolkasem and Drikvandi [37] show that when important covariates are associated with correlated data, the l 1 and l 2 prediction performances of the Lasso improve for both sparse and non-sparse high dimensional settings and even sometimes outperform those of the Ridge regression. The predictive performance of the Lasso remains generally unaffected when the correlated covariates are associated with nuisance and less important variables. Given the previous evidence and the fact that the focus of the current study is set on identifying the most relevant methods and firm characteristics for predicting the cross-section of expected returns in a variable selection framework, we do not report results for alternative shrinkage techniques like the Ridge. Predictive performance results using Ridge are qualitatively similar to those presented for the lasso in Section 5.2.1 and are available from the authors upon request.

Selection and Prediction Evaluation
We apply pooled ordinary least squares as described in (4), where the t-values are based on the Driscoll and Kraay [38] robust standard errors. Next, we set a significance level for the OLS estimates to have a rule determining whether or not a coefficient can be seen as selected-we set the level to the literature standard of 5%. The impact of multiple-testing is gauged by considering t-value corrections as presented by Harvey et al. [1]. Specifically, we use the Bonferroni and Holm adjustment, which belongs to the class of family-wise error rates. Additionally, the study includes Benjamini, Hochberg and Yekutiel's (BHY) adjustment, a false discovery rate control, which we also consider; we refer to Harvey et al. [1] for a more complete description of the multiple-testing adjustments. In the case of the Lasso and the adaptive Lasso, the FC selection procedure is straightforward: all nonzero coefficient estimates are considered to be selected. Here we provide estimates for Lasso-and adaptive Lasso-based BIC, AIC and five-fold cross-validation (CV5) optimized regularization strength.
Following the variable selection, we calculate expected returns for each stock at each point in time. In this step, we evaluate two variants of each method. The first case drops all insignificant coefficients in the case of OLS and takes the relevant ones directly into consideration for the prediction. The second variant performs a post-variable selection OLS (PVSOLS).
The prediction quality is then measured based on a cross-sectional average, as proposed by Gu et al. [21]. More formally, where R e n,t andR e n,t denote the actual and the predicted excess returns over the risk-free rate, respectively. We report two metrics measuring the prediction performance, the simple time series average of l t and the model confidence set (MCS) as introduced by Hansen et al. [39]. Furthermore, we report the out-of-sample R 2 OS following Campbell and Thompson [40] defined as follows:

Simulation Study
In order to analyze the suitability of the Lasso-type estimators in the previously described context, we propose a simulation of cross-sectional returns and FC. It is calibrated such that crucial properties of the cross-sectional return data are satisfied. Although the simulation setting is highly stylized and cannot be a perfect replication of the true underlying data-generating process (DGP), it is helpful for gaining insights into the method's model selection and predictive performance in finite samples under different distributional assumptions in our specific setting; see the literature review in Section 1 on what has already been proved theoretically.

Calibration
The calibration of the return-generating process is as follows: • We set the number of priced and unpriced factors to 6 each. Even if it were theoretically assumed that unpriced factor risk should not exist, it could empirically still be present. Moreover, we assume that the factors are independent of one another and that the stock market factor explains the highest proportion of variance of all priced factors. • Firm characteristics fall into one of the following three groups: Group 1 measures factor exposure to priced factors and group 2 to unpriced factors. Group 3 measures FCs which are independent of the return-generating process. FCs are potentially correlated across groups. • The signal-to-noise ratio is relatively small, assuming a ratio implying a yearly R 2 of 5% (see the Appendix A for details on the transformation to monthly R 2 s). This is in line with empirically documented R 2 s in the case of the linear model; see, for example, Lewellen [41]. • The stock market factor follows a time-varying volatility process (implying heteroskedasticity for the individual stocks over time as well). • The simulated return series do not possess any auto-correlation. • The return-generating process follows (1) with c,t+1 ∼ N (0, σ 2 f ) ∀c ∈ P ∪ S ∧ c = 0 and 0,t+1 ∼ N (0, σ 2 0,t ). FC, c = 0, represents the stock market return, which follows a latent volatility process σ 2 0,t (see next item). c,t+1 is set to 0 ∀c ∈ S (spurious factor). The elements of the vector of risk-premia, µ c , are drawn from ∼ unif(0.1, 3.5)∀c ∈ P ∧ c = 0 and are set equal to zero ∀c ∈ U ∪ S. We assume a stock market premium, µ 0 , of 5.5% per year. The risk-premia are drawn only once per case and are kept constant through each simulation. The market premium is the estimate of the Fama French market factor. • σ 2 0,t , the stock market volatility, is estimated by using a GARCH(1,1) process, where the estimatedσ t are obtained by fitting a GARCH(1,1) model on daily observed US stock market returns. The GARCH(1,1) model captures a sufficient fraction of distribution properties observed in stock returns for our simulation study. Moreover, model performance seems reasonable compared to many less parsimonious approaches (see Hansen and Lunde [42]). Better volatility models exist, but are beyond the scope of this paper and not of crucial relevance. • η n,t represents the idiosyncratic stock-specific component and is drawn from ∼ N (0, σ 2 idio ). Bekaert et al. [43] show that aggregated idiosyncratic volatility varies over time. Despite this empirical evidence, we choose a parsimonious approach to model idiosyncratic volatility. This is mainly motivated by the statistical properties our DGP already possesses.

•
x n,t marks the vector of FC of stock n at t of length C. We simulate the characteristics The correlation matrix of FC, Σ, is obtained following the simulation approach of Hardin et al. [44] and is a crucial feature of our simulation. It is important, as many FC measure empirically similar variations. The base case refers to the constant correlation structure within groups (Algorithm 1) in Hardin et al. [44]. The Σ is drawn only initially and kept constant in each specification. The empirical correlation structure described in Section 5.1 shows a handful of cases with pairwise correlations around 0.9 and many between 0.4 and 0.5. Our simulated correlation pairs reflect this, in order to investigate the impact of this difference. However, this study does not consider a correlation grouping of more than two FCs, or any other forms of more involved linear dependencies.
Finally, the collection of all T periods of the simulations can then be stacked together in matrix X and matrix Y. The true coefficient of interest is the vector of µ, which is estimated as the vector of coefficientsγ.
The specification allows flexible simulations under different assumptions, analyzing the sensitivity of simulation parameters on the performance of the method. For this, we perform several different simulation specifications, each loosening one assumption separately. In general, the following default parameters are set: The correlation matrix, Σ, is simulated such that we have one high (0.9) and one low (0.4) pairwise correlation between FC from each group. See Figure 1 for a visualization of one realization of the specified correlation matrix simulation.

Sensitivity Analysis
The behavior of the simulated DGP crucially depends on its calibration defined above. In order to investigate the sensitivity to these choices, we define the following cases: Case 1: Base case Default settings. Case 2: Small T Default settings, T is set to 240. Case 3: Large T Default settings, T is set to 4200 (and N reduced to 800 to keep it computationally tractable). This specification requires a longer than available estimated GARCH(1,1) series; the missing σ 2 0,t are simply simulated based on the GARCH(1,1) parameter estimates described in the Appendix A. (This case is not a realistic scenario for monthly data but is insightful for applications with higher frequencies.) Case 4: Expected returns: a function of FC instead of factor exposure Default settings. The premium of the stock market factor is set to zero. Instead, we attach the premium to the FC directly and impose a correlation of the factor exposure and the FC of 0.9. This is in line with our second asset pricing interpretation introduced in Section 2.2.
Case 5: Small N Default settings, N is set to 250. Note that each simulation considers a balanced panel. As the actual data consist of an unbalanced panel, we adjust the data as described in Section 5.1. Moreover, the simulation ignores potential measurement errors in FC and assumes that they are measured without errors. Empirically, the most common FC suffering from measurement error are, as mentioned above, market betas. It shows the correlation for the first 25 FC, where FC 0 to 5 refer to the set of FC with positive risk-premia, 6-11 belong to U and the rest to S. For example, FC 1 (part of P) and FC 12 (part of S) are correlated with each other with a correlation of around 0.9. Note that the missing 75 FC are uninteresting insofar as their off-diagonal elements are close to zero.

Performance Evaluation
After each data simulation is performed, we report ten specifications covering all three methods and their respective choice variables presented above in the pooled panel framework. We collect the results of the OLS estimates, the Lasso and the adaptive Lasso. The evaluation considers two performance dimensions. First, we show the selection qualities of each method; second, we assess the simulated forecast accuracy, reflecting jointly the selection and parameter estimation qualities.
Knowing the true DGP of the simulation, we can then simply calculate the ratios of correctly classified coefficients, providing insights on type I and type II error behavior. Type II errors are calculated for FC 0-5-a failure to reject the null hypothesis of an unpriced factor FC given the factor is truly priced. Type I errors are measured for FC 6-100-rejecting the null hypothesis of an unpriced factor exposure given the FC is tied to an unpriced factor (FC 6-11) or the FC is independent of the returns (FC 12-100). The prediction evaluation follows Section 2.5 and is based on the 240 periods of simulated out-of-sample data.

Case 1: Default settings
The prediction simulation results of case 1 are displayed in Figure 2. The BIC and CV5 Lasso specifications perform best considering the average p-value of the MCS; however, the differences are not large. The weakest among the methods is the standard POLS. Correcting the t-values for multiple testing yields meaningful improvements in this specification. The MSE ratios are more concentrated and only a few differences can be documented; the highest value achieves the Bonferroni and Holm t-value adjusted predictions.
Case 2: Small T Reducing the number of estimation periods increases the difference between the min and max p-value average. POLS remains the weakest prediction tool and the two Lasso-type methods still perform best, although in the opposite order. Furthermore, the p-value corrected POLS-based predictions rank among the worst-performing methods. Consistent with the wider p-value difference, the MSE relative difference increases compared to the default case. Once more, adjusting the t-values for the variable selection realizes the highest MSE on average. Case 3: Large T A larger T leads to convergence of the prediction results as the gap between the average p-value and the MSE narrows.

Case 4: Direct linear dependence between FC and expected returns
There is no notable difference in the order of the p-values; the two Lasso-type predictions yield the best results. Case 5: Small N Compared to the previous four cases the Lasso CV5 methodology falls by some rank levels; however, the Lasso BIC remains on top. Overall, a lower number of stocks in the cross-section impacts the relative quality of POLS-based predictions negatively when compared to the default case.

Case 1: Default settings
The selection simulation results of case 1 are displayed in Table 1. The results reveal that there are distinct differences between the methods applied. It shows that for the simulated stock market factor, OLS performs the worst, as it displays a type II error rate of around 50%. On the other hand, the Lasso and the adaptive Lasso methods show a far better performance with an error rate of around 0% and 15%, respectively, for the stock market factor (in the case of CV5). For the other, significantly less volatile factors carrying a positive risk-premium, the type II error rates are zero in the case of AIC and BIC-based selection. The exceptions are some POLS t-value adjusted estimates and all Lasso-and adaptive Lasso-selected FCs are based on CV5. The type I error cases have to be distinguished in two cases: First, for FC belonging to set U (unpriced factor FC) where OLS has slight advantages over the adaptive Lasso, only the CV5-based adaptive Lasso selections perform comparably. The Lasso reveals a poor performance with error rates mostly above 50%. The second case of type I errors comprises spurious FC. Correlations of the uninformative FC with any of the two other FC types prove once more to be a problem for the Lasso. For these cases, the more restrictive adaptive Lasso reveals a far better selection performance than the Lasso, as the error rate is zero for all cases. Moreover, the OLS type I error rate behaves as expected by varying around the 5% significance level.
Case 2: Small T Reducing the number of periods in the simulation reveals differences in the performance compared to the default simulation results, as Table 1 shows. First, type II error rates rise strongly for OLS estimates, whereas only a slight increase is observed for the adaptive Lasso and practically no changes are visible for the Lasso. The picture also changes when looking at type I error rates for FC in set U , where we observe a jump in error ratios for the adaptive Lasso. Case 3: Large T As T grows larger, the error ratios decline as expected. The only remarkable exception is found for the Lasso, where once more the correlated FC remains prone to false inference. A strong indication that the neighborhood stability condition is likely to be violated in cases of higher correlations is that the error rate of FC 6 is 99% and that of FC 12 reaches 84%. Moreover, higher error ratios are also observed for cases with weaker correlations of around 0.5; see FC cases 7-9 and 12-15.

Case 4: Direct linear dependence between FC and expected returns
This specification yields a performance improvement for all methods, most apparent for OLS, where the classification error comes down from 33% to 23% compared to the default assumption.
Case 5: Small N We observe some higher type II errors for one FC in the case of adjusted POLS selection and generally lower type I errors for the Lasso-type methods.
Finally, we briefly summarize the simulation results. We show that the adaptive Lasso is superior to OLS when type II errors are a concern. A Lasso-based selection reveals for this case only negligible advantages over the adaptive Lasso. The picture changes if we want to minimize type I error behavior. Here we have to differentiate between two distinct scenarios. First, whenever we encounter an entirely uninformative independent variable, we show that the adaptive Lasso performs best. Second, in case we have a relation of the independent variable with an unpriced risk factor of the dependent variable, an OLS approach achieves the best results. We note that correlations are a crucial driver behind these results, where, in particular, the Lasso presents problems reaching reasonable type I error ratios when confronted with higher correlations (≈0.9). Moreover, altering the optimal λ selection mechanism impacts the results importantly. BIC is favorable over AIC in the specifications under consideration. BIC reduces type I errors without suffering from an increase in type II misclassifications. BIC vs. CV exposes a tradeoff between type I and type II. Assigning equal weight to both error types, BIC-based estimation is the preferable tuning method. Additional robustness checks can be found in Appendix A.2.

Data
Our objective is to preserve consistency as much as possible. Therefore the selection, data preparation and the notation of the description of firm characteristics generally follow the approach of Green et al. [3]. The FC data are implemented independently of Green et al. [3]. Our sample period ranges from 1974 to 2020. As in most studies, the analysis considers only CRSP stocks with share codes 10 and 11 which are traded either at NYSE, AMEX or NASDAQ; for an example, see Fama and French [4]. Furthermore, we exclude stocks with missing market capitalization data and/or where book values are unavailable. Compustat data are aligned with a standard lag of six months of the fiscal year end date. For example, the data of a firm with fiscal year end date 12/31 are aligned with data 06/30, predicting monthly returns from 6/30 to 7/31. CRSP-based stock/firm characteristics, such as idiosyncratic volatility, beta, maximum return or six-months momentum are used as of the most recent month end. For example, for the return prediction from 6/30 to 7/31, the max daily return from the period 5/31-6/30 is used. Additionally, following Green et al. [3] some selected Compustat accounting data are set to zero if not available; see the Appendix A for details. In processing larger amounts of data, correcting extreme and often implausible values is mostly unavoidable. Correcting these values on a discretionary basis is not feasible; hence, winsorizing the data is a useful strategy to reduce the problem. Therefore, each FC is winsorized at the 1% and 99% percentile at each point in time. Binary FC like divi, divo, rd and ipo are excluded from the winsorizing procedure. In the next step, missing data are replaced by the mean of the winsorized data at each point in time. Only then can the z-score standardization be applied at each calendar point. We do winsorize the return observations at the 5% and 95% percentile at each point to reduce the weight of outliers in the least-squares setting; therefore, no observations are excluded because of implausible returns. Moreover, returns are only de-meaned for each period and not corrected by the standard deviations. Finally, the data can be stacked and the pooled regressions applied, as each independent variable has mean zero and variance one given by the property of combining z-scores. Note that this is necessary as the Lasso requires a normalized design matrix as input, as described above.
However, differences in the selection of FC are unavoidable. This study employs only FCs which are not dependent on Compustat quarterly and IBES data. A detailed description of each FC included in the empirical part of this study can be found in the Appendix A. Moreover, the β estimates are obtained by regressing rolling weekly stock returns on the market excess returns. The literature often employs an alternative procedure whereby stocks are ranked and sorted into portfolios according to their individual market beta; see, for example, Fama and French [4]. The betas assigned to each stock for estimating the equity market premia are obtained by using the betas of the corresponding portfolios. Using portfolio beta estimates instead of individual stock betas has been applied to reduce potential errors-in-variable issues in the second stage regression. However, Ang et al. [45] cast doubt on whether portfolio betas are optimal due to the loss of dispersion in individual betas. More details about the specific CRSP and Compustat data and the corresponding data alignment process can be found in the Appendix A. The returns used in the prediction regression are the CRSP returns (RET) adjusted by the provided CRSP delisting return (DLRET). Additionally and for verification purposes, we benchmark our data for selected FC with the FC portfolio returns provided by Kenneth French's Data Library. We find satisfying R 2 s, reaching values from 0.99> to about 0.9 for cases where the FC definition of the benchmark data slightly deviates from the one presented in Green et al. [3]. Furthermore, we follow Fama and French [46] for the size classification definition, where large-cap stocks are the 1000 stocks with the highest market capitalization, mid-cap stocks rank 1001-2000 and small comprise all stocks with rank >2000. Finally, our industry-adjusted variables always use the 48 sectors downloaded from Kenneth French's Data Library, as the SIC 2 classification is empirically too granular since in many instances the sector group is defined by a single stock.

Empirical Results
The first subsection explains the details of how we construct the required normalized matrix X of the unbalanced panel of FC and returns. This subsection is followed by the empirical analysis of the predictability of cross-sectional stock returns. The third subsection covers the discussion of the selected FC.

Estimation Set-Up
As described above, the approach estimates the coefficients based on a pooled panel set-up. However, simply stacking the data causes two problems.
First, since we are interested in cross-sectional differences, we need to normalize each FC at each point in time to preserve the cross-sectional information. To illustrate the problem, one can think about the book-to-market ratio of single stocks, which certainly fluctuates partially based on market-wide price movements through time; standardizing along the entire panel would then implicitly change the order as time and cross-sectional information get mixed up.
Another issue that needs to be addressed is the unbalanced panel structure, as it implicitly causes the weights of each period in the regression to vary. Assuming there is no correlation between the returns and the number of stocks, we could ignore this issue, but empirically this is not the case; for example, we see that prior to the stock market peak at the beginning of the 2000s we have a much higher number of stocks with unknown return dependence. Therefore, we suggest adjusting the number of stocks in each period to the mean number of stocks per time point. This can be achieved by simply randomly drawing stocks with replacements at each point in time until we have filled the desired sample size. Figure 3 presents the correlation structure of the FC. It shows overall only five cases with absolute correlation coefficients greater than 0.9. Even though we cannot achieve precisely the same correlation structure in our simulation, we have considered cases with correlations of around 0.9 and hence capture this feature observed in the data in our simulation as well. As we show in the simulation study in Section 3, correlations around 0.9 cause no selection issues for the adaptive Lasso; only a Lasso-based selection appears prone to misclassification. However, we want to avoid including almost identical FC. Hence, before regressing the returns on the full set of FC included in our dataset, we screen the correlations for cases with an absolute correlation greater than 0.95. In such cases, we eliminate the more recently published FC of the affected pair from our analysis. Finally, not all FC are included in our FC analysis due to data problems; we drop: cfp_ia, roic, pchemp_ia, and ipo.

Predicting the Cross-Section of Returns
The prediction results presented in this subsection analyze the out-of-sample return predictions from 1992 to 2020. We run monthly rolling and expanding window regressions to form predictions for the upcoming month. The expanding window regression is initially fit with 15 years of observations. The rolling window specification includes three different windows, with 10, 15 and 20 years of data. Moreover, we form five different data groups classified by market capitalization: all, including all stocks available; large, consisting of the highest 1000 ranked stocks; mid, the stocks ranked between 1001 and 2000; large plus mid, the top 2000; small, considering all stocks below a market cap rank 2000. Hence, we look in total at 20 different data groupings. Note that, the results in this section represent aggregated numbers, as we show averages of the four different data estimation windows defined above.   We use the normalized, winsorized and pooled FC data (as used in the full sample regression) to calculate the correlation coefficients. The figure shows five extreme correlation pairs (>0.9). For example, the highest absolute correlation is measured for beta and beta_sq with a coefficient slightly less than 0.95. Table 2 and Figure 4 present the empirical out-of-sample prediction evaluation. It reveals that the ability to forecast cross-sectional returns varies along the size dimension. Large-cap stocks are not predictable compared to a naïve benchmark for the full sample period, whereas small and micro-cap stocks are highly predictable. When focusing on R 2 statistics we do not find any meaningful differences along the time dimension.

Performance Evaluation
Separating the predictions into a different sizes and time buckets allows for a more granular perspective. Figure 4 displays the 12-month aggregated rolling R 2 for each size group. The prediction quality of the sample including all stocks shows relatively stable predictability, with a period of poorer forecasts between 2003 and 2008. However, small and micro-cap stock returns are predictable until the end of our sample. The MCS p-value for the naïve zero-return forecast is consistently below the 5% significance level. The question of whether these predictable returns are exploitable by investors remains unanswered here. It might reflect the fact that risk compensation or existing trading frictions simply do not allow prices to reflect all information available at the time. On the other hand, large-cap stock returns are much harder to predict during the entire sample, with the exception of a short period in the early 2000s. The prediction results highlight the importance of evaluating the quality of return predictions conditioned on size corroborating the empirical evidence shown in the previous literature. Splitting the sample into a pre-and post-2004 period (not reported) yields almost the same results for samples when assessing the predictability of cross-sectional returns compared to the full period analysis.
Comparing different linear estimators reveals that the empirical results of all stocks are consistent with the simulation findings. Based on the MSE and R 2 metric, the Lasso specifications perform best followed by the adaptive Lasso, whereas POLS-based methods show the weakest performance. Moreover, the five-fold cross-validation reaches the best prediction results in the case of Lasso and adaptive Lasso-based predictions. However, the differences are not statistically significant, as we can only reject the Lasso AIC case as not being part of the MCS at a 10% confidence level.
The predictability pattern changes if we consider only large cap stocks, as all out-ofsample R 2 s turn negative. Overall, large-cap stocks are not predictable with the selected linear methods measured over the full sample between 1992 and 2020 as the zero return forecast achieves the lowest MSE. The prediction quality slightly improves when considering mid-cap stocks only, with two specifications achieving slightly positive R 2 s. Doubling the cross-sectional sample size by combining large and mid-cap stocks does not improve the prediction quality in a statistical or economically meaningful way. The small-cap subset shows that small and micro-cap stocks are highly predictable as the zero return prediction benchmark is statistically rejected at a 1% level and not part of the MCS. Lasso-type predictions perform best in this case. Furthermore, the size sub-sample analysis underscores the importance of conditioning on the size as the existing predictability shown for all stocks is mostly driven by a small fraction of the market which only accounts for a negligible share of the US market capitalization. This result echoes the findings of Hou et al. [13], who show that small and micro-cap stocks contribute disproportionately to return characteristics of many published anomalies.
Furthermore, we can see that there are distinct differences in the number of selected FC. The Lasso specifications contain the largest number of FC; on average about 14-42 FC are selected to form expected returns. The more conservative adaptive Lasso selects around 5-28 FC depending on the tuning method. The POLS-based forecasts use a lower dimensional model, as it includes on average between 1 and 18 FC. Counting the pure number of FC can be misleading, as many coefficients might be close to zero and hence the effective dimensionality could be more similar between the methods. If we compare the absolute sum of all coefficients between the different regressions, we still see meaningful differences, however at a different order of magnitude than purely counting active variables. Table 2. Out-of-Sample Forecast Evaluation: The MSE and R 2 s are calculated according to (7)- (8). The MCS indicates the p-value of being part of the set that includes the best model. The monthly R 2 s are expressed in percentage points. The MSE and ∑ abs(coef) column values are scaled by a factor of 10 3 and 10 1 , respectively. The columns "Median #" and "Mean #" show the time-series median and mean of the number of active FC.

Shrinking the Zoo of Firm Characteristics for Prediction
The analysis considers the years from 1974 to 2020 and primarily emphasizes the selection regression including all stocks. These findings are shown in Table 3. Furthermore, this table shows the regression results of conditioning on large, mid and small-cap-sized stocks. It displays the coefficient estimates of all FC determined by the Lasso, adaptive Lasso, and POLS. Given the results discussed in the previous section in terms of predictive accuracy, we focus on the lasso procedures optimized using five-fold cross-validation. The discussion in this section stresses mostly the details of the FC selection for the adaptive Lasso and the differences from the alternative selection procedures. As we showed in the simulations the adaptive lasso with five-fold cross-validation is able to reduce significantly the number of false positives. As a consequence, the firm characteristics estimated to belong to the active set by the adaptive lasso are quite reliable and give a clear indication of which characteristics carry relevant information for prediction.
Once more, results reflect the findings of Hou et al. [13], who document the impact of micro-cap stocks. It is also in line with the work of Green et al. [3], as they emphasize a valueweighted selection exercise. We are interested in the full sample analysis, i.e., the results of the single pooled regression applied at once over all periods to obtain the set of selected FC. Note that we denote the sign of the selected FC in brackets behind each FC when mentioned in the text for the first time and also provide a brief description of the respective FC but for the following subsection only. For the other subsections, we refer to Appendix A.

FC Selection including All Stocks
Considering all stocks for the selection analysis, the first block of columns in Table 3 embodies the FC selection results for the adaptive Lasso, POLS and Lasso-based on all sample periods. Note that we drop beta_sq prior to the selection regression for all samples, due to its high correlation with beta within the large and mid-cap stocks. It is not surprising that beta and beta_sq are more highly correlated for large and mid-cap stocks, as the beta of these stocks tends to be more centered around one, causing any quadratic transformation to be more correlated compared to a more dispersed beta measured among small-caps. We find that dimensionality varies starkly between the methods. The adaptive Lasso selects 21, the Lasso 47, POLS with unadjusted t-values 23 and the DFDR adjusted POLS inference 13 FC. The adaptive Lasso selects four out of five FC associated with the Fama and French [9] five-factor model and observes consistency with respect to the sign of the coefficients. Specifically, we identify beta(+)-market; bm(+)-book-to-market; agr(−)-asset growth; and gma(+)-profitability; as part of the set of active FC. Only mve -size; is missing. Many out of the 21 FC selected by the adaptive Lasso are based exclusively on price information. This includes the most relevant FC, measured by the absolute size of the coefficient, mom1m(−)-short-term reversal; Moreover, the prominent and classical 12 months momentum-mom12m(+) takes the spot of the second-most relevant FC. Moreover, the adaptive Lasso selects the following price-related FC: idiovol(−)-last month idiosyncratic return volatility; maxret(−)the max daily return of the previous month; and idiovol(−)-last month idiosyncratic volatility.
We skip all other selected FC and refer to Table 3 instead. Furthermore, the table expresses the differences between the three selection specifications and underscores the relevance of this choice. Generally, these across-study comparisons have to be conducted with caution, as the set of FC and the sample periods can differ and hence impact the inference in an unknown way. POLS on the other hand selects a set of 23 FC, whose elements largely overlap with the set identified by the adaptive Lasso.
The table reveals that a Lasso-based procedure would suggest an even higher dimensional relation between FC and returns. We dispense with the discussion here: As the simulation results showed, lasso could be severely affected by false positives and generally overestimates the number of active variables. Overall, we find a substantial number of FC inspected do not contain relevant information for predicting returns when considered in a multivariate selection, as 41 of the included 62 FC are not picked by the adaptive Lasso.

FC Selection Conditioned on Size
The selection results conditioned on large-cap stocks only are depicted in the second set of columns in Table 3. Overall, we observe fewer active FC compared to results including all stocks; a total of 14 (vs. 21) are selected. Not overly surprisingly, the top-ranked FC resembles the picture described above, where price-related information dominates the overall prediction contribution-mom1m(−), mom6m(+) and chg_mom6m(−) are among the FC with the highest absolute coefficient values. Moreover, OLS selects a sparser set vs. the adaptive Lasso with ten FC only. The DFDR adjustments suggest none of the included FC are useful in predicting large-cap stocks. Given the out-of-sample prediction results presented above, this reflects the findings that large-cap stocks are not predictable with the linear methods and FC included in this work. Table 3 also presents the active set of FC conditioned on mid-sized stocks. Strikingly, price-based FC rank highest, with mom6m(+), retvol(−), chg_mom6m(−) and mom1m(−) as the top four contributors. Moreover, the full sample estimated mid-cap dimensionality is higher, reflecting once more some predictability for stocks belonging to an economically less relevant segment of the market.
Finally, we briefly discuss the results including exclusively small-cap stocks. The FC selection, as shown in the last block of Table 3, shows that the price information-driven FC is dominant as in the mid-sized selection regression. We find consistency in which FC occupies the top ranks, considering the magnitude of short-term reversal. The three highest rank FC are: mom1m(−), idiovol(−) and mom12m(+). The OLS and Lasso-based selection deviate once more.

Conclusions
In this work, we propose the application of the adaptive Lasso for predicting crosssectional stock returns. In particular, this study contributes to a better understanding of the behavior of the adaptive Lasso when applied in panel data settings mimicking the expected returns cross-section dynamics. We perform an extensive Monte Carlo simulation study in which we consider panel data scenarios of low signal-to-noise ratios including heteroscedastic, non-normal and highly cross-sectionally correlated errors. We compare the accuracy of the adaptive Lasso, Lasso and POLS based on the ability to select the truly informative FC for prediction and on the final predictive performance. The selection results show that the Lasso is inferior to its adaptive version in most specifications. In particular, a required condition, most apparent in cases of higher correlations, reveals shortcomings in the Lasso. Despite these apparent selection disadvantages for the standard Lasso, both Lasso-type methods yield improved predictive results over their classical alternatives. The adaptive Lasso appears promising compared to OLS, especially at reducing type II error ratios and controlling FC that suffer from a likely publication bias. POLS-based predictions show the least promising results.
Furthermore, in agreement with the previous literature, we show that the predictability of linear methods based on a rich zoo of firm characteristics is mostly limited to small and micro-cap stocks-the least relevant section of the stock market. Large-cap stocks are not predictable with the linear methods used in this work. Overall, the predictive differences between different linear methods are hard to measure given the potentially non-existing predictability of large-cap stocks. When emphasizing the evaluation based on the less relevant but predictable small and micro-cap segment, we find that Lassotype predictions perform best. This empirical finding is consistent with the results of the simulation study. An adaptive Lasso selection procedure applied to 62 FC included in this paper and constructed based on US stock data from 1975 to 2020 identifies a highly dimensional return process. We show that a large part of published FC is selected when considered in a multivariate predictive analysis simultaneously; we identify 21 FC of relevance for prediction. In order to calibrate the simulation to the desired signal-to-noise ratio, we set the volatility of the factors and idiosyncratic volatility as follows. The signal-to-noise ratio (SNR) is defined as: and it is related to the r-squared as follows: Recalling Equation (1) and ignoring the indices, we can write, with f = µ + Hence, we can define the variance of the signal as: Note that the µ is defined as a uniformly distributed random vector and the realized µ are fixed at the beginning of the simulation and can be treated as deterministic.
Furthermore, the variance of the noise can be stated as follows: The first line can be simplified according to Equation (A2) below as all terms involving Cov(x, ), E( ) and E(x) collapse to zero. σ ,1 is given by the data reflecting the long-term mean of the stock market GARCH volatility. σ f and σ η are calibrated such that each part contributes equally to fit the desired signal-to-noise ratio (σ 2 η = (P + R − 1)σ 2 f ). The value of σ 2 η and σ 2 f of the desired SNR or the desired R 2 follow then straightforwardly.
The variance of the product of two random variables X and Y can be expressed as follows: Moreover, we can show that the R 2 of a frequency with length T and a frequency comprising a fraction T τ of it are related as follows, assuming that x does not change with the time horizon and all terms in σ 2 noise are treated as returns with zero auto-correlation: noise,T and hence, Combining Equations (A3) and (A1), the following relation holds:

. Simulation Study: Additional Robustness Checks
In addition to the sensitivity analysis provided in Section 3.2 this section exhibits additional robustness checks. The results are presented in Table A1 and Figure A1.

.1. Additional Cases
Case A1: Time-constant stock market volatility Default settings, except for the assumption of the underlying latent volatility process ofσ 2 t,0 , which we fix for all t to the long-term volatility estimate of the US stocks of 15.8 %.
Case A2: t-distributed stock market returns Default settings, except t+1,0 ∼ t-with the GARCH(1,1) estimation also based on the t-distributed errors, with an estimated number of degrees of freedom,ν, of 7.14. However, differences are not overly strong.

Case A3: Time varying risk-premia
The model is identical to Model 1 except that it is equipped with a time-varying µ t , instead of the time constant µ as before. Notice, that this case can collapse to the first model from a statistical perspective. For example, assume the following time-varying process: µ t =μ + κ t , and, κ t ∼ N (0, σ κ ). Any variation in the risk-premia would then be absorbed by the error term and cannot be distinguished from it. In a panel regression, we would then simply estimate µ = ∑ T t=1 µ t . Otherwise, default settings. We impose that, E[µ t,0 ] is equivalent to µ 0 . The stock market risk-premia is replaced with the following AR(1) process: µ t,c = c c + ϕµ t−1,c + ψ t , the constant, c c , is set to µ c (1 − ϕ) = c c for the mean constraint to hold and the noise component, ψ t ∼ N (0, σ 2 µ 0 ) and ϕ = 0.2. The size of the standard error, σ µ 0 , is set such that it absorbs 1/5 of the unconditional variance of the associated stock market factor variance (which is reduced accordingly). The errors are independent of each other. Table A1 reveals that assuming homoscedastic errors for the stock market factor only marginally affects the error ratios. The only difference we observe is a slight drop in the error ratio for the stock market factor for the OLS approach.

Case A1: Time constant stock market volatility
Case A2: T-distributed stock market returns Using draws from a Student-t distribution instead of the Normal with corresponding GARCH-(1,1) volatility estimates for the market factor does not have an impact on the performance behavior of the three different methods, as Table A1 shows.
Case A3: Time varying risk-premia Assuming an AR(1) process for the mean component of the stock market factor does not influence the inference by much. We document only marginal changes for all methods in this case for FC 0.

. Code
Our code is available upon request via github.com; please make requests by email. It is all written in Python 3.x and should be compatible on win and ux systems. Be aware the simulations as specified above are memory/RAM intensive; in order to run the main simulation, at least 90GB of available RAM are required.