Machine Learning, Thematic Feature Grouping, and the Magnificent Seven: A Forecasting Analysis

Jalali, Mirarmia; Najand, Mohammad; Cohen, Andrew

doi:10.3390/jrfm19040274

Open AccessArticle

Machine Learning, Thematic Feature Grouping, and the Magnificent Seven: A Forecasting Analysis

by

Mirarmia Jalali

,

Mohammad Najand

^* and

Andrew Cohen

Department of Finance, Strome College of Business, Old Dominion University, Norfolk, VA 23529, USA

^*

Author to whom correspondence should be addressed.

J. Risk Financial Manag. 2026, 19(4), 274; https://doi.org/10.3390/jrfm19040274

Submission received: 27 February 2026 / Revised: 30 March 2026 / Accepted: 2 April 2026 / Published: 9 April 2026

(This article belongs to the Section Financial Markets)

Download

Browse Figures

Versions Notes

Abstract

This study examines the predictability of monthly excess returns for the “Magnificent Seven” U.S. technology firms using machine learning and economically motivated thematic feature grouping. Framed as a focused study of the most systemically consequential equity panel in modern markets—seven firms representing over 30% of the S&P 500—the analysis confronts a small-N, large-P environment where economically structured dimensionality reduction is essential. Using 154 firm-level characteristics categorized into 13 economic themes, we evaluate linear, penalized, tree-based, and neural network models in a small-N, large-P setting. Unrestricted models suffer substantial overfitting and fail to outperform the historical average benchmark out-of-sample. In contrast, theme-based models generate economically meaningful and regime-dependent predictive gains. Short-Term Reversal and seasonality exhibit stronger expansion-period predictability, while size and profitability perform better during recessions. Regularized linear models provide the most stable performance in limited-data environments, whereas nonlinear ensemble methods improve only when training windows are extended. The findings underscore the importance of economically structured dimensionality reduction and adaptive factor allocation in managing concentration risk among systemically important mega-cap firms.

Keywords:

machine learning; stock return predictability; factor investing; regime dependence; thematic feature grouping; mega-cap stocks; asset pricing

1. Introduction

Financial markets are currently characterized by two profound, intersecting trends: the accelerating concentration of market capitalization in a handful of “mega-cap” stocks, and the overwhelming proliferation of variables purported to predict stock returns. This study investigates the nexus of these trends, confronting the challenge of high-dimensional forecasting in the very stocks that dominate the modern-market landscape.

The dominance of superstar firms is unprecedented in recent decades. While Bessembinder (2018) documented that long-term stock market wealth creation is driven by a tiny fraction of high-performing stocks, the recent concentration has been exceptionally rapid. Colloquially dubbed the “Magnificent Seven” (Apple, Microsoft, Amazon, Alphabet, Meta Platforms, Tesla, and Nvidia), these firms have seen their collective weight in the S&P 500 surge from under 7% in 2010 to over 30% by 2024 (Figure 1). This concentration has profound implications. When a third of the benchmark’s value rests on seven companies, index returns become highly sensitive to their performance, challenging traditional diversification benefits and introducing significant concentration risk for institutional and retail portfolios benchmarked to the S&P 500. Furthermore, this phenomenon validates the granular hypothesis proposed by Gabaix (2011), which posits that idiosyncratic shocks to sufficiently large firms do not average out, but rather can drive aggregate economic fluctuations. This concentration is precisely why the Magnificent Seven warrant dedicated study: when seven firms represent over 30% of the world’s most widely tracked equity benchmark, understanding their return predictability is a necessity for virtually every institutional portfolio.

Concurrently, academic finance is grappling with the “factor zoo” (Cochrane, 2011). The discovery of hundreds of firm characteristics that appear to predict returns presents both an opportunity and a significant methodological challenge. While this vast information set offers the potential for sophisticated asset-pricing models, it also creates a high-dimensional problem fraught with risks of data mining, overfitting, and spurious correlations.

This paper asks whether the returns of the Magnificent Seven can be systematically forecast using firm characteristics and machine learning. Given the systemic importance of these firms, even modest predictability has large implications for portfolio allocation, risk management, and understanding the drivers of market dynamics in a tech-driven economy. We investigate this question by applying a broad suite of ML methods ranging from penalized linear models to complex ensemble methods like boosted trees and Random Forest to a dataset of 154 firm-specific characteristics.

We argue that forecasting the Magnificent Seven presents a unique challenge that distinguishes it from standard asset-pricing studies. Unlike broad cross-sectional studies that benefit from large datasets (large N), or time series studies of the equity premium that typically use few predictors (small P), our study faces the dual difficulty of a small sample (N = 7), and a high-dimensional predictor set (P = 154). In this “small N, large P” context, the risk of overfitting is severely elevated. Complex machine learning models, while powerful in capturing non-linearities (Gu et al., 2020), are highly susceptible to memorizing noise in small datasets, leading to poor out-of-sample (OOS) performance.

The decision to restrict the sample to the Magnificent Seven is not an incidental data limitation but a deliberate research design choice: these firms collectively account for over 30% of the S&P 500, and their return dynamics now materially influence benchmark performance, institutional asset allocation, and systemic equity risk exposure for virtually every diversified portfolio. We are not studying a random small-N panel; we are studying the panel whose idiosyncratic behavior, through the granular channel identified by Gabaix (2011), can propagate to aggregate market fluctuations.

The existing ML asset-pricing literature has demonstrated the power of flexible, high-dimensional models when applied to large cross-sections of stocks (e.g., Gu et al., 2020). Our paper addresses a complementary but methodologically distinct challenge: what happens when this toolkit is applied to a deliberately small, economically motivated panel where the standard large-N assumptions no longer hold? We show that the answer depends critically on how the predictor space is structured. Unrestricted ML models collapse under the weight of their own flexibility in small samples, while economically organized, theme-based models recover stable and interpretable predictive signals. This finding, that economic structure is load-bearing and not merely decorative, in small-sample ML, has implications that extend well beyond the Magnificent Seven.

To confront this challenge, we introduce a structured approach that incorporates economic-domain knowledge into the ML framework. Building on the intuition of Jensen et al. (2023), we implement a thematic grouping of predictors. We classify the 154 characteristics into meaningful economic themes, such as momentum, value, size, profitability, and investment. We then train and evaluate separate ML models on each thematic subset of features, in addition to a model using the full set of characteristics.

This design serves two crucial purposes. First, it acts as a form of economically motivated regularization. By reducing the dimensionality of the feature space for each model, we aim to mitigate overfitting and improve OOS generalization. Second, it significantly enhances the interpretability of the results. Rather than producing a black-box prediction from hundreds of intermixed variables, we can assess the predictive contribution of different economic narratives. We can ask, for example, whether the returns of these mega-cap growth stocks are better explained by momentum indicators or by fundamental quality metrics.

We rigorously evaluate the performance of each model using rolling-window OOS tests from 2015 through 2023. Predictive accuracy is assessed against the historical average benchmark using the OOS R² (measuring the reduction in mean squared forecast error) and the success ratio (directional accuracy).

Our findings reveal several noteworthy patterns. First, the difficulty of extracting predictive signals in this high-dimensional, small-sample setting is immediately apparent. An unrestricted model utilizing all 154 firm characteristics simultaneously fails to outperform the naive historical average benchmark. For instance, the unregularized OLS model suffers from severe overfitting, resulting in a disastrous OOS R² of approximately −70%. This underscores that applying ML without structural constraints in this context is counterproductive.

Second, when we constrain the models to thematic groups, clear and economically meaningful predictive gains emerge. The source of predictive power is not spread uniformly across firm information but resides in specific factor domains. Notably, technical themes demonstrate the strongest predictive power. The Short-Term Reversal theme emerges as the most predictive, with a combined model achieving an OOS R² of 0.76%, followed closely by seasonality (OOS R² up to 0.61%). Fundamental themes such as quality (up to 0.43%) and profitability (up to 0.37%) also demonstrate robust predictive power. In contrast, themes like investment showed virtually no positive signal, and critically, traditional value metrics often found to be irrelevant for high-growth technology firms yielded only modest gains.

Third, we find significant state-dependent predictability when conditioning model performance on NBER-defined business cycles. The efficacy of factors is highly contingent on the economic environment. During recessionary periods, fundamental themes associated with stability demonstrate superior performance. For example, the size theme exhibits significantly higher R² during recessions (R² of about 0.85–0.94%) than expansions (0.15%), reflecting a “flight-to-safety” effect during crises. Conversely, the Short-Term Reversal anomaly dominates during expansionary periods (R² ~ 0.88%) but fails dramatically during recessions (R² ~ −1.63%), indicating that typical mean-reversion patterns break down during market stress.

Finally, we delve into the interpretability of the ML models using feature importance analysis. This analysis reveals that the most reliable predictors across various models correspond to well-established equity anomalies. Key drivers include proxies for persistent investment efficiency (such as one-year CapEx growth and short-term asset growth) and strong composite quality constructs (like the Piotroski F-score and QMJ components). Furthermore, tail-risk measures, such as idiosyncratic volatility and skewness effects, add incremental explanatory power. The interpretability step reinforces confidence in our models by showing they rely on economically sensible predictors.

To summarize, our contribution is threefold. First, we present the first comprehensive assessment of machine learning-based return prediction focused exclusively on the Magnificent Seven stocks, filling a gap regarding the predictability of these systemically important firms. Second, we demonstrate that an economically guided feature grouping approach is essential for managing complexity in a small-sample, high-dimensional setting, improving both prediction and interpretability. Third, we provide new evidence on the state-dependent nature of return predictability for mega-caps, showing that the relevance of factor themes shifts dramatically across business cycles. Given that these firms account for over 30% of the S&P 500’s market capitalization, understanding their return predictability is not merely an asset-pricing question but also a systemic risk-management issue. Beyond its methodological contribution, this study has direct implications for financial risk management. The Magnificent Seven currently account for more than 30% of the S&P 500’s market capitalization, implying that return dynamics of only a handful of firms now materially influence benchmark performance, institutional asset allocation, and systemic equity risk exposure. Understanding whether their returns are predictable, and under which economic conditions, is therefore not merely an asset-pricing question but a concentration risk problem. If predictive signals exist and vary across macroeconomic regimes, passive and active managers alike face non-trivial exposure risks tied to factor cyclicality. Our analysis provides evidence that forecasting accuracy is state-dependent and theme-specific, suggesting that adaptive factor allocation may mitigate concentration risk in modern equity markets.

The remainder of the paper is organized as follows. Section 2 provides a review of the relevant literature Section 3 describes the data and methodology, Section 4 presents the empirical results, Section 5 reports the theme-level prediction results, Section 6 provides a discussion, and Section 7 concludes.

2. Literature Review

Asset-pricing research has traditionally focused on two primary streams. The first investigates the cross-section of expected returns, aiming to understand how firm-specific characteristics influence stock performance. The second concentrates on time series forecasting, attempting to predict aggregate market movements, often referred to as the equity risk premium. This study resides at the intersection of these streams, applying a high-dimensional set of cross-sectional predictors to a specific, small panel of systemically important firms.

The cross-sectional approach has a long and rich history (e.g., Fama & French, 2008; Lewellen, 2015). Over the decades, this framework has identified hundreds of potential return predictors based on accounting data, market data, and analyst expectations. This proliferation, termed the “factor zoo,” has created significant challenges for traditional methodologies like Fama–MacBeth regressions and portfolio sorts. These methods struggle to manage the high dimensionality and collinearity among signals, making it difficult to discern which factors provide genuinely incremental predictive insight. Consequently, classical methods risk overfitting and often fail to assess predictors jointly. This raises profound concerns about data-snooping and multiple comparisons, where predictors may appear statistically significant purely by chance unless rigorous controls for false discoveries are implemented (Harvey et al., 2016).

The second traditional stream involves forecasting broad market returns using macroeconomic variables. Accurately predicting the equity premium (ERP) is of significant interest due to its implications for asset allocation, the interpretation of asset-pricing models, and the evaluation of market efficiency (Rapach et al., 2013; Spiegel, 2008). However, while some variables exhibit in-sample predictive correlation, their out-of-sample reliability is notoriously weak. A long-standing debate, initiated by Welch and Goyal (2008), centers on whether any empirical model can consistently provide a more accurate forecast of the U.S. equity premium than the simple historical average (HA) benchmark.

The difficulty of ERP prediction stems from the low signal-to-noise ratio inherent in the data, the limited availability of macroeconomic time series, and the likelihood that the relationship between the ERP and its predictors is non-linear and time-varying, often due to structural breaks and model instability (Pettenuzzo & Timmermann, 2011; Rapach et al., 2010). Despite the simplicity of the HA benchmark, researchers have explored numerous sophisticated strategies to outperform it. These attempts include imposing steady-state valuation restrictions (Campbell & Thompson, 2008), forecast combinations (Rapach et al., 2010), accounting for regime shifts (Henkel et al., 2011), incorporating technical indicators (Neely et al., 2014), and utilizing advanced techniques like ridgeless regression (Kelly et al., 2024). However, the results have been largely mixed (Ferreira & Santa-Clara, 2011). Dichtl et al. (2021) conclude that most complex attempts fail to outperform the HA benchmark out-of-sample after accounting for data snooping, echoing persistent skepticism (Welch & Goyal, 2008; Goyal et al., 2024).

The limitations of conventional linear methodologies in both cross-sectional and time series contexts have motivated the exploration of more sophisticated tools. Machine learning techniques, capable of handling high-dimensional data and capturing complex relationships without imposing strong prior model assumptions (Leippold et al., 2022; Akbari et al., 2021), have increasingly been adopted.

Early applications of ML demonstrated promise in specialized financial tasks, establishing the intuition that flexible, data-driven models could capture dynamics missed by linear models. For example, neural networks were applied to option pricing in the 1990s (Hutchinson et al., 1994; Yao et al., 2000). In the realm of credit risk, decision-tree methods found early use in predicting consumer defaults (Khandani et al., 2010; Butaru et al., 2016). Similarly, deep neural networks were utilized to forecast mortgage behaviors (Sadhwani et al., 2021), and ML also informed portfolio optimization (Heaton et al., 2017).

In recent years, burgeoning literature has applied a wide range of ML methods directly to the challenge of predicting stock returns, particularly in the cross-section. A primary goal is to analyze the multitude of firm characteristics within a holistic, multivariate framework, utilizing regularization and dimension-reduction tools.

The seminal study by Gu et al. (2020) conducted a comprehensive comparison of ML methods, finding that they substantially outperform traditional linear regressions. They concluded that the largest gains come from models that allow for non-linearities and interactions, such as neural networks.

Penalized regression methods, such as the Lasso, have been central to this effort, aiming to identify sparsity in the factor zoo. Feng et al. (2020) employed a double-selection Lasso, concluding that only a small subset of published factors are truly robust. Similarly, Freyberger et al. (2020) introduced an adaptive group-lasso approach to model expected returns as a flexible nonlinear function of firm attributes. Beyond penalized regressions, Kelly et al. (2019) introduced instrumented principal component analysis (IPCA), blending ML with equilibrium factor modeling.

Tree-based and ensemble methods, which are inherently suited to capturing interactions and regime shifts, have also shown significant promise (Moritz & Zimmermann, 2016). Bryzgalova et al. (2019) developed an “asset-pricing tree” method to endogenously group stocks into characteristic-based portfolios, achieving higher out-of-sample Sharpe ratios.

Neural networks and deep learning have also made significant inroads. Deep learning architecture has evolved to incorporate economic structure. For instance, Chen et al. (2024) developed a deep neural network asset-pricing model that imposes the no-arbitrage condition from financial theory as a learning objective, bridging machine learning with traditional asset-pricing theory. The success of ML in enhancing predictive accuracy extends beyond equities, with studies demonstrating significant forecasting gains for bond risk premia (Bianchi et al., 2021).

Despite the enthusiasm, a note of caution is warranted. The advent of machine learning does not repeal the fundamental challenges of asset-pricing data. Complex models can overfit noise if not properly regularized and validated. Implementation challenges also remain significant. Avramov et al. (2023) point out that ML-based strategies often concentrate on hard-to-arbitrage stocks, meaning their impressive gross returns may diminish substantially after accounting for real-world trading costs. Furthermore, skepticism remains regarding the superiority of ML in certain contexts. For instance, Beutel et al. (2019) reported that machine learning methods could not beat a simple logit model in predicting banking crises.

While we leverage the methodologies developed in the cross-sectional ML literature (e.g., Gu et al., 2020; Jensen et al., 2023), our focus and context are fundamentally different. The success of ML in cross-sectional studies typically relies on massive datasets (large N), where sophisticated models excel at identifying patterns across thousands of stocks. In contrast, our sample is restricted to only seven firms (small N). An alternative research design would train a large-sample cross-sectional model and then apply it to specific firm subsets; this is precisely the approach taken by Gu et al. (2020). Our study is the deliberate complement: rather than asking whether broad-market patterns transfer to mega-caps, we ask whether these firms’ own characteristics predict their returns in a real-time forecasting setting. This positions our study in a challenging methodological intersection. We face the high dimensionality (large P) of cross-sectional studies but the data scarcity (small N) more typical of time series ERP studies. As observed by Xu and Liu (2024), in truly low-signal-to-noise ratio tasks with limited data, even state-of-the-art ML may offer no improvements over simpler models. Applying complex ML models in this “small N, large P” context significantly elevates the risk of overfitting, demanding careful, economically motivated constraints, such as our thematic grouping approach, rather than relying solely on statistical regularization.

3. Data and Methodology

3.1. Data

We construct a panel dataset of monthly stock returns for seven major U.S. technology companies commonly referred to as the “Magnificent Seven” (Apple, Microsoft, Alphabet (Google), Amazon, Meta (Facebook), Tesla, and Nvidia) from January 2010 through December 2023. The dependent variable in our analysis is each stock’s log excess return. This yields 168 months of data for each of the 7 firms (≈1176 firm-month observations). Our predictor variables consist of a broad set of firm-specific characteristics drawn from the asset-pricing literature. Using Jensen et al.’s (2023) dataset, we include 154 firm-level features capturing diverse effects such as valuation ratios, momentum indicators, profitability and growth metrics, accruals, and other anomalies identified in prior studies. We deliberately exclude market-wide macroeconomic predictors in the primary analysis to focus on firm-level return drivers, although we later consider macro conditions in a conditional performance analysis.

To facilitate interpretation of the many correlated predictors, we follow the thematic groups suggested by Jensen et al. (2023) and organize firm characteristics based on their economic nature. In total, we consider 13 such themes. Grouping characteristics in this manner allows us to aggregate and compare the importance of different economic-factor domains in driving the Magnificent Seven’s returns. An important methodological consideration is the choice of taxonomy itself. We adopt the thematic classification of Jensen et al. (2023) without modification. This choice is deliberate: using a published, peer-reviewed, and widely adopted taxonomy significantly reduces researcher degrees of freedom and the risk of data-mining. We did not design groupings to maximize our results; we applied an established framework, which strengthens the reproducibility and credibility of our findings. We note that testing alternative aggregations, such as collapsing the 13 themes into fewer broad economic categories, is a natural sensitivity check that we leave for future work.

Finally, to analyze performance across business cycles, we tag each monthly observation as occurring in a Recession or Expansion using NBER recession dates. This classification enables us to later evaluate whether predictive performance varies between recessionary and non-recessionary periods.

3.2. Methodology

Forecasting framework:

We forecast monthly log excess returns for a balanced panel of Magnificent Seven’s (

N = 7

) individual stocks observed over months

t = 1, \dots, T

. Let

r_{i, t + 1}

denote the log excess return of stock

i \in \{1, \dots, N\}

from

t

to

t + 1

. Let

x_{i, t} \in R^{P}

be a vector of predictors observable at time

t

. The predictive relation is

r_{i, t + 1} = g (x_{i, t}; θ) + ε_{i, t + 1}, E [ε_{i, t + 1} | x_{i, t}] = 0

where

g (\cdot; θ)

is an unknown (possibly nonlinear) function learned from the data and

θ

denotes model parameters. We estimate

g

in a pooled panel manner: a single model is trained on the entire set of

(i, t)

observations and then used to generate stock-level forecasts. The stock-fixed effects are economically meaningful given the substantial heterogeneity in the Magnificent Seven’s return profiles. Tesla’s mean monthly excess return of 4.66% versus Microsoft’s 1.88% (Table 1) implies that the intercepts capture systematic differences in growth trajectories, risk compensation, and investor expectations across firms. The estimated fixed effects range from approximately 0.8% to 4.2% per month, confirming that pooling without stock-level intercepts would conflate cross-sectional heterogeneity with time series predictability. To absorb persistent, stock-specific level differences without sacrificing pooling, linear models include stock-fixed effects

α_{i}

, and nonlinear models include stock identifiers as categorical features:

r_{i, t + 1} = α_{i} + g (\tilde{x_{i, t}}; θ) + ε_{i, t + 1}

where

\tilde{x_{i, t}}

may contain standardized predictors. Missing stock-months are allowed; all evaluation metrics below account for time-varying cross-sectional counts

n_{t} \leq N

.

Forecasting models:

Following Gu et al. (2020) and subsequent work, we deploy a broad suite of models designed to capture linear structure, shrinkage, low-rank factors, and nonlinear interactions. The linear and penalized class comprises ordinary least squares, Ridge, lasso, and Elastic Net. We also estimate dimension-reduction approaches (principal components regression and partial least squares) to summarize information in high-dimensional predictors. To allow for flexible nonlinearities, we use tree-based ensembles (Random Forests; gradient-boosted trees, including XGBoost and CatBoost), feed-forward neural networks with one to five hidden layers and ReLU activations with weight decay, and two additional nonparametric benchmarks: support vector regression and k-nearest neighbors. To mitigate model uncertainty, we also consider forecast combination. We use a simple arithmetic average across the

M

models:

{\hat{r}}_{i, t + 1}^{(F C)} = \frac{1}{M} \sum_{m = 1}^{M} {\hat{r}}_{i, t + 1}^{(m)}

Empirical design and training:

Our out-of-sample (OOS) evaluation follows a walk-forward expanding-window scheme. Let

T_{o o s} \subseteq \{1, \dots, T - 1\}

denote the set of forecast origin months. For each

τ \in T_{o o s}

, models are trained using only data up to and including month t and then used to forecast

r_{i, τ + 1}

. To control complexity without look-ahead, we retune hyperparameters on a pre-specified subset of origin dates (e.g., annually). At each retune date, the in-sample window is split in temporal order into the earliest 85% for training and the most recent 15% for validation. For each model, we search a fixed grid of hyperparameters (see Appendix A for full hyperparameter settings), select the value minimizing validation mean square forecast error (MSFE), refit the model on the full in-sample window with the selected hyperparameters, and keep those settings until the next retune. Random seeds are fixed to ensure replicability.

Preprocessing:

Each month, predictor variables are winsorized at the 1st and 99th percentiles and then we cross-sectionally rank all stock characteristics period-by-period and map these ranks into the [−1, 1] interval following Kelly et al. (2019) and Freyberger et al. (2020). Targets are not standardized. Our purely historical benchmark is the stock-by-stock historical average (HA), computed using only information available at the forecast origin month. This benchmark serves both as a transparency check and as the denominator for MSFE-based comparisons.

{\hat{r}}_{i, t + 1}^{(b)} = {\bar{r}}_{i, 1 : t} = \frac{1}{t} \sum_{s = 1}^{t} r_{i, s}

As an additional benchmark, we estimate a Fama–French three-factor (FF3) model for each stock using the same expanding-window scheme. At each forecast origin, we regress each stock’s excess returns on the market, SMB, and HML factors using all available data, then forecast the next-period excess return using the estimated intercept and factor loadings applied to the historical means of the three factors. This provides a natural comparison with the traditional linear factor model approach that is standard in the asset-pricing literature.

Forecast evaluation:

Let

I_{t}

denote the set of stocks observed at

t

with realized returns

r_{i, t + 1}

, model predictions

{\hat{r}}_{i, t + 1}^{(p)}

(for a particular predictive model

p

) and benchmark predictions

{\hat{r}}_{i, t + 1}^{(b)}

. Let

n_{t} = |I_{t}|

, and let

T_{o o s} = |T_{o o s}|

.

Out-of-sample R² (MSFE reductions)

We report OOS

R^{2}

measures that take the canonical form

1 - \frac{{M S F E}_{p}}{{M S F E}_{b}}

.

Panel

R_{w}^{2}

. Weight each stock-month equally within a given month by

w_{i, t} = \frac{1}{n_{t}}

, then

R_{w}^{2} = 1 - \frac{\sum_{t \in T_{o o s}}^{T_{o o s}} \sum_{i \in I_{t}}^{I_{t}} w_{i, t} {(r_{i, t + 1} - {\hat{r}}_{i, t + 1}^{(p)})}^{2}}{\sum_{t \in T_{o o s}}^{T_{o o s}} \sum_{i \in I_{t}}^{I_{t}} w_{i, t} {(r_{i, t + 1} - {\hat{r}}_{i, t + 1}^{(b)})}^{2}}

Directional accuracy:

h_{t} = \frac{1}{n_{t}} \sum_{i \in I_{t}}^{I_{t}} 1 \{r_{i, t + 1} {\hat{r}}_{i, t + 1}^{(p)} > 0\}

S R = \frac{1}{T_{o o s}} \sum_{t \in T_{o o s}}^{T_{o o s}} h_{t}

We compute

{h_{t}}^{(b)}

and

{S R}^{(b)}

analogously for the benchmark.

Statistical inference:

Inference is conducted at the month level to allow arbitrary cross-sectional correlation within each month. We report heteroskedasticity- and autocorrelation-consistent (HAC) Newey–West standard errors with lag length

L = ⌊{T_{o o s}}^{\frac{1}{3}}⌋

. For nested forecast comparisons against HA, we implement the Clark–West MSPE-adjusted test (Clark & West, 2007) by constructing per-observation adjusted losses, aggregating them within the month, and regressing the monthly differential on a constant; the HAC t-statistic on the intercept yields the one-sided test of improvement over the benchmark.

f_{i, t} = {(e_{i, t}^{(b)})}^{2} - [{(e_{i, t}^{(p)})}^{2} - {({\hat{r}}_{i, t + 1}^{(b)} - {\hat{r}}_{i, t + 1}^{(p)})}^{2}]

Aggregate within month:

f_{t} = \frac{1}{n_{t}} \sum_{i \in I_{t}}^{I_{t}} f_{i, t}

We also computed the Diebold–Mariano statistic (Diebold & Mariano, 1995) using monthly squared-error differentials and the corresponding one-sided alternative.

d_{t}^{(S E)} = {M S F E}_{t}^{(p)} - {M S F E}_{t}^{(b)}

The DM statistic is

D M = \frac{{\bar{d}}^{(S E)}}{(\hat{S E} ({\bar{d}}^{(S E)}))}

with the one-sided alternative

E [d_{t}^{(S E)}] < 0

.

For directional performance, we report the Pesaran–Timmermann test (Pesaran & Timmermann, 1992) of independence, a paired HAC t-test for monthly hit-rate improvements over HA, and a DM test on 0–1 losses constructed from hit-rate indicators. Let

p_{t} = h_{t}

. We tested

H_{0} : E [p_{t}] = 0.5 v s . H_{1} : E [p_{t}] > 0.5

using a HAC t-statistic for

\bar{p} - 0.5

.

For the Paired test of hit-rate improvement, let

d_{t} = {p_{t}}^{(m)} - {p_{t}}^{(b)}

. We reported the one-sided HAC t-statistic for

H_{0} : E [d_{t}] \leq 0 v s . H_{1} : E [d_{t}] > 0

.

DM test on hit-rate losses are defined as

{l_{t}}^{(m)} = 1 - {p_{t}}^{(m)}

and

{l_{t}}^{(b)} = 1 - {p_{t}}^{(b)}

, with a loss differential

d_{t}^{(H L)} = {l_{t}}^{(m)} - {l_{t}}^{(b)} = {p_{t}}^{(b)} - {p_{t}}^{(m)}

.

Conditional performance (recessions vs. expansions):

To assess whether predictability is state-dependent, we compute all accuracy and inference measures separately for NBER recession months and for expansion months, aligning the recession indicator to the OOS timeline. HAC inference is applied within each subset exactly as in the full sample. Let

D_{t}^{(R E C)}

\in

\{0, 1\}

indicate NBER recession months aligned to the OOS timeline. For any

R^{2}

metric above, we computed the same statistic on the subsets

\{t \in T_{o o s} : D_{t}^{(R E C)} = 1\}

(recessions) and

\{t \in T_{o o s} : D_{t}^{(R E C)} = 0\}

(expansions). Inference uses the same HAC procedure within each subset.

D_{t}^{(R E C)} \in \{0, 1\}

4. Empirical Results

Overall Model Performance (In-Sample vs. Out-of-Sample):

The suite of predictive models displays a stark contrast between in-sample fit and out-of-sample forecasting performance. In-sample, several complex machine learning models achieve impressively high R² and directional accuracy, indicating their ability to capture historical return patterns (Table 2). For example, the CatBoost model explains about 90% of in-sample return variance (in-sample R² ~ 90.16%) with a success ratio of ~90%. This far-exceeds the in-sample fit of a simple linear OLS regression (in-sample R² ~ 49%) or standard linear factor compression methods like PCR/PLS (in-sample R² ~ 18–19%). Most regularized linear models (LASSO, Ridge, Elastic Net) and tree-based ensembles (Random Forest, XGBoost) capture between ~17% and 58% of in-sample variance, while simpler dimension-reduction models and shallow neural networks also show moderate in-sample fits in the high-teens to low-30% range. These high in-sample R² values are paired with correspondingly high in-sample success ratios (often 60–70% or more), indicating the models correctly signed the return forecast in well over half of historical periods. Such results underscore that with full benefit of hindsight, many models, especially more flexible ones, can be tuned to closely track past return realizations.

Out-of-sample, however, the picture reverses dramatically. Once tested on new data (post-2015), most models see their R² shrink to near zero and negative values (Table 3), highlighting the challenges of genuine predictive power. The majority of models fail to materially beat the historical mean benchmark (which by construction has OOS R² = 0). For instance, the Elastic Net and LASSO regressions, among the better performers overall, achieve only −0.13% and −0.19% out-of-sample R² respectively, implying almost no reduction in forecast error variance relative to a naive forecast. A number of models even produce substantially negative OOS R², meaning their forecasts underperform the simple historical average. The most extreme case is the unconstrained OLS with the full feature set: despite its reasonable in-sample fit, OLS suffers a disastrous OOS R² ~ −70%, an indication that OLS badly overfit the sample noise. In contrast, models with built-in regularization (shrinkage or ensemble methods) guard against the worst overfitting, generally clustering near OOS R² ~ 0%. For example, a Ridge regression yields about −0.5% OOS R² and tree-based GBRT about −8%, while the best neural nets range from roughly −0.3% to −2% in aggregate tests (depending on architecture).

In terms of directional accuracy (success ratio) almost all models exceeded 50%, reflecting the fact that the market had an upward drift over the sample. The historical average (HA) strategy, which essentially always predicts the average positive return in our case, had a success ratio of 60.32%. Many models matched or modestly beat this directional accuracy. For example, gradient-boosted trees (GBRT) achieved about 60.6% correct direction predictions, the highest among individual models, while PCR and Ridge achieved around 60.2%, virtually identical to the benchmark. In contrast, the worst model directionally was k-NN (KNR), with only ~48.9% success, essentially failing to beat random guessing. Most other models clustered in the 55–60% range and had statistically significant PT statistics (e.g., OLS 53.9% with PT = 2.36, p < 0.01; Ridge 58.9% with PT = 2.79, p = 0.0026), implying they consistently picked the correct return sign more than half the time. Importantly, however, high success ratios did not always coincide with high R². Models like PCR and PLS, for instance, had significant directional skill without reducing MSE by much. This suggests the success ratio comes from models picking on market drift, not model skill. Overall, the gap between high in-sample fit and flat out-of-sample performance underscores the importance of evaluating models on fresh data to avoid overestimating their true predictive power.

Performance Across Economic Regimes, Recessions vs. Expansions:

We next examine whether predictive performance differs between recessionary and expansionary market environments. The out-of-sample period includes both economic downturns and recoveries, allowing us to compute conditional R² in recessions (R²_REC) and expansions (R²_EXP) (Table 4). A striking finding is that several models perform materially better during recessions. For example, the LASSO model, which has an overall out-of-sample R² ~ −0.19%, manages a positive R² ~ +0.07% during NBER-designated recessions, versus −0.20% in expansions. A similar pattern is observed for the Elastic Net (R²_REC ~ +0.05% vs. R²_EXP ~ −0.14%) and PCR (R²_REC ~ +0.04% vs. R²_EXP ~ −0.15%). In other words, the regularized linear models exhibit a slight predictive edge specifically in downturn periods, even though their average performance is flat. Notably, the PLS regression achieves about 0.49% R² in recessions, the highest among linear models, while turning strongly negative (−1.48%) in expansions. This suggests that certain factor payoffs become easier to forecast in bad times, perhaps because risk premia and mispricing effects are more pronounced when markets are under stress. Also, the fact that regularized models manage to retain or improve their accuracy in downturns is an encouraging sign for using these models as part of a defensive forecasting toolkit.

Tree-based ensemble models show an even starker dichotomy. The Random Forest, for instance, has a modestly positive R²_REC ~ +0.61% but falls to −4.1% in expansions. In other words, the RF model adds value in recessions, but severely mispredicts in normal times. In contrast, CatBoost and XGBoost show consistently negative performance in both regimes (e.g., CatBoost R²_REC ~ −5.95%, R²_EXP ~ −6.50%), indicating these models did not find recession-specific traction. Overall, however, the evidence points to better predictability in recessions for many models. This may reflect that cross-sectional return differences widen during market downturns, creating clearer signals that models can exploit. By comparison, in long expansionary periods, stock returns may be driven more by broad market moves and investor optimism that prove harder to forecast cross-sectionally, causing model performance to falter.

Market timing performance:

We evaluate whether model-predicted one-month-ahead stock returns can time the aggregate market by aggregating binary signs across the cross-section. First, we convert predictions into a timing signal using only their sign

s_{i, t} = s i g n (\hat{r_{i, t + 1}})

, with three possible values

\in {- 1, 0, + 1}

. Next, we normalize the monthly gross exposure to one by scaling signs by the cross-sectional average absolute sign

g_{t} = \frac{1}{N_{t}} \sum_{i = 1}^{N_{t}} |s_{i, t}|

, which yields scaled signals:

{\tilde{s}}_{i, t} = \frac{s_{i, t}}{g_{t}} i f g_{t} > 0

{\tilde{s}}_{i, t} = 0 i f g_{t} = 0

Table 5 reports annualized performance for timing portfolios derived from each predictor. Gradient boosting (GBRT) achieves the strongest risk-adjusted performance with Sharpe 1.47 at a 33.05% average return and 21.56% volatility, dominating benchmark’s Sharpe 1.25 (34.73% return, 26.67% volatility). Random Forest matches the benchmark’s Sharpe (1.25) while reducing volatility to 24.84% at a 32.46% return. PCR, HA, and NN4 essentially mimic benchmark. Linear shrinkage estimators (LASSO 1.23, ENet 1.20, Ridge 1.16) modestly underperform, while XGBoost (0.86), CatBoost (0.70), OLS (0.37) and KNR (0.03) produce weak risk-adjusted timing in this experiment, indicating that implementation details and model complexity matter for directional accuracy.

Because the strategy uses only signs and enforces unit gross exposure, improvements in Sharpe must come from allocating net exposure in the right months (timing) and/or aligning signs with idiosyncratic winners and losers (selection).

5. Theme-Level Prediction Results

Drilling down, the predictive factors can be grouped into 13 thematic categories (accruals, debt issuance, investment, low leverage, low risk, momentum, profit growth, profitability, quality, seasonality, short-term reversal, size, and value). We evaluate model performance within each theme to identify which types of predictors show the strongest out-of-sample efficacy. We observed many well documented phenomena in our handful of stocks. Figure 2 summarizes the best out-of-sample R² achieved by any model in each theme over the 2015–2023 out-of-sample period. It is immediately clear that certain themes consistently offered more predictive signal than others.

Short-term reversal emerges as the single most predictive standalone theme. Using past-month return reversals as a predictor of next-month returns, multiple models attain positive and significant out-of-sample R². In fact, nearly every top model in this theme yields OOS R² on the order of 0.4–0.5% (Table 6). For example, OLS, LASSO, and Elastic Net all deliver about 0.48–0.50% OOS R² on the short-term reversal signal. Even methods like CatBoost match that with 0.49%. The best individual model is a Ridge regression (OOS R² = 0.502%). Moreover, when all models’ forecasts are combined, the reversal theme’s performance jumps further. The combined model achieves OOS R² ~ 0.76%, the highest of any theme. This indicates the short-term reversal effect was a robust and exploitable regularity in the 2015–2023 sample. The success ratio for reversal-based forecasts is also around 60–61% for top models, underlining the consistency of this anomaly.

Other technical price-based themes also perform strongly. Seasonality shows similarly high predictability (Table 7). A particularly successful model is a second neural network specification (NN2), which achieves OOS R² ~ 0.61% on the seasonality theme. Regularized linear models do well here too (LASSO 0.37%, Ridge 0.34%), and even the combined forecast is 0.37%. Seasonality’s outperformance is intuitive: persistent calendar anomalies lend themselves to systematic prediction. The momentum theme yields more modest gains—on the order of 0.2–0.3% at best. A Ridge model on momentum factors reaches OOS R² = 0.27%, and LASSO/ENet about 0.18–0.19%. These results are somewhat underwhelming given the prominence of the momentum anomaly (Table 8); it appears that plain momentum signals were less potent or already arbitraged, leaving only a small residual predictability. Nonetheless, momentum forecasts still achieved success rates near 60%, suggesting they retained some directional value if not large magnitude.

Among fundamental factor themes, the profitability and quality categories stand out as the most useful (Table 9 and Table 10). The profitability theme shows consistent out-of-sample gains across a range of models. A deep neural network (NN5) performs best here, reaching OOS R² ~ 0.37%, notably high for a single model on a fundamental factor. Regularized linear models are not far behind: both LASSO and Elastic Net obtain about 0.28–0.29% R², and PCR achieves ~0.28%. These results imply that profitability-related metrics did contain a stable, exploitable return signal in this era. The quality theme also generates solid predictability. Here, linear shrinkage models perform exceptionally well: Elastic Net delivers OOS R² ~ 0.43% and LASSO 0.41%, the highest in this theme, and Ridge hits ~0.21%. These are sizeable out-of-sample effects for stock selection based purely on quality indicators. By contrast, more complex models did not improve upon the linear ones for quality. Overall, the profitability/quality complex appears to have been one of the more rewarded fundamentals, delivering ~0.3–0.4% predictive R² and consistently having ~59–60% success rates.

The low leverage theme also shows a meaningful, though slightly smaller, performance edge (Table 11). The best model, a neural network, attains OOS R² ~ 0.38% on this theme, and notably, a number of linear models (LASSO, Ridge, ENet) yield around 0.25–0.30% as well. This suggests that stocks with stronger balance sheets (lower debt) could be systematically forecast to outperform during the sample, a pattern consistent with a “low leverage anomaly.” Interestingly, the success ratio of the top NN model in this theme is about 61%, highest among the themes, indicating a very consistent signal.

In contrast to the above, several themes show little to no predictive power out-of-sample. The value factor theme, despite being a cornerstone of cross-sectional return prediction, historically shows minimal predictive edge (Table 12). No advanced model significantly beats the baseline using value metrics: the best OOS R² comes from a PCR model at only 0.32%. The standard linear and penalized models yield slightly negative R² (e.g., LASSO −0.99%, ENet −1.03%), albeit with success ratios around 60–61%.

The investment theme shows no positive signal (Table 13): virtually all models have negative out-of-sample R² for this theme. Even the best attempt (Ridge or a shallow NN) only gets close to breakeven (e.g., Ridge −0.67%, NN4 −0.25%), and most others are deeply negative. This suggests that firms’ investment or growth characteristics did not produce a reliable return spread in our sample. Similarly, most low-risk predictors failed (Table 14), consistent with the idea that the low-volatility anomaly may have been arbitraged away or unstable in recent years. The debt issuance theme shows only a mild signal (Table 15): PCR and related models hit about 0.11–0.23% R², suggesting a modestly predictive effect but nothing very robust.

The accruals-based models exhibit only a very modest ability to predict returns out-of-sample (Table 16). The highest OOS R² achieved for the Accruals theme is around 0.25% (Ridge regression), with most other models clustering near zero or even slightly negative R². Consistent with this weak fit, the directional success ratio hovers around 59–60% for most accruals’ models, barely above the baseline of ~60.3% achieved by benchmark. For example, the best model’s directional accuracy (Random Forest, about 60.6%) is slightly better than the no-skill historical average approach. This finding aligns with recent literature suggesting the accrual anomaly concentration in smaller, illiquid firms that is not prominent in our sample.

Models built on the profit growth theme similarly show only mild predictive power out-of-sample (Table 17). The top-performing model (Ridge) attains an OOS R² of about 0.26% (with LASSO close behind at 0.25%). As with accruals, the directional success ratios are clustered around the high-50s to 60% range and show minimal improvement over the baseline. No model exceeds 61% directional accuracy, so profit growth adds little to the ability to predict up vs. down months. The slight edge of Ridge/LASSO implies that a bit of shrinkage helps generalize the weak signal present.

The size theme shows more encouraging predictive metrics, yet still modest in absolute terms (Table 18). Several models are able to extract a small positive signal from firm size. In out-of-sample tests, the best model (CatBoost) reaches an OOS R² of approximately 0.25%, and a cluster of other approaches (LASSO, PCR, Ridge, Elastic Net) are close behind with R² 0.19–0.20%. Thus, unlike accruals or profit growth where only one or two models eked out a positive R², here, multiple model types perform well, suggesting the size–return relationship is more readily captured. In terms of directional accuracy, the size theme again mirrors previous patterns: most models correctly predict return direction about 59–60% of the time, very close to the baseline 60.3%. Notably, the top performer (CatBoost) attains a 60.4% success ratio. CatBoost’s performance suggests the size–return relationship is largely monotonic and low-dimensional. The key takeaway is that the size theme does contain a small but detectable predictive signal, stronger than those of accruals or profit growth.

Economic Regime Performance by Theme:

The efficacy of each theme’s predictors often varied between recessions and expansions, reflecting that some factor payoffs are regime-dependent. We find that several fundamental themes have predictive power concentrated in recession periods, whereas technical momentum/reversal signals work better in expansions. For example, the profitability theme’s strong overall performance was disproportionately driven by bear markets: both LASSO and Elastic Net regressions on profitability factors show recession R² of about +0.70%, far higher than their expansion R² (+0.26%). Similarly, the size theme appears to matter most in recessions. Linear models on the size factor achieve about R²_REC ~ 0.85–0.94% in recessions but only ~0.15% in expansions. Intuitively, during market crises, small-cap stocks tend to drastically underperform large caps—a relatively predictable pattern; during expansions, size-based performance differences are much weaker, rendering the signal noisy. It is interesting to see this pattern on the handful of largest cap stock in our sample. We observe a similar but weaker pattern for the value theme: while overall value had small predictive power, there is a hint of a counter-cyclical effect—e.g., a PLS model on value has R²_REC ~ 0.56% vs. R²_EXP ~ 0.10%, suggesting value strategies performed better in recessions relative to their poor showing in expansions.

By contrast, the short-term reversal anomaly flips behavior across regimes: it performs far worse in recessions and best during normal times. Our models indicate that reversal strategies which normally yield +0.5% R² turned counterproductive in downturns. For instance, the OLS reversal model that has OOS R² ~ 0.49% overall sees its performance drop to −5.44% in recessions (and +0.79% in expansions). LASSO and ENet show the same pattern, with R²_REC around −4% and R²_EXP +0.7%. This implies that during acute market stress, short-term losers do not revert to winners as they typically might; instead, losing stocks continue to crash, undermining reversal trades. Only in calmer periods does the classic “oversold bounce-back” behavior manifest strongly. A similar pro-cyclical behavior is observed for seasonality and momentum signals; they are largely an expansion phenomenon. In the seasonality theme, the top models (LASSO, ENet, etc.) have mildly negative R² in recessions but healthy positive R² in expansions (e.g., ENet: −1.62% in rec vs. +0.48% in exp). The combined seasonality model achieves R²_EXP ~ 0.40% while dipping below zero in recessions. Seasonality patterns can break down or reverse in tumultuous periods, but hold during steady times, which is consistent with these results.

The quality theme also behaved somewhat pro-cyclically: interestingly, quality signals had negative or zero R² in recessions (the LASSO/ENet quality models drop to R²_REC ~ −0.33%) but positive in expansions (R²_EXP ~ +0.45%). This might seem counter-intuitive. One might expect quality to shine in bad times, but it could be that in severe downturns all stocks fall dramatically (regardless of quality), compressing cross-sectional differences. In contrast, during expansions, investors discriminate more between high- and low-quality firms, allowing quality factors to drive return spreads. Another possibility is that quality overlaps with growth characteristics that did well in the booming market of the late 2010s, rather than with pure defensive quality.

Looking at the combined model (an average or ensemble of all models in the theme), we observe some similar patterns (Figure 3). Short-term reversal again stands out with the highest positive difference. The combined model’s R² in expansions is ~0.88% vs. −1.63% in recessions—a +2.5 point advantage. Low leverage and accruals also maintain notable positive gaps, indicating these themes on average predict better in expansions. On the other hand, several themes show negative differences when using the combined model. For example, momentum and profit growth have bars below zero, meaning their averaged predictions perform better during recessions (e.g., momentum’s combined R² is ~1.44% in recessions vs. −0.58% in expansions). Quality and size also show moderate negative gaps, suggesting their predictive power is relatively higher in downturns on average.

These regime analyses suggest that factor effectiveness is state-dependent: profitability, quality, and size signals helped most during market downturns (offering defensive alpha), whereas reversal and seasonality required benign conditions to flourish. This state dependence also connects directly to the granular hypothesis motivating our study (Gabaix, 2011). If idiosyncratic shocks to the Magnificent Seven can propagate to aggregate market fluctuations, then the regime-dependent nature of their return predictability has macroeconomic implications: the factors that best predict these firms’ returns shift precisely when macroeconomic conditions change, suggesting that the transmission channel from firm-level to aggregate dynamics is itself time-varying. From a practical perspective, this means an investor could rotate factor strategies based on macro conditions by emphasizing quality/defensive factors in recessions and technical factors in expansions to potentially improve overall performance. Our evidence that certain models’ R² swing from significantly positive in one regime to negative in the other reinforces the importance of incorporating economic conditions into factor allocation decisions.

Model Performance by Type and Complexity:

An important dimension of our results is how different classes of prediction models performed across factor themes. Traditional linear models, tree-based ensemble learners, and neural networks fared across various factor themes. Our analysis reveals that model complexity did not uniformly translate into better predictive performance (Figure 4 and Figure 5). In fact, simpler regularized linear models established a difficult benchmark that more complex models only occasionally exceeded, a finding likely influenced by the constraints of our dataset.

Linear models, particularly regularized regressions like LASSO, Ridge, and Elastic Net, proved to be robust and consistently strong performers. While being limited to additive linear relationships, this was sufficient for many factor–return linkages. Their inherent resistance to overfitting allowed them to frequently outperform more complex non-linear models out-of-sample. For instance, LASSO and Elastic Net achieved the highest OOS number of times a model beat the benchmark on R² in several key themes, including quality (0.41–0.43%), short-term reversal (0.49%), and profit growth (0.19–0.25%). Ridge regression performed similarly well, leading the momentum theme (0.27%).

These results stand in stark contrast to the unpenalized ordinary least squares (OLS) model, which produced a catastrophic −70% overall OOS R², underscoring the dangers of unregularized regression in high-dimensional settings. This suggests that while linear relationships are exploitable, controlling model complexity is paramount. Factor-reduction methods like PCR also showed utility, notably being the only model to generate a positive R² from the value theme (0.32%). On balance, regularized linear models emerged as reliable choices that consistently captured the bulk of the available predictive signal.

Tree-based ensemble models, including Random Forest and gradient-boosted trees (XGBoost, CatBoost), presented a more mixed picture. These models achieved an extremely high fit in-sample but frequently failed to generalize, suggesting a tendency to overfit the training data. For example, XGBoost and other gradient boosting models seldom achieved positive OOS R² and, in some cases, dramatically misfired.

However, there were exceptions where these models successfully captured non-linear interactions. CatBoost was a top performer for short-term reversal (R² = 0.485%) and led the size theme (R² = 0.25%), indicating that non-linearities can be important in specific contexts. The Random Forest model was generally unremarkable, though it did show a unique ability to capture a non-linear “flight-to-quality” pattern during the 2020 market crash. Overall, the flexibility of tree models proved to be a double-edged sword, often capturing noise rather than signal.

Neural networks yielded the most varied results. We tested several architectures, many of which struggled to outperform linear models and showed signs of optimization issues or over-parameterization. Despite these challenges, a few networks identified non-linear patterns that other models missed. A key success was NN2, whose 0.61% OOS R² in the seasonality theme was the best result for that category. Similarly, our most complex network, NN5, became the top model in profitability (0.37%) and low leverage (0.38%), demonstrating that neural nets can exploit subtle, non-linear combinations of firm characteristics. On the other hand, most neural network applications resulted in marginal or negative OOS R² across other themes. This variability suggests that neural networks are a high-risk, high-reward tool in this domain.

A critical caveat to our analysis is the relatively small sample size, spanning a decade of monthly data. This limitation is particularly relevant for complex models. Models with high parametric complexity, such as tree-based ensembles and neural networks, are notoriously “data-hungry”. With a limited number of observations, their flexibility can be a disadvantage, as they risk learning spurious patterns specific to the training sample rather than the true underlying signal.

This aligns with a broader consensus in machine learning literature. Studies employing substantially larger datasets, for instance, those spanning multiple decades or incorporating higher-frequency data, often find that neural networks and boosted trees can outperform traditional linear models. Therefore, the strong performance of parsimonious linear models in our study may be less an indictment of non-linear models in principle, and more a reflection of a data-scarce environment. It is plausible that with a more extensive dataset, the capacity of these complex models would become more apparent.

Ultimately, our findings show that regularized linear models provide the most reliable performance. Tree-based models largely underperformed, and neural networks were highly unpredictable, capable of top performance but often failing to generalize. This leads to a crucial practical insight: for many equity returns prediction tasks under typical data constraints, researchers can begin with parsimonious models (sparse linear core that captures the dominant, largely additive relations). Complexity should only be introduced cautiously (nonlinear “specialists”), with robust cross-validation only where diagnostics and economic priors indicate genuine interaction effects and the data is sufficient to support them.

Feature Importance and Attribution:

Feature attribution analysis reveals that the most reliable predictors (those contributing positively across models) tend to correspond to well-established equity anomalies (Figure 6): investment, profitability/quality, and certain risk premia (volatility and skewness effects). In contrast, features that consistently undermine performance are often those prone to nonpersistent relationships or redundancy: various accruals and earnings adjustments, incidental seasonality patterns, and ultra-short-term technical indicators. There is a degree of model dependency in these outcomes as well. Different algorithms leveraged different information: the latent-factor approaches (PLS, PCR) leaned more on broad composite signals and were more easily misled by nuanced accounting metrics, whereas the penalized regressions (Ridge, ENet, LASSO) honed in on a mix of factor-like features while mitigating (but not entirely avoiding) the influence of noisy variables. The neural network stood out in that it had no single feature that outright harmed its R², likely due to its ability to model complex interactions and effectively “learn around” useless inputs. This suggests that a sufficiently flexible model can, to some extent, immunize itself against bad predictors by simply assigning them minimal weight, whereas more rigid models suffer unless those features are removed.

Feature-level attributions reconcile several apparent tensions in the theme results and clarify which definitions within a theme are worth retaining. On the positive side, a small set of investment-efficiency proxies (e.g., one-year CapEx growth, short-term asset growth, asset-to-market equity) consistently lift out-of-sample fit when combined with quality and risk controls, even though the broad “investment” theme is weak as a standalone bet. Tail-risk measures (downside beta, idiosyncratic volatility and skewness) add incremental explanatory power despite the poor showing of naive low-volatility/low-beta tilts, indicating that asymmetry in risk is more informative than level. Composite quality constructs (Piotroski F-score, QMJ components) are robust contributors, whereas narrow or noisy accounting ratios are not. On the negative side, dense packets of overlapping accruals measures, several seasonality variants, and ultra-short-horizon technicals systematically erode OOS R², underscoring that redundancy and fragile definitions (not the underlying concept) are the primary failure modes. The upshot is a “keep the spine, prune the fronds” message: retain a few stable, interpretable proxies within each theme and excise collinear or brittle cousins.

Robustness:

We conduct several robustness checks using the full set of firm-specific characteristics (consistent with the full-model in Table 3 and Table 4) to ensure our findings hold under different specifications. First, we vary the rolling estimation window length. Extending the training window from 6 years to 8 years markedly improves out-of-sample fit for all models, especially the more complex ensemble methods (Table 19 and Table 20). For example, XGBoost’s OOS R² jumps from about −5.8% with a 6-year window to +0.86% with an 8-year window, and CatBoost similarly rises from roughly −5.5% to +0.83%. These once-negative R²s turn positive, allowing the tree-based models to overtake regularized linear models when given additional data. This aligns with the intuition that such non-linear models are “data-hungry”—their performance benefits disproportionately from larger training samples, which mitigate overfitting.

Second, we use an alternative evaluation window focusing on 2017–2023 (a more recent subsample) (Table 21 and Table 22). In this period, the sophisticated models achieve their highest accuracy: XGBoost attains an OOS R² of ~2.7%, far above the ~0.5% of the best linear model. In other words, with more recent data (and effectively a longer overall sample), the complex algorithms finally realize a predictive edge, whereas simpler models remain more modest. This suggests that either the late-2010s market conditions offer clearer patterns for non-linear learning or that by accumulating enough observations the ensemble methods were able to unlock greater signal.

Finally, we change the hyperparameter tuning loss from MSE to MAE to check robustness against the choice of error metric (Table 23). The qualitative rankings are essentially unchanged. Regularized linear models still produce only small positive OOS R² (on the order of 0.2–0.3%), while the more complex learners remain unable to beat the benchmark (XGBoost stays at around 4% OOS R²). Thus, using MAE in training does not materially improve the complex models’ generalization. Taken together, these tests reinforce our central message: model complexity pays off only in sufficiently large-data settings. When sample sizes are limited or the regime is unstable, simpler linear models are more reliable and stable; but as the training window grows or the evaluation period moves to a data-rich, more predictable era, flexible non-linear models begin to outperform, consistent with machine learning literature on the high-data requirements for complex models.

6. Discussion

Implications for Concentration Risk and Index Exposure

The increasing dominance of a small number of mega-cap firms raises structural concerns for benchmark-based investing. When a handful of stocks drive index-level performance, forecastability at the firm level can translate into material portfolio-level risk management implications. Our results indicate that predictive relationships are not uniformly distributed across factor domains and are highly regime-dependent. In particular, profitability and size signals become more informative during recessions, while short-term reversal and seasonality are more effective during expansions. This suggests that passive exposure to mega-cap equities implicitly embeds cyclical factor tilts that may amplify drawdowns or reduce diversification benefits during market stress.

A central methodological insight of this study is that model complexity must be matched to data availability. In a small-N, large-P environment, unrestricted machine learning models exhibit severe overfitting, often underperforming a simple historical average benchmark. Regularized linear models provide more stable out-of-sample performance, while nonlinear ensemble methods only demonstrate advantages when the training window is extended. This finding aligns with the broader literature emphasizing that flexibility without sufficient data can degrade predictive reliability. In practical forecasting applications involving concentrated asset sets, parsimony and economically motivated dimensionality reduction appear essential.

Transaction Costs and Practical Implementation

A natural question concerns the practical implementability of the predictive signals once trading costs are considered. The Magnificent Seven are among the most liquid securities in global equity markets, with bid–ask spreads typically below 1 basis point and minimal market impact for institutional-scale positions. Since our market timing strategy (Table 5) uses only the sign of predicted returns, actual portfolio turnover is limited to months where the directional signal changes, estimated at approximately 20–40% annual turnover. At 2–5 basis points round-trip cost, this implies an annualized cost drag of roughly 4–20 basis points, which is small relative to the strategy’s annualized returns (e.g., GBRT: 33.05% with Sharpe ratio 1.47). Importantly, the concern raised by Avramov et al. (2023) that ML-based strategies often concentrate in hard-to-arbitrage, illiquid stocks does not apply to our sample. The Magnificent Seven are among the most heavily traded and closely arbitraged securities in the world. We emphasize, however, that our paper’s primary contribution is predictability evidence and regime-dependent factor analysis, not a trading strategy recommendation. Practical implementation of the signals documented here would require further engineering, including dynamic position sizing and cost-aware optimization.

Adaptive Factor Allocation

The documented regime dependence of theme-level predictability suggests potential gains from conditional factor rotation strategies. Defensive themes such as size and profitability exhibit stronger performance in recessionary environments, whereas technical themes such as short-term reversal generate greater gains during expansions. These dynamics imply that unconditional factor exposures may obscure meaningful cyclical variation. An adaptive allocation framework, whether rule-based or learned via hierarchical modeling, may enhance forecasting robustness while reducing concentration-related volatility.

7. Conclusions

This study evaluates the predictability of returns for the Magnificent Seven using machine learning models applied to 154 firm characteristics organized into economically motivated themes. Three primary findings emerge.

First, theme-based modeling provides more reliable out-of-sample performance than unrestricted, high-dimensional specifications. Economically structured feature grouping mitigates overfitting and enhances interpretability.

Second, predictive effectiveness is strongly regime-dependent. Fundamental themes such as size and profitability perform better during recessions, while short-term reversal and seasonality are more effective in expansions. These findings highlight the limitations of unconditional forecasting models.

Third, model complexity does not uniformly improve performance in data-constrained environments. Regularized linear models consistently outperform more flexible algorithms unless longer training windows are available. Parsimony and economic structure appear more valuable than algorithmic sophistication in small-sample settings.

Taken together, the results suggest that adaptive, theme-oriented forecasting frameworks may improve risk management in increasingly concentrated equity markets. Future research may explore hierarchical architectures that combine theme-specific specialists with regime-sensitive weighting mechanisms to further enhance robustness and economic interpretability. A natural extension of this work is to test whether the thematic signals identified here, particularly short-term reversal and profitability, generalize to a broader universe of large-cap or technology stocks. We note that several of our strongest thematic findings are directionally consistent with the large cross-sectional results reported by Jensen et al. (2023), which provides indirect evidence of generalizability. Extending the thematic ML framework to broader equity panels represents a logical and important next step in this research agenda.

Author Contributions

Conceptualization, M.J., M.N. and A.C.; methodology, M.J.; validation, M.N. and A.C.; formal analysis, M.J.; data curation, M.J.; writing and original draft preparation, M.J.; writing review and editing, M.N. and A.C.; visualization, M.J.; supervision, M.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The firm-level characteristics data used in this study are available through the Wharton Research Data Services (WRDS) platform, based on the dataset of Jensen et al. (2023). NBER recession dates are publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Hyperparameter Settings for Machine Learning Models

Forecasting Models	Tuning Parameters
PLS	The number of components: {1, 2, 3, 4, 5, 6, 7, 8}
PCR	The number of components: {1, 2, 3, 4, 5, 6, 7, 8}
LASSO	L1 regularization parameter: [10⁻⁴, 10]
Enet	Constant that multiplies the penalty terms: [10⁻⁴, 10] The ENet mixing parameter: {0.2, 0.5, 0.8}
GBRT	The number of boosting stages to perform: {10, 50, 100, 150, 200} Maximum depth of the individual regression estimators: {2, 3, 4} The minimum number of samples required to be at a leaf node: {1, 3, 5}
RF	The number of trees in the forest: {10, 50, 100, 150, 200} Maximum depth of the individual regression estimators: {2, 3, 4} The minimum number of samples required to be at a leaf node: {1, 3, 5}
NN1~NN5	Dropout rate: {0.2, 0.4, 0.6, 0.8} Learning rate: {0.001, 0.01} L2 regularization parameter (also called weight decay): {0.1, 0.01, 0.001}
Ridge	L2 regularization parameter: {10^k \| k = 0, 1, 2, ⋯, 20}
SVR	The kernel type to be used in SVR: {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’} Degree of the polynomial kernel function: {2, 3, 4} L2 regularization parameter: {0.1, 0.5, 1}
KNR	Number of neighbors: {3, 4, 5, 6, 7} Weight function used in prediction: {‘distance’, ‘uniform’} Leaf size: {20, 30, 40} Power parameter for the Minkowski metric: {1, 2, 3}
XGBoost	Maximum depth of a tree: {4, 5, 6, 7, 8} Step size shrinkage used in update to prevent overfitting: {0.01, 0.1} L2 regularization parameter: {0, 0.5, 1} L1 regularization parameter: {0, 0.5, 1}
CatBoost	Number of iterations: {50, 100, 150, 200} Maximum tree depth: {4, 6, 8} Learning rate: {0.01, 0.1} L2 leaf regularization (l2_leaf_reg): {1, 3, 5} Random strength: {0, 1} Bagging temperature: {0, 1}

References

Akbari, A., Ng, L., & Solnik, B. (2021). Drivers of economic and financial integration: A machine learning approach. Journal of Empirical Finance, 61, 82–102. [Google Scholar] [CrossRef]
Avramov, D., Cheng, S., & Metzker, L. (2023). Machine learning vs. economic restrictions: Evidence from stock return predictability. Management Science, 69(5), 2587–2619. [Google Scholar] [CrossRef]
Bessembinder, H. (2018). Do stocks outperform treasury bills? Journal of Financial Economics, 129(3), 440–457. [Google Scholar] [CrossRef]
Beutel, J., List, S., & Von Schweinitz, G. (2019). An evaluation of early warning models for systemic banking crises: Does machine learning improve predictions? [IWH discussion papers] (No. 2/2019). Deutsche Bundesbank. [Google Scholar]
Bianchi, D., Büchner, M., & Tamoni, A. (2021). Bond risk premiums with machine learning. The Review of Financial Studies, 34(2), 1046–1089. [Google Scholar] [CrossRef]
Bryzgalova, S., Pelger, M., & Zhu, J. (2019). Forest through the trees: Building cross-sections of stock returns. Available online: https://papers.ssrn.com/sol3/Papers.cfm?abstract_id=3493458 (accessed on 19 December 2019).
Butaru, F., Chen, Q., Clark, B., Das, S., Lo, A. W., & Siddique, A. (2016). Risk and risk management in the credit card industry. Journal of Banking & Finance, 72, 218–239. [Google Scholar] [CrossRef]
Campbell, J. Y., & Thompson, S. B. (2008). Predicting excess stock returns out of sample: Can anything beat the historical average? The Review of Financial Studies, 21(4), 1509–1531. [Google Scholar] [CrossRef]
Chen, L., Pelger, M., & Zhu, J. (2024). Deep learning in asset pricing. Management Science, 70(2), 714–750. [Google Scholar] [CrossRef]
Clark, T. E., & West, K. D. (2007). Approximately normal tests for equal predictive accuracy in nested models. Journal of Econometrics, 138(1), 291–311. [Google Scholar] [CrossRef]
Cochrane, J. H. (2011). Presidential address: Discount rates. Journal of Finance, 66(4), 1047–1108. [Google Scholar] [CrossRef]
Dichtl, H., Drobetz, W., Neuhierl, A., & Wendt, V. S. (2021). Data snooping in equity premium prediction. International Journal of Forecasting, 37(1), 72–94. [Google Scholar] [CrossRef]
Diebold, F. X., & Mariano, R. S. (1995). Comparing predictive accuracy. Journal of Business & Economic Statistics, 13(3), 253–263. [Google Scholar]
Fama, E. F., & French, K. R. (2008). Dissecting anomalies. The Journal of Finance, 63(4), 1653–1678. [Google Scholar] [CrossRef]
Feng, G., Giglio, S., & Xiu, D. (2020). Taming the factor zoo: A test of new factors. The Journal of Finance, 75(3), 1327–1370. [Google Scholar] [CrossRef]
Ferreira, M. A., & Santa-Clara, P. (2011). Forecasting stock market returns: The sum of the parts is more than the whole. Journal of Financial Economics, 100(3), 514–537. [Google Scholar] [CrossRef]
Freyberger, J., Neuhierl, A., & Weber, M. (2020). Dissecting characteristics nonparametrically. The Review of Financial Studies, 33(5), 2326–2377. [Google Scholar] [CrossRef]
Gabaix, X. (2011). The granular origins of aggregate fluctuations. Econometrica, 79(3), 733–772. [Google Scholar] [CrossRef]
Goyal, A., Welch, I., & Zafirov, A. (2024). A comprehensive 2022 look at the empirical performance of equity premium prediction. The Review of Financial Studies, 37(11), 3490–3557. [Google Scholar] [CrossRef]
Gu, S., Kelly, B., & Xiu, D. (2020). Empirical asset pricing via machine learning. The Review of Financial Studies, 33(5), 2223–2273. [Google Scholar] [CrossRef]
Harvey, C. R., Liu, Y., & Zhu, H. (2016). … and the cross-section of expected returns. The Review of Financial Studies, 29(1), 5–68. [Google Scholar] [CrossRef]
Heaton, J. B., Polson, N. G., & Witte, J. H. (2017). Deep learning for finance: Deep portfolios. Applied Stochastic Models in Business and Industry, 33(1), 3–12. [Google Scholar] [CrossRef]
Henkel, S. J., Martin, J. S., & Nardari, F. (2011). Time-varying short-horizon predictability. Journal of Financial Economics, 99(3), 560–580. [Google Scholar] [CrossRef]
Hutchinson, J. M., Lo, A. W., & Poggio, T. (1994). A nonparametric approach to pricing and hedging derivative securities via learning networks. The Journal of Finance, 49(3), 851–889. [Google Scholar] [CrossRef]
Jensen, T. I., Kelly, B., & Pedersen, L. H. (2023). Is there a replication crisis in finance? The Journal of Finance, 78(5), 2465–2518. [Google Scholar] [CrossRef]
Kelly, B. T., Malamud, S., & Zhou, K. (2024). The virtue of complexity in return prediction. The Journal of Finance, 79(1), 459–503. [Google Scholar] [CrossRef]
Kelly, B. T., Pruitt, S., & Su, Y. (2019). Characteristics are covariances: A unified model of risk and return. Journal of Financial Economics, 134(3), 501–524. [Google Scholar] [CrossRef]
Khandani, A. E., Kim, A. J., & Lo, A. W. (2010). Consumer credit-risk models via machine-learning algorithms. Journal of Banking & Finance, 34(11), 2767–2787. [Google Scholar]
Leippold, M., Wang, Q., & Zhou, W. (2022). Machine learning in the Chinese stock market. Journal of Financial Economics, 145(2), 64–82. [Google Scholar] [CrossRef]
Lewellen, J. (2015). The cross-section of expected stock returns. Critical Finance Review, 4(1), 1–44. [Google Scholar] [CrossRef]
Moritz, B., & Zimmermann, T. (2016). Tree-based conditional portfolio sorts: The relation between past and future stock returns. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2740751 (accessed on 5 March 2016).
Neely, C. J., Rapach, D. E., Tu, J., & Zhou, G. (2014). Forecasting the equity risk premium: The role of technical indicators. Management Science, 60(7), 1772–1791. [Google Scholar] [CrossRef]
Pesaran, M. H., & Timmermann, A. (1992). A simple nonparametric test of predictive performance. Journal of Business & Economic Statistics, 10(4r), 461–465. [Google Scholar]
Pettenuzzo, D., & Timmermann, A. (2011). Predictability of stock returns and asset allocation under structural breaks. Journal of Econometrics, 164(1), 60–78. [Google Scholar] [CrossRef]
Rapach, D. E., Strauss, J. K., & Zhou, G. (2010). Out-of-sample equity premium prediction: Combination forecasts and links to the real economy. The Review of Financial Studies, 23(2), 821–862. [Google Scholar] [CrossRef]
Rapach, D. E., Strauss, J. K., & Zhou, G. (2013). International stock return predictability: What is the role of the United States? The Journal of Finance, 68(4), 1633–1662. [Google Scholar] [CrossRef]
Sadhwani, A., Giesecke, K., & Sirignano, J. (2021). Deep learning for mortgage risk. Journal of Financial Econometrics, 19(2), 313–368. [Google Scholar] [CrossRef]
Spiegel, M. (2008). Forecasting the equity premium: Where we stand today. The Review of Financial Studies, 21(4), 1453–1454. [Google Scholar] [CrossRef]
Welch, I., & Goyal, A. (2008). A comprehensive look at the empirical performance of equity premium prediction. The Review of Financial Studies, 21(4), 1455–1508. [Google Scholar] [CrossRef]
Xu, X., & Liu, W. H. (2024). Forecasting the equity premium: Can machine learning beat the historical average? Quantitative Finance, 24(10), 1445–1461. [Google Scholar] [CrossRef]
Yao, J., Li, Y., & Tan, C. L. (2000). Option price forecasting using neural networks. Omega, 28(4), 455–466. [Google Scholar] [CrossRef]

Figure 1. Magnificent Seven weight in S&P 500, 2010–2025.

Figure 2. Best out-of-sample R² by theme.

Figure 3. Performance difference by theme (combined model).

Figure 4. Number of times a model beat/matched the benchmark on directional accuracy.

Figure 5. Number of times a model beat the benchmark on R² (overall, recession, expansion). # denotes the count of themes in which the model outperformed the benchmark.

Figure 6. Top features that help/hurt each model. Notes: This figure identifies the ten most influential features for each of the six best-performing machine learning models referenced in Table 3. The analysis shows which predictors had the largest impact on model forecasts during the out-of-sample period of 2015:01 to 2023:12.

Table 1. Descriptive statistics of monthly returns for Magnificent Seven’s stocks (2010–2023).

Ticker	N	Mean	Std. Dev.	Min	Median	Max
AAPL	167	0.0236	0.0792	−0.1807	0.0236	0.2163
GOOG	167	0.0155	0.0724	−0.1795	0.0166	0.2175
TSLA	161	0.0466	0.1853	−0.3673	0.0110	0.8107
AMZN	167	0.0223	0.0879	−0.2375	0.0244	0.2706
META	138	0.0237	0.1114	−0.3263	0.0178	0.4791
MSFT	167	0.0188	0.0630	−0.1508	0.0203	0.1964
NVDA	167	0.0368	0.1320	−0.3203	0.0290	0.5533

Notes: This table reports descriptive statistics of monthly log excess returns for each of the Magnificent Seven stocks over the sample period January 2010 to December 2023. N denotes the number of monthly observations. Mean, standard deviation (Std. Dev.), minimum (Min), median, and maximum (Max) are expressed in decimal form.

Table 2. Model performance: in-sample metrics.

Models	In-Sample R²	Success Ratio (%)
OLS	49.14	65.23
PLS	18.89	59.30
PCR	17.97	59.84
LASSO	17.12	59.84
Ridge	17.12	59.84
Enet	17.12	59.84
GBRT	28.51	62.53
RF	30.20	61.73
XGBoost	47.72	70.35
CatBoost	90.16	90.57
SVR	58.22	70.62
KNR	32.96	61.99
NN1	17.08	59.84
NN2	17.07	59.84
NN3	16.74	59.84
NN4	17.08	59.84
NN5	−35.76	39.62
Combined	35.19	64.15
HA	0.00	51.48

Notes: The table summarizes in-sample forecasting results. All models are benchmarked against the historical average (HA). We report the in-sample R² (percent reduction in mean squared forecast error relative to the HA) and the success ratio (the model’s accuracy in predicting the direction).

Table 3. Out-of-sample performance metrics.

Model	OOS R²	MSFE Adjusted	p-Value (MSFE)	Success Ratio (%)	PT Stat	p-Value	Paired t-Stat	p-Value (Paired t)	DM Stat	p-Value (DM)
CatB	−6.48	−1.30	0.903	55.77	1.87	0.031	−2.96	0.998	2.96	0.998
ENet	−0.13	1.11	0.134	58.99	2.80	0.003	−2.17	0.984	2.17	0.985
GBRT	−7.99	0.21	0.416	60.58	3.82	0.0001	0.27	0.393	−0.27	0.393
KNR	−23.64	−0.93	0.823	48.86	−0.37	0.646	−3.80	0.9999	3.80	0.9999
LASSO	−0.19	1.09	0.138	59.13	2.88	0.0020	−1.99	0.975	1.99	0.977
NN1	−3.28	−0.41	0.658	57.49	2.51	0.0061	−2.52	0.993	2.52	0.994
NN2	−2.15	−0.38	0.646	54.52	1.52	0.065	−3.20	0.9991	3.20	0.9993
NN3	−9.12	−1.10	0.864	54.60	1.50	0.067	−3.02	0.9984	3.02	0.9987
NN4	−0.36	0.36	0.360	60.32	3.15	0.0008	—	—	—	—
NN5	−11.88	−0.68	0.750	54.76	1.23	0.109	−1.64	0.948	1.64	0.949
OLS	−70.07	−0.26	0.602	53.92	2.36	0.0091	−2.56	0.994	2.56	0.995
PCR	−0.14	0.97	0.166	60.19	3.13	0.0009	−1.00	0.840	1.00	0.841
PLS	−1.38	0.51	0.303	58.99	2.83	0.0023	−1.59	0.943	1.59	0.944
RF	−3.91	−0.44	0.669	59.92	3.14	0.0008	−0.73	0.765	0.73	0.766
Ridge	−0.48	0.72	0.236	58.86	2.79	0.0026	−2.34	0.989	2.34	0.990
SVR	−5.86	−0.51	0.696	56.72	2.21	0.0137	−1.78	0.961	1.78	0.962
XGB	−4.08	0.32	0.376	57.22	2.38	0.0086	−2.07	0.979	2.07	0.981
FF3	−0.02	−0.33	0.629	60.32	2.19	0.014	—	—	0.47	0.680
HA	0.00	—	—	60.32	—	—	—	—	—	—

Notes: This table presents the out-of-sample forecasting performance from January 2015 to December 2023, calculated using an expanding estimation window. The historical average (HA) is used as the benchmark model for comparison. The out-of-sample R² shows the percentage reduction in mean squared forecast error for each model relative to the HA. The success ratio indicates a model’s accuracy in predicting the direction (OOS = out-of-sample; MSFE = mean squared forecast error; PT = Pesaran–Timmermann; DM = Diebold–Mariano).

Table 4. Out-of-sample conditional R²: recession vs. expansion.

Model	OOS R² (%)	Recession R² (%)	Expansion R² (%)
ENet	−0.13	0.05	−0.14
PCR	−0.14	0.04	−0.15
LASSO	−0.19	0.07	−0.20
NN4	−0.36	0.15	−0.39
Ridge	−0.48	−0.19	−0.49
PLS	−1.38	0.49	−1.48
NN2	−2.15	−1.07	−2.20
Combined	−2.44	−2.72	−2.43
NN1	−3.28	−2.39	−3.32
RF	−3.91	0.61	−4.14
XGBoost	−4.08	−27.07	−2.92
SVR	−5.86	−1.86	−6.07
CatBoost	−6.48	−5.95	−6.50
GBRT	−7.99	−2.35	−8.28
NN3	−9.12	−5.90	−9.29
NN5	−11.88	−5.92	−12.18
KNR	−23.64	−5.44	−24.56
OLS	−70.07	−9.50	−73.13

Notes: This table presents the out-of-sample forecasting performance from January 2015 to December 2023. The second and third columns show the OOS R² during NBER-dated recessions and expansions.

Table 5. Market timing performance.

Model	Average Return (%)	Volatility (%)	Sharpe Ratio
Buy-and-hold	34.73	26.67	1.25
OLS	6.26	13.38	0.37
PLS	31.77	25.93	1.18
PCR	34.66	26.67	1.25
LASSO	33.23	26.04	1.23
Ridge	31.83	26.21	1.16
ENet	32.48	25.97	1.20
RF	32.46	24.84	1.25
GBRT	33.05	21.56	1.47
XGBoost	19.72	21.42	0.86
CatBoost	16.40	21.44	0.70
KNR	1.90	21.98	0.03
NN1	24.80	23.90	0.98
NN2	20.18	23.93	0.79
NN3	15.63	23.08	0.62
NN4	34.73	26.67	1.25
FF3	32.15	26.68	1.21
HA	34.73	26.67	1.25

Notes: This table reports the annualized performance of market timing portfolios. For each model, a monthly signal is generated from the sign of one-month-ahead return forecasts. The signals are scaled to maintain a constant unit gross exposure. Performance is compared against a passive buy-and-hold benchmark. The table presents the annualized average return, volatility, and the corresponding Sharpe Ratio.

Table 6. Short-term reversal conditional out-of-sample R²: recession vs. expansion.

Model	OOS R² (%)	R² REC (%)	R² EXP (%)	Success Ratio (%)
Combined	0.76 **	−1.63	0.88	60.45
Ridge	0.50 ***	−2.44	0.65	60.58
ENet	0.49 **	−4.08	0.72	60.32
OLS	0.49 ***	−5.44	0.79	58.99
CatBoost	0.49 *	0.33	0.49	59.92
LASSO	0.49 **	−3.96	0.71	60.32
PCR	0.28 **	−1.31	0.36	60.32
PLS	0.18 ***	−3.74	0.38	60.45
NN1	−0.20	−0.04	−0.20	60.45
NN2	−0.31	0.13	−0.33	60.32
NN5	−0.52	−0.06	−0.54	60.32
SVR	−0.60	−6.46	−0.31	58.60
NN4	−0.62	0.14	−0.66	60.32
RF	−0.68	−0.81	−0.67	59.79
NN3	−0.69	−0.12	−0.72	60.32
XGBoost	−0.89	−3.14	−0.78	60.45
GBRT	−1.11	−4.60	−0.94	59.79
KNR	−12.86	3.15	−13.67	55.50
HA	0.00	0.00	0.00	60.32

Notes: This table reports the out-of-sample forecasting performance to evaluate the isolated predictive power of short-term reversal variables. The evaluation period is 2015:01 to 2023:12, and the historical average (HA) serves as the benchmark. The out-of-sample R² measures the percentage reduction in mean squared forecast error relative to the HA, while the success ratio evaluates directional accuracy. ***, **, and * denote significance at the 1%, 5%, and 10% levels, respectively.

Table 7. Seasonality conditional out-of-sample R²: recession vs. expansion.

Model	OOS R² (%)	R² REC (%)	R² EXP (%)	Success Ratio (%)
NN2	0.61 ***	−0.06	0.64	59.79
ENet	0.38 **	−1.62	0.48	60.58
Combined	0.37 *	−0.27	0.40	59.74
LASSO	0.37 **	−1.64	0.47	60.58
NN5	0.36	−0.56	0.41	60.32
Ridge	0.34 **	−1.64	0.44	60.45
NN1	0.06 *	0.06	0.06	60.21
PCR	−0.21	−1.66	−0.14	60.19
PLS	−0.39	−1.78	−0.32	59.39
NN4	−0.76	0.16	−0.81	60.32
CatBoost	−0.76	3.56	−0.98	59.21
OLS	−0.89	−1.79	−0.84	59.13
RF	−1.46	−1.46	−1.46	58.86
NN3	−1.49	1.33	−1.63	60.32
XGBoost	−1.68	3.36	−1.93	58.28
GBRT	−2.81	0.48	−2.98	56.08
SVR	−6.31	−5.02	−6.37	56.64
KNR	−13.13	−6.00	−13.49	56.69
HA	0.00	0.00	0.00	60.32

Notes: This table reports the out-of-sample forecasting performance to evaluate the isolated predictive power of seasonality variables. The evaluation period is 2015:01 to 2023:12, and the historical average (HA) serves as the benchmark. The out-of-sample R² measures the percentage reduction in mean squared forecast error relative to the HA, while the success ratio evaluates directional accuracy. ***, **, and * denote significance at the 1%, 5%, and 10% levels, respectively.

Table 8. Momentum conditional out-of-sample R²: recession vs. expansion.

Model	OOS R² (%)	Recession R² (%)	Expansion R² (%)	Success Ratio (%)
Ridge	0.27 *	−0.19	0.29	60.32
ENet	0.19	−0.19	0.20	60.32
LASSO	0.18	−0.19	0.20	60.32
PCR	−0.29	−0.29	−0.28	60.32
Combined	−0.48	1.44	−0.58	60.40
NN3	−0.73	−0.06	−0.76	60.32
NN4	−0.73	−0.12	−0.76	60.32
NN2	−0.92	−0.12	−0.96	60.21
NN1	−0.97	0.45	−1.05	59.83
OLS	−1.09	−0.16	−1.14	59.30
PLS	−1.29	−0.19	−1.34	59.72
RF	−2.11	−2.38	−2.10	59.84
NN5	−2.26	0.40	−2.40	60.32
CatBoost	−2.32	3.41	−2.61	58.90
GBRT	−2.63	6.14	−3.08	57.80
XGBoost	−3.37	6.37	−3.86	58.13
SVR	−4.73	−5.01	−4.72	57.00
KNR	−12.37	6.22	−13.31	54.00
HA	0.00	0.00	0.00	60.32

Notes: This table reports the out-of-sample forecasting performance to evaluate the isolated predictive power of momentum variables. The evaluation period is 2015:01 to 2023:12, and the historical average (HA) serves as the benchmark. The out-of-sample R² measures the percentage reduction in mean squared forecast error relative to the HA, while the success ratio evaluates directional accuracy. * denote significance at the 10% levels.

Table 9. Profitability conditional out-of-sample R²: recession vs. expansion.

Model	OOS R² (%)	Recession R² (%)	Expansion R² (%)	Success Ratio (%)
NN5	0.37	−0.56	0.42	61.12
LASSO	0.29 **	0.70	0.27	59.43
ENet	0.28 **	0.71	0.26	59.58
PCR	0.28 ***	0.04	0.29	59.58
Ridge	0.19 **	0.30	0.19	59.44
PLS	0.10 **	0.44	0.09	59.57
Combined	−0.26	−0.35	−0.25	59.42
NN1	−0.68	1.98	−0.82	58.77
NN4	−0.75	0.14	−0.80	60.32
OLS	−0.94	1.45	−1.07	56.67
NN2	−1.09	1.25	−1.21	59.02
NN3	−1.85	1.31	−2.01	60.32
RF	−3.55	−4.88	−3.48	57.16
SVR	−5.06	−13.31	−4.64	54.74
GBRT	−5.18	−0.70	−5.41	59.56
XGBoost	−5.27	0.21	−5.54	56.92
CatBoost	−7.48	−19.14	−6.90	57.88
KNR	−14.43	−4.04	−14.95	53.34
HA	0.00	0.00	0.00	60.32

Notes: This table reports the out-of-sample forecasting performance to evaluate the isolated predictive power of profitability variables. The evaluation period is 2015:01 to 2023:12, and the historical average (HA) serves as the benchmark. The out-of-sample R² measures the percentage reduction in mean squared forecast error relative to the HA, while the success ratio evaluates directional accuracy. *** and ** denote significance at the 1% and 5% levels, respectively.

Table 10. Quality conditional out-of-sample R²: recession vs. expansion.

Model	OOS R² (%)	R² REC (%)	R² EXP (%)	Success Ratio (%)
ENet	0.43 *	−0.36	0.47	60.32
LASSO	0.41 *	−0.33	0.45	60.32
Ridge	0.21	−0.18	0.23	60.32
CatBoost	−0.42	0.38	−0.46	56.75
RF	−0.47	0.99	−0.54	59.66
PLS	−0.56	−0.16	−0.58	60.19
Combined	−0.65	0.66	−0.72	59.26
NN3	−0.66	−0.12	−0.69	60.32
PCR	−0.83	−0.27	−0.86	59.79
NN4	−1.03	−0.06	−1.08	60.32
NN2	−2.09	−0.12	−2.19	58.73
OLS	−2.18	−2.43	−2.16	58.73
NN1	−2.51	0.10	−2.64	59.92
GBRT	−3.23	1.39	−3.46	58.99
XGBoost	−5.17	−4.11	−5.22	57.54
NN5	−5.61	1.04	−5.95	60.32
SVR	−7.49	−2.68	−7.74	54.13
KNR	−21.29	−0.82	−22.32	49.26
HA	0.00	0.00	0.00	60.32

Notes: This table reports the out-of-sample forecasting performance to evaluate the isolated predictive power of quality variables. The evaluation period is 2015:01 to 2023:12, and the historical average (HA) serves as the benchmark. The out-of-sample R² measures the percentage reduction in mean squared forecast error relative to the HA, while the success ratio evaluates directional accuracy. * denote significance at the 10% levels.

Table 11. Low leverage conditional out-of-sample R²: recession vs. expansion.

Model	OOS R² (%)	Recession R² (%)	Expansion R² (%)	Success Ratio (%)
NN5	0.38	−0.56	0.43	61.11
Ridge	0.29 *	−0.19	0.31	60.32
LASSO	0.28 *	−0.19	0.30	60.32
ENet	0.26 *	−0.19	0.29	60.32
PCR	−0.07	−0.39	−0.05	60.32
NN4	−0.71	0.14	−0.75	60.32
PLS	−0.86	−0.57	−0.88	59.74
NN2	−1.18	−0.12	−1.23	60.19
Combined	−1.75	−3.31	−1.67	58.23
NN3	−1.77	1.33	−1.92	60.32
OLS	−1.96	−0.53	−2.03	58.68
RF	−4.33	−4.53	−4.32	55.45
GBRT	−4.84	−10.48	−4.56	55.58
NN1	−5.94	−2.76	−6.10	54.87
XGBoost	−8.46	−6.60	−8.55	52.80
SVR	−11.71	−3.84	−12.10	52.09
CatBoost	−14.33	−13.21	−14.38	53.41
KNR	−18.80	−20.18	−18.73	52.51
HA	0.00	0.00	0.00	60.32

Notes: This table reports the out-of-sample forecasting performance to evaluate the isolated predictive power of low leverage variables. The evaluation period is 2015:01 to 2023:12, and the historical average (HA) serves as the benchmark. The out-of-sample R² measures the percentage reduction in mean squared forecast error relative to the HA, while the success ratio evaluates directional accuracy. * denote significance at the 10% levels.

Table 12. Value conditional out-of-sample R²: recession vs. expansion.

Model	OOS R² (%)	Recession R² (%)	Expansion R² (%)	Success Ratio (%)
PCR	0.32 **	0.27	0.32	59.92
PLS	0.12 *	0.56	0.10	60.32
Ridge	0.08	−0.19	0.10	60.32
NN3	−0.46	−0.26	−0.47	60.32
NN4	−0.53	0.19	−0.57	60.32
NN5	−0.71	−0.41	−0.73	60.32
LASSO	−0.99	−0.19	−1.03	61.11
ENet	−1.03	−0.19	−1.07	60.98
Combined	−1.03	−0.76	−1.04	59.52
CatBoost	−1.12	−4.94	−0.93	58.99
XGBoost	−1.44	−0.66	−1.48	58.81
OLS	−1.91	−0.62	−1.97	58.94
NN1	−3.05	−0.70	−3.17	56.32
NN2	−3.48	−1.20	−3.59	55.79
RF	−3.88	−0.73	−4.04	57.94
GBRT	−6.43	1.82	−6.84	56.83
SVR	−8.05	−4.00	−8.26	54.89
KNR	−15.84	−11.37	−16.07	48.92
HA	0.00	0.00	0.00	60.32

Notes: This table reports the out-of-sample forecasting performance to evaluate the isolated predictive power of value variables. The evaluation period is 2015:01 to 2023:12, and the historical average (HA) serves as the benchmark. The out-of-sample R² measures the percentage reduction in mean squared forecast error relative to the HA, while the success ratio evaluates directional accuracy. ** and * denote significance at the 5% and 10% levels, respectively.

Table 13. Investment conditional out-of-sample R²: recession vs. expansion.

Model	OOS R² (%)	Recession R² (%)	Expansion R² (%)	Success Ratio (%)
NN4	−0.25	−1.20	−0.20	60.32
RF	−0.40	0.38	−0.44	58.28
NN1	−0.54	−0.06	−0.57	57.67
NN3	−0.58	0.13	−0.62	60.32
Ridge	−0.67	−0.19	−0.69	60.05
NN5	−0.88	−0.78	−0.88	60.32
LASSO	−1.02	−0.19	−1.06	60.05
ENet	−1.08	−0.19	−1.12	60.05
Combined	−1.10	−0.07	−1.16	58.54
PCR	−1.41	0.70	−1.52	59.13
PLS	−1.46	1.30	−1.60	58.73
XGBoost	−2.38	−11.27	−1.94	57.17
GBRT	−2.60	−1.77	−2.64	56.16
CatBoost	−2.67	−0.38	−2.79	58.10
OLS	−2.92	1.10	−3.12	57.14
NN2	−4.68	−0.11	−4.91	55.79
SVR	−18.58	−4.75	−19.28	53.76
KNR	−19.22	0.92	−20.24	51.14
HA	0.00	0.00	0.00	60.32

Notes: This table reports the out-of-sample forecasting performance to evaluate the isolated predictive power of investment variables. The evaluation period is 2015:01 to 2023:12, and the historical average (HA) serves as the benchmark. The out-of-sample R² measures the percentage reduction in mean squared forecast error relative to the HA, while the success ratio evaluates directional accuracy.

Table 14. Low risk conditional out-of-sample R²: recession vs. expansion.

Model	OOS R² (%)	Recession R² (%)	Expansion R² (%)	Success Ratio (%)
Ridge	0.05	−0.19	0.06	60.32
PLS	−0.08	−0.05	−0.08	60.05
NN3	−0.09	−0.18	−0.09	60.32
NN2	−0.46	−0.12	−0.48	60.85
LASSO	−0.51	−0.19	−0.53	60.32
NN4	−0.53	0.18	−0.57	60.32
ENet	−0.56	−0.19	−0.58	60.32
GBRT	−0.84	0.51	−0.91	59.26
NN5	−0.88	−0.41	−0.90	60.32
RF	−0.90	0.78	−0.98	60.45
Combined	−1.10	0.45	−1.18	60.19
OLS	−1.33	−0.48	−1.38	60.05
PCR	−1.38	−0.11	−1.44	60.05
NN1	−2.00	−0.12	−2.10	58.86
XGBoost	−2.06	1.62	−2.24	58.86
CatBoost	−2.79	−0.54	−2.90	55.95
SVR	−7.63	−0.39	−8.00	57.17
KNR	−20.90	2.94	−22.10	50.77
HA	0.00	0.00	0.00	60.32

Notes: This table reports the out-of-sample forecasting performance to evaluate the isolated predictive power of low risk variables. The evaluation period is 2015:01 to 2023:12, and the historical average (HA) serves as the benchmark. The out-of-sample R² measures the percentage reduction in mean squared forecast error relative to the HA, while the success ratio evaluates directional accuracy.

Table 15. Debt issuance conditional out-of-sample R²: recession vs. expansion.

Model	OOS R² (%)	Recession R² (%)	Expansion R² (%)	Success Ratio (%)
PCR	0.23 ***	−0.68	0.27	60.05
PLS	0.16 **	−0.68	0.20	59.92
LASSO	0.13 **	−0.67	0.17	60.19
ENet	0.11 **	−0.68	0.15	60.06
Ridge	0.10 **	−0.24	0.12	59.92
OLS	−0.02	−0.68	0.01	59.92
NN3	−0.50	0.10	−0.53	60.32
NN2	−0.77	−0.12	−0.81	60.32
Combined	−0.81	−0.74	−0.81	59.79
RF	−1.08	−1.36	−1.06	59.66
NN1	−1.44	−0.12	−1.50	59.66
XGBoost	−1.80	−3.01	−1.74	58.99
NN4	−2.53	−1.18	−2.60	60.32
SVR	−4.51	−1.17	−4.68	54.79
NN5	−5.75	1.35	−6.11	60.32
GBRT	−6.01	0.47	−6.34	55.71
CatBoost	−6.37	−2.88	−6.55	56.83
KNR	−11.49	−7.08	−11.71	53.47
HA	0.00	0.00	0.00	60.32

Notes: This table reports the out-of-sample forecasting performance to evaluate the isolated predictive power of debt issuance variables. The evaluation period is 2015:01 to 2023:12, and the historical average (HA) serves as the benchmark. The out-of-sample R² measures the percentage reduction in mean squared forecast error relative to the HA, while the success ratio evaluates directional accuracy. *** and ** denote significance at the 1% and 5% levels, respectively.

Table 16. Accruals conditional out-of-sample R²: recession vs. expansion.

Model	OOS R² (%)	Recession R² (%)	Expansion R² (%)	Success Ratio (%)
Ridge	0.25 *	−0.09	0.26	60.32
ENet	0.11	−0.15	0.13	60.32
LASSO	0.10	−0.18	0.12	60.32
PCR	−0.14	0.27	−0.16	60.32
CatBoost	−0.45	−1.50	−0.40	59.79
NN3	−0.53	−0.12	−0.55	60.32
PLS	−0.65	0.33	−0.70	60.05
Combined	−0.93	−1.75	−0.88	60.32
NN2	−0.95	−0.12	−0.99	60.45
NN1	−1.57	−1.85	−1.55	58.60
OLS	−1.61	0.03	−1.69	58.99
XGBoost	−1.66	−4.54	−1.52	60.05
NN4	−2.52	−1.14	−2.59	60.45
RF	−4.99	−1.16	−5.18	60.58
SVR	−5.01	−17.36	−4.39	58.20
GBRT	−5.72	−10.31	−5.49	59.79
NN5	−5.75	1.35	−6.11	60.32
KNR	−11.67	−8.65	−11.83	52.86
HA	0.00	0.00	0.00	60.32

Notes: This table reports the out-of-sample forecasting performance to evaluate the isolated predictive power of accruals variables. The evaluation period is 2015:01 to 2023:12, and the historical average (HA) serves as the benchmark. The out-of-sample R² measures the percentage reduction in mean squared forecast error relative to the HA, while the success ratio evaluates directional accuracy. * denote significance at the 10% levels, respectively.

Table 17. Profit growth conditional out-of-sample R²: recession vs. expansion.

Model	OOS R² (%)	Recession R² (%)	Expansion R² (%)	Success Ratio (%)
Ridge	0.26 **	−0.19	0.28	60.24
LASSO	0.25 **	−0.21	0.28	60.32
ENet	0.19 **	−0.22	0.21	60.32
PCR	−0.39	−0.14	−0.41	60.12
Combined	−0.86	0.33	−0.92	59.94
NN3	−0.88	0.21	−0.93	60.11
NN1	−0.96	0.35	−1.02	60.32
PLS	−1.17	−0.03	−1.23	60.13
OLS	−1.50	0.20	−1.58	59.57
NN2	−1.74	−0.12	−1.82	60.14
CatBoost	−1.90	1.06	−2.05	59.86
RF	−1.99	−1.14	−2.03	58.78
NN4	−2.18	1.08	−2.35	60.32
GBRT	−2.39	−0.27	−2.50	59.84
XGBoost	−3.59	1.21	−3.83	57.87
SVR	−4.38	0.72	−4.64	56.83
KNR	−16.75	−8.23	−17.19	52.35
NN5	−17.90	−3.01	−18.65	52.64
HA	0.00	0.00	0.00	60.32

Notes: This table reports the out-of-sample forecasting performance to evaluate the isolated predictive power of profit growth variables. The evaluation period is 2015:01 to 2023:12, and the historical average (HA) serves as the benchmark. The out-of-sample R² measures the percentage reduction in mean squared forecast error relative to the HA, while the success ratio evaluates directional accuracy. ** denote significance at the 5% levels, respectively.

Table 18. Size conditional out-of-sample R²: recession vs. expansion.

Model	OOS R² (%)	Recession R² (%)	Expansion R² (%)	Success Ratio (%)
CatBoost	0.25 **	3.92	0.07	60.40
LASSO	0.20	0.86	0.16	60.19
ENet	0.19	0.85	0.16	60.19
Ridge	0.19	0.90	0.15	60.32
PCR	0.19 *	0.94	0.15	60.32
OLS	−0.15	0.94	−0.20	59.52
Combined	−0.17	1.35	−0.24	60.32
SVR	−0.20	0.14	−0.22	59.52
NN3	−0.57	−0.12	−0.59	60.32
PLS	−0.71	−0.32	−0.73	59.52
NN2	−0.75	−0.12	−0.79	60.32
GBRT	−1.13	−0.92	−1.14	57.83
NN1	−1.53	−0.12	−1.60	60.19
NN4	−2.42	−1.18	−2.48	60.05
RF	−4.67	−2.33	−4.79	58.07
XGBoost	−5.45	−0.98	−5.67	58.60
NN5	−5.75	1.35	−6.11	60.32
KNR	−16.19	14.47	−17.73	53.47
HA	0.00	0.00	0.00	60.32

Notes: This table reports the out-of-sample forecasting performance to evaluate the isolated predictive power of size variables. The evaluation period is 2015:01 to 2023:12, and the historical average (HA) serves as the benchmark. The out-of-sample R² measures the percentage reduction in mean squared forecast error relative to the HA, while the success ratio evaluates directional accuracy. ** and * denote significance at the 5% and 10% levels, respectively.

Table 19. Six-year rolling window.

Model	OOS R² (%)	R² REC (%)	R² EXP (%)	Success Ratio (%)
RF	−1.10	2.36	−1.29	61.43
PCR	−1.87	0.57	−2.00	59.60
LASSO	−2.60	−0.68	−2.70	59.15
ENet	−2.66	−0.68	−2.77	59.45
Ridge	−3.73	−0.04	−3.93	57.62
CatBoost	−5.55	−1.50	−5.77	58.69
XGBoost	−5.80	−0.70	−6.08	58.08
PLS	−6.72	−1.64	−7.00	57.62
NN4	−8.42	1.76	−8.99	55.03
GBRT	−9.23	−0.79	−9.70	56.71
KNR	−17.45	−3.44	−18.23	51.98
NN3	−20.01	2.63	−21.26	53.81
NN2	−20.08	1.84	−21.30	52.90
NN1	−20.99	1.56	−22.24	53.05
NN5	−25.89	1.78	−27.43	44.36
OLS	−196.02	−189.53	−196.38	54.42
HA	0.00	0.00	0.00	61.40

Notes: This table evaluates the out-of-sample forecasting performance of models estimated over rolling windows of fixed 6-year lengths. Each model is benchmarked against the historical average (HA). Performance is assessed using the out-of-sample R² and the success ratio.

Table 20. Eight-year rolling window.

Model	OOS R² (%)	R² REC (%)	R² EXP (%)	Success Ratio (%)
XGBoost	0.86 **	3.10	0.72	57.79
CatBoost	0.83 **	0.69	0.84	58.61
ENet	0.66 **	0.45	0.67	59.63
LASSO	0.66 **	0.45	0.67	59.63
Ridge	0.59 **	0.45	0.60	59.02
GBRT	0.27 *	5.61	−0.07	57.38
RF	−0.37	6.86	−0.83	57.79
PCR	−0.80	−0.54	−0.81	59.63
PLS	−2.78	0.33	−2.98	56.97
NN3	−3.24	1.86	−3.56	59.02
NN5	−3.76	−3.96	−3.74	44.67
NN2	−7.22	1.38	−7.77	50.20
NN1	−8.17	0.63	−8.73	55.53
KNR	−8.65	−8.92	−8.63	51.64
NN4	−10.03	1.38	−10.75	53.69
OLS	−54.38	−23.88	−56.31	53.89
HA	0.00	0.00	0.00	59.18

Notes: This table evaluates the out-of-sample forecasting performance of models estimated over rolling windows of fixed 8-year lengths. Each model is benchmarked against the historical average (HA). Performance is assessed using the out-of-sample R² and the success ratio. ** and * denote significance at the 5% and 10% levels.

Table 21. Alternative window (2017–2023) model performance.

Model	OOS R²	MSFE Adjusted	p-Value (MSFE)	Success Ratio (%)	PT Stat	p-Value	Paired t-Stat	p-Value (Paired t)	DM Stat	p-Value (DM)
OLS	−38.66	0.40	0.345	54.78	2.50	0.006	−2.04	0.978	2.04	0.979
PLS	−1.53	0.44	0.331	58.87	2.29	0.011	−1.60	0.943	1.60	0.945
PCR	−0.60	0.26	0.398	60.41	2.61	0.005	−1.00	0.840	1.00	0.841
LASSO	0.38	1.42	0.078	59.56	2.54	0.006	−1.03	0.847	1.03	0.848
Ridge	0.51	1.59	0.056	59.56	2.45	0.007	−1.93	0.971	1.93	0.973
ENet	0.48	1.47	0.070	60.24	2.77	0.003	−0.36	0.641	0.36	0.642
RF	1.28	1.37	0.085	62.46	3.50	0.000	1.90	0.030	−1.90	0.028
GBRT	1.41	1.76	0.039	61.26	3.04	0.001	0.43	0.333	−0.43	0.332
XGBoost	2.70	2.53	0.006	59.73	2.78	0.003	−0.73	0.765	0.73	0.766
CatBoost	1.82	2.06	0.019	59.56	2.55	0.005	−1.28	0.899	1.28	0.900
KNR	−5.02	0.78	0.218	58.02	3.59	0.000	−0.99	0.838	0.99	0.839
NN1	−4.56	−1.12	0.869	56.99	1.93	0.027	−2.53	0.993	2.53	0.994
NN2	−3.23	−1.22	0.890	53.24	0.91	0.182	−3.25	0.999	3.25	0.999
NN3	−11.30	−1.46	0.928	53.41	0.90	0.184	−3.05	0.999	3.05	0.999
NN4	−1.16	−1.01	0.845	60.58	2.64	0.004	—	—	—	—
NN5	−5.82	−0.61	0.729	55.80	1.30	0.097	−1.26	0.895	1.26	0.896
HA	—	—	—	60.71	—	—	—	—	—	—

Notes: This table presents the out-of-sample forecasting performance from January 2017 to December 2023, calculated using an expanding estimation window. Historical average (HA) is used as the benchmark model for comparison. Out-of-sample R² shows the percentage reduction in mean squared forecast error for each model relative to the HA. The success ratio indicates a model’s accuracy in predicting the direction (OOS = out-of-sample; MSFE = mean squared forecast error; PT = Pesaran–Timmermann; DM = Diebold–Mariano). The FF3 benchmark is omitted from the alternative window analysis for brevity; its performance in the 2017–2023 subsample is qualitatively unchanged from the main results.

Table 22. Alternative window (2017–2023) conditional R² (recession vs. expansion).

Model	OOS R² (%)	Recession R² (%)	Expansion R² (%)
XGBoost	2.70 ***	−1.97	2.98
CatBoost	1.82 **	6.37	1.55
GBRT	1.41 **	6.34	1.12
RF	1.28 *	0.34	1.34
Ridge	0.51 *	0.23	0.52
ENet	0.48 *	0.63	0.47
LASSO	0.38 *	0.67	0.37
PCR	−0.60	0.04	−0.63
NN4	−1.16	0.15	−1.24
PLS	−1.53	0.49	−1.65
NN2	−3.23	−1.07	−3.36
NN1	−4.56	−2.39	−4.69
KNR	−5.02	−4.38	−5.06
NN5	−5.82	−5.92	−5.82
NN3	−11.30	−5.90	−11.62
OLS	−38.66	−9.50	−40.38

Notes: This table presents the out-of-sample forecasting performance from January 2017 to December 2023. The second and third columns show the OOS R² during NBER-dated recessions and expansions. ***, **, and * denote significance at the 1%, 5%, and 10% levels.

Table 23. MAE out-of-sample analysis.

Model	OOS R² (%)	Recession R² (%)	Expansion R² (%)
Ridge	0.35 **	−0.01	0.36
LASSO	0.22 *	−0.19	0.24
ENet	0.19 *	−0.19	0.21
PLS	0.03	0.49	0.01
PCR	−0.29	0.04	−0.30
NN4	−0.30	0.15	−0.32
Combined	−2.47	−1.30	−2.53
RF	−3.98	1.64	−4.26
XGBoost	−4.42	−7.38	−4.27
GBRT	−4.91	−2.35	−5.04
CatBoost	−5.69	−6.68	−5.64
SVR	−6.43	−1.86	−6.66
NN2	−6.70	−0.11	−7.03
NN1	−8.00	−2.39	−8.29
NN3	−8.82	0.56	−9.30
NN5	−9.92	−5.09	−10.16
KNR	−25.08	−5.20	−26.08
OLS	−70.07	−9.50	−73.13

Notes: This table presents the out-of-sample forecasting performance using the median absolute error scoring function on the validation set from January 2015 to December 2023. The second and third columns show the OOS R² during NBER-dated recessions and expansions. ** and * denote significance at the 5% and 10% levels.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jalali, M.; Najand, M.; Cohen, A. Machine Learning, Thematic Feature Grouping, and the Magnificent Seven: A Forecasting Analysis. J. Risk Financial Manag. 2026, 19, 274. https://doi.org/10.3390/jrfm19040274

AMA Style

Jalali M, Najand M, Cohen A. Machine Learning, Thematic Feature Grouping, and the Magnificent Seven: A Forecasting Analysis. Journal of Risk and Financial Management. 2026; 19(4):274. https://doi.org/10.3390/jrfm19040274

Chicago/Turabian Style

Jalali, Mirarmia, Mohammad Najand, and Andrew Cohen. 2026. "Machine Learning, Thematic Feature Grouping, and the Magnificent Seven: A Forecasting Analysis" Journal of Risk and Financial Management 19, no. 4: 274. https://doi.org/10.3390/jrfm19040274

APA Style

Jalali, M., Najand, M., & Cohen, A. (2026). Machine Learning, Thematic Feature Grouping, and the Magnificent Seven: A Forecasting Analysis. Journal of Risk and Financial Management, 19(4), 274. https://doi.org/10.3390/jrfm19040274

Article Menu

Machine Learning, Thematic Feature Grouping, and the Magnificent Seven: A Forecasting Analysis

Abstract

1. Introduction

2. Literature Review

3. Data and Methodology

3.1. Data

3.2. Methodology

4. Empirical Results

5. Theme-Level Prediction Results

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Hyperparameter Settings for Machine Learning Models

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI