1. Introduction
Asset pricing has been widely studied in financial markets; however, forecasting remains particularly challenging. The challenges stem from the turbulent dynamics of financial markets, their inherently noisy and non-stationary nature, and the impact of any uncertainty, regardless of its origin, on the market [
1]. As uncertainty has a greater impact on equity returns, equity investors are exposed to more risk than investors in traditional investment instruments. Therefore, equity investors focus on market analysis and projections of future returns. In order to understand financial markets accurately, it is essential to establish an effective model for market analysis [
2]. Many models have been formulated to estimate the expected returns of assets [
3,
4].
Among the models developed to estimate expected returns, the Fama–French asset pricing model is a prominent example. Fama argued that expectations and forecasts are immediately incorporated into stock prices, forming the basis of the efficient market hypothesis [
5]. The ability of this model to explain excess returns has long been evaluated using econometric methods, while more recently Machine Learning (ML) algorithms have also been employed for this purpose. Artificial intelligence and ML methods have taken a different approach to the hypothesis, which states that expectations and forecasts formed in the markets are immediately reflected in stock prices and do not affect their randomness. These technologies have been described as revolutionary in understanding markets and detecting stock return dynamics [
6]. Indeed, recent studies highlight that such nonstandard statistical techniques can recover complex patterns better than classical econometric approaches thanks to their flexibility in high-dimensional settings [
7].
The traditional portfolio management approach, which aimed to protect investors from risk through diversification, was widely used until the 1950s. The fact that the method relied on naive diversification and ignored other risks in portfolio construction was criticized by researchers [
8]. In 1952, Harry Markowitz [
9] introduced the mean–variance framework in his article ‘Portfolio Selection’, emphasizing diversification, variance–covariance structures, and the efficient frontier as the foundation of modern portfolio theory. This framework demonstrated that investors could achieve higher expected returns for a given level of risk through optimal portfolio construction. Since 1952, the main models have been the Capital Asset Pricing Model (CAPM) formulated by Sharpe [
10] in 1964, Lintner [
11] in 1965 and Mossin [
12] in 1966; the Arbitrage Pricing Theory (APT) introduced by Ross [
13] in 1976; and the family of Fama–French asset pricing models. The CAPM explains the correlation between the expected return and the risk of a security through a single-factor model, assessing whether a stock offers a return commensurate with the risk taken in a competitive market [
14]. The APT is based on the law of one price and price assets according to their sensitivity to systematic risk factors. Following these two frameworks, the Fama–French Asset Pricing Model was first introduced by E. Fama and K.R. French in 1992 with the study “Cross-Section of Expected Stock Returns” [
15]. Drawing conceptual inspiration from the arbitrage pricing theory, they extended the CAPM by incorporating additional risk factors and developed a three-factor model in 1993 [
16]. According to this model, stock returns are influenced not only by the market risk premium but also by firm size and the book-to-market (B/M) ratio. Subsequently, Carhart [
17] augmented the three-factor model with a momentum factor, leading to the four-factor specification. The development of factor models continued with the five-factor specification of Fama and French, incorporating profitability and investment factors [
18]. They further revisited the framework in 2018 and discussed the relevance of different factor specifications, emphasizing the challenge of identifying which factors should be retained in empirical applications [
19].
Following this debate, numerous studies have examined the empirical validity of factor specifications across markets and conditions. Among them, one study that applied the five-factor framework in an emerging market is Ozkan [
20], who analyzed Borsa Istanbul and found that the model was valid, with the value factor continuing to play a critical role in explaining returns. In another study, Huang [
21] investigated the Chinese stock market by comparing the three-, five- and six-factor versions, and reported that while the five-factor model generally outperformed alternative specifications, the profitability and value premia often appeared with different signs than in developed markets. More broadly, Zaremba et al. [
22] examined 23 developed and emerging economies, comparing the three- and five-factor models. Their results indicated that the superiority of the five-factor specification was evident in developed markets, where it typically outperformed the three-factor version, whereas in emerging markets, its advantage proved less consistent. In an industry-level application, the U.S. steel sector was examined during the COVID-19 period, demonstrating that profitability lost significance in the crisis, whereas size and value factors remained robust [
23]. More recently, Korenak and Stakić [
24] examined U.S. small-size mutual funds with the three- and five-factor models and found that factor performance varied across time and fund scale. Complementary evidence from the energy sector, based on the FF3F and FF5F models, likewise indicated that the explanatory role of profitability weakened under the COVID-19 crisis, while size and value factors continued to retain their relevance [
25]. A subsequent study incorporated macroeconomic variables and showed that inflation and interest rates substantially shaped the explanatory strength of standard factor models [
26]. In a further contribution, Li et al. [
27] compared competing specifications under price-impact cost adjustments and concluded that explanatory power was highly sensitive to both factor choice and transaction cost considerations. Extending the discussion to sectoral resilience, Martí-Ballester [
28] showed that healthcare funds preserved robust explanatory strength during the COVID-19 crisis, illustrating that not all industries are equally vulnerable to profitability shocks. On the modeling side of the study, the comparison of the FF3F, FF4F, and FF5F specifications highlighted the relative superiority of the five-factor framework. Viewed across this body of literature, the studies indicate that while classical multi-factor models remain central to empirical asset pricing, their validity and the relevance of individual factors are contingent on market structure, sample characteristics, and methodological design.
The complexity and diversity of stock markets, combined with the inability of current models to adequately explain price fluctuations, have encouraged researchers to seek new approaches [
29]. In this context, the accurate estimation of stock price fluctuations has become important for market participants. Institutional investors and researchers have started to apply artificial intelligence, statistical analysis, and engineering methods to financial markets to explain stock market movements and optimize returns [
30]. The increasing diversification of influences on markets has also led researchers to consider external factors, and the ability to predict stock price changes has become an important area of investigation. Therefore, algorithms that estimate future price movements have made financial markets an important resource for evaluating the potential of artificial intelligence [
31]. Additionally, some financial institutions have begun to explore ways to generate excess returns using high-frequency trading algorithms.
Instant trading in capital markets with developed algorithms has become a new and significant issue, especially for markets without depth, causing dramatic volatility. As a result, company valuation remains crucial for investors and researchers to understand the market and select appropriate companies. In general, most institutional investors use ML methods to estimate future prices and generate excess returns. These methods can be applied to many risky asset pricing models including Fama–French models, which occupy a central place in the literature. While the Fama–French models are primarily designed to explain returns using contemporaneous factor exposures, prior evidence indicates that their explanatory power is not uniform across industries and may vary over time, raising concerns about model generalizability [
32]. Building on this insight, the present study adapts the Fama–French frameworks within a ML setting by using their generalization ability to examine the robustness of these methods on unseen future data. Specifically, the analysis investigates whether patterns derived from past realizations of factor returns can enhance the out-of-sample prediction of excess returns.
The rationale for using ML out-of-sample prediction techniques in this study can be summarized as follows: estimating the risk premium of an asset is inherently a prediction problem. ML methods are particularly suited for this task due to their predictive capabilities. Moreover, traditional forecasting methods become ineffective when the number of predictors approaches the number of observations or when predictors exhibit multicollinearity. ML algorithms, especially those involving variable selection and dimension reduction, offer robust solutions to these challenges. This methodology includes techniques that reduce degrees of freedom and control unnecessary variation among predictors. A range of methods, including generalized linear models, regression trees and neural networks, are well suited to approximate complex nonlinear relationships. Furthermore, the use of parameter penalization and conservative model selection criteria prevents overfitting and false discovery [
33].
The objective of this study is to evaluate the performance of different ML algorithms in predicting excess returns of U.S. industry portfolios, based on the Fama–French three-factor (FF3F), four-factor (FF4F), five-factor (FF5F), and six-factor (FF6F) asset pricing models, using an out-of-sample prediction framework. The scope of the study includes the U.S. industry portfolios compiled by Fama and French consisting of stocks traded on the NYSE, AMEX, and NASDAQ. This study contributes to the empirical literature on the prediction of stock portfolio returns, which has developed along several strands. One line of research models cross-sectional differences in expected returns across U.S. industry portfolios, typically using factor models such as those proposed by Fama and French. To validate the statistical significance of performance differences among these model structures, the present study incorporated the Model Confidence Set (MCS) test as a model confidence testing procedure. Another line of research focuses on forecasting returns using the predictor sets, and related contributions compare the effectiveness of ML algorithms, including Support Vector Regression (SVR), Linear Regression (LR), K-Nearest Neighbor (KNN), and Multilayer Perceptron (MLP). By clarifying how the forecastability of industry portfolios varies across factor structures and algorithms, the analysis is expected to support more resilient investment strategies and inform policy considerations directed at long-term stability and sustainable performance.
The paper consists of five sections. After the introduction, the
Section 2 reviews the literature testing the Fama–French asset pricing model. The
Section 3 explains the data and methodology. The
Section 4 presents and discusses the results. Finally, the
Section 5 provides a general assessment.
2. Literature Review
Numerous studies have examined the predictive potential of ML methods within the framework of the Fama–French asset pricing model and broader stock return modeling. One of the earliest efforts applied neural networks to pricing derivative contracts, signaling the beginning of interest in nonlinear algorithms in financial contexts [
34]. Over the following decades, this interest expanded into stock return prediction, with a growing emphasis on ML techniques.
In the early 2010s, research increasingly turned toward the integration of ML in financial forecasting. Allen et al. provided an early example of ML improving return predictions [
35]. They applied a support vector machine (SVM) to predict the directional movement of Dow Jones Industrial Average stock prices, comparing it against a logistic regression classifier. The support vector machine achieved higher predictive accuracy than logistic regression in forecasting price changes, and its structural risk-minimization principle helped resist overfitting. This demonstrated the potential of nonlinear ML methods to outperform traditional linear models in short-horizon return prediction, even for simple long/short trading decisions. Building on this early stream, Shen et al. [
36] employed Fama–French datasets to develop a doubly regularized portfolio optimization model with L1 and L2 penalization. Their results showed that regularization enhanced the stability and out-of-sample performance of factor-based portfolios, underscoring the usefulness of ML-inspired penalization methods for asset pricing applications. Extending the methodological scope, a subsequent study introduced deep learning into return prediction through the concept of deep portfolios [
37]. This framework employed autoencoder networks to capture nonlinear dependencies in asset returns, illustrating that deep architectures can extract latent factors beyond those identified by linear models and reinforcing the potential of deep learning in asset pricing research. Research activity in this field continued to expand throughout the decade, with subsequent studies exploring more advanced architectures and ensemble techniques to enhance predictive performance. Among these contributions, Zhang and Chen [
38] introduced an AdaBoost ensemble framework to predict annual stock returns, incorporating multiple predictive variables reflecting market, size, valuation, and fundamental characteristics. Their empirical findings showed that the ensemble approach substantially improved forecasting accuracy compared to traditional regression-based methods, highlighting the benefit of combining diverse factor signals within an ML architecture. Reflecting the growing role of deep learning in ML, Nguyen and Yoon [
39] proposed a transfer learning framework built on LSTM models, which pre-trains on related stocks and fine-tunes on the target stock to address data scarcity in short-term movement forecasting. Using one-day returns as inputs, their model outperformed baseline approaches such as support vector machine, Random Forest (RF), and KNN, thereby demonstrating the ability of advanced deep learning techniques to enhance return prediction accuracy.
With the advent of the 2020s, the scope of ML applications in asset pricing continued to broaden, encompassing various domains. Weigand [
40] provided a comprehensive survey of empirical asset pricing studies employing ML, showing how algorithms such as SVR, ensemble methods, and neural networks have been increasingly adopted to improve factor selection, capture nonlinear dependencies, and enhance return predictability beyond traditional linear models. Illustrating this trend, Jan and Ayub [
41], applied artificial neural networks (ANN) to the Fama–French five-factor model in order to generate one-period-ahead return forecasts. Their results indicated nearly perfect alignment in high- and medium-beta portfolios (
), whereas predictive strength weakened considerably in the low-beta group (
). Using rolling 48-month windows and Levenberg–Marquardt training across varying hidden-layer configurations, they showed that five-factor signals can be effectively captured by ANN for short-horizon prediction, although forecast errors increase in low-beta portfolios. In a large-scale empirical application, Gu et al. [
33] conducted a large-scale empirical analysis, comparing linear, tree-based, and neural network methods, and demonstrated that ML-based approaches outperformed classical factor models, including the Fama–French specifications, in predicting U.S. stock returns. Among these studies, one applied ML to European stock markets using RF, gradient boosted regression trees (GBRT), MLP, and support vector machine, employing both regression-based asset pricing and classification-based portfolio construction [
42]. The results indicated that the MLP achieved the highest explanatory power in factor regressions, while support vector machine outperformed in classification tasks for portfolio selection; when transaction costs were considered, MLP again provided superior net performance, with the FF5F model serving as the strongest linear benchmark. In 2023, Diallo et al. [
43] employed a Bayesian optimization SVR to estimate FF3F and FF5F, with the three-factor model outperforming the five-factor version in several industry sectors, achieving correlation coefficients as high as 94% for consumer and industrial sectors and 92% for the high-tech sector. In the study, the five-factor model exhibited a wider range of correlation, from 48% in healthcare to 89% in high-tech. In another attempt to extend the five-factor specification, Li and Teng [
44] incorporated long-memory dynamics via the Hurst exponent together with momentum, thereby increasing explanatory power by nearly 7%. They further compared ordinary least squares regression (OLS) estimates with several ML techniques and found that SVR and RF provided clear improvements, while Lasso and Ridge failed to outperform the regression. ANN generally performed well but underperformed only in small-sample settings. A further contribution was made by Fallahgoul et al. [
45], who employed neural networks to test the statistical significance of factor returns within the asset pricing framework, showing that such nonlinear methods can offer alternative insights compared to traditional regressions. In the same year, Luo [
46] analyzed the predictive performance of the FF3F and FF5F models on a dataset comprising 49 U.S. industry portfolios using XGBoost regression. The results, measured by root mean square error, showed minimal differences in accuracy between the two models. In parallel, Ferrara and Ciano [
47] integrated explainable AI techniques such as SHAP and LIME into ML-based asset pricing, demonstrating that RF and XGBoost not only achieved strong predictive accuracy but also allowed for a clearer interpretation of factor contributions, thereby addressing the black-box criticism often directed at ML in finance.
Despite such advances in ML applications, evidence emerged that nonlinear dependencies exist even within the classical factor model frameworks, suggesting that linear specifications may overlook relevant interactions. In line with this view, Bandurski and Postek [
48] demonstrated that introducing nonlinear terms into the Fama–French three-factor model yielded statistically significant improvements, although the economic gains remained limited. Complementary results from the Hanoi stock market further demonstrated that while the FF3F outperformed the CAPM in explanatory power, explanatory gaps remained, and predictive accuracy improved substantially when the SVR algorithm was embedded within the model, underscoring the importance of nonlinear ML approaches [
49]. At the same time, Kang [
50] emphasized that linear factor models themselves may still suffer from systematic mispricing, highlighting limitations in their explanatory capacity. Extending this line of evidence to the FF6F specification, Goo and Wang [
51] documented nonlinear phenomena in the Fama–French six-factor model, indicating that even comprehensive linear models may overlook relevant dynamics. This further underscores the importance of testing whether ML algorithms can capture such complexities more effectively. In this context, recent research has examined the integration of ML methods with linear factor models, showing that feature-importance rankings for the Fama–French five-factor model can vary substantially across algorithms, thereby offering complementary insights into factor relevance and interpretability [
52]. For ease of comparison,
Table 1 summarizes the key studies reviewed in this section, including their data characteristics, methodological approaches, and principal findings.
Collectively, the literature demonstrates a growing interest in the application of ML to asset pricing, particularly in testing the predictive capacity of the Fama–French models across various markets and time horizons. While earlier studies focused on demonstrating feasibility, recent research increasingly emphasizes performance comparison, hybrid modeling, risk forecasting, and integration with advanced algorithmic techniques. This study differs from prior research by jointly applying multiple Fama–French model specifications; three-, four-, five-, and six-factor versions, across a broad range of U.S. industry portfolios and systematically evaluating their out-of-sample predictive performance using various ML algorithms.
3. Materials and Methods
The main objective of this research was to assess the out-of-sample prediction performance of the ML methods for modeling excess returns under the FF3F, FF4F, FF5F, and FF6F asset pricing models. Each specification was evaluated across U.S. industry portfolios grouped into 5-, 10-, 12-, 17-, 30-, 38-, 48-, and 49-industry classifications, as defined by Fama and French. The ML techniques applied included SVR, LR, KNN, and MLP.
The ML algorithms used in the study were selected based on their widespread use and their representation of diverse learning paradigms in regression tasks. They span a range of modeling capabilities, from linear parametric approaches (LR), to non-parametric methods (KNN), kernel-based models (SVR), and nonlinear function approximation via neural networks (MLP). Prior empirical evidence in financial forecasting has demonstrated the relative success of these algorithms in capturing complex relationships between financial variables [
7]. Moreover, alternative methods such as regression trees, RF, and radial basis function networks were initially considered but excluded because their preliminary predictive accuracy in early exploratory screening on representative subsets of the main data was inferior to that of the four methods retained in the analysis.
All ML analyses were implemented in the WEKA (Waikato Environment for Knowledge Analysis) software platform, version 3.8 [
53]. This setup ensured a consistent computational environment across all model–method combinations. The predictive performance of each ML method was assessed using the Pearson correlation coefficient (
) and the Root Mean Squared Error (RMSE) between the predicted and actual excess returns. Among these,
served as the primary criterion for evaluating model performance, as it summarizes the directional agreement and strength of association between predicted and realized excess returns in a scale-independent manner [
54]. This property makes
particularly suitable for financial return modeling, as it provides a robust measure for assessing predictive alignment under varying market volatility conditions and effectively captures the consistency between predicted and realized trends [
55]. Therefore,
provides a consistent and interpretable benchmark for comparing the predictive coherence of alternative methods across industry portfolios. The use of
as an evaluation metric is well established in the empirical asset pricing literature and has been frequently adopted in ML-based applications [
33,
43]. RMSE was additionally reported as a complementary magnitude-based metric as it summarizes the average size of prediction errors in squared form and allows for deviations between the predicted and realized returns to be evaluated in terms of their scale. Furthermore, the MCS test was employed to statistically validate whether the observed performance differences among the model specifications (FF3F–FF6F) were significant. This approach enhances the robustness of model comparison by verifying that the observed differences are statistically meaningful rather than the result of random variation, providing a reliability check that complements the
- and RMSE-based evaluations.
3.1. Data Description and Model Framework
The study uses industry portfolio groupings provided by Fama and French to construct the FF3F, FF4F, FF5F, and FF6F models. All portfolios consisted of stocks listed on the NYSE, AMEX, and NASDAQ exchanges. These portfolios were organized into eight classification levels, including 5-, 10-, 12-, 17-, 30-, 38-, 48-, and 49-industry groupings, which corresponded to the portfolio structures listed in the
Appendix A. Each classification level was defined according to the sectoral compositions provided in the Kenneth R. French Data Library [
56]. Two sector portfolios, STEAM and WATER, were excluded from the analysis due to limited data continuity over the evaluation period, which prevented the compilation of complete and reliable return–factor datasets. Monthly excess return data for all industry portfolios were also obtained from the same source. Each portfolio’s return was value-weighted based on market capitalization and represents the aggregated performance of stocks within its respective industry category. The dataset covers the period from July 1992 to January 2022, spanning 355 monthly observations.
The Fama–French multi-factor models form the theoretical foundation of the analysis. These models explain industry portfolio excess returns through systematic risk factors representing distinct sources of market variation. The mathematical formulations of the FF3F, FF4F, FF5F, and FF6F models used in this study are presented below [
18].
where
Monthly return on portfolio in period ;
Risk-free rate in period ;
Market return in period The return spread between the monthly returns of stock portfolios with small market capitalization and those with large market capitalization in period ;
The return spread between the monthly returns of stock portfolios with high B/M and those with low B/M in period ;
Intercept term representing the average excess return on portfolio not explained by the model factors;
Sensitivity of the excess return of portfolio to the market risk premium (), measured at time ;
Sensitivity of the excess return of portfolio to the size factor ();
Sensitivity of the excess return of portfolio to the size factor ();
Error term in period .
The return spread between the monthly returns of winner stocks (those with the highest past 12-month returns) and loser stocks (those with the lowest past 12-month returns) in period (Winner Minus Loser factor; captures the momentum effect.);
Sensitivity of the excess return of portfolio to the momentum factor (), measured at time .
= The return spread between the monthly returns of stock portfolios with high profitability and those with low profitability in period ;
= The return spread between the monthly returns of stock portfolios with conservative investment and those with aggressive investment in period ;
= Sensitivity of the excess return of portfolio to the profitability factor ();
Sensitivity of the excess return of portfolio to the investment factor ().
In the predictive framework, these factor structures are used as explanatory variables to model and predict industry portfolio excess returns (), where each model defines a distinct combination of risk factors. The FF3F includes the market excess return (, market risk premium, representing the return on the market over the risk-free rate), the size factor (, Small Minus Big, capturing the return difference between small- and large-cap stocks), and the value factor (, High Minus Low, capturing the return difference between high and low book-to-market stocks). The four-factor model (FF4F) extends this specification by incorporating the momentum factor (, Winners Minus Losers, reflecting the return of past 12-month winners minus losers) in addition to the market, size, and value factors. The FF5F builds on FF3F by adding two profitability and investment-related factors named (Robust Minus Weak, a profitability factor comparing firms with high versus low operating profitability) and (Conservative Minus Aggressive, an investment factor comparing firms with low versus high investment rates). Finally, the six factor model (FF6F) extends the FF5F specification by adding the momentum factor (, Winners Minus Losers), which captures the tendency of past winners to outperform past losers. This factor is often associated with the so called low volatility anomaly, where stocks with lower volatility tend to yield superior risk adjusted returns. This progression allows for evaluating how the expansion of factor information influences the model’s predictive performance. Each set of factor returns (, , , , , , and ) was obtained from the Kenneth R. French Data Library or derived following standard definitions in asset pricing literature. This structure allows the algorithms to learn the mapping between factor exposures and realized portfolio performance within an out-of-sample prediction framework.
3.2. Experimental Design and Application
The experimental procedure consists of five main stages (
Figure 1): data preparation, pilot hyperparameter tuning, cross-validation, method training and out-of-sample testing. These stages were designed to ensure that all portfolio–model–method combinations were analyzed under consistent, replicable, and unbiased experimental conditions.
Data preparation: For each portfolio configuration (5-, 10-, 12-, 17-, 30-, 38-, 48-, and 49-industry groups), all constituent portfolios were matched with the corresponding Fama–French factor variables of the selected model specification (FF3F, FF4F, FF5F, and FF6F). This process resulted in four datasets per portfolio, each corresponding to one factor model and containing the same dependent variable, the portfolio’s excess return, alongside the relevant explanatory factors. Each dataset was then chronologically divided into training and testing sets according to the pre-specified periods: July 1992–December 2012 for training and January 2013–January 2022 for out-of-sample testing. In addition to these datasets, a separate dataset was prepared exclusively for the preliminary hyperparameter tuning experiments. This pilot dataset was constructed in the same structure as the primary dataset, comprising excess returns and the corresponding Fama–French factor variables, and it was split once into training and testing subsets using the same temporal segmentation, following the same temporal segmentation as the main data. It was used to determine suitable hyperparameter settings and value ranges for the machine-learning algorithms applied in the study, aiming to identify robust parameter configurations rather than algorithm-specific optima. This dataset was not included in the main empirical analysis. All machine-learning algorithms were trained using the same input structure, where factor returns corresponding to each Fama–French model (FF3F–FF6F) served as explanatory variables for predicting the contemporaneous portfolio excess returns. No temporal shifting or lag structure was introduced, ensuring that all models operated on identical and fully synchronized factor–return relationships.
Pilot hyperparameter tuning: Preliminary hyperparameter tuning was conducted using a dataset independent from the main analysis to avoid potential data leakage or bias during parameter calibration. For this purpose, the Fama–French 25 Size and Book-to-Market portfolios were selected because they share similar structural and methodological characteristics with the industry portfolio data employed in the main analysis. This approach allowed for parameter calibration to be performed under conditions comparable to those of the main study while remaining completely independent from it. The parameters optimized through the preliminary tuning experiments were fixed across all methods and portfolio groups to maintain methodological consistency and ensure that the results reflect the intrinsic performance of each learning algorithm rather than data-specific tuning effects. Hyperparameter tuning was performed once for each machine-learning algorithm type to determine stable and generalizable parameter configurations. The selected parameters were then fixed across all Fama–French model specifications (FF3F–FF6F) and portfolio groups to ensure that performance differences reflect the predictive capability of the FFVF models and ML algorithms themselves, rather than parameter variation. A unified hyperparameter setting was adopted across portfolios to ensure that the algorithms operated under consistent learning conditions. This approach enhances the generalizability of the results and preserves comparability across Fama–French model specifications and machine-learning methods. Maintaining the fixed configuration across portfolios also reduces the computational burden, as performing tuning individually for more than 150 portfolio–model–method combinations would substantially increase complexity and introduce unnecessary parameter proliferation.
Training and out-of-sample testing: For each portfolio, the datasets corresponding to the Fama–French model specifications (FF3F–FF6F) were used to train the four machine-learning algorithms, SVR, MLP, LR, KNN, and to assess method performance on the corresponding test data. The methods were trained using the fixed hyperparameter settings determined during the preliminary tuning phase, and their predictive performance was evaluated to measure out-of-sample accuracy. During training, a 10-fold cross-validation procedure was applied to enhance method robustness and to prevent the estimation process from being overly influenced by any specific data partition. This validation approach was designed to ensure that the methods captured stable and generalizable patterns within the training data before being applied to the unseen test samples. It reduces sampling variance, provides a more reliable assessment of method consistency, and supports fair comparison among algorithms operating under identical learning conditions. After cross-validation, the finalized methods were applied to the unseen test data to evaluate their predictive ability on new observations. This procedure ensured that all algorithms were assessed under consistent experimental conditions, yielding unbiased out-of-sample performance results. Out-of-sample forecasting was conducted in a chronologically ordered framework, where methods were trained on the earlier period and evaluated on the subsequent period. This non-overlapping out-of-sample evaluation ensures that all forecasts are generated using information available up to each point in time, providing a consistent temporal structure across all portfolio–model–method combinations.
3.3. Machine Learning Methods
ML methods rely on modeling historical data by training algorithms to extract patterns and generate predictions for new observations. These systems optimize performance criteria based on the training data, enabling them to generalize without additional adjustments during inference. By learning the underlying structure of the data, ML models can adapt and transfer learned relationships to subsequent prediction tasks. This section summarizes the ML methods used in the study.
3.3.1. Support Vector Regression
The support vector machine, introduced by Vapnik [
57], is a statistical learning method based on structural risk minimization and can be formulated for both classification and regression problems [
58]. In the regression setting, known as Support Vector Regression (SVR), the method estimates a continuous function that minimizes prediction errors within a tolerance margin. This approach constructs a linear prediction function, expressed as
, where
represents the weight vector and
the bias term, and
is the feature vector. When the relationship between inputs and outputs is nonlinear, kernel functions are applied to project the data into a higher-dimensional space where linear SVR can be performed. Common kernels include linear, polynomial, and radial basis function, each defined by parameters that control the flexibility of the fitted function. To operationalize this framework, SVR formulates the estimation task as an optimization problem that balances model smoothness and tolerance to prediction errors.
The optimization problem is formulated as where the first term controls the smoothness of the estimated function, while penalizes errors via the slack variables , which allow limited tolerance for observations that exceed the acceptable error threshold. The constant balances model complexity and error tolerance. In the framework, a tolerance parameter defines an error-insensitive region around the predicted function Samples satisfying incur no penalty, whereas larger deviations contribute to the loss and are penalized through . The parameters and jointly control the trade-off between flexibility and tolerance to prediction errors and denotes the target variable.
3.3.2. K-Nearest Neighbors
The KNN algorithm, introduced in 1951, was initially described as a non-parametric supervised learning method [
59]. Subsequently, the method was extended to include applications in both classification and regression tasks [
60]. The algorithm classifies new samples by comparing the similarity of their features with those of existing data points. When encountering an unknown sample
, the algorithm identifies the
training samples closest to the sample using a predefined distance metric, most commonly the Euclidean distance given as
, where
and
are two input samples. Once the
-nearest neighbors are determined, the algorithm classifies the unknown sample by majority voting among these neighbors for classification tasks. For regression tasks, the algorithm predicts the output by calculating the average value of the
-nearest neighbors. The simplicity of KNN allows it to be widely used in various domains, although its performance is sensitive to the choice of
and the distance metric used.
3.3.3. Multilayer Perceptron
MLP is a type of feedforward ANN widely used for both classification and regression tasks, capable of modeling nonlinear relationships between input and output variables [
61]. It consists of an input layer, one or more hidden layers, and an output layer, where each neuron in a layer is connected to all neurons in the next layer through weighted links. In this structure, the output of a given layer
l, the output
can be expressed as
, where
and
represent the weight matrix and bias vector, respectively, and
is the activation function, often a nonlinear function such as sigmoid, ReLU, or tanh, enabling the network to approximate complex functional mappings.
Learning in MLP is typically performed using gradient-based optimization methods, with backpropagation being the most common algorithm. In this process, the prediction error is propagated backward through the network, allowing the algorithm to compute the gradient of the loss function with respect to the method parameters (weights and biases). The parameter updates are then performed using an optimization algorithm such as gradient descent according to the rule , where is the learning rate that controls the step size of the updates, and represents the weight connecting -th neuron of the layer to the -th neuron of the previous layer. The loss function is typically the mean squared error when the network is used for regression. This iterative optimization continues until the network’s performance reaches a predefined stopping criterion. The main parameters that govern the learning process in an MLP are the learning rate, the number of hidden neurons, and the number of training iterations (epochs). Appropriate configuration of these parameters is essential to ensure stable convergence and effective generalization.
3.3.4. Linear Regression
Linear Regression is a fundamental statistical learning method used to model the relationship between a dependent variable and one or more independent variables. The model assumes a linear relationship of the form , where is the predicted value, represents the -th feature of observation , are the model coefficients, is the intercept, and denotes the error term, while is the target variable. The model parameters are estimated by minimizing the sum of squared errors between the observed and predicted values. This optimization, which does not involve external hyperparameters, can be solved analytically using the ordinary least squares method or numerically through iterative algorithms such as gradient descent, particularly when dealing with large or multicollinear datasets.
3.4. Model Confidence Test
The MCS test provides a formal statistical framework for comparing the predictive accuracy of multiple models [
62]. Unlike pairwise significance tests, the MCS procedure allows for simultaneous evaluation of several competing models while controlling for Type I errors and the family-wise error rate, identifying the subset of models that cannot be statistically distinguished from the best-performing one.
The MCS procedure begins with an initial set of candidate models and iteratively removes the models with statistically inferior predictive performance until a subset of equally performing models remains. The null hypothesis of Equal Predictive Ability states that the loss differences between any pair of models are, on average, equal to zero, implying no systematic difference across models. The test statistic is computed as , where is the sample mean of the loss differential between model and model , and is a variance estimate obtained through block bootstrap resampling. Models are sequentially removed based on their relative loss rankings until the null hypothesis cannot be rejected at the chosen significance level. The remaining models constitute the confidence set of superior models, representing the specifications that cannot be statistically distinguished from the best-performing model at the selected confidence threshold.
3.5. Parameter Configuration and Evaluation Settings
The hyperparameter configuration used in the empirical analysis was determined through a set of preliminary tuning experiments conducted prior to the out-of-sample forecasting evaluation. These experiments were carried out to identify parameter settings that would be applied consistently across all ML methods used in the study. Based on this tuning stage, the following hyperparameter configurations were adopted for each method. For MLP, the learning rate was first tested in the range of 0.1 to 1.0 with increments of 0.1. Since performance declined as the value increased, the interval was narrowed to 0.01 to 0.1 and tested with 0.01 increments, ultimately selecting 0.01 as optimal. Momentum was not applied, as it did not improve convergence behavior and occasionally led to less stable performance. The next hyperparameter examined was the maximum training time (epoch count), which was assessed across values ranging from 500 to 2500 in steps of 500. As performance plateaued beyond 1000, that value was adopted. The number of neurons in the hidden layer was left at WEKA’s default setting, which is based on a widely accepted formula in the literature: . Since this configuration is commonly used and has produced satisfactory results in prior studies, no further tuning was performed for this parameter. For KNN, the number of neighbors was tested from 5 to 50 in steps of 5, with the best results observed at . For SVR, the penalty parameter was tested from 10 to 100 in increments of 10 and further refined in the 10–20 range using 1-point increments, leading to the selection of . Informed by prior studies, other potential hyperparameters such as kernel type, sigma values, or batch size were retained at WEKA’s default settings as they did not lead to appreciable improvements in performance. These parameter choices were identified through systematic grid-based pilot experiments, in which alternative values were evaluated and the selected settings consistently yielded the most reliable predictive behavior across trials.
Complementing the parameter configurations of the machine-learning methods, the setup for the MCS test was defined using the squared forecast error as the loss function, following standard practice in mean-squared-based comparisons. The procedure was implemented with 1000 bootstrap replications after preliminary experiments with 500 and 1000 replications yielded comparable outcomes and the latter provided more stable estimation. As the block length parameter of the test, ℓ = 2, 4, and 6 were examined, consistent with the values considered in the original MCS study [
62], and ℓ = 6 was selected as it offered a reasonable compromise between dependence preservation and bootstrap efficiency. The significance level was set at
(75% confidence), in line with applications in forecasting studies where limited sample sizes and modest performance differences warrant a more flexible threshold.
4. Results and Discussion
The out-of-sample prediction performance of four ML methods, namely SVR, LR, KNN, and MLP was evaluated in the context of excess return prediction across eight U.S. industry portfolio groupings. These portfolio sets, each comprising a different number of industries, were modeled using the Fama–French three-, four-, five-, and six-factor frameworks (FFVF models). The predictive performance was assessed using the
between the predicted and actual excess returns, while the statistical significance of the observed differences across the Fama–French model specifications was further validated through the MCS test. This section summarizes the best-performing model–method combinations for each portfolio configuration, with detailed tabulated outputs and MCS-based evaluations presented in the following subsections. The complete set of prediction results for all model–method configurations can be accessed through the related study [
63].
4.1. Analysis of 5-Industry Portfolios
The evaluation begins with the 5-industry portfolio classification, which represents the most aggregated sectoral structure considered in this study. This configuration provides a compact setting to examine how the FFVF models perform when cross-sectional heterogeneity is limited, offering a reference point for subsequent portfolio groupings. The predictive performance obtained for this configuration is presented in
Table 2, which provides the out-of-sample results across the FF3F to FF6F specifications. In the FF3F specification, LR yielded the most accurate predictions in three portfolios, followed by SVR in 2. FF4F showed a shift in favor of SVR, which led in three portfolios, while KNN ranked first in 2. Under FF5F, LR again outperformed others in three portfolios, and SVR in 2. Finally, in the FF6F setting, SVR maintained its position with 2 leading results, while the remaining portfolios were split among LR, KNN, and MLP. This performance pattern was further clarified by examining the highest
values achieved by the leading methods under each model. For the CNSMR portfolio, the highest
value (0.925) was obtained under the FF5F model using LR. HITEC exhibited strong predictability with FF4F and SVR, yielding a
of 0.949. In the HLTH portfolio, the best performance was achieved with FF4F using KNN, reaching a
of 0.796. The MANUF portfolio showed its highest accuracy under FF3F with LR (0.948). Finally, OTHER attained the strongest overall result with FF5F and LR, recording a
value of 0.959. In terms of model performance, FF4F and FF5F each produced the highest
values in 2 portfolios, followed by FF3F with 1, while FF6F did not outperform any other specification.
The findings from the 5-industry portfolio group reveal that applying Fama–French factor models through ML techniques can yield highly effective out-of-sample predictions. Except for HLTH, all portfolios recorded values exceeding 0.90, underscoring the strong alignment between factor structures and realized returns in this configuration. Even in HLTH, the best-performing model–method combination achieved a of 0.80, indicating reasonably accurate forecasts. Among the methods, SVR emerged as the most consistently effective, followed by LR, confirming their strength in capturing return dynamics within this portfolio setting.
The MCS results for the 5-industry group reveal that FF3F and, more notably, FF4F were excluded from the superior set in several portfolios, indicating that their loss differentials correspond to
-values
in
Table 2, placing them outside the acceptance region. In contrast, FF5F and FF6F remained within the superior set across all portfolios, showing that their loss profiles do not provide statistical evidence strong enough to reject equal predictive ability. Across the group, this pattern emerged in different forms as HITEC and HLTH displayed selective retention of models, whereas OTHER and MANUF preserved all four specifications within the superior set. In the case of the CNSMR portfolio, the uniformly high
values were accompanied by loss differences that prevented the MCS procedure from identifying a statistically superior model, resulting in an empty superior set. For CNSMR, however, the uniformly high
values were accompanied by loss differences that prevented the MCS procedure from identifying a statistically superior model, leaving the superior set empty. This structure is consistent with the model-level interpretation derived from the
values, where the relative weakness of FF3F and FF4F and the stronger alignment of FF5F were already evident. The accompanying MCS
-values plot further reflects this pattern by displaying a more stable acceptance profile for FF5F and FF6F across portfolios (
Figure 2). Taken together, these results indicate that the weaker standing of FF3F and FF4F is statistically meaningful and coherent with the broader evidence in this configuration.
4.2. Analysis of 10-Industry Portfolios
Following the 5-industry evaluation, the analysis continued with the 10-industry portfolio classification, which represents an intermediate level of sectoral aggregation within the Fama–French framework. This configuration offers a balanced environment for assessing the interaction between the FFVF models and the machine-learning methods under a moderate degree of cross-sectional heterogeneity. Empirical results for this portfolio group are provided in
Table 3. LR outperformed other methods in 16 model-portfolio combinations, followed by SVR (12), MLP (6), and KNN (6), reflecting LR’s relative advantage in this configuration. In FF3F, LR achieved the highest number of best predictions, while KNN and SVR each performed best in 2 portfolios, and MLP in 1. FF4F produced a more evenly distributed outcome, with KNN and SVR each delivering top results in 4 portfolios and LR and MLP leading in 1 portfolio apiece. LR regained dominance in FF5F and FF6F, outperforming competitors in 5 and 4 portfolios, respectively. From a model-level perspective, FF5F and FF6F each yielded the highest
values in 3 portfolios, indicating comparable levels of general effectiveness across different industry contexts. FF3F and FF4F followed, with 2 portfolios each, reflecting a more limited but still meaningful level of predictive alignment within this 10-industry configuration.
Among the analyzed portfolios, the most accurate predictions were achieved in HITEC with MLP under FF4F (0.919), MANUF with MLP under FF6F (0.952), and OTHER with LR under FF5F (0.959), indicating strong predictive alignment between factor-based structures and realized returns. SHOPS with LR under FF5F (0.887) and NODUR with MLP under FF4F (0.838) also produced comparably strong results, yielding values close to the highest recorded accuracy. Meanwhile, TELCM (0.797, SVR, FF6F), HLTH (0.796, KNN, FF4F), ENERGY (0.785, LR, FF3F), and DURBL (0.758, KNN, FF3F) demonstrated reasonably strong performance across models. On the other hand, the UTILS portfolio consistently exhibited weak predictive performance across all model–method combinations, with accuracy levels rarely exceeding 0.52. This recurring underperformance may reflect structural factors unique to utilities, such as regulatory insulation and low responsiveness to market dynamics, which can limit the relevance of risk-based explanatory models. Additionally, recent transformations in the sector, such as technological innovation, renewable energy integration, and digitalization, may have altered return dynamics over time, reducing the ability of models trained on historical data to accurately predict future outcomes.
Overall, the results indicate a generally strong out-of-sample prediction performance across the 10-industry portfolios, with all but 1 portfolio achieving values above 0.75. LR emerged as the most consistently successful method across model–portfolio combinations. Regarding model-based performance, FF4F, FF5F, and FF6F each yielded the highest explanatory power in 3 portfolios, while FF3F proved less effective, leading in only 2.
The MCS results for the 10-industry group indicate that model support varies across portfolios, with no factor specification receiving consistent acceptance throughout the group. Portfolios such as HITEC, SHOPS, and TELCM provided particularly strong signals against FF3F and FF4F, with
-values falling below 0.25 in 4 and 5 cases, respectively (
Table 3). These outcomes suggest that the explanatory reach of the simpler specifications weakens precisely in the sectors where prediction is more challenging. The FF5F and FF6F models, on the other hand, retained admissibility in 6 and 7 portfolios and showed a more durable alignment with the loss based evaluation. This profile complements the
based perspective, which indicated that FF5F often extended its coverage across sectors without establishing complete superiority. As illustrated in
Figure 3, the distribution of MCS
-values reinforces this balance by showing repeated support for FF5F and FF6F, each of which remained admissible in 7 portfolios, whereas FF3F and FF4F were accepted in only five. Within this structure, the SHOPS portfolio stands out as a case where all models are rejected at the chosen threshold, reflecting the sensitivity of the procedure to loss differences despite the relatively high
values reported for this sector. The resulting pattern characterizes the 10-industry group as one in which the richer factor structures sustain broader statistical footing, while the simpler models face persistent limitations.
4.3. Analysis of 12-Industry Portfolios
The subsequent evaluation focuses on the 12-industry portfolio classification, which introduces additional sectoral distinctions relative to the 10-industry grouping. This expanded structure enables a more refined assessment of model–method performance within a more differentiated industry layout. The corresponding out-of-sample prediction results for this configuration are presented in
Table 4.
Building on the findings reported in
Table 4, the evaluation of the 12-industry portfolios group under the FF3F to FF6F models revealed a consistent dominance of the LR method across specifications. In the FF3F setting, LR provided the most accurate predictions in 5 portfolios, KNN in 2, and MLP in 1. Under FF4F, LR and SVR each led in 4 portfolios, KNN in 3, and MLP in 1. The FF5F model showed a similar structure, with LR again achieving the best results in 5 portfolios, MLP in 4, and SVR in 3. In the FF6F specification, LR was the most successful method in 6 portfolios, followed by MLP in 3, SVR in 2, and KNN in 1.
Aggregating results across all models and portfolios, LR achieved the highest number of top-performing predictions (20), followed by SVR (13), MLP (9), and KNN (6). This distribution highlights LR’s broad effectiveness while also underscoring the ability of MLP to perform strongly in specific configurations. When viewed from a model-focused perspective, FF5F demonstrated the broadest applicability, producing the highest values in five portfolios. FF4F followed with 3 top-performing cases, while both FF3F and FF6F yielded the best results in 2 portfolios each. This outcome indicates that the five-factor specification, which incorporates profitability and investment factors, delivered the strongest overall prediction accuracy across the portfolios group.
The most successful predictions were observed in the MONEY portfolio under the FF5F model (0.962, MLP), followed by OTHER under FF5F (0.948, MLP), MANUF under FF4F (0.942, SVR), and BUSEQ with FF4F (0.917, MLP), each delivering values above 0.90. Notably, all portfolios with explanatory power surpassing 90% were predicted most accurately by MLP, underscoring its ability to deliver high explanatory accuracy in suitable configurations. Additional strong results were recorded in CHEMS (0.895, LR, FF5F), SHOPS (0.887, LR, FF5F), and NODUR (0.845, MLP, FF5F). TELCM (0.797, SVR, FF6F), HLTH (0.796, KNN, FF4F), ENRGY (0.785, LR, FF3F), and DURBL (0.758, KNN, FF3F) produced moderately strong outcomes. Once again, the UTILS portfolio appeared at the lower end of predictive alignment, with values consistently falling between 0.482 and 0.522 across all model–method combinations. This continued underperformance reinforces earlier observations suggesting that the sector’s evolving structure and unique regulatory characteristics may undermine the effectiveness of models trained on historical data.
Viewed collectively, these outcomes reaffirm LR’s statistical reliability across a diverse set of industry portfolios, as it consistently outperformed other methods in the majority of cases. Nevertheless, the superior values achieved by MLP in several portfolios, especially those surpassing the 0.90 threshold, highlight the method’s ability to capture complex, nonlinear relationships that may elude linear models. The FF5F specification emerged as the most effective overall, producing the highest number of top-performing predictions, which underscores the value of incorporating profitability and investment factors into return modeling. Most portfolios were predicted with strong accuracy, often exceeding 0.84, with only the UTILS showing persistent underperformance. These findings indicate that the combination of factor-based models and ML methods, particularly when tailored to portfolio structure and complexity, can substantially enhance predictive capacity.
The MCS results for the 12-industry group indicate a broadly consistent pattern across the factor specifications, with FF3F, FF5F, and FF6F each remaining admissible in 9 portfolios and FF4F in 6. This distribution suggests that the FF3F, FF5F, and FF6F structures retained comparable statistical support across much of the group, while the FF4F model showed a more selective presence. The heterogeneity of outcomes becomes clearer in portfolios such as BUSEQ, TELCM, and SHOPS, where only subsets of the models were retained or, in the case of SHOPS, all specifications fell below the acceptance threshold. A comparable pattern was visible in the
results, where FF5F often attained strong performance without uniformly surpassing the alternatives. As reflected in the MCS
-value distribution in
Figure 4, FF3F, FF5F, and FF6F received consistent support across the group, whereas FF4F was removed more often. The complete rejection of all models for SHOPS reflects the sensitivity of the procedure to loss differences rather than an absence of predictive accuracy, given the relatively high
values for this sector. In sum, the broader evidence from the 12-industry group indicates that the richer factor specifications generally maintain a more stable statistical footing, while simpler structures exhibit greater variability across sectors.
4.4. Analysis of 17-Industry Portfolios
The analysis of the 17-industry portfolios provides insights into the variation in predictive performance across ML methods and Fama–French model specifications. The results, summarized in
Table 5, revealed distinct patterns across both algorithms and specifications. In the FF3F model, SVR delivered the most successful predictions in 9 portfolios, with the strongest performances recorded in OTHER (0.981) and FINAN (0.927), followed by KNN in 5, LR in 2, and MLP in 1. A similar trend continued under the FF4F model, with SVR maintaining its lead in 7 portfolios and achieving accuracies above 0.90 in OTHER (0.980) and FINAN (0.926), while KNN, LR, and MLP produced top results in 6, 3, and 1 portfolios, respectively. The distribution changes under the FF5F model, where MLP led in 6 portfolios, including the top-scoring FINAN (0.932), LR and SVR each succeeded in 5, and KNN in 1. Finally, under FF6F, MLP again dominated in 6 portfolios, followed by LR and SVR with 5 each, and KNN with 1. Across all model–portfolio combinations, SVR emerged as the most consistently effective method, securing 26 top predictions overall and ranking as the single best performer in 5 portfolios. LR achieved 15 top results and led in 4 portfolios, while MLP recorded 14 top results with 6 portfolio-level best performances. KNN, despite achieving 13 top predictions, was the leading method in only 2 portfolios.
Among the portfolios, the OTHER portfolio stood out with remarkably high predictive performance across all models, consistently exceeding a value of 0.98. Following closely, the FINAN portfolio also demonstrated robust explanatory strength, particularly under the FF5F specification with MLP, where its predictive power reached 0.932. Other portfolios such as MACHN with SVR (0.891) and FABPR with SVR (0.885) under FF4F, TRANS with SVR (0.890) under FF5F, and CHEMS with MLP (0.883) under FF6F, also exhibited strong forecasting performance, with correlation strength approaching 0.90. Consistent with earlier findings, the UTILS portfolio remained among the least predictable, with values ranging narrowly between 0.482 and 0.522 regardless of model or method. From a model-centric perspective, FF5F yielded the highest number of top-performing predictions, with 7 portfolios showing the best results under this specification. This was followed by FF3F with 4, and FF4F and FF6F with 3 each.
Considered as a whole, the results in
Table 5 underscore the versatility of SVR across varying model complexities and reaffirm MLP’s responsiveness in high-factor environments. Similar to patterns observed in earlier portfolio groups, the FF5F specification emerged as the most effective overall, producing the highest number of top-performing predictions. Most portfolios achieved strong predictive alignment, frequently above 0.84, while UTILS once again exhibited persistent underperformance, consistent with its behavior in other configurations. These outcomes indicate that extending factor structures and aligning model–method choices with portfolio characteristics can meaningfully enhance predictive capacity, particularly in structurally diverse industry groups.
Expanding the analysis to the 17-industry portfolios revealed a more varied pattern in how the factor specifications are supported across sectors. FF3F was retained in 14 portfolios and FF5F in 13, while FF4F and FF6F remained admissible in 10 portfolios each. The distribution of retention decisions differed notably across industries, CNSM and FINAN preserved only the richer specifications, whereas CLTHS and FABPR maintained preference for the simpler structures. DURBL represented the most extreme outcome, as all models fell below the threshold, indicating that the loss differences do not allow for the identification of a statistically non-inferior specification for this portfolio. The sector level contrasts presented in
Figure 5 underscore the breadth of support for FF3F and FF5F, while FF4F and FF6F showed a more selective pattern of retention across industries. This distribution reflects the way factor structures interact with the expanded portfolio set, with simpler and richer specifications each finding support in different parts of the cross section. Rather than pointing to a single dominant model, the evidence suggests that performance varies with sector characteristics, and that the FF3F and FF5F formulations sustain a broader statistical footing relative to the FF4F and FF6F alternatives.
4.5. Analysis of 30-Industry Portfolios
The performance analysis of FF3F to FF6F models across 30 U.S. industry portfolios highlights the continued strength of the SVR method, particularly in the more parsimonious model specifications. The corresponding results are provided in
Table 6. SVR achieved the highest number of top predictions in 14 portfolios under FF3F, and in 12 portfolios under both FF4F and FF5F. As model complexity increased, LR emerged as the most successful method in FF6F, leading in 10 portfolios. MLP remained highly competitive in deeper structures, recording 9 best outcomes under both FF5F and FF6F. KNN, while less prominent overall, showed notable effectiveness in FF4F with 7 leading results. Considering model-level performance regardless of method, FF5F provided the highest number of top-performing predictions with 11, followed by FF6F and FF4F with 8, and FF3F with 3.
A comprehensive count across all model–portfolio combinations revealed that SVR achieved the highest frequency of best-performing predictions with 45 cases, being the leading method in 10 portfolios. LR followed with 34 cases, topping 5 portfolios, while MLP attained 25 cases with 10 such outcomes, and KNN posted 16 cases with 5 portfolio-level best results. This overall distribution is reflected in several standout cases, such as WHLSL (0.936, MLP, FF5F), SERVS (0.932, LR, FF6F) and FIN (0.932, MLP, FF5F), which exhibited remarkably strong forecasting performance. With a few exceptions, most portfolios in the 30-industry configuration achieved values exceeding 0.70, confirming the general robustness of the applied models.
As seen from
Table 6, most portfolios preserved signal strength in return patterns that can be effectively captured by ML algorithms, even in the presence of increased heterogeneity across industries. Notable deviations from this trend were observed in the UTILS, SMOKE, and COAL portfolios. UTILS once again yielded explanatory power in the 0.48–0.52 range, consistent with its previously noted weak explanatory alignment. SMOKE displayed similarly low predictive performance, with
values around 0.56, which may reflect the disruptive impact of tightening public health regulations, increasing excise taxes, and growing legal obligations imposed on tobacco companies. COAL recorded the lowest fit quality levels among all portfolios, with
values falling below 0.46. This pronounced underperformance may be attributed to the sector’s structural decline amid the global transition to cleaner energy sources, diminishing the relevance of past return patterns rooted in fossil-fuel-based dynamics.
The analysis indicates that, as in previous cases, overall predictive performance remained strong, although the highest explanatory strength in this group reached approximately 0.93, slightly below the peaks observed in earlier portfolio groups. SVR once again emerged as the most prominent method in terms of model–portfolio combinations, yet at the portfolio level, MLP maintained a lead in securing the top results. This outcome may underscore the adaptability of MLP in deeper model structures and reaffirm the stability of SVR in more compact configurations. Furthermore, the consistent superiority of the FF5F specification lends strong support to the notion that incorporating profitability and investment factors substantially improves the explanatory strength of asset pricing models, especially when applied to a broad and heterogeneous range of industries.
The 30-industry MCS results revealed a configuration in which model support is neither concentrated nor evenly dispersed, but instead follows a tiered structure. FF5F stood out with 23 retained portfolios, confirming that this specification remained competitive across a wide span of sectors. FF3F and FF4F were each retained in 20 portfolios, showing comparable levels of acceptance despite representing distinct factor compositions. FF6F appeared less frequently, with 16 inclusions, suggesting a more limited scope under the current significance threshold. Certain industries, including BUSEQ, COAL, HSHLD, and TXTLS, did not retain any specification, indicating that none of the tested models offers sufficient statistical proximity to the non-inferiority region in these particular cases, a result consistent with the
-values falling below the 0.25 threshold applied in the MCS procedure. The retention decisions are reflected in
Figure 6, where FF5F appeared most consistently across the portfolios, and FF3F, FF4F, and FF6F followed with progressively narrower ranges. The strong position of FF5F aligns with its
-based performance in this group, where it frequently yielded high correspondence between the predicted and realized excess returns. The coexistence of retained and empty-set portfolios indicates that sector-specific characteristics influence the feasibility of identifying non-inferior models in this expanded setting.
4.6. Analysis of 38-Industry Portfolios
The analysis was extended to encompass 38 U.S. industry portfolios, as detailed in
Table 7, in order to further examine the evolution of predictive capacity across broader and more diversified industry structures. This expansion introduced a higher degree of structural complexity, offering a more rigorous test of each ML method’s adaptability. In this extended configuration, SVR emerged as the leading method overall, producing the highest number of top-performing results in 54 portfolios. Its predictive advantage was particularly evident under the FF3F and FF4F models, where it led in 17 and 15 portfolios, respectively. LR followed with 43 best predictions, most notably dominating under FF5F and FF6F with 13 and 14 top performances, respectively. MLP demonstrated competitive strength in deeper model structures, achieving 11 and 9 first-place results under FF5F and FF6F, respectively, for a total of 31. KNN, in contrast, remained the least frequent top performer with a total of 16 leading outcomes, lacking consistent sectoral alignment across models. When focusing on portfolio-level best performances, MLP ranked first in 13 portfolios, followed by LR in 11, SVR in 8, and KNN in 4, highlighting that the optimal method–portfolio matches vary depending on the interplay between algorithmic strengths and sector-specific features.
From a model-centered perspective, FF5F yielded the most accurate results overall, with 16 portfolios reaching the highest values under this specification. FF3F and FF6F followed, each achieving the best performance in 8 portfolios, while FF4F recorded the highest accuracy in 4. This trend reinforces the value of profitability and investment factors in capturing return behavior across diverse industry settings. The most successful model–method–portfolio combination was observed in the SRVC portfolio under FF6F, where LR achieved an explanatory strength of 0.946, reflecting near-perfect alignment between model structure and realized returns. Additional portfolios demonstrating high explanatory strength include WHLSL (0.936, MLP, FF5F), MONEY (0.932, MLP, FF5F), and MTLPR (0.916, SVR, FF4F), each demonstrating an explanatory power surpassing 0.90. These instances illustrate that top-tier predictive performance is not confined to a single algorithm, but rather emerges from sector-specific compatibilities with particular model architectures.
Across the 38-industry classification, a considerable portion of sectors exhibited
values above 0.70, indicating that meaningful return regularities remain detectable even as industry granularity increases. The detailed values in
Table 7 further show that several portfolios fell outside this general pattern. MINES and PHONE displayed weak explanatory alignment within this classification, with
values mostly below 0.56, while SMOKE and UTILS maintained the subdued performance observed in earlier portfolio groups. Taken together, these results suggest that sector-specific dynamics, ranging from idiosyncratic risk structures to regulatory or market-insensitive characteristics, continue to limit the compatibility of these industries with factor-based return models
The analysis of the configuration shows that, consistent with earlier portfolio groups, SVR retained its position as the most prominent method when evaluated across all model–portfolio combinations. However, when focusing on the highest-achieving methods within individual portfolios, MLP once again outperformed SVR by a clear margin. Unlike previous cases, LR matched MLP in the number of portfolio-level best performances, reflecting a more balanced distribution of method-specific strengths in this configuration. In line with prior results, the FF5F model continued to encompass the largest number of best predictions, reinforcing its robustness in capturing return dynamics. Overall, predictive performance remained high, with the majority of portfolios exceeding 0.75 in accuracy. Nevertheless, this group contained a slightly greater number of portfolios with performance levels around or below 0.60, a pattern that is likely attributable to the characteristics of the additional industries included in the group.
The MCS outcomes for the 38-industry group revealed substantial dispersion in model support. FF5F was retained in 31 portfolios, emerging as the most widely accepted specification in this expanded setting, while FF6F and FF3F followed with 27 and 29 retained portfolios, respectively. FF4F remained admissible in 23 portfolios, marking the lowest level of support among the four models. Several industries, including INSTR, FOOD, GARBG, TXTLS, and WOOD, produced empty superior sets, indicating that none of the tested factor structures satisfied the MCS retention criterion under the chosen threshold for these sectors. This outcome reflects the well-known sensitivity of the MCS procedure to the selected significance level, as stricter thresholds would allow at least one specification to remain admissible in these industries.
Figure 7 summarizes these outcomes by highlighting the broad coverage of FF5F and the sizeable acceptance of FF3F and FF6F across the industry set, while FF4F occupied a comparatively narrower range. The prominent standing of FF5F is in line with the
-based evaluation for this configuration, where the five-factor specification frequently achieved high predictive alignment, and the more moderate retention of FF4F and FF6F mirrors their intermediate roles in the correlation-based assessment. Considered across the full industry spectrum, these results indicate that no single specification dominated, although FF5F maintained the most extensive statistical footing, with the remaining structures exhibiting sector-specific variation in their support.
4.7. Analysis of 48- and 49-Industry Portfolios
The analysis proceeded to the 48- and 49-industry portfolio classifications, which represent the most granular sectoral structures examined. These extended configurations provide a detailed setting to evaluate how increasing cross-sectional heterogeneity influences the interaction between the FFVF models and the ML methods. The out-of-sample results for the 48-industry classification are presented in
Table 8. The results further highlight the sustained strength of the SVR method, which achieved the highest explanatory performance in 74 portfolio–model combinations, underscoring its broad adaptability across diverse settings. Under the FF3F specification, SVR achieved the highest number of leading results with 21 portfolios, clearly ahead of LR, MLP, and KNN, each with 9. In FF4F, SVR maintained its dominance with 17 portfolios, followed by KNN with 13 and LR with 12. The FF5F configuration further reinforced SVR’s leading position, again securing 21 portfolios, with MLP emerging as the closest contender at 16. In FF6F, SVR continued to lead in total winning combinations, with LR and MLP showing competitive performance in several portfolios. Consistent with earlier results, while SVR maintained its dominance in the total count of winning model–portfolio combinations, an assessment based on the best-performing method within each portfolio once again placed MLP at the forefront, leading in 16 portfolios, followed by SVR in 13, KNN in 10, and LR in 9. In terms of models, FF5F delivered the largest number of top results with 15 portfolios, followed by FF4F and FF6F at slightly lower counts.
The overall predictive performance for most portfolios remained above 0.75, as in previous cases. In particular, BANKS, BLDMT, BUSSV, FIN, and WHLSL registered explanatory strength surpassing 0.90, a level attained mainly under the FF5F and FF6F specifications, where the leading method exhibited especially strong alignment with realized returns. Consistent with the patterns observed in earlier portfolio classifications, the SMOKE, UTILS, and COAL sectors continued to exhibit weak explanatory alignment, typically in the 0.50 range, while in this configuration, GUNS also fell into this underperforming group. The limited predictive performance for the portfolio may stem from its distinct risk–return profile, shaped by defense-industry-specific factors such as government procurement cycles, geopolitical events, and regulatory constraints. In addition, GOLD exhibited extremely limited predictive capacity under the tested factor structures, a result likely attributable to its distinctive role as a crisis-hedging and macro-economically independent asset, with return patterns shaped by drivers beyond the scope of the FFVF models.
Viewed in aggregate, the results reaffirm the recurring advantage of MLP at the portfolio level despite SVR’s frequent appearance as the leading method in individual model–portfolio combinations. The continued superiority of the FF5F specification reinforces the earlier conclusion that the inclusion of profitability and investment factors enhances the explanatory strength of asset pricing models, even within an expanded and highly heterogeneous industry classification. Consistent with previous results, the vast majority of portfolios achieved a correlation strength above the 0.75 level, with only one portfolio remaining effectively unpredictable under all tested configurations.
The outcomes for the 49-industry classification, which differed from the 48-portfolio group only in that the COMPS portfolio was disaggregated into HARDW and SOFTW, largely mirror those obtained for the previous portfolio group, with comparable distributions of method success across the FFVF specifications (
Table 9). Across all model–portfolio combinations, SVR recorded four additional first-place outcomes relative to the previous configuration. Within the best-performing category, SVR showed a marginal increase in FF4F and FF5F by 1 portfolio, while results for FF3F and FF6F remained unchanged. Among the models, FF5F again emerged as the most successful, this time with 18 top-performing portfolios, marking a clearer distinction from its counterparts. At the portfolio level, MLP retained its position as the leading method, also with 18 portfolios, thus achieving a more pronounced lead over the other methods. In other respects, the performance patterns also closely aligned with those observed in the 48-industry portfolio group.
When the 48-industry structure was expanded to the 49-industry configuration through the separation of the COMPS portfolio into HARDW and SOFTW, the MCS
-values were as reported in
Table 8 and
Table 9, respectively. These results show that FF3F and FF5F satisfied the non-inferiority requirement in 33 industries, while FF4F and FF6F remained admissible in 32 and 27 industries. Several sectors, including COAL, HARDW, HSHLD and TXTLS, admitted no specification since all associated
-values fell below
, leaving no model that satisfied the conditions necessary to remain in the superior set. The leading performance of FF5F identified in the
-based evaluation gained statistical confirmation under the MCS procedure, because the model was admitted in 33 industries and could not be statistically distinguished from the strongest alternatives within the expanded cross-section. This relationship is also reflected in
Figure 8, where FF5F formed the most concentrated cluster of high MCS
-values across both the 48- and 49-industry settings. This evidence indicates that the strong predictive alignment attributed to FF5F in the
analysis is supported by the non-inferiority outcomes, reinforcing its role as the most consistently viable specification in the broader industry universe.
4.8. Comparative Performance Evaluation
A comparative assessment of predictive performance across all portfolio classifications, spanning the 5- to 49-industry structures, is presented in
Table 10. The cumulative results revealed the consistent dominance of the SVR method, which achieved the highest number of best-performing predictions in 311 out of 828 model–portfolio combinations (‘All’). This dominance became more pronounced in more complex and detailed portfolio classifications, indicating that SVR’s generalization capability is particularly strong under higher structural granularity. LR followed with 221, while MLP and KNN lagged behind with 174 and 122, respectively.
Across portfolio granularities, as given in
Table 10, SVR recorded the highest number of first-place outcomes in both the overall and single-highest categories for the 5-, 17-, 30-, 38-, 48-, and 49-industry groups. LR held an advantage in the 10- and 12-industry groups, particularly in the single-highest category. This pattern suggests that LR is more competitive in less complex portfolio structures, where the model can capture relationships without the noise introduced by broader classifications. MLP’s competitive position strengthened in the most granular configuration of the 48- and 49-industry groups, where it outperformed LR in overall top-ranked results. All these patterns underscore the adaptability of MLP in deeper and more complex portfolio structures, while reaffirming the stability of SVR in more compact configurations.
While SVR maintained clear dominance in the overall count of top-performing results (‘Best’), the ranking shifted in cases where the method was the most accurate predictor for a given portfolio. In this measure, which represents situations where the method was the most accurate predictor for a given portfolio, MLP emerged as the leading approach in 70 portfolios, outperforming all other methods in this regard. This finding is consistent with Gu et al. [
33], who compared regression-based models, tree-based ensembles, and neural networks, and concluded that, among all methods, MLP achieved the highest predictive accuracy. They further emphasized that shallow learning outperformed deep learning, which differs from typical results in various fields because financial return prediction is characterized by limited effective data and a low signal-to-noise ratio in asset pricing problems. In line with this evidence, it is worth emphasizing that MLP retained this first-place position across all portfolio groupings except the 5-industry configuration, underscoring its adaptability and competitive strength under a wide range of structural complexities. Its advantage became even more evident as portfolio structures broadened, pointing to a performance profile that benefits from richer and more diverse sectoral compositions. In comparison, SVR followed with 54 such instances, indicating that although it frequently appeared among the top performers, it was less often the outright best predictor for an entire portfolio. LR achieved this distinction in 47 portfolios, with particular strength in the 10- and 12-industry configurations, whereas KNN reached first place in 36 portfolios, reflecting a comparatively modest record in securing outright predictive leadership.
Portfolios with
values exceeding 0.90, which are summarized in
Table 11, appeared consistently across configurations, demonstrating strong predictability. In the 5- and 10-industry groups, OTHER, HITEC, and MANUF were the most predictable sectors. For the 12-industry portfolios, MONEY and BUSEQ joined this group, while in the 17-industry classification, OTHER and FINAN achieved similarly high accuracy. As the portfolio scope expanded, additional sectors reached this predictive performance. In the 30-industry case, WHLSL, FIN, and SERVS achieved
values above 0.90, followed in the 38-industry set by SRVC, MONEY, MTLPR, and WHLSL. Finally, in the 48-industry classification, WHLSL, BUSSV, FIN, BANKS, and BLDMT consistently surpassed a
value of 0.90, while in the 49-industry classification, SOFTW was additionally included among these sectors.
The results in
Table 11 provide a consolidated view of the predictive performance of the four ML methods across all portfolio groupings under the FF3F to FF6F specifications. In terms of total number of portfolios in which a given method produced more successful predictions than the others within the same model (‘All’), SVR held a clear lead, producing the highest number of best-performing results in 92 portfolios under FF3F and in 79 portfolios under both FF4F and FF5F. Evidence from prior studies also confirms that SVR performs strongly within the three-factor framework [
49]. Under FF6F, LR moved into the lead with 68 portfolios, while SVR remained in second place with a relatively close margin, followed by MLP with 57, which also represented a competitive outcome. In FF3F, LR followed SVR with 49 portfolios, while in FF4F, KNN ranked second with 55. In FF5F, MLP advanced to second place with 64 portfolios, approaching the level of SVR. These results indicate two broad adaptation tendencies across the methods. SVR showed consistently strong overall portfolio coverage across all specifications, with its highest performance observed under FF3F, while FF6F representing its relatively weakest outcome. Consistent with Diallo et al. [
43], these findings suggest that SVR tends to perform more successfully under simpler factor structures. In contrast, MLP and LR performed particularly well under FF5F and FF6F, underscoring their suitability for richer factor structures. KNN, however, remained weaker overall, with its relatively better outcomes confined to the simpler FF3F and FF4F specifications, while its performance declined markedly in FF5F and FF6F, which attests to the well-documented sensitivity of nearest-neighbor methods to higher-dimensional and more heterogeneous settings, although the practical severity of this issue depends on the intrinsic dimensionality of the data [
64].
When the analysis considered the highest-performing predictions for each portfolio (‘Best’), MLP clearly dominated under the FF5F specification with 30 portfolios, well ahead of the other methods. LR and SVR followed both with 19 in FF5F, while KNN diverged from the others by attaining its strongest performance under FF4F with 17. The table further indicates that SVR also performed relatively strongly in FF6F with 18, close to its peak of 19 in FF5F, and MLP recorded its second-best outcome in FF6F with 20, which, although not comparable to its dominance in FF5F, still indicates a robust level of adaptation to the extended factor structure. Similarly, LR’s best performance was observed in FF5F, but its results in FF6F, though still competitive, were more modest. Taken together, the particularly strong showing of SVR, LR, and MLP in FF5F, combined with their relatively competitive results in FF6F, indicates their ability to adapt to broader and more diverse portfolio structures. In contrast, KNN’s isolated success in FF4F reflects a narrower adaptability, illustrating that methods with limited flexibility are better aligned with simpler factor structures but fail to generalize to more complex settings. From the perspective of factor models, FF5F generated the highest overall number of best predictions with 74 across all methods, underscoring its superiority as the most effective factor configuration. FF4F, while recording a markedly lower performance than FF5F, still ranked second with 50 best predictions, followed closely by FF6F with 49.
In light of the
Table 11 results, sectoral differences in forecastability become evident, reflecting the notion that the explanatory strength of Fama–French factor model is not uniform across industries [
32]. Portfolios with
values above 0.90, defined as those reaching this level in their best-performing model–method combination, appeared consistently across specifications, with HITEC and OTHER recurring frequently among the most predictable sectors in smaller portfolio classifications. Other sectors including BUSEQ, MONEY, FINAN, FIN, BUSSV, SERVS, WHLSL, MTLPR, and SRVC also reached similarly high levels of predictive accuracy and BANKS, CNSMR, BLDMT, and SOFTW joined this group in the broader classifications. These results confirm that certain sectors maintain high forecastability regardless of the underlying model specification, although the set of highly predictable sectors expands as the portfolio structure becomes more granular. At the other end of the spectrum, several portfolios displayed limited alignment, with
values falling below 0.60; UTILS, COAL, and SMOKE consistently remained in this range, while GOLD stood out as cases where predictability was almost entirely absent, underscoring the structural or non-market driven dynamics of these industries. For some sectors, this pattern can be interpreted in light of Jan and Ayub [
40], who reported that neural networks achieved predictive accuracy well above 0.98 in high- and medium-beta portfolios but dropped to around the 70s for low-beta portfolios. Reflecting this distinction, sectors commonly identified as low-beta, such as Utilities (UTILS) and Consumer Staples (SMOKE), ranked among the least predictable, whereas high-beta industries including Information Technology (HITEC, BUSEQ, SOFTW, ELCEQ), Financials (FIN, FINAN, BANKS, MONEY), and Industrials (BUSSV, MANUF, MTLPR) frequently achieved
values above 0.90 [
65]. Wholesale Trade (WHLSL) also behaved similarly to cyclical high-beta sectors, often yielding strong predictive accuracy. Beyond these general patterns, some portfolios displayed distinct shifts across groupings. A particularly notable case was MANUF, which consistently achieved
values above 0.90 in smaller portfolio groups but dropped to 0.61 in the 38-industry configuration. In the opposite way, MINES generally recorded accuracy around the 0.50 level but in the larger 48- and 49-industry groups, its performance improved, approaching values near 0.70. These contrasting outcomes indicate that changes in sectoral composition within broader portfolio classifications can materially alter prediction accuracy.
Overall, the results underline a clear association between factor structure complexity and method-specific strengths. FF5F consistently yielded the highest number of best predictions across portfolios, reinforcing its role as the most robust factor configuration, a result also supported by prior evidence showing the superiority of the five-factor model in both developed and emerging markets [
20,
22,
28,
46]. MLP’s ability to exploit this setting, together with its strong performance in FF5F, reflects its adaptability to richer and more heterogeneous portfolio compositions. Related empirical evidence is provided by Drobetz and Otto [
42], who also found that MLP performed best in factor regressions under the FF5F model and retained superior net performance after accounting for transaction costs. While this highlights the strength of MLP within the FF5F setting, SVR, while dominant in overall counts, exhibited slightly weaker outright leadership in highly granular structures. LR delivered stable performance, particularly under FF5F and FF6F, whereas KNN’s isolated success under FF4F highlights that optimal method selection depends on the structural scope and sectoral composition of the portfolios. These performance differences align with well-documented economic characteristics of the respective industries. Sectors with stronger exposure to profitability, investment intensity, and technological innovation such as HITEC, MONEY, and MANUF tend to benefit more from richer factor structures like FF5F and FF6F as well as from flexible nonlinear learning algorithms such as MLP. Meanwhile, SVR, which dominates in overall counts and performs particularly well under more compact factor setups, captures more stable relationships in industries whose return dynamics rely less on nonlinear interactions and more on persistent, broad-based patterns shaped by market-wide or macroeconomic forces. In contrast, more regulated, low-volatility, or structurally rigid industries including UTILS and ENERGY generally exhibit limited responsiveness to both factor-based models and machine-learning methods, reflecting their lower factor sensitivity and weaker transmission of underlying risk premia into realized excess returns.
5. Conclusions
The challenge of accurately forecasting asset returns remains central to both academic research and practical investment management, given its critical implications for risk assessment and portfolio optimization. Addressing this issue is essential for improving market efficiency and supporting more informed investment decisions. Against this backdrop, the present study investigated the predictive performance of four ML methods, specifically SVR, LR, MLP, and KNN, across four distinct Fama–French multifactor models (FF3F to FF6F), where the 3-factor model includes market, size, and value factors; the 4-factor model incorporates momentum; the 5-factor model adds profitability and investment; and the 6-factor model further includes a low-risk factor, and evaluated them over a broad set of U.S. industry portfolio configurations, which consisted of 5-, 10-, 12-, 17-, 30-, 38-, 48-, and 49-sector groupings. The analysis was conducted within a rigorous out-of-sample prediction framework, based on monthly return data from 1992 to 2022 covering firms listed on NYSE, AMEX, and NASDAQ, obtained from the Fama–French data library, obtained from the Fama–French data library, which enabled an extensive assessment of forecasting capabilities across varying combinations of models, methods, and portfolios. To reinforce the robustness of the comparative findings, statistical validation was additionally performed through the MCS test, which examined the significance of performance differences among the Fama–French model specifications.
The findings indicate that both the structure of the factor model and the choice of ML method significantly affect predictive performance. SVR achieved the highest number of top-performing model–portfolio combinations, winning in 311 out of 828 cases, followed by LR with 221 and MLP with 174. However, when evaluated by the number of portfolios in which each method yielded the most accurate predictions, MLP emerged as the leading technique in 70 portfolios, surpassing SVR with 54 and LR with 47. Notably, MLP ranked first across all portfolio groups except the smallest, underscoring its adaptability to more granular and structurally diverse configurations.
According to the model–method alignment, FF5F matched most effectively with MLP, which generated the best predictions in 32 portfolios, followed by LR in 20 and SVR in 19, highlighting the compatibility of the five-factor structure with multiple ML techniques. The FF6F specification also delivered competitive outcomes, where MLP recorded 21 top predictions and SVR 18, indicating that both methods were able to adapt relatively well to the extended six-factor structure. LR yielded stable performance across most settings, particularly under FF5F and FF6F, in line with the other leading methods. KNN, in contrast, performed best under the FF4F specification, indicating a limited capacity to adapt to richer factor environments. Therefore, MLP, SVR, and LR demonstrated high effectiveness in deeper structures with greater complexity, whereas KNN achieved strong results under simpler models. When considering the overall counts of portfolios in which a method outperformed the others within the same model, SVR exhibited the opposite pattern, showing stronger alignment under simpler specifications, most notably FF3F, and also under FF4F and FF5F.
Among the competing model structures, FF5F, which integrates profitability and investment factors, demonstrated the strongest predictive capacity. It achieved the highest number of best predictions across portfolios, totaling 77, and progressively outperformed other models as portfolio complexity increased. In particular, it shared similar performance levels with FF4F in the 5- and 10-industry groups, but expanded its lead substantially in more granular configurations. This dominance is further supported by the MCS results, which showed that FF5F remained non-inferior in the broadest set of industries and therefore passed the elimination steps of the procedure more consistently than its competitors. This persistent admissibility indicates that the specification’s superior performance is not merely sample-specific but reflects statistically robust differences in loss behaviour across the expanded industry universe. As for the other specifications, FF6F generally ranked second, while FF4F often achieved a comparable number of successful predictions and occasionally outperformed FF6F in specific portfolio sets. FF6F generally ranked second among models, while FF4F often achieved a similar number of successful predictions and occasionally outperformed FF6F in certain portfolio sets. These patterns suggest that incorporating profitability and investment dimensions contributes meaningfully to explanatory strength, particularly in the presence of increasing industry heterogeneity.
The analysis revealed substantial variation in sectoral forecastability. Portfolios such as OTHER (in the 5-, 10-, 12-, and 17-industry groups) and WHLSL (in the 30-, 38-, 48-, and 49-industry groups) consistently achieved
values above 0.90 across all ML methods, highlighting their strong compatibility with multifactor structures. Similarly, MANUF (in the 5-, 10-, and 12-industry portfolios) and FIN (in the 30-, 48-, and 49-industry portfolios) surpassed 0.89 in every ML application, while HITEC (in the 5- and 10-industry portfolios), MONEY (in the 12- and 38-industry portfolios), and BANKS, BUSSV, and BLDMT (in the 48- and 49-industry portfolios) recorded values exceeding 0.88. Additional sectors including BUSEQ (in the 12-industry portfolios), FINAN (in the 17-industry portfolios), SERVS (in the 30-industry portfolios), SRVC and MTLPR (in the 38-industry portfolios), and SOFTW (in the 49-industry portfolios) also demonstrated
values above 0.88 across all ML configurations, confirming the robustness of these methods across diverse sectoral structures. In contrast, the industries UTILS, COAL, GOLD, and SMOKE repeatedly underperformed, with
values often remaining below 0.56. These persistent shortcomings may be attributed to structural or regulatory characteristics, such as limited exposure to market-based dynamics, policy insulation, or declining relevance, which constrain the ability of factor-based models to capture return behavior in these sectors. Indeed, studies have shown that heavy regulation increases intra-industry comovement, implying that regulatory shocks rather than market forces often drive price dynamics in such sectors [
66]. Such limitations may not only arise from structural conditions, but also from sector-specific characteristics that shape the alignment between risk exposures and realized returns. Prior evidence indicates that extraordinary events can suppress the relevance of standard factors at the industry level, as profitability lost significance in both the U.S. steel sector and the U.S. energy sector during the COVID-19 period, while other premiums such as size and value remained effective [
23,
25]. On the other hand, crisis effects are not uniform across industries, as certain sectors were adversely affected while others remained resilient, with healthcare funds maintaining strong explanatory alignment during the COVID-19 period [
28]. Broader macroeconomic influences, particularly interest rates and inflation, also exert strong pressures on returns, limiting factor alignment in industries that are heavily regulated or rate-sensitive [
26].
The shortcomings of classical factor models in fully explaining return dynamics underscore the need for more flexible approaches, and ML techniques offer a compelling framework in this regard [
48,
50]. By implementing such methods, the present study demonstrates that ML can achieve superior predictive performance across diverse sectoral structures. These findings reaffirm the practical utility of incorporating ML approaches into asset pricing and return forecasting, particularly when combined with robust and comprehensive factor structures. Furthermore, the results emphasize the critical role of model architecture and algorithm choice in achieving strong explanatory outcomes, especially under conditions of heightened industry heterogeneity, as also highlighted by Kwon [
52], who showed that the relative importance of factor specifications can vary across different ML algorithms. In conclusion, this study provides empirical evidence supporting the complementarity of ML techniques and multifactor models in capturing the return dynamics of a broad range of sectors, providing empirically grounded insights for refining asset pricing models and advancing predictive applications in finance. The evidence also challenges a strict interpretation of the efficient market hypothesis by demonstrating systematic and replicable predictive accuracy across decades of U.S. capital market data. Nevertheless, the results remain contingent upon the choice of ML method, the temporal scope of analysis, and the structural features of each industry group. From a practical standpoint, this study offers actionable insights for financial institutions, portfolio managers, and investors seeking to enhance return estimation and risk assessment through the integration of ML methods into multifactor frameworks. Moreover, understanding the relative performance of alternative factor structures may serve as a valuable decision-support mechanism in both academic research and professional investment contexts. In this regard, the demonstrated improvement in predictive accuracy through ML applications contributes to more reliable investment decisions under uncertainty, supporting the reliability of long-term investment strategies.
Beyond these implications, the study also contributes to the asset pricing literature by combining multifactor pricing theory with algorithmic learning across a wide array of sectoral return data, offering a robust empirical benchmark for future applications of ML in asset pricing. These findings further bear practical implications for asset managers seeking data-driven techniques capable of adapting to varying degrees of industry heterogeneity, thereby supporting more reliable model selection in return prediction tasks and providing insights not only for investors but also for firms aiming to align financial performance with long-term stability and strategic objectives. While the results are encouraging, the study is subject to certain limitations. Notably, the analysis did not incorporate macroeconomic indicators or sentiment-based predictors, which may enhance forecasting capacity when integrated with existing factor structures. Future research may benefit from developing hybrid ML architectures, incorporating macro-financial or sentiment-based variables, and applying advanced deep learning techniques to alternative portfolio configurations. In addition, future studies could extend the current framework by jointly forecasting both expected portfolio returns and the risk-free rate. Such an extension would enable comprehensive risk–return assessment through financial performance measures, such as the Sharpe ratio, within a consistent predictive evaluation framework in portfolio management.