1. Introduction
It is well known that, from a traditional business perspective, accounting-based ratios are the fundamental criteria for measuring financial performance. However, crises arising from different perspectives in the global world, such as climate risk, social inequalities, corporate scandals, and management failures, reveal that profitability cannot be explained solely by the income statement and balance sheet. To address this shortcoming, complementary sources of information have become necessary. Businesses have consequently become familiar with the ESG approach, which encompasses environmental, social, and governance dimensions. This approach provides businesses with a framework for assessing their risk profiles, strategic orientations, and performance sustainability.
In this context, a review of the literature reveals that there is no consensus among studies examining the impact of ESG indicators on financial performance. Particularly when examining performance metrics that directly concern investors, such as Return on Equity (ROE), the impact of ESG indicators cannot be clearly articulated. It is thought that the most significant reason for this is that the studies conducted are predominantly based on linear regression models and linear propositions.
The increasing use of machine learning methods in the literature contributes to the modelling of non-linear relationships. Nevertheless, significant methodological limitations are apparent. The majority of studies focus on a single algorithm, and performance differences between models cannot be statistically reported. As a result, it is not possible to clearly explain the impact of ESG indicators.
The primary motivation behind this study is to test, in a generalizable manner, whether ESG indicators can provide information beyond financial ratios for ROE prediction. To this end, two separate models are compared. The first model incorporates financial ratios, while the second model incorporates sustainability and governance signals. In this study, governance signals refer to firm-specific corporate governance disclosures (such as board structure and audit-related information), which are treated separately from aggregated ESG “G” components in order to capture more granular governance effects. The most important feature that distinguishes this study from others is the use of the Bootstrapped Grouped Cross-Validation approach. This method allows for the statistical reporting of model performance. In this respect, it sheds light on a gap in the literature. Furthermore, the combined use of different tree-based algorithms such as Random Forest, XGBoost, LightGBM, and CatBoost in the study enhances methodological robustness by testing the algorithm-sensitivity of the findings. Consequently, this study offers a unique contribution to the literature by testing the additional information value that sustainability and corporate governance indicators provide for ROE prediction within a multi-algorithmic, information leakage-resistant, and statistically reliable comparison framework. From a conceptual perspective, financial performance reflects not only firms’ operational efficiency but also their governance quality, risk management practices, and strategic sustainability orientation. Accordingly, indicators related to profitability, liquidity, leverage, and corporate governance are expected to capture complementary dimensions of firm performance, making them suitable predictors within a machine learning framework.
Based on the above discussion, this study tests the following hypotheses:
H1. A model incorporating sustainability and corporate governance indicators provides higher out-of-sample predictive performance for ROE than a model based solely on financial ratios.
H2. The improvement in out-of-sample predictive performance obtained by adding sustainability and governance indicators to traditional financial ratios is statistically significant but limited in magnitude.
H3. The contribution of sustainability and governance indicators to ROE prediction is algorithm-sensitive and more pronounced in boosting-based tree models.
3. Literature Review
Table 1 summarises selected empirical studies examining the relationship between ESG indicators, corporate governance, and firm performance.
Table 1 summarises the literature examining the relationship between ESG indicators, corporate governance, and financial performance in chronological order. When current studies are evaluated thematically, the literature is seen to be concentrated on four main axes.
The first strand of the literature consists of studies that directly test the relationship between ESG performance and firm profitability and financial outcomes. A significant portion of these studies suggests that ESG indicators may be positively related to accounting-based performance measures such as ROA, ROE, and EBIT (e.g.,
De Lucia et al., 2020;
Herman et al., 2025;
D’Amato et al., 2021). However, this literature generally examines the impact of ESG on financial performance through a single model or limited set of methods; while partially capturing non-linear structures, it does not assess the statistical reliability of performance differences between models.
The second group of literature focuses on modelling and explaining ESG indicators or ESG scores using machine learning and deep learning methods. Studies in this context show that ESG rating processes can be re-estimated (
Del Vitto et al., 2023), ESG data can be analysed using NLP and deep learning techniques (
Lee et al., 2022), and the determinants underlying ESG scores can be revealed using explainable ML tools (
Kim & Lee, 2025;
Pei, 2025). While these studies offer significant methodological contributions to the ESG measurement process, they do not directly test the marginal contribution of ESG to financial performance prediction and mostly focus on the internal structure of a single model.
The third strand of literature examines the impact of ESG and sustainability disclosures on forecast accuracy, information asymmetry, and prediction errors. Studies on analyst forecasts and financial prediction accuracy show that sustainability disclosures can improve the information environment and reduce forecast errors (
Acheampong & Elshandidy, 2025). Conversely, some studies reveal that ESG’s contribution to financial forecast accuracy may be limited or model-sensitive (
Dincă et al., 2025;
Dossa et al., 2025). A key methodological issue in this literature is the failure to account for firm-level dependence in panel-like data structures and the insufficient consideration of the risk of information leakage that may arise in the training–testing split. Furthermore, supporting performance differences between models with confidence intervals and statistically testing them is largely neglected.
The final group of literature focuses on systematic reviews and bibliometric analyses of studies in the AI–ESG field (
Davidescu et al., 2025;
Ferrari et al., 2025;
Mohsin & Nasim, 2025). These studies show that the ESG and machine learning literature has grown rapidly since 2015, with explainability, risk management, and ESG measurement standards coming to the fore. However, these reviews emphasise that the fundamental problem in the field is the lack of data standardisation and methodological consistency; they offer limited contributions to model comparisons at the empirical level and the statistical reliability of prediction performance.
When the current literature is evaluated overall, it is seen that the majority of studies have structural limitations, such as relying on a single model, not using group-based validation techniques that prevent information leakage, not statistically testing performance differences between models, and not testing the additional information value that ESG provides to financial performance from a generalisability perspective.
This study systematically tests the marginal information value that ESG provides to ROE prediction by comparing a basic model based solely on financial ratios with an enriched model that includes sustainability/governance indicators in addition to financial ratios on the same sample. Grouped Cross-Validation prevents information leakage arising from firm-level dependencies; the statistical reliability of model performance differences is assessed through bootstrap-based confidence intervals and matched tests. Furthermore, applying SHAP analysis to all models allows for examining whether variable contributions are consistent independently of the model, rather than being model-specific. In these respects, the study offers a unique and robust contribution to the ESG–financial performance literature, both methodologically and empirically. Unlike prior studies, this paper explicitly quantifies the marginal contribution of sustainability and governance indicators by formally testing performance differences between competing models under a grouped cross-validation design. By combining multi-algorithm benchmarking, leakage-aware validation, and bootstrap-based statistical inference, the study moves beyond descriptive model comparisons and provides reproducible evidence on whether ESG-related information adds economically and statistically meaningful predictive value. Beyond statistical significance, the observed improvements are also economically meaningful, as even modest gains in ROE prediction accuracy may translate into substantial differences in firm valuation, risk assessment, and investment decision-making. These performance differences are formally evaluated using matched statistical tests alongside bootstrap confidence intervals, allowing us to distinguish systematic information gains from random variation. Accordingly, the reported improvements reflect both statistical reliability and economically meaningful effect sizes, rather than model-specific artifacts.
4. Methodology and Approach
The aim of the study is to test whether sustainability/governance indicators provide additional information value in predicting companies’ financial performance (ROE) beyond financial ratios. To this end, two model families were compared on the same sample. These are (i) the basic model based solely on financial ratios (Model 1) and (ii) the enriched model that includes financial ratios and sustainability/governance indicators derived from companies’ annual reports (Model 2). Tree-based ensemble models were selected because of their ability to capture nonlinear relationships and higher-order interactions commonly observed in financial data. In particular, Random Forest benefits from variance reduction through bagging, while boosting-based methods (XGBoost and LightGBM) sequentially improve weak learners by focusing on residual errors, allowing governance and sustainability signals to interact with financial ratios in a more flexible manner. This design enables a direct assessment of whether ESG-related variables provide incremental predictive content beyond traditional financial fundamentals.
To establish the forecasting framework and reduce information leakage, the dependent variable
ROE(i,t) (ROE for 2024 in this study) has been defined. All independent variables are taken from period
t − 1 (2023 in this study).
Thus, the ROE (for year 2024) forecast is produced using only financial information disclosed to the public at the end of 2023 and governance/sustainability disclosures in the 2023 activity report.
Financial data and activity reports were obtained through the Public Disclosure Platform (KAP) operating in Turkey. KAP is an official public platform where publicly traded companies disclose their financial reports and news to the public in accordance with the legislation. The Model 1 dataset (financial ratios) and Model 2 dataset (financial ratios and governance/sustainability indicators) belong to a total of 428 companies. Banks and insurance companies were not included in the study due to differences in the presentation of financial statements. The sample structure across these different sectors allows us to test whether the findings are specific to a particular industry. Thus, the proposition that “governance/sustainability indicators add value to ROE forecasting” can be tested not only in a single sector but also in a broader corporate environment. In terms of analytical methods, the sample structure of companies within the same sector also produces natural variance compared to the sample structure across different sectors, particularly in financial ratios and reporting practices. This variance expands the information content that machine learning models can “learn.” Consequently, it may be misleading regarding whether the added governance/sustainability indicators are truly discriminative. Furthermore, the sample structure of companies in the same sector may sometimes be captured as an “easy signal” by machine learning methods when making predictions, due to sector-specific circumstances, and show high success. In a multi-sector data set, the model must learn generalizable patterns because it does not rely on the patterns of a single industry. This subjects model performance to a “more difficult but more reliable” test.
The model variables for Model 1 and Model 2 designed in the study are provided in
Table 2 and
Table 3.
The selection of the dependent variable (ROE) and independent variable sets (financial ratios and sustainability/governance indicators) in the study is based on the literature stating that financial performance can be explained through both accounting-based profitability drivers and corporate governance/transparency channels, in line with the “predicting profitability in period t using t − 1 information” approach.
The primary reason for selecting ROE (return on equity) as the dependent variable is that ROE is one of the most common accounting-based performance measures reflecting the company’s periodic value creation from the shareholders’ perspective. The continuity of profitability components over time and their predictive power for the future have also been systematically examined in the financial ratio analysis literature (
Nissim & Penman, 2001). ROE is considered a fundamental “outcome variable,” particularly in valuation and profitability analyses (based on DuPont logic). Furthermore, ROE is a frequently used performance target in both corporate finance and financial reporting research because capital structure, operating success, and margin dynamics can be read simultaneously through ROE (
Selling & Stickney, 1989).
The financial ratios used in Model 1 (INKM, IFMA, IROI, ICRT, ILEV) have economic content that is highlighted in the literature as key determinants of ROE, but do not completely overlap with each other. Net profit margin (INKM) and operating margin (IFMA) represent the “margin/operational efficiency” channel of profitability. The sensitivity of margins to both cross-sector structural differences and the strategic competitive environment has been demonstrated in classical ratio analysis studies (
Selling & Stickney, 1989). The ratio analysis and valuation literature emphasizes that margin and operating performance carry critical information for understanding future levels of profitability (
Nissim & Penman, 2001). The investment return (IROI) variable also aims to capture the “investment efficiency” channel of ROE by representing the firm’s investment/asset utilization efficiency and the capacity of investments to generate profitability. This variable also plays a central role in DuPont-based studies discussing the continuity and risk of profitability components (
Li et al., 2014). The current ratio (ICRT), representing liquidity, and leverage (ILEV), representing leverage and capital structure, are expected to bring the financial policy dimension of ROE into the model. The relationship between liquidity and profitability is addressed in the literature within a “trade-off” framework. While high levels of liquidity provide a safety buffer, excessive liquidity can suppress profitability due to financing costs or idle resources. This relationship has been empirically tested, particularly through liquidity indicators such as the current ratio (
Eljelly, 2004). Similarly, the link between working capital management and profitability points to short-term financing/operating mechanisms affecting profitability (
Deloof, 2003). The choice of leverage (ILEV) relates the impact of capital structure on profitability and shareholder returns to both theoretical and empirical literature. As is well known, agency theory suggests that the use of external sources of funds may create a disciplinary effect on managers, but that these sources may also generate agency costs (
Jensen & Meckling, 1976). Empirically, the relationship between capital structure components and ROE has been directly tested in different samples (
Abor, 2005).
In Model 2, sustainability/governance indicators (IKYI, ISUS, IESG, IRSK, IMNG) were added as independent variables. They were used as independent variables in the study based on the assumption that, beyond financial ratios, they could provide additional information on the prediction of ROE regarding “corporate transparency, oversight, and risk management capacity.” This selection has two main theoretical underpinnings. The first is the agency-based approach (
Jensen & Meckling, 1976), which argues that corporate governance and stakeholder oversight may be related to performance by disciplining firm behavior. The second pillar is the disclosure literature, which emphasizes that public disclosure and reporting can reduce information asymmetry, thereby generating economic outcomes through more efficient pricing in capital markets, lower uncertainty, and potentially lower capital costs (
Healy & Palepu, 2001). Thus, the relationship between corporate governance quality and performance has also found empirical support in studies within the literature that analyze the link between shareholder rights and governance mechanisms and valuation and performance indicators (
Gompers et al., 2003;
Bhagat & Bolton, 2008).
The presence of the risk management section (IRSK) among these sustainability/governance indicators in Model 2 is consistent with the risk disclosure literature, which indicates that risk reporting is a measurable dimension of annual reports and may be related to corporate characteristics (
Linsley & Shrives, 2006;
Abraham & Cox, 2007). The sustainability committee (ISUS) was selected because it is consistent with studies discussing that sustainability-focused corporate governance mechanisms may be related to reporting and auditing practices (particularly through mechanisms such as environmental/corporate social responsibility (CSR)/sustainability committees at the board level) (
Peters & Romi, 2015). The reporting of ESG indicators (IESG) and the corporate governance compliance report/information form (IKYI, IMNG) connect with the extensive literature showing that corporate social responsibility/ESG disclosures may have economic consequences. The relationship between CSR and ESG and financial performance is mostly reported as non-negative/positive in meta-analytic evidence (
Friede et al., 2015;
Orlitzky et al., 2003) and that CSR reporting has effects on the cost of capital and the information environment (
Dhaliwal et al., 2011). Therefore, the indicators in Model 2 enable a literature-based test of the hypothesis that ROE can be better predicted by “financial and governance/sustainability disclosures” rather than “financial ratios alone”.
The analysis of the study is designed to test whether sustainability/governance indicators provide additional information value in ROE prediction beyond financial ratios. Therefore, two model families are constructed on the same sample and compared using the same validation framework. The analysis is conducted using a four-stage comparison logic: (i) the performance of Model 1 based on the algorithm, (ii) the performance of Model 2 based on the algorithm, (iii) a comparison of the models’ performance, and (iv) an interpretability analysis. The study’s findings are based on predictive power and additional information value rather than on a claim of causality.
In this study, machine learning algorithms were selected as tree-based methods to ensure that the predictive performance of Model 1 and Model 2 does not depend on the structural assumptions of a single algorithm and to test the algorithm-sensitivity of the findings. For both models, Random Forest, XGBoost, LightGBM, and CatBoost algorithms are used as tree-based methods. Tree-based algorithms rely on decision trees that produce a prediction for the target variable (ROE in this study) in each region by sequentially splitting the input (X) space. The fundamental statistical/algorithmic framework of decision trees has been systematised in the Classification and Regression Trees (CART) literature (
Breiman et al., 1984). Although a single tree is interpretable, individual trees often exhibit a high tendency for overfitting. Consequently, success in modern applications is largely achieved through ensemble approaches (
Hastie et al., 2009). Therefore, all four methods used in this study are approaches that combine decision trees using an “ensemble” logic.
Random Forest (RF) is an ensemble method that trains multiple decision trees on bootstrap samples and combines their results by averaging (
Breiman, 2001). Its fundamental basis is to reduce high variance using the “bagging (bootstrap aggregating)” approach (
Breiman, 1996). In RF, inter-tree correlation is also reduced by using a random feature subset in each split, thereby strengthening generalisation performance (
Breiman, 2001). XGBoost is an optimised application of Friedman’s gradient boosting idea (
Friedman, 2001) for high efficiency and scalability. XGBoost builds trees sequentially, with each new tree attempting to reduce the errors of the previous one (
Chen & Guestrin, 2016). One of XGBoost’s distinguishing features is its regularised objective function, sparse data-sensitive splitting, and various sampling/learning rate mechanisms that attempt to control overfitting (
Chen & Guestrin, 2016); LightGBM is also in the gradient-boosted trees class. It is designed for scalability/computational efficiency (
Ke et al., 2017). One of its most notable differences from other tree-based methods is that it favours a leaf-wise (best-first) growth strategy over the level-wise (depth-based) approach used by most gradient-boosted tree algorithms. This strategy can enable faster convergence by growing the leaf that reduces the loss the most (
Ke et al., 2017). CatBoost is also a gradient-boosted tree method. Its prominent difference in the literature is its ordered boosting approach, which aims to reduce the “prediction shift” type of bias seen in the boosting process, and its methods developed for categorical variables (
Prokhorenkova et al., 2018). Although the sustainability/governance indicators in the Model 2 dataset of the study are binary (0/1), and thus do not directly experience the “categorical variable explosion” problem, CatBoost’s ordered boosting approach has been added to the analysis with the aim of reducing the risk of boosting-induced leakage/bias (
Prokhorenkova et al., 2018) and providing methodological diversity. The model performance of the analysis will be evaluated for each algorithm using the R
2 determination coefficient and the RMSE and MAE error metrics.
The objective of this study is to compare the predictive performance of two model families established on the same sample (Model 1 based solely on financial ratios and Model 2 incorporating sustainability/governance indicators in addition to financial ratios) and to test whether the enriched variable set provides “additional information value” from a generalizability perspective. For this reason, the analysis design requires that performance measurement be conducted with a validation framework that is free from information leakage and reports uncertainty components, beyond algorithm selection. Accordingly, tree-based estimators such as Random Forest, XGBoost, LightGBM, and CatBoost were set up for both models. For all algorithms, hyperparameters were tuned using grid search within the training folds only, embedded in the grouped cross-validation procedure, in order to avoid any information leakage from the test sets. The same tuning protocol was applied consistently across all models to ensure a fair benchmarking framework. The tuned hyperparameters included tree depth, learning rate, number of estimators, and minimum node size (model-specific), with search ranges defined based on standard practice in the literature and preliminary experimentation. These algorithms were selected as state-of-the-art tree-based ensemble methods widely used in tabular financial data, allowing comparison across both bagging-based (Random Forest) and boosting-based learners (XGBoost, LightGBM, and CatBoost). These models were chosen because they represent complementary ensemble paradigms (bagging versus gradient boosting), have demonstrated strong performance in structured financial datasets, and differ in their handling of feature interactions, regularization, and categorical information, thereby providing a diverse yet comparable benchmarking set. However, the inter-model comparison was performed using the Bootstrapped Grouped Cross-Validation Model Comparison (BG-CVMC) framework rather than relying on a single random data split. This framework aims to (i) prevent information leakage through group-protected splitting in data structures containing firm-level dependencies, and (ii) quantify CV-induced randomness using bootstrap sampling to generate confidence intervals for both model performance and performance differences.
The BG-CVMC application consists of three components. Firstly, the data set has been grouped according to company identities, and observations belonging to the same company have been prevented from falling into both the training and test sets within the same fold. This step reduces the risk of “information leakage” that could arise from dependencies generated by the panel/hierarchical structure in the study findings. Thus, observations belonging to the same company are either entirely in the training set or entirely in the test set in any fold, ensuring that the training-test separation remains “clean” at the company level (
Roberts et al., 2017). Second, RMSE, MAE, and R
2 metrics were calculated for each split through grouped CV splits. Since cross-validation outputs contain random components in a single run, reporting only average performance values may be insufficient for model evaluation. Therefore, in BG-CVMC, the error metrics calculated for each split/fold (e.g., RMSE, MAE, R
2) are resampled using bootstrap to obtain point estimates and 95% confidence intervals for each metric. The bootstrap approach approximates the distribution of performance estimates through resampling and allows for the quantitative reporting of uncertainty (
Efron & Tibshirani, 1994;
Davison & Hinkley, 1997). In the third and final step, split-based performance values are resampled using bootstrap, and point estimates with 95% confidence intervals are reported for each model and each metric. To strengthen the statistical basis of the model comparison, within the scope of BG-CVMC, not only the average performances of the two models but also the bootstrap distribution of the model differences calculated over the same splits are obtained. The magnitude predicted by cross-validation estimation and how the quality of the prediction is affected (particularly due to the variance/dependence structure) have been discussed in detail in the literature (
Bates et al., 2021). Furthermore, the theoretical foundations and limitations of approaches for producing confidence intervals for CV-based test errors have been developed, taking into account dependencies within CV (
Bayle et al., 2020). BG-CVMC combines this theoretical background with group-based partitioning and multiple bootstrap repetitions to establish a more robust basis for comparison in data structures containing dependencies at the firm level.
In this study, to address the uncertainty of CV estimation in a more efficient and computable manner, a framework compatible with the recently proposed accelerated bootstrap-based variance estimation approach has been adopted (
Cai et al., 2025). This enables the generation of confidence intervals not only for each model’s performance but also for the performance difference between Model 2 and Model 1. It is possible to assess whether the difference is statistically distinct from zero. The statistical significance of the differences between models was examined using both parametric and non-parametric tests, while maintaining the split-based matched structure. Company codes were defined as group labels, and the data was split according to these labels using a repeated K-fold Cross-Validation (CV) structure. In this context, the paired
t-test (
Student, 1908) and the Wilcoxon signed-rank test (
Wilcoxon, 1945) were used to assess whether the observed performance differences stemmed from random fluctuations or systematic superiority. In the analysis, company codes were defined as group labels, and the data was divided according to these labels using a repeated-
-fold Cross-Validation (CV) structure.
The Bootstrapped Grouped Cross-Validation Model Comparison (BG-CVMC) applied in the previous stage serves to compare the generalizable prediction performance of Model 1 and Model 2 (and the tree-based predictors trained on them), taking into account firm-group dependency, and to report the performance difference along with its uncertainty. Accordingly, performance differences between Model 1 and Model 2 are formally evaluated using matched statistical tests alongside bootstrap-based confidence intervals, allowing systematic information gains to be distinguished from random variation. The findings obtained in this stage revealed which model/algorithm combination was more successful in terms of predictive power. Subsequently, in the final step of the study, an interpretability layer was implemented for the final model with high performance. In this context, the Shapley Additive Explanations (SHAP) analysis is not an alternative to BG-CVMC, which serves to “prove” prediction performance, but rather a complementary method that explains the model’s decision logic (
Lundberg & Lee, 2017). SHAP is an additive explanation approach that decomposes a machine learning model’s prediction for a specific observation into the contribution (attribution) components of each explanatory variable. SHAP’s theoretical basis is rooted in the Shapley value concept from cooperative game theory. In the analysis, variables are considered as “players” and the model prediction as “payoff,” and the marginal contribution of each variable is systematically allocated (
Shapley, 1953;
Lundberg & Lee, 2017). In this study, variable contributions were reported using SHAP (TreeSHAP) because it provides a computable and consistent framework for tree-based methods, enabling the interpretation of global patterns by combining local explanations (
Lundberg & Lee, 2017). In this study, Shapley-based attribution was interpreted as an intra-model contribution rather than a causal effect (
Aas et al., 2021).
The analysis results obtained as a result of these procedures are reported in
Section 5.
5. Findings
The study is based on the concept of forecasting ROE (t) for 2024 using 2023 (t − 1) data; thus, the risk of information leakage is minimized by using only publicly disclosed historical financial information and activity report disclosures in the forecast. The sample consists of a total of 428 companies; banks and insurance companies are excluded due to differences in the presentation of financial statements. The study compares two model families on the same sample. These are (i) the basic model based solely on financial ratios (Model 1) and (ii) the enriched model that includes financial ratios and sustainability/governance indicators derived from the companies’ activity reports (Model 2).
When comparing the error metrics for the four tree-based algorithms for both models (lower RMSE and MAE, higher R
2 indicates better performance),
Table 4 shows that Random Forest achieves the best overall performance.
As shown in
Table 4, Model 2 exhibits a tendency to increase R
2 across all algorithms and reduce error metrics in most cases.
The marginal contribution of Model 2 compared to Model 1 is shown in
Table 5 on an algorithm-by-algorithm basis.
Upon examining the table, the error reductions in Model 2 are particularly noticeable in boosting-based models (XGBoost, LightGBM). This finding indicates that the governance/sustainability signal can be captured more effectively in interaction with financial ratios in some algorithms. However, rather than relying on a single split/single run for the final decision, an assessment should be made using the BG-CVMC presented below, taking into account the associated uncertainty.
In this study, the Bootstrapped Grouped Cross-Validation Model Comparison (BG-CVMC), which considers firm-level dependency, prevents the mixing of observations from the same firm in training and testing; furthermore, the randomness of performance estimation and CV uncertainty are quantified using bootstrap. This approach is consistent with cross-validation strategies proposed for grouped/dependent structures (e.g.,
Roberts et al., 2017) and with the literature focusing on bootstrap-based confidence interval production for CV uncertainty (e.g.,
Bayle et al., 2020;
Cai et al., 2025;
Efron & Tibshirani, 1994). The models’ BG-CVMC Performance Summary is shown in
Table 6.
Table 6 shows the BG-CVMC performance summary (mean and 95% GA), indicating a small but consistent advantage for Model 2.
The BG-CVMC results of the models’ differences are presented in
Table 7.
When examining
Table 7 where model differences are reported, the positive differences in (M1 − M2) RMSE and MAE and the negative difference in R
2 indicate that Model 2 performs better in terms of error metrics and has higher explanatory power.
As can be seen in
Table 7, the BG-CVMC analysis results indicate statistically significant differences between the performance of the two models. Model 2’s average performance is better than Model 1’s in terms of both RMSE and MAE as well as R
2 metrics, and the 95% confidence intervals for the model differences are outside zero. The statistical significance of these performance differences is further confirmed by paired
t-tests and Wilcoxon signed-rank tests, as reported in
Table 8. Split-based paired
t-tests and Wilcoxon tests also support this finding; positive mean differences for RMSE and MAE confirm that Model 2 produces lower errors, while the negative difference for R
2 confirms that Model 2 has higher explanatory power. The
p-values being below 0.05 in all tests demonstrate that the observed differences are not random and that Model 2 performs statistically significantly better than Model 1.
Overall, these results support H1, indicating that the model incorporating sustainability and corporate governance indicators consistently outperforms the baseline financial model in out-of-sample ROE prediction.
Consistent with H2, the performance improvements observed are statistically significant but limited in magnitude, as reflected by the small average differences and the associated confidence intervals.
Finally, the algorithm-specific results reported in
Table 4 and
Table 5 provide support for H3, showing that the incremental contribution of sustainability and governance indicators is more pronounced in boosting-based tree models.
Table 8 reports the results of the paired
t-tests and Wilcoxon signed-rank tests, confirming that the performance differences between Model 1 and Model 2 are statistically significant across all evaluation metrics. Overall, the findings of the BG-CVMC analysis indicate that Model 2 provides statistically significant but limited “additional information value” compared to Model 1. This supports that the contribution of sustainability/governance indicators to ROE prediction is marginal; however, it differs from zero even when group dependency and CV uncertainty are taken into account (
Bates et al., 2021;
Bayle et al., 2020;
Cai et al., 2025).
When BG-CVMC and key performance metrics (RMSE, MAE, R
2) were evaluated together, the best generalization performance for both Model 1 and Model 2 was achieved using the RandomForest algorithm. Therefore, the final model selected for SHAP analysis, RandomForest, is reported in
Table 9 and
Figure 1.
The Random Forest SHAP results presented in
Table 9 and
Figure 1 indicate that the most dominant determinant of ROE prediction is Net Profit Margin (INKM) (Mean|SHAP| = 10.9589; positive effect ratio = 0.611). Return on Investment (ROI) ranks second, and the average contribution direction appears to be predominantly negative (Mean|SHAP = −0.4997; Mean|SHAP| = 2.3100; negative ratio = 0.709). The negative average contribution of ROI can be interpreted in the context of mean reversion and profitability normalization effects, particularly in emerging markets. Firms exhibiting unusually high ROI in period t − 1 may experience subsequent performance moderation due to competitive pressures, capacity constraints, or transitory gains, leading to a negative marginal association with next-period ROE. Similarly, the negative contribution of the Net Margin growth indicator (IMNG) may reflect adjustment dynamics whereby short-term margin expansions—often driven by temporary cost reductions, pricing anomalies, or one-off operational effects—do not persist into future profitability. In addition, rapid margin growth may coincide with increased risk-taking or earnings volatility, which can weaken the stability of subsequent ROE. These findings are consistent with prior evidence suggesting that extreme profitability signals tend to partially reverse over time, implying that exceptionally strong short-run margins or returns may not translate proportionally into sustainable future performance. Among the financial ratios, the Operating Margin (IFMA) (Mean|SHAP| = 0.8627) and Current Ratio (ICRT) (Mean|SHAP| = 0.6267) follow with positive contributions, while Leverage (ILEV) provides a more limited positive contribution (Mean|SHAP| = 0.4656). The mean absolute SHAP values of the sustainability/governance indicators specific to Model 2 (IKYI, ISUS, IESG, IRSK, IMNG) are lower than those of the financial ratios but are not entirely negligible (e.g., IMNG Mean|SHAP| = 0.2290; IESG Mean|SHAP| = 0.1355). This pattern, consistent with the “small but systematic performance improvement of Model 2” observed in the BG-CVMC, indicates that the governance/sustainability signal makes a marginal but measurable contribution to ROE prediction.
6. Discussion
This study examined whether sustainability and corporate governance indicators provide measurable additional information value in return on equity (ROE) prediction compared to models based solely on financial ratios, using advanced machine learning algorithms and validation techniques that prevent information leakage. The findings reveal that ESG and governance signals statistically significantly improve ROE prediction performance, albeit to a limited extent. From an economic perspective, the magnitude of the observed performance improvement is modest. This finding is consistent with the nature of ESG and governance information, which is not expected to replace core financial fundamentals but rather to complement them. Financial ratios remain the dominant drivers of profitability, while sustainability and governance disclosures function as secondary signals that refine predictive accuracy at the margin. In this sense, the results suggest that ESG-related information should be interpreted as an incremental layer of informational content rather than as a primary determinant of firm-level profitability.
When examining algorithm-based results, it is observed that the highest prediction success in both model families was achieved with the Random Forest algorithm. In Model 1, which is based solely on financial ratios, the R2 value for Random Forest was calculated as 0.5497, RMSE as 11.50, and MAE as 7.54. In Model 2, which includes sustainability/governance indicators in addition to financial ratios, the R2 value for the same algorithm increased to 0.5525, while the RMSE decreased to 11.47 and the MAE decreased to 7.52. These results indicate that, although the absolute magnitude of the performance improvement is limited, its direction is consistently in favour of Model 2.
A similar pattern is observed more distinctly in boosting-based algorithms. For example, for XGBoost, the R2 value increased from 0.5142 in Model 1 to 0.5270 in Model 2; RMSE decreased from 11.95 to 11.78, and MAE decreased from 8.08 to 7.97. LightGBM results also support this trend; the R2 value increased from 0.4972 to 0.5062, while a decrease was observed in the error metrics. These findings indicate that sustainability and governance indicators can be more effectively reflected in the model, particularly in boosting-based algorithms that are strong at capturing non-linear interactions.
However, beyond one-off algorithm comparisons, the fundamental contribution of this study is that these differences have been evaluated together with their uncertainties using the Bootstrapped Grouped Cross-Validation Model Comparison (BG-CVMC) framework. The BG-CVMC results show that Model 2 has an average RMSE value of 11.4134 (95% GA: [11.1111, 11.7219]), while Model 1 had a value of 11.4336 (95% GA: [11.1312, 11.7458]). Similarly, the average MAE was calculated as 7.6377 in Model 1 and 7.6208 in Model 2; the average R2 increased from 0.5497 to 0.5515.
In the BG-CVMC difference analysis, where model differences are directly evaluated, the RMSE difference (Model 1 − Model 2) averages 0.0202, with a 95% confidence interval ranging from [0.0012, 0.0383]. The mean MAE difference was 0.0170 (95% CI: [0.0007, 0.0332]) and the R2 difference was −0.0018 (95% CI: [−0.0032, −0.0004]). The fact that the confidence intervals for all three metrics exclude zero indicates that the contribution of sustainability/governance indicators to ROE prediction is statistically significant. The paired t-test and Wilcoxon tests also support this result; p-values below 0.05 for all metrics indicate that the observed differences are not random.
These quantitative findings are consistent with the “weak but positive” impact patterns frequently reported in ESG literature. Sustainability and governance signals are not dominant determinants replacing financial ratios; they function as complementary layers of information added to financial fundamentals in ROE estimation. This suggests that the role of ESG indicators on financial performance should be assessed without exaggeration, but also without being completely disregarded.
The SHAP analysis results also support this interpretation. For the Random Forest model, the highest mean absolute SHAP value belongs to the Net Profit Margin (INKM) variable (Mean|SHAP| = 10.96). This is followed by Return on Investment (ROI) (Mean|SHAP| = 2.31) and Operating Margin (IFMA) (Mean|SHAP| = 0.86). In contrast, the average absolute SHAP values for sustainability and governance indicators are lower (e.g., IMNG = 0.229, IESG = 0.136). However, the fact that these values are significantly different from zero indicates that these variables are not entirely ineffective within the model; on the contrary, they make marginal contributions consistent with the performance increase observed in BG-CVMC.
These findings are consistent with prior studies reporting a weak but statistically significant association between ESG dimensions and firm performance, suggesting that sustainability-related information primarily operates as a refinement mechanism rather than a substitute for financial fundamentals. From a theoretical perspective, governance and sustainability disclosures may reduce information asymmetry, signal managerial quality, and reflect long-term risk management practices, thereby supporting future profitability in an indirect manner. However, their relatively small effect sizes indicate that such signals are absorbed gradually by markets and materialize mainly through interaction with core financial drivers. Accordingly, the results support an incremental-information view of ESG, where governance and sustainability indicators enhance prediction accuracy at the margin while financial ratios remain the primary determinants of ROE.
7. Conclusions
This study examined whether sustainability and corporate governance indicators provide additional and statistically significant information value in ROE forecasting compared to traditional models based on financial ratios. The results obtained reveal that ESG and governance signals enhance prediction performance; however, this increase is limited in magnitude and exhibits a structure dominated by financial fundamentals.
The BG-CVMC results show that Model 2 reduces the RMSE and MAE by approximately 0.2–0.3 per cent compared to Model 1 and increases the R2 by approximately 0.18 points. While this improvement does not represent a significant leap in economic terms, it is statistically reliable and exhibits a consistent pattern across different algorithms. Therefore, sustainability and governance disclosures produce small but systematic signals rather than “noise” in ROE estimation. The robustness of these results is supported by the use of grouped cross-validation to prevent information leakage, bootstrap-based confidence intervals, paired statistical tests, and consistent performance patterns observed across multiple machine learning algorithms. Together, these elements indicate that the reported improvements reflect systematic informational gains rather than random variation or model-specific artifacts.
The findings of the study present an important balance point regarding how ESG data should be positioned in financial performance analyses. ESG indicators are not independent performance determinants that replace financial ratios; however, when considered alongside financial fundamentals, they can statistically significantly improve prediction accuracy. This sheds light on the methodological sources of conflicting results in the academic literature and demonstrates that ESG reporting should be evaluated with realistic expectations from the perspective of both investors and regulatory bodies.
For future studies, examining sustainability and governance indicators not only in binary terms (present/absent) but also in terms of intensity, quality and continuity over time, as well as investigating dynamic effects in long-term panel data structures, could deepen the knowledge base in this field. This study contributes to the ESG–financial performance literature with measured, methodologically sound and numerically supported findings.
Overall, the findings of this study suggest that sustainability and corporate governance disclosures should be viewed neither as substitutes for financial fundamentals nor as negligible sources of information. Instead, they function as complementary signals that provide incremental informational content in profitability prediction. From this perspective, the contribution of ESG-related information is modest in magnitude but statistically reliable, highlighting its role as a refinement mechanism rather than a primary driver of firm-level financial performance.
Despite its contributions, this study is subject to certain limitations, which also point to avenues for future research. First, the sustainability and governance indicators employed in the analysis are based on binary disclosures, which capture the presence of reporting practices rather than their depth, quality, or intensity. Future studies could extend this framework by incorporating more granular ESG measures or textual-based indicators that reflect the qualitative aspects of sustainability reporting. Second, the analysis focuses on a single forecasting horizon using lagged information; exploring dynamic or multi-period prediction settings may provide further insights into the long-term informational role of sustainability and governance disclosures.