1. Introduction
Sustainability reporting is now considered as important as financial statements for decision makers considering investing in global markets. In Turkey, sustainability or integrated reporting is not mandatory for companies listed on the Istanbul Stock Exchange (BIST). Furthermore, this voluntary approach provides an excellent analytical environment for measuring the impact of sustainability indicators across different topics compared with capital markets that mandate the publication of these indicators. Companies present these reports not only to meet potential legal requirements but also because they complement and support their financial information. The interaction between RoE, one of the key indicators of financial performance that investors refer to when assessing the effective use of capital, and sustainability factors in markets has not been clearly established in the literature.
A review of the literature shows that the impact of sustainability reporting on financial performance has mostly been examined using traditional econometric methods and, more specifically, regression-based models. The main focus of these studies is to predict the level of ESG scores. The lack of objectivity and comparability of ESG scores and corporate governance indicators in emerging markets is a significant problem. It also complicates interpretation of the findings.
On the other hand, testing financial performance solely with traditional econometric models is no longer sufficient. This approach, which reflects the market’s actual products onto the decision-making mechanisms of decision makers, is now inadequate compared with artificial intelligence tools (machine learning algorithms). Furthermore, capital market investors attempt to predict a company’s performance when making decisions. Therefore, they require simple, understandable, and effective indicators to distinguish between high and low performance. This situation brings machine learning models to the fore. These models offer a more distinct prediction-focused framework for classification compared with traditional econometric models. The literature review conducted for this study also points to this gap in the literature. The literature review conducted for this study did not find any research examining the possible effects of governance and sustainability regimes on RoE performance.
This study aims to develop machine learning models that classify the return on equity of 427 companies listed on the BIST as high or low, using sustainability and governance committee and reporting variables as independent variables. The main purpose of using the independent variables in the study’s model is that these reports, audited by internal audit mechanisms and shared with the public, demonstrate companies’ transparency and reporting behavior, and the information they contain is publicly available to all investors in real time. The original aspect of this research is that, rather than measuring the degree of impact of ESG scores, it tests the discriminative power of the relevant independent variables’ binary indicators in classifying companies as low or high based on their RoE. Thus, the aim is to create a predictive mechanism for the power of past governance and reporting indicators to classify companies as low/high based on their RoE in the future.
Although the study relies on simple present/absent (binary) disclosures, these signals are theoretically grounded in both signaling theory and the resource-based view. In emerging markets where information asymmetry is high, binary governance and sustainability indicators function as threshold signals that reveal whether a minimum level of organizational capacity and internal control infrastructure exists. Therefore, even if these signals are coarse, they may still contain limited but non-negligible predictive value for distinguishing firms with stronger future RoE performance. This theoretical link clarifies why binary disclosures are evaluated as potential classification inputs in the present study.
Ultimately, this study does not merely measure the impact of sustainability committees and sustainability reporting on RoE (financial performance) at the level of correlation; it also assesses the extent to which these internal control mechanisms and corporate disclosures provide a distinctive signal for investors/potential investors through classification success. In this respect, the study proposes a new methodological approach for analyzing sustainability committees and reporting, particularly in emerging markets.
This study makes three key contributions to the literature regarding the relationship between financial performance and sustainability reporting. First, it presents an innovative machine learning classification framework based on investors systematically identifying companies with high RoE in the stock market using binary (yes/no) management and reporting signals, rather than traditional econometric analyses using ESG scores. Second, it employs a TreeSHAP-based Explainable Artificial Intelligence (XAI) approach to explain model decisions, ensuring the interpretability and transparency of results. Third, the finding of low prediction performance empirically demonstrates the inadequacy of quality at the asset level alone in reporting practices and provides important evidence that sheds light on future regulations.
3. Methodology
3.1. Model Specification and Hyperparameter Settings
This study employs three tree-based classification algorithms—Random Forest, XGBoost, and LightGBM—to predict whether a firm’s first-quarter RoE for 2025 falls into the high (H) or low (L) category. All models were implemented in Python 3.11 using scikit-learn (version 1.4), XGBoost (version 2.0), and LightGBM (version 4.1). To ensure full reproducibility, all procedures used a fixed random_state = 42, and the same 70/30 stratified train–test split was applied across algorithms.
A consistent model tuning framework was adopted to avoid overfitting and to ensure comparability. Hyperparameters were optimized via a 5-fold stratified cross-validation scheme on the training set. For XGBoost and LightGBM models, early stopping (patience = 50 rounds) was applied during tuning. The grids explored for each algorithm were as follows:
n_estimators: {200, 400, 600}
max_depth: {None, 5, 10, 20}
min_samples_split: {2, 5, 10}
min_samples_leaf: {1, 2, 4}
max_features: {“sqrt”, “log2”}
n_estimators: {300, 500, 800}
learning_rate: {0.01, 0.05, 0.10}
max_depth: {3, 4, 6}
subsample: {0.7, 0.9, 1.0}
colsample_bytree: {0.7, 0.9, 1.0}
gamma: {0, 1}
n_estimators: {300, 500, 800}
learning_rate: {0.01, 0.05, 0.10}
max_depth: {−1, 4, 6}
num_leaves: {15, 31, 63}
feature_fraction: {0.7, 0.9, 1.0}
bagging_fraction: {0.7, 0.9, 1.0}
bagging_freq: {0, 1}
Model performance was evaluated using Accuracy, Precision (Macro), Recall (Macro), F1 (Macro), AUC, and Balanced Accuracy. All models were evaluated using a standard decision threshold of θ = 0.50, which classifies a firm into the high (H) class if the predicted probability p(H) ≥ 0.50. The majority-class baseline model was computed using the same test set to provide a transparent reference point. All SHAP analyses were conducted with the shap package (version 0.44), using the TreeExplainer for tree-based models.
This specification provides a reproducible and transparent model card for the analysis, in line with the editor’s request for methodological clarity.
3.2. Data Description, Variable Definitions, and Thresholding Rules
The aim of this analysis is to develop a machine-learning framework grounded in the presence of sustainability and governance indicators to classify the return on equity (RoE) levels of firms listed on BIST in an emerging capital market setting into high/low (H/L) categories and to quantify the discriminative power of these features. By comparing alternative machine learning algorithms used for classification, the study seeks to obtain generalizable and time-robust classification performance and, through explainability analyses, to identify which compliance elements systematically separate RoE classes.
Whereas the sustainability and financial performance literature has largely focused on regression and ESG score-based tests of association, the predictive classification perspective and the separating capacity of binary governance signals have been relatively underexamined. Yet, regulatory pressure and investors’ need for rapid, replicable screening make models that operate with low-cost, transparent input sets particularly valuable.
Within this study, the RoE of 427 BIST-listed firms is modelled using sustainability reporting and governance indicators as features. The choice to define the sample as 427 firms across multiple industries reflects several methodological considerations: the pursuit of external validity and generalizability, regulatory and accounting comparability, data accessibility and disclosure quality, and the desire to increase empirical power. All data were obtained from the Public Disclosure Platform (KAP) of the Capital Markets Board of Türkiye (SPK).
The dependent variable, RoE, is measured as of Q1 2025 (t). The independent variables are firms’ sustainability disclosures and governance practices for the 2024 fiscal period (t − 1). Two considerations motivate this design. The first is temporal precedence and the avoidance of simultaneity bias: using explanatory variables from t − 1 to account for financial performance at t helps limit reverse causality and contemporaneous common shocks (e.g., market volatility, regulatory changes). Second, the inherent reporting lag in sustainability and annual reports means that investors, when assessing next-period financial outcomes, effectively rely on prior-period corporate practices. Accordingly, the information set used here aligns with investors’ realistic information access.
Although quarterly profitability measures may exhibit seasonal or sector-driven volatility, Q1 RoE was selected because it represents the most recent financial performance available to investors at the time of prediction. Using the immediately subsequent reporting period (t) is consistent with a forward-looking classification design, where sustainability disclosures from year t − 1 are used to predict early-period performance in year t. Nevertheless, the study acknowledges that future research may incorporate annual RoE, sector-adjusted profitability ratios, or multiple-period averages to assess the robustness of the observed patterns.
From a machine learning standpoint, this design also constrains prediction to the information set available to a real-world researcher/decision maker and, crucially, mitigates artificially inflated accuracies due to data leakage. Put differently, the temporal separation between 2024 sustainability indicators (features) and Q1-2025 RoE (target) prevents target-period information from inadvertently entering the model, thereby strengthening the reproducibility of findings. Numerous recent cases across scientific fields have shown that leakage can systematically overstate the results of ML-based studies. For this reason, a “lagged” information structure for prediction is methodologically recommended [
18]. The model for this study is presented in
Table 1.
Note for ISDK variable: In the Turkish corporate governance framework, this indicator reflects the presence of a board-level Audit Committee (Denetimden Sorumlu Komite) established in accordance with the Capital Markets Board Communiqué on Corporate Governance (II-17.1) and the annexed Corporate Governance Principles (Corporate Governance Principles Section 4.5, Principle 4.5.9). The Audit Committee is responsible for overseeing the integrity of financial reporting, internal control and compliance and, in practice, contributes to the oversight of non-financial and sustainability-related reporting and risks in coordination with other board committees. ISDK does not denote a separate, stand-alone “sustainability audit committee”.
As shown in
Table 1, the median split was preferred because it is distribution-robust and less sensitive to extreme sector-specific profitability differences. Alternative thresholds such as quartile-based splits, industry-adjusted medians, or continuous modelling were also considered. However, given the cross-industry nature of the sample, a median-based rule provides the most interpretable and sector-neutral benchmark. Future robustness checks may incorporate alternative cut-offs to confirm that classification performance does not materially depend on the threshold definition.
Building on this thresholding structure, the binary rating of RoE relative to the median (H = 1/L = 0) and the coding of the sustainability indicators as Yes = 1/No = 0 are motivated by several considerations. First, median-based thresholding for RoE provides a classification that is less sensitive to extreme observations than the mean, thereby limiting the extent to which measurement error and volatility stemming from accounting policies degrade model performance. In addition, the sample spans firms from heterogeneous industries whose capital intensity and cyclicality structurally differentiate RoE levels. Defining “high/low” relative to the median yields a reference that is independent of sectoral composition and supports a directly interpretable output for policymakers and investors, e.g., “a firm exceeding the median exhibits relatively strong performance”.
Coding the independent variables on a present/absent basis furnishes an objective criterion grounded in regulation and publicly disclosed reports, reducing coder subjectivity and measurement error compared with graded rating scales. These indicators also function as threshold signals of whether a minimum governance infrastructure is in place. Accordingly, the binary (present/absent) scheme is designed to distinguish between firms that surpass the minimum corporate capacity threshold and those that do not.
4. Analyses and Empirical Results
This section presents the empirical results in four steps. We begin with descriptive counts by variable to characterize the sample and class distributions. We then assess marginal (bivariate) associations between the high/low RoE classes and each binary sustainability/governance indicator using the Chi-square test and Cramér’s V. Next, we compare the out-of-sample performance of three tree-based classifiers (Random Forest, XGBoost and LightGBM) under a common evaluation protocol. Finally, we provide model-agnostic explanations via TreeSHAP to identify which disclosures and governance features contribute most to the classification decision, thereby linking predictive evidence to interpretable signals relevant for investors and policymakers.
Table 2 reports the number of firms for the RoE class distribution and for the binary governance/sustainability indicators. For the dependent variable, firm counts are presented under the Low–High RoE classes; for the independent variables, counts are presented under the No–Yes codes, thereby documenting the binary coding scheme used in the analysis.
In the second step of the analysis, to evaluate the marginal (bivariate) association between RoE (H/L) and the binary sustainability indicators (ISUS, ICOM, IESG, IREP, IMNG, ISDK), we employ the Chi-square test of independence and Cramér’s V as an effect-size measure. The Chi-square test addresses whether the RoE class and each binary indicator are independent, whereas Cramér’s V quantifies the strength (intensity) of any association. In short, the Chi-square
p-value tests “Is there an association?”, while Cramér’s V answers “How strong is the association?”.
Table 3 summarizes the results of this analysis.
Bivariate associations between the binary indicators and the RoE high/low class were assessed using Pearson’s Chi-squared tests. All x2 statistics for 2 × 2 tables were computed without Yates’ continuity correction. Tests were performed in SPSS (version 20). We report the Pearson χ2 statistic without continuity correction and the corresponding Cramér’s V.
Table 3 indicates weak statistical evidence only for ISUS at the
p ≤ 0.10 level, with a small effect size (Cramér’s V, approximately 0.08). No marginal relationship was observed in other variables regarding both significance and effect size. This result indicates that binary and superficial relationships are limited and, therefore, the use of multivariate/interactive machine learning algorithms is analytically necessary.
This study addresses a binary classification task, where firms are classified into High RoE (1) versus Low RoE (0) using binary (0/1) sustainability and governance indicators for a sample of 427 firms. In this context, tree-based ensemble methods Random Forest (bagging), XGBoost, and LightGBM (boosting) are selected because, for medium-scale tabular data, they offer a well-established balance of predictive performance, flexibility, and interpretability. These approaches explicitly capture non-linear patterns and feature interactions, require no scaling/standardization, handle binary/categorical attributes natively, and yield explainability outputs via feature importance [
19,
34,
36]. Accordingly, they are well suited to uncover the non-linear and interactional relations between financial indicators and sustainability disclosures.
Random Forest is a bagging-based ensemble classifier that aggregates predictions from many feature-subsampled decision trees via voting/averaging, reducing inter-tree correlation and thereby lowering generalization error [
34]. XGBoost is a scalable, regularized implementation of gradient-boosted trees; with sparsity-aware splitting, weighted-quantile sketch and system-level optimizations, it achieves high accuracy and speed on large tabular datasets [
35]. LightGBM accelerates histogram-based gradient boosting using GOSS (Gradient-based One-Side Sampling) and EFB (Exclusive Feature Bundling), and, with its preference for deeper trees, provides an efficient and memory-friendly boosting library [
37].
The principal rationale for employing this trio is that bagging (Random Forest) and boosting (XGBoost/LightGBM) exhibit different bias–variance profiles; comparing all three on the same dataset allows us to assess robustness and reduce the risk that findings are algorithm-specific [
38,
39].
All three models were trained using a five-fold cross-validation procedure on the training set to reduce overfitting and ensure generalizable performance. Hyperparameters were tuned using a restricted grid search focusing on commonly influential parameters such as maximum tree depth, number of estimators, learning rate, and subsampling ratios. These optimization steps follow standard machine-learning practice and help ensure that the reported model performance is not driven by arbitrary parameter choices.
In this study, the classification pipeline ingests firm-level binary governance/sustainability indicators from period t − 1 (Yes = 1, No = 0) as inputs and aims to predict the RoE class (high vs. low) at period
t Random Forest aggregates parallel trees by majority voting/averaging, whereas XGBoost and LightGBM build trees sequentially via additive updates to minimize a regularized loss. Each model produces the class probability P (High RoE), which, unless otherwise stated, is mapped to a predicted class using a decision threshold
. Performance is evaluated on a held-out test set (30%) using Accuracy, macro-Precision/Recall, macro-F1. The process is depicted in
Figure 1.
The analysis findings are presented in
Table 4.
As shown in
Table 4, all models yield extremely limited discrimination power, with ROC(AUC) values ranging from 0.48 to 0.53. In practical terms, these AUC levels are almost indistinguishable from random classification (AUC = 0.50). XGBoost performs only marginally above the majority-class baseline (AUC = 0.526 vs. 0.500), while LightGBM and Random Forest remain at or below random-classification levels. Accuracy, balanced accuracy, and macro-averaged F1 scores are all tightly clustered around the majority-class prevalence, indicating that the models do not provide a practically usable screening tool for distinguishing high-RoE from low-RoE firms. Instead, the results should be interpreted as evidence that, under current reporting practices, binary governance/sustainability indicators carry at most a very weak and unstable predictive signal for future RoE.
Beyond these supplementary indicators,
Table 4 also reports the primary metrics used in the study. According to the results reported in
Table 4, XGBoost ranks first regarding accuracy (0.5116). Random Forest (RF) attains an accuracy of 0.4884, and LightGBM yields 0.4731. Macro-averaged F1 scores confirm the same ordering (XGBoost 0.5100; RF 0.4869; LightGBM 0.4670). Macro-averaged Precision/Recall are similar across models (LightGBM: 0.4708/0.4711; XGBoost: 0.5120/0.5120; RF: 0.4879/0.4884), indicating that class-imbalance weighting does not distort the metrics.
For benchmarking, model performance is compared against the majority-class baseline. This reference corresponds to the accuracy obtained by predicting the most frequent class in the test set for all observations, a widely used and recommended practice to make the impact of class distribution explicit, especially when reporting accuracy [
40]. As noted, the data are split into 70% training/30% test. With a total of 427 observations, the test set contains approximately 427 × 0.30 = 128.1, i.e., about 129 observations. The accuracy of the majority-class baseline equals the prevalence of the majority class in the test set. When every observation is predicted as that class, the proportion correctly classified coincides with that prevalence. Given the definition of accuracy,
[
41,
42], the baseline accuracy for a reference model that assigns all cases to the majority class is
in the binary case, that is, the majority-class prevalence. Given the test-set prevalence
, the majority-class baseline is approximately 0.504. Accordingly, XGBoost performs slightly above the majority-class reference, whereas Random Forest and LightGBM are at or below that level. This finding indicates that for this sample, separating RoE (H/L) based on sustainability indicators is a difficult task, and that the underlying signal is weak. Nevertheless, the modest improvement in binary classification is consistent with the small effect sizes observed in the earlier Chi-square/Cramér’s V analyses. Taken individually, the binary sustainability indicators appear to have limited discriminative power for RoE (H/L). Even so, because tree-based methods can capture feature interactions and non-linear patterns, XGBoost is able to surpass the majority class baseline.
Overall, the findings do not reveal a robust relationship between the binary sustainability indicators and RoE classes. Any predictive signal, if present at all, appears to be extremely weak, fragile, and of negligible practical importance for out-of-sample classification. This is consistent with the class balance provided by stratified partitioning and implies that there is no overfitting to a single class, with errors distributed across both classes. The limited improvement in binary classification is consistent with the small effect sizes observed in the previously presented Chi-square/Cramer’s V analyses. In summary, the discriminatory power of binary sustainability indicators individually appears low for RoE (H/L). Nevertheless, thanks to the ability of tree-based methods to capture feature interactions and non-linear patterns, XGBoost is seen to outperform the majority class reference.
To enable readers to directly examine class-wise prediction errors,
Table 4 presents the confusion matrices for the best-performing model (XGBoost) and for the majority-class baseline on the held-out test set. These matrices provide a transparent view of false positives and false negatives for each RoE class and complement the summary metrics reported in
Table 4.
Table 4 reports the overall performance metrics, whereas
Table 4 reports the corresponding confusion matrices (Panels A–B) for XGBoost and the majority-class baseline.
Step 3 reveals a significant finding: XGBoost is the only model that outperforms the majority class-based reference model. Therefore, in Step 4, Shapley Additive Explanations (SHAP) analysis will be applied to the XGBoost model to see which variables influence the classification decision and to what extent, and to examine the marginal contributions of these variables. In short, in Step 4, the model’s decision will be analyzed by breaking it down “feature by feature”.
SHAP is an explainability framework based on game theory and decomposes a single prediction produced by any machine learning model into marginal contributions at the feature level, i.e., Shapley values. It does this based on an additive decomposition structure. It is also a unique method that simultaneously provides desirable properties such as local accuracy, consistency, and missingness. Thanks to these qualities, it serves as an umbrella approach that brings different explanatory techniques together on common ground [
43,
44].
It is important to note that SHAP values are model-based, game-theoretic attributions rather than test statistics. In this study, SHAP is used purely as a descriptive explainability tool to summarize the magnitude and direction of feature contributions to the predicted probability of High RoE. We therefore do not attach p-values, confidence intervals, or formal notions of ‘statistical significance’ to SHAP values; inferential statements remain restricted to the classical tests reported in Step 2.
Because XGBoost is tree-based, we employ TreeSHAP, which yields exact, model-consistent attributions for tree ensembles. In addition to overall predictive accuracy, we thus identify which sustainability indicators increase or decrease the probability of High RoE, both at the global level (via the ranking of mean ∣SHAP∣) and at the local level (firm-specific explanations). This dual perspective enhances transparency and supports interpretation and policy inference in accounting and finance.
Figure 2 presents TreeSHAP’s mechanics, exemplified by the IMNG variable.
The results obtained from the SHAP analysis are provided in
Table 5.
Table 5 presents the SHAP analysis, which summarizes the relative contributions of the six corporate-governance indicators to the High RoE predictions. Ranking variables by their mean |SHAP| values shows that IMNG and IREP are the most influential features in the XGBoost model, followed by ISUS and ISDK, whereas IESG has the lowest SHAP importance. Thus, IMNG and IREP account for a large share of the model-based variation in the predicted probability of belonging to the High RoE class, even though their average directional effects differ.
The signed mean SHAP values complement this ranking by indicating the predominant direction of each variable’s contribution. For IMNG, the signed mean SHAP value is negative and only about 31% of its local SHAP contributions are positive; in approximately 69% of the cases IMNG reduces the predicted probability that a firm belongs to the High RoE group. In other words, IMNG is a strong but predominantly negative governance signal for current High RoE outcomes in our sample. By contrast, IREP tends to produce positive SHAP contributions for a larger share of observations, which is consistent with stronger internal reporting practices being associated with a higher probability of High RoE.
The signed mean SHAP value for ISUS is positive (approximately 0.0187) and its positive SHAP ratio is above 50% (52.35%), indicating a small but predominantly positive contribution to the predicted probability of High RoE. By contrast, IESG has a slightly negative signed mean SHAP value (approximately −0.0015), while its positive SHAP ratio remains relatively high (62.08%). This combination suggests a mixed pattern in which IESG contributes positively in a majority of observations, but a smaller subset of firms exhibits negative contributions that are large enough to pull the overall mean contribution marginally below zero. In contrast, although the signed mean SHAP value of ISDK is negative (−0.0181), its positive SHAP ratio is quite high (79.87%), suggesting that ISDK increases the model’s predicted probability of High RoE in most observations, while the overall average becomes slightly negative due to comparatively larger negative contributions in a smaller subset of cases.
Figure 3 shows the Shap beeswarm plot and dependency plots for IMNG, IREP and ICOM.
Figure 3 displays the SHAP beeswarm plot used to explain the model alongside the dependency plots for the ICOM, IMNG, and IREP variables. In the beeswarm plot at the top, each point represents an observation, the x-axis denotes the SHAP value (marginal contribution to the model output), and the colour scale indicates the level of the corresponding variable at the observation level. The spread of the points along the horizontal axis reveals that the effect of the variables on the prediction is distinctly heterogeneous in both direction and magnitude. For example, for IMNG and IREP, high values (red dots) are mostly associated with negative and positive SHAP values, respectively, whereas in some observations, the same colour tones also have SHAP values with opposite signs. Similarly, when the ICOM variable is high, it produces both positive and negative SHAP values, meaning that the same variable level can affect the prediction in a positive or negative manner in different observations. The dispersion of points of the same level (similar colour) to both the right and left suggests that the effect of the variable in question arises in interaction with other explanatory variables; thus implying that the assumption of a univariate linear effect is insufficient to capture the true data structure.
In the dependency graphs presented below, the SHAP distributions are distinctly separated for each level of the binary-coded ICOM, IMNG and IREP variables, but there is substantial dispersion within these distributions. For example, in the case of IREP = 1, although SHAP values are mostly concentrated in the positive region, there are also non-trivial negative contributions within the same group. Similarly, for IMNG and ICOM, it is noteworthy that positive and negative contributions coexist within each category. This structure implies that the multivariate contributions of these variables are relatively small on average, yet systematically non-zero at the level of individual observations. When such heterogeneous effects with opposite signs are aggregated, classical marginal analyses (e.g., univariate regression coefficients or group mean-difference tests) may report the average effect as ‘non-existent’ or negligible, because they simply average out local effects that operate in different directions across subgroups and interaction regimes. In contrast, the SHAP-based multivariate explainability approach disaggregates local contributions for each observation, thereby revealing heterogeneity and interaction patterns and showing that relationships which appear weak in marginal analyses can in fact be complex, non-linear and context-sensitive.
6. Conclusions
This study presents a prediction framework that classifies the 2025 first-quarter RoE values of 427 companies listed on the BIST as “high/low” using sustainability/governance statements (indicated by yes/no codes) for the 2024 period. The XGBoost, LightGBM, and Random Forest models were compared on the same 70/30 stratified test sample, and the results regarding accuracy metrics were centered around the majority class-based reference value for all three algorithms. SHAP explanations were analyzed using the XGBoost model, which yielded the highest accuracy rate among the three algorithms used in the analysis phase. These analysis findings revealed that the indicators “Corporate Governance Report”, “Integrated Report” and “Sustainability Committee” rank higher in terms of SHAP-based global importance in the High RoE classification model. However, the associated average SHAP effects remain small, and the directions of their contributions are heterogeneously distributed across firms. This overall picture shows that the binary (yes/no) explanations of the independent variables used were not sufficient to strongly distinguish RoE classes on their own but instead produced weak but context-sensitive signals.
The findings also reveal the relative strengths and weaknesses of sustainability tools. While corporate reporting issues generate some relative signals, these effects appear to be small and heterogeneous, and the overall predictive power remains very limited. Based on these results, we do not interpret the current set of sustainability and governance indicators as a practically useful basis for RoE prediction. A corporate sustainability architecture that is more closely aligned with financial performance targets would require substantial improvements in the depth of content, governance quality, and data quality of sustainability practices. Only under such conditions might these tools eventually become more effective in supporting value creation and responding to stakeholder demands, and this possibility should be viewed as conditional and speculative rather than as a direct implication of our empirical results.
From a stakeholder perspective, the findings also offer several practical implications. For policy makers, the results highlight the need for more standardized, content-rich, and verifiable sustainability reporting frameworks. For corporate managers, the study underscores that the mere existence of committees or reports is insufficient unless supported by substantive governance quality and deeper disclosure practices. For investors, the weak but interpretable signals identified through SHAP suggest that report quality and underlying governance capacity should be evaluated beyond simple disclosure presence. For civil society organizations, the findings reinforce the importance of transparency, accountability, and independent assurance in strengthening the informational value of sustainability reporting. Finally, for the academic community, the study identifies promising avenues for integrating textual ESG analytics, hybrid modelling, and cross-market comparisons into future research.
Finally, studies in the literature have reached similar findings [
14,
15,
16,
17,
20], and it is understood that sustainability indicators alone do not provide generalizable discriminatory power for all sectors; the effects are largely sensitive to sector/country conditions, the existence of corporate committees on this issue, and the quality of reports.
Using a broad sample, this study contributes to the discussion in the context of Turkey, a developing country, regarding methodology and application through comparative tree-based models and explainability tools. The findings indicate that the potential of sustainability committees and sustainability reports to enhance value and accountability capacity must be realized not only through formal existence, but primarily through substantive depth and governance quality. However, our results also suggest that these practices can have modest and, in some cases, negative short-term impacts on financial performance, which is consistent with the predominantly negative SHAP contributions of the most influential governance indicator to the High RoE class in our sample.
When all findings are evaluated together, the main result of this study shows that in their current form, dual sustainability and corporate governance signals have limited power to reliably distinguish RoE classes. However, these results do not overshadow the methodological strength of our study and its contribution to the literature. On the contrary, our framework, based on objective signals and a prediction-based classification method, suggests that current reporting practices may not always provide sufficiently rich or reliable market signals, particularly when disclosures are limited to simple presence/absence indicators. Furthermore, the use of SHAP analysis transparently shows which management signals are most considered by the model in predicting financial outcomes, providing an interpretable perspective for investor decision-making processes. Taken together, these findings demonstrate that binary governance and sustainability signals possess limited standalone explanatory power, yet remain theoretically meaningful within signaling and resource-based perspectives. The results therefore clarify how minimal disclosure structures function in emerging markets and highlight where more substantive reporting becomes necessary for predictive value creation. Overall, the study contributes empirical evidence that helps bridge theoretical expectations with the practical performance of simple corporate disclosure mechanisms.
Limitations and Directions for Future Research
This study is subject to several limitations that provide opportunities for future research. First, the sustainability and governance indicators are coded in a binary (yes/no) format, which restricts the granularity of information captured from corporate reports. As a result, subtle differences in reporting depth, content quality, or governance practices are not reflected in the models. Future studies may incorporate richer textual or quantitative ESG data, including the report length, narrative tone, disclosure specificity, or externally assured metrics. Second, model performance remains close to the majority-class benchmark across all algorithms, suggesting that the predictive signal contained in binary disclosures is inherently weak. This highlights the value of exploring alternative modelling strategies such as textual embeddings, topic modelling, or hybrid quantitative–qualitative approaches. Third, the use of quarterly RoE as the outcome variable may introduce noise due to seasonality and short-term fluctuations; future research can evaluate whether annual profitability metrics, multi-period averages, or sector-adjusted returns yield improved discriminative power. Finally, the empirical design focuses on firms listed on a single emerging market exchange (BIST), which may limit generalizability. Comparative analyses involving multiple countries or broader ESG regimes would help assess the extent to which the findings generalize across institutional settings. Future research could complement the present tree-based analysis with more traditional linear benchmarks, such as parsimonious or penalised logistic regression models, to further assess the robustness of the results.
Overall, the magnitude of these effects remains modest and heterogeneous, and the predictive improvement over a naïve majority-class baseline is slight.