A Design Science Approach to Predicting ESG Performance Using Ensemble Machine Learning

Ibrahim, Yara; Hussainey, Khaled; Moawad, Taghred Mokhtar Sayed

doi:10.3390/ijfs14050133

Open AccessArticle

A Design Science Approach to Predicting ESG Performance Using Ensemble Machine Learning

by

Yara Ibrahim

¹

,

Khaled Hussainey

^2,*

and

Taghred Mokhtar Sayed Moawad

²

¹

Investment and Finance Department, Faculty of International Business and Humanities, Egypt–Japan University of Science and Technology, Alexandria 21934, Egypt

²

The Albert Gubay Business School, Bangor University, Bangor LL57 2DG, UK

^*

Author to whom correspondence should be addressed.

Int. J. Financial Stud. 2026, 14(5), 133; https://doi.org/10.3390/ijfs14050133 (registering DOI)

Submission received: 12 April 2026 / Revised: 3 May 2026 / Accepted: 12 May 2026 / Published: 19 May 2026

Download

Browse Figures

Versions Notes

Abstract

Environmental, Social, and Governance (ESG) metrics have become a cornerstone to sustainable finance, yet their measurement and predictability remain constrained by data heterogeneity, methodological divergence, and disclosure bias. This study develops a comprehensive ESG prediction framework grounded in the Design Science Research paradigm, integrating advanced machine learning techniques with rigorous data preprocessing, feature selection, and temporal validation. Using firm-level data from Refinitiv and Bloomberg, the analysis distinguishes between ESG composite performance and disclosure-based robustness, addressing a critical gap in the literature. Ensemble learning models, including Random Forest and XGBoost, are evaluated alongside deep learning architectures using multiple sampling strategies and rolling-window validation. The results demonstrate that ESG performance is moderately forecastable, with ensemble methods consistently outperforming neural networks in structured datasets. In contrast, disclosure robustness exhibits lower predictability, reflecting its dependence on discretionary strategic reporting and institutional factors. The findings highlight the importance of data quality, model selection, and validation design in ESG analytics, while emphasizing the limitations of deep learning in tabular financial contexts. The integration of explainable artificial intelligence further enhances interpretability by identifying key predictors of ESG outcomes. Overall, the study contributes to the literature by providing a robust, interpretable, and methodologically rigorous framework for ESG prediction, with implications for investors, regulators, and corporate decision-making.

Keywords:

sustainable finance; machine learning; ESG prediction; design science research; interpretability

1. Introduction

Embedding ESG into financial products and corporate strategy is now critical to how society perceives and derives value from economic opportunities. ESG is a multidimensional index that describes a firm’s environmental, social, and governance performance (Khamis et al., 2025; Pedersen et al., 2021). These constructs are usually captured through composite scores created by third-party data providers. This concept builds on prior research, such as socially responsible investing and corporate social responsibility (CSR), which emerged in the second half of the twentieth century (Eccles et al., 2014). Since then, ESG has been increasingly regarded as an investment decision-making tool and has been incorporated into risk management frameworks for financial systems and the regulatory context.

ESG assessment remains complex, as many of its subcomponents emerge from multiple socio-economic measures and variations in their measurement (Berg et al., 2022). In addition, issues with data quality make it difficult to assess ESG risks accurately.

Research has documented a positive correlation between better operational performance and strong ESG ratings (Friede et al., 2015; Khan et al., 2016) and that firms with high sustainability ratings experience lower capital costs (Eccles et al., 2014; El Ghoul et al., 2011) as well as greater resilience during market shocks (Herawati et al., 2024; Zikriani et al., 2025). Simultaneously, this includes developing regulations and policies that align the goals for sustainable economic growth with ESG factors.

However, uncertainty remains whether ESG scores are comparable and useful; methodological differences, data sourcing, or even the assumptions behind estimates might lead to significant discrepancies (Berg et al., 2022; D. M. Christensen et al., 2021; H. B. Christensen et al., 2021). These discrepancies raise fundamental questions about the mechanisms investors use to interpret and apply environmental, social, and governance (ESG) ratings in their investment decisions. Consequently, the research literature has increasingly adopted advanced analytical methodologies to enhance the accuracy of ESG measurement. These methods are significant for their superior ability to process high-dimensional data and for their effectiveness in modeling the complex, nonlinear relationships inherent in this data (Dossa et al., 2025; Patel et al., 2026; Shin et al., 2024).

Accordingly, this study provides a comprehensive and rigorous framework for ESG prediction. The analysis is conducted at the firm level and incorporates both financial and non-financial variables. It also addresses key methodological issues such as target leakage and multicollinearity.

This study addresses three research questions. (1) To what extent can ML models successfully predict various levels of ESG performance and disclosure? (2) Do predictive performance and robustness differ across datasets when comparing different sampling approaches? (3) What are the main drivers of ESG performance?

This research makes several contributions. The first is the use of advanced machine learning methods, combined with large-scale data preprocessing and validation, to enhance the accuracy and generalizability of ESG prediction. It also draws an important distinction between ESG composite scores and disclosure-based indicators. It brings interpretability methods to explain the drivers of ESG performance and connects the findings to economic theories. These contributions highlight the need to develop and verify the quality and transparency of ESG analytics in a short time period.

The novelty of this approach lies in integrating these elements into a unified research framework. While multiple machine learning approaches to this problem have been studied in the literature, to the best of our knowledge, no prior studies have specifically investigated ESG prediction using feature selection, hyperparameter tuning, temporal validation, and interpretability analysis. Meanwhile, the strength of ESG disclosure is also a pressing issue, as it is associated with information reliability (Yu et al., 2020). However, the robustness of ESG disclosure has received limited attention.

This represents the systematic integration of standard components such as ensemble learning and SHAP into a cohesive ESG prediction pipeline. The pipeline employs cutting-edge analytics across its components while diligently avoiding data leakage and temporal dependence and maintains a clear distinction between ESG performance and disclosure.

The remainder of this paper is organized as follows. This section reviews studies on ESG ratings and machine learning in sustainable investing. The next section presents the empirical methodology, including data sources and preprocessing, model design, and validation. The results section presents the main findings. The paper concludes by outlining the main findings, limitations, and significance of the results, followed by recommendations for future research.

2. Literature Review

2.1. ESG Measurement, Data Quality, and Theoretical Foundations

Companies often employ ESG scores as proxies for non-financial performance and long-term value creation; however, these are subjective measures. Instead, these indicators are affected by choices regarding methodology, data availability, and qualitative rating ranges (Berg et al., 2022; D. M. Christensen et al., 2021). Such an observation raises concerns regarding the validity of using these metrics as target variables in predictive models, which raises important questions regarding the accuracy of data quality measures.

Cross-provider divergence is a central topic in the literature. Berg et al. (2022) found that large discrepancies in ESG ratings arise from differing scopes, weights, and aggregation mechanisms, as ESG scores are constructed from inherently noisy and biased measures.

These limits are worsened by the fact that corporate disclosures are based on self-reporting, which is often influenced and can be strategically influenced by managerial incentives. H. B. Christensen et al. (2021) and Yu et al. (2020) show that ESG ratings include information about firm performance, disclosure strategy, and greenwashing behavior. Hence, ESG indicators theoretically can be considered under the signaling theory, agency theory, and institutional theory. All these frameworks suggest that ESG ratings are endogenous signals whose materiality is influenced by both information asymmetry and regulatory pressures (Dimaggio & Powell, 2021; Jensen & Meckling, 1976; Spence, 1973). Consequently, intrinsic properties of the underlying signal limit ESG predictability.

2.2. Predictors, Persistence, and Predictability of ESG Performance

The previous studies examining the predictors of ESG performance use firm-level linear models with explanatory variables including reporting size, profitability, governance, and industry structure (Eccles et al., 2014; Khan et al., 2016). These techniques rely on assumptions that introduce model specification bias and, as such, do not account for nonlinear or interaction effects.

A recent line of research highlights the temporal dimension of ESG scores, as sustainability investments take several years or even decades to yield results (Lins et al., 2015). Prediction in static settings limits the ability to capture dynamic relationships in the data. However, many studies rely on random sampling approaches and thus an estimation method that has been subject to sample bias and is in fact associated with false overfitting (Gu et al., 2020; Ibrahim et al., 2026).

A primary emphasis of the literature review is that ESG data are neither dynamic nor heterogeneous. Given the intrinsically high-dimensional nature and inherent collinearity of ESG metrics, these indicators often exhibit significant internal volatility, especially when subjected to sudden regulatory shifts (Berg et al., 2022). Consequently, achieving precise forecasting becomes a formidable task, necessitating the deployment of dynamic modeling frameworks that account for temporal dependencies.

2.3. Machine Learning in ESG Prediction: Comparative Evidence and Methodological Challenges

This section explores the different perspectives on machine learning methodologies that could prove beneficial in predicting ESG with limited data. Ensemble methods (e.g., Random Forests, gradient boosting) dominate approaches to parameter optimization because these models can model high-dimensional nonlinear relationships and account for interaction effects (Breiman, 2001; T. Chen & Guestrin, 2016). However, empirical evidence suggests that when such models are applied to so-called prediction tasks, they generally outperform regression-based methods in nearly all settings.

Nonetheless, this advantage should be viewed with caution. These methods are performed on top of individual models, and those models can suffer from data quality; sparse, noisy, or imbalanced ESG data requires a careful selection of the hyperparameters. Moreover, the cross-validation schemes used across the different studies do not reflect a real-world forecasting problem. Thus, we have an inherent bias whereby performance estimates are often inflated or biased (F. H. Chen et al., 2014; Hajek et al., 2023).

Recently, several novel approaches, such as hybrid and explainable modeling strategies, have been suggested to mitigate these issues. Studies such as Patel et al. (2026) show that ensembles of interpretable models (e.g., XGBoost) can achieve near-optimal predictive performance while maintaining high transparency. Furthermore, prior research emphasizes the need for temporal modeling, as static assumptions do not incorporate time dependencies and are therefore not appropriate for dynamic tasks (Goodfellow et al., 2016; Hastie et al., 2009).

2.4. Deep Learning, Model Complexity, and Data Suitability

Deep learning models used for ESG prediction indicate that the trend of a gradual shift toward higher model complexity in financial analytics continues. Neural networks offer a wide range of approaches for approximating complex functional relationships (Goodfellow et al., 2016). However, empirical results remain mixed. This is partially supported by evidence that deep learning models may perform marginally worse than tree-based approaches on tabular datasets, especially when sample sizes are small (Shwartz-Ziv & Armon, 2022).

However, recent research finds that deep learning models yield better performance on larger datasets or when temporal dynamics are explicitly modeled (Gunnarsson et al., 2021; Krauss et al., 2017; Sirignano & Cont, 2018). However, this at the cost of higher computational complexity and less interpretability. This reveals a problem within the methodological literature overall, which tends to prioritize sophisticated models at the expense of appropriate data usage. In summary, comparing models that are irrelevant to the data characteristics will only mislead researchers, as ESG prediction requires selecting suitable models rather than merely following the trend toward increasingly complex algorithms.

2.5. Data Engineering, Feature Selection, and Model Robustness

As ESG datasets are fundamentally plagued by missing values, measurement error, and high dimensionality, this can in turn harm model performance (Berg et al., 2022). Lasso regression and optimization-based feature selection improved model performance by reducing redundancy and multicollinearity, resulting in more stable and interpretable model outputs (Lahmiri, 2016; Papíková & Papík, 2022; Snoek et al., 2012; Tibshirani, 1996).

Existing advances notwithstanding, critical methodological risks remain poorly addressed. Specifically, in high-dimensional datasets, the existence of target leakage can give the illusion of high-quality metrics. The lack of time-aware validation approaches also hampers the generalizability of empirical results (Gu et al., 2020).

2.6. Interpretability, Transparency, and Model Accountability

With the widespread use of machine learning in ESG analytics, a strong demand for interpretability and accountability has emerged. Given significant financial and regulatory risks arising from the results of such an ESG assessment, stakeholders require accuracy and transparency. Within the explainable artificial intelligence paradigm, SHAP is among the techniques that decompose predictions into feature-level contributions, thereby enhancing interpretability (Lundberg & Lee, 2017; Molnar, 2019).

Nonetheless, the literature is still highly fragmented, and several outstanding questions remain unresolved. In particular, many studies conflate ESG performance (the outcome) with ESG disclosure (the action), even though the mechanisms are fundamentally different! Moreover, existing research often addresses individual methodological components contingently (Berg et al., 2022).

Considering these limitations, this work offers a unified framework for data preprocessing, feature selection, leakage, temporal validation, and interpretability. This framework aims to provide a more theoretically grounded contribution of the ESG prediction literature by explicitly incorporating the fact that ESG metrics are constructed and endogenous.

For comparative remarks, previous studies generally discuss ESG prediction and dimensionality reduction separately, as presented in Table 1.

3. Empirical Methodology

3.1. Research Design and Data Sources

This research is situated within the Design Science Research (DSR) paradigm, which was chosen for its unique alignment with both theoretical grounding and construction of new analytical artifacts. Whereas general research methods are explanatory or predictive, DSR is unequivocally prescriptive, prescribing a solution to follow based on theory and then confirming its validity through empirical observation (Hevner et al., 2004; Peffers et al., 2007). This paradigm is best suited for ESG analytics, where one of the main drivers that should inform sustainability outcome characterization is actually to build robust predictive systems that take into account the uncertainty (implying nonlinear relationships between variables) and data heterogeneity and measurement error inherent in our field.

The measurement and forecasting of ESG performance are among the most significant challenges we face, which is one of the many reasons prompting our transition to DSR. These constructs tend to be multidimensional and characteristically subject to cross-provider divergence, poor reporting sensitivity characterized by limited corporate disclosure, and biases that remain unobserved (Berg et al., 2022; H. B. Christensen et al., 2021). These features represent well-documented challenges for traditional empirical methods and require a formal analytical product that integrates disparate data, filters out noise, and extracts higher-order relationships among nonlinear environmental/social/governance factors.

Alongside this framework, we develop an ESG prediction system as the main contribution of the work based on data-preprocessing, leakage-mitigation and feature-selection models in a modular multi-stage architecture. It is in line with the canonical DSR process of problem identification, artifact design, demonstration and evaluation (Peffers et al., 2007). Every stage of the methodology pipeline is carefully calibrated to tackle well-known issues in ESG analytics such as dimensionality multicollinearity or overfitting. This use case of repeated evaluation in a true-to-life setup while testing the artifact using temporal cross-validation and robustness analysis to achieve generalization across different time horizons is a mainstay of this approach. In addition, model interpretability methods provide more insights by providing links of predictions to economy-related predictors that are interpretable and useful for ESG decisions.

We draw the empirical work in this analysis on a dataset constructed from firm-level data obtained from the Refinitiv ESG database and firm-level scores from Bloomberg, which provides significant global coverage of firms, and scores become standard datasets widely used in sustainable finance literature (Berg et al., 2022; H. B. Christensen et al., 2021).

It consists of a longitudinal panel dataset whereby temporal dynamics and the trajectory of firm characteristics over time is observed. The research uses dependent variables that comprise both ESG composite scores and disclosure-based robustness indicators, which account for sustainability performance as well as reporting credibility. Measuring both of these dimensions is consistent with previous literature on the substantial difference between a firm’s actual impact during its operations and how discourse strategy can be used to position itself through signaling (Eccles et al., 2014; Yu et al., 2020).

To provide a clear and structured overview of the proposed ESG prediction framework, the full methodological pipeline is summarized in Algorithm 1. This algorithm outlines the sequential steps of data preprocessing, feature selection, model training, and evaluation within the design science research framework.

Algorithm 1: ESG Data Processing and Prediction Pipeline

Input: Raw ESG dataset
Output: Optimized predictive model and evaluation metrics
Procedure:
BEGIN
1. Data Acquisition
1.1 Load the ESG dataset from the specified data source
1.2 Inspect dataset structure (dimensions, data types, summary statistics)
2. Data Cleaning
2.1 Remove duplicate records
2.2 Handle missing values:
IF proportion of missing values in a column is high THEN
Remove the column
ELSE
IF column is numerical THEN
Impute missing values using mean or median
ELSE
Impute missing values using mode
2.3 Remove irrelevant or redundant features
3. Feature Engineering
3.1 Encode categorical variables using label encoding or one-hot encoding
3.2 Transform variables where necessary
3.3 Construct additional features if applicable
4. Data Preprocessing
4.1 Normalize or standardize numerical features
4.2 Define feature matrix X and target variable Y
5. Dataset Splitting
5.1 Partition dataset into training and testing subsets
(e.g., 80% training, 20% testing)
6. Model Development
6.1 Initialize candidate models:
- Linear Regression
- Random Forest
- Other relevant algorithms (optional)
7. Model Training
FOR each model DO
Train model using training dataset
END FOR
8. Model Evaluation
FOR each trained model DO
Generate predictions on test dataset
Compute evaluation metrics:
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- Coefficient of Determination (R²)
END FOR
9. Model Optimization
9.1 Perform hyperparameter tuning (e.g., Grid Search or manual tuning)
9.2 Retrain optimized model
9.3 Re-evaluate performance
10. Results and Visualization
10.1 Compare model performances
10.2 Visualize predictions versus actual values
10.3 Analyze feature importance (if applicable)
11. Output
11.1 Select best-performing model
11.2 Save final model and results (optional)
END

End of Algorithm

3.2. Data Preprocessing and Leakage Mitigation

One of the important aspects in this preprocessing stage is preventing any target leakage. This, in particular, is critical in ESG research where a majority of the variables are tightly linked to the construction of the ESG score itself. Herein, leakage is defined as including variables that are either in direct inclusion or components of the ESG score, or indirectly capture the same underlying information (Kaufman et al., 2011).

To address this, all variables clearly linked to ESG construction are removed before any modeling takes place. This includes composite ESG measures, pillar scores, and related indicators; environmental indicators such as emissions and energy use, as well as policy and disclosure-related variables. The initial filtering is based on keyword identification, but it is complemented by a manual review of the variables to ensure that no problematic features remain. After this step, the dataset is reduced to 496 predictors, which mainly reflect financial characteristics, market information, and structural governance features.

Multiple diagnostics are performed to ensure that no residual leakage remains in this procedure. First, for each predictor with the dependent variables. The correlations in question are modest, with values of ∼0.37 for ESG scores and 0.42 for ESG disclosure, and in no case anywhere near high enough to raise leakage concerns. Importantly, the variables typically associated with ESG construction do not rank anywhere near the strongest correlations as predictors. Secondly, a shuffled-target test is carried out: since the model predictive performance collapses when the target variable is randomly permuted, this lends further confidence that the findings are not attributable to hidden leakage.

This stage also deals with multicollinearity. Most financial datasets have multiple representations of similar concepts, such as different types of sales measures or accounting identities that connect assets, liabilities, and equity. Such variables are removed to prevent redundancies, and further validation ensures that no duplicates or perfectly collinear variables exist.

A key point that needs to be defined is the temporal structure of data preprocessing. We only use to predict outcomes at time. Lags are introduced where necessary to ensure that the predictors always occur before the target variable. This prevents the model from exploiting future information in a way similar to real-life forecasting, since it is what we would want predictions to be carried out.

We perform all preprocessing on a per-training window basis to prevent information from leaking from the test data. This is particularly the case when we are dealing with missing values. We fill in the missing values with respect to the center of data in training only and test (optional). All preprocessing steps were executed within each training window to maintain the integrity of the predictive framework. This time-course isolation is critical as it avoids leakage of test set information to the training phase, a common risk when dealing with missing observations. For our main specification, we only imputed the missing values using the median of the training sample (and repeated for every transformation for test set) to apply a time consistency. We performed a sensitivity analysis of the baseline against several alternative techniques when this technical choice could potentially introduce bias (mean imputation, k-nearest neighbors [KNN], and iterative multivariate methods).

This pattern of results remained exceptionally stable across all specifications, which gives us confidence in the stability of this framework. Although the baseline median imputation had an R² of 0.679, both less invasive KNN and more complex iterative methods produced almost equivalent scores (i.e., 0.681 and 0.680). Surprisingly, the simplest method, mean imputation, gave a very close score of 0.677! This trivial difference with a very small change of <0.004 in predictive accuracy suggests the model is less sensitive to various policies or imputation strategies regarding missing inputs. That stability is itself validation that the underlying signal driving predictive power lies in the ESG and financial data rather than in the mechanics of any specific imputation process, strengthening our basis for methodology.

3.3. Feature Selection and Dimensionality Reduction

Given the high dimensionality of ESG datasets, feature selection is conducted using Lasso regularization to reduce redundancy and improve model stability. The optimization problem is defined as:

\underset{β}{m i n} \{\frac{1}{n} \sum_{i = 1}^{n} (Y_{i} - X_{i} β)^{2} + λ \sum_{j = 1}^{p} ∣ β_{j} ∣\}

where the

L_{1}

penalty induces sparsity by shrinking less informative coefficients toward zero (Tibshirani, 1996). This is particularly helpful in ESG contexts, where many indicators are highly correlated or redundant (Berg et al., 2022). Lasso mitigates multicollinearity and adds stability to subsequent model estimation by reducing the effective number of predictors (Gu et al., 2020).

Since feature selection may introduce data leakage, we have Lasso running inside each training window of this temporal validation framework. In particular, for every estimation period, the model is only fitted with training data, and then the selected features are applied to the respective test set. Feature selection is never actually performed on the whole dataset.

The regularization parameter is picked via cross-validation within the training data as that which minimizes prediction error. This allows making the degree of sparsity data-driven while keeping it consistent with the out-of-sample evaluation strategy. In particular, for each estimation period the model is fitted using only the training data, and the best features are therefore applied to the associated test set. And at no time is this full dataset subjected to feature selection analysis.

The regularization parameter is selected on validation (not test) folds through cross-validation, taking the value that minimizes the prediction error. The degree of sparsity is thus data-driven whilst keeping with the out-of-sample evaluation strategy.

3.4. Model Specification and Ensemble Learning

To model ESG outcomes within a predictive framework, the analysis combines both traditional econometric approaches and machine learning methods. This allows for a structured comparison between linear specifications and more flexible nonlinear models and helps assess the incremental predictive value of machine learning techniques.

3.4.1. Baseline Econometric Models

As a benchmark, standard econometric models are estimated. The baseline specification is given by:

Y_{i t} = α + X_{i t} β + ε_{i t}

where

Y_{i t}

denotes the ESG outcome for firm

i

at time

t

, and

X_{i t}

represents the set of predictors. In addition, panel specifications with firm-level fixed effects are considered:

Y_{i t} = α_{i} + X_{i t} β + ε_{i t}

to account for unobserved heterogeneity across firms. These models provide a reference point for evaluating whether more complex machine learning approaches offer meaningful improvements in predictive performance.

3.4.2. Machine Learning and Ensemble Models

To capture nonlinear relationships and interaction effects, the predictive framework incorporates ensemble learning methods. The general prediction function can be expressed as:

{\hat{Y}}_{i t} = \sum_{m = 1}^{M} ω_{m} f_{m} (X_{i t})

where

f_{m}

denotes individual base learners and

ω_{m}

represents their associated weights. Random Forest and Extreme Gradient Boosting are selected due to their strong performance in structured datasets and their ability to model complex dependencies (Breiman, 2001; T. Chen & Guestrin, 2016).

The Random Forest model constructs an ensemble of decision trees using bootstrap aggregation:

\hat{f} (X) = \frac{1}{B} \sum_{b = 1}^{B} T_{b} (X)

where

T_{b}

denotes individual trees trained on bootstrap samples. This approach reduces variance and improves generalization (Breiman, 2001).

XGBoost, by contrast, follows a gradient boosting framework in which the predictive function is iteratively updated:

{\hat{Y}}^{(t)} = {\hat{Y}}^{(t− 1)} + f_{t} (X)

where each new function

f_{t}

is fitted to minimize the residual error from the previous iteration. This sequential learning process enhances predictive accuracy while incorporating regularization (T. Chen & Guestrin, 2016).

3.4.3. Deep Learning Benchmark

A Multi-Layer Perceptron (MLP) is also included as a benchmark model, with predictions defined as:

\hat{Y} = σ (W^{(L)} \dots σ (W^{(1)} X + b^{(1)}) \dots + b^{(L)})

where

σ (\cdot)

denotes nonlinear activation functions (Goodfellow et al., 2016). Although in theory a Neural Network can approximate any complicated function, in practice the scale and dimensionality of the data make a substantial difference to its performance. Neural networks tend to overfit and have less stable optimization in an environment like ESG analysis, where sample size is small, but the feature space is high-dimensional. Due to these reasons, the MLP is only used for benchmarking purposes in this case so that we can compare how well deep learning models are suited in this context.

3.5. Hyperparameter Optimization

Hyperparameters are tuned using a randomized search over the parameter space

Θ

, with the aim of minimizing out-of-sample prediction error. Formally, we select the set of parameters

θ^{*}

that minimizes the expected loss function, where

L (\cdot)

denotes the prediction error and

θ

represents the hyperparameter configuration (Bergstra & Bengio, 2012).

As the data are time-dependent, the tuning procedure is implemented within the same rolling-window framework used for model evaluation to avoid any risk of data leakage. For each estimation window, the available training data are further divided into an inner training and validation subset. Hyperparameters are selected based solely on performance in this validation subset, and the chosen configuration is then applied to the subsequent out-of-sample test period. At no point is information from the test set used during tuning.

The above procedure is performed across all rolling windows, which enables hyperparameters to respond to a changing environment, rather than being fixed in one single, global specification process. Such a design is aligned with the time-aware structure of the analysis and reflects realistic forecasting premises.

Randomized search was used rather than exhaustive grid search, which explores a greater volume of the parameter space and is computationally less expensive (Bergstra & Bengio, 2012). This tuning is focused on the bigger parameters regarding each model. For ensemble schemes, it is the number of trees, tree depth, and minimum sample splits; for neural networks, hidden layer sizes, learning rate, and regularization strength. The ranges for parameters are selected based on well-established practice in the literature on both machine learning and finance (Hastie et al., 2009; Zou & Hastie, 2005).

Due to the small size of the dataset and its high dimensionality, great care is taken in regulating model complexity. We have thus developed the tuning strategy with the need to balance between flexibility and generalization, given that overfitting can be avoided while stable predictive performance occurs, as presented in Table 2.

3.6. Temporal Validation and Sampling Strategy

Model validation is conducted using a rolling (expanding) window approach to ensure that predictions are based strictly on information available at the time of forecasting. The training and testing structure is defined as:

{\hat{Y}}_{t + 1} = f (X_{1 : t}; θ_{t})

Model trained on observations up until time and score it on the following window of time. This is consistent with a real-world forecasting scenario and avoids look-ahead bias, defined as the use of future information (Hastie et al., 2009).

In order to further preserve this validation process, all the preprocessing steps are performed independently in each training window. The obtained transformations are then applied to the relevant test period. This ensures no aspect of the test data affects model estimation at any stage.

On the other hand, although the rolling window provides a realistic evaluation framework, a single validation scheme may not fully reflect model sensitivity to temporal variation. That is why supplementary validation checks will be performed. Namely, they comprise alternative temporal splits and the model performance evaluation across segments of the ESG distribution. This enables the evaluation of time and firm-level stability over low and high ESG scores.

The prediction performance of the model is validated using a rolling (expanding) window approach, and no future information is leaked into the modeling step or prediction stage.

The training and testing structure is defined as:

{\hat{Y}}_{t + 1} = f (X_{1 : t}; θ_{t})

where the model is trained on observations up to time

t

and evaluated on the subsequent period

t + 1

. This setup reflects realistic forecasting conditions and prevents any use of future information, thereby eliminating look-ahead bias (Gu et al., 2020; Hastie et al., 2009).

For further rigor, all preprocessing is conducted independently in each training window. These transformations, in turn, are applied to the respective test period. The test data never influences model estimation at all, at any stage.

Although the rolling window offers a practical grid search toolkit, it may not be able to fully account for time-variant sensitivity in models when using a single validation scheme. To this end, more validation checks are performed. This includes, for example, alternative temporal splits and assessing model performance across different segments of the ESG distribution. This enables a time-series assessment of stability across firms with lower or higher ESG scores.

3.7. Interpretability and Model Transparency

To ensure interpretability, SHAP values are used to decompose predictions into additive feature contributions:

f (X) = ϕ_{0} + \sum_{j = 1}^{p} ϕ_{j}

where

ϕ_{j}

represents the marginal contribution of the feature

j

to the prediction (Lundberg & Lee, 2017). This method provides a consistent and theoretically grounded way to interpret complex machine learning models.

This paper uses SHAP values to evaluate variable importance and analyze the presence of common predictors across model specifications. The economic meaning of these contributions is given in the results section. Computation-wise, separating the computations of feature contributions from their interpretations enables a clear distinction between modeling and empirical conclusions.

4. Empirical Results and Model Performance

4.1. Empirical Data Diagnostics and Temporal Validation

A temporal diagnostic was conducted on the longitudinal panel to ensure that the predictive models were grounded in a realistic forecasting environment. The validation process confirmed that look-ahead bias has been removed, shown by the passed temporal logic check in which firm-level predictors from periods that are aligned corresponding in time with ESG outcomes at period. Perhaps one of the most interesting inferences from this diagnosis was the extremely high target persistence at 0.8824 for BESG ESG Score, implying that sustainability performance is very sticky; likely a result of historical behavior remaining as the strongest predictor of future score (although still an insufficient predictor).

Besides temporal alignment, the data were also filtered for signal-to-noise refinement of each fingerprint in order to generate stationarity at the required level for ensemble learning. The refinement identified thirty-four near-zero variance variables, including specific amortization and discontinued operations metrics, which were pruned as final-stage via learners with meaningful information signals. Additionally, multicollinearity was a concern among the environmental indicators, as indicated by a redundancy check demonstrating colinearity between GHG Scope 1 and Scope 2 emissions, with a correlation of 0.9037. The strong overlap here indicates that Lasso regularization was indeed an appropriate methodological approach in the next step, as it is a robust way of reducing redundancy between high-dimensional data.

To also examine the longitudinal structure, a rolling-window simulation was used that suggested an appropriate sample size adjusted over time. There is a substantial imbalance, as the average firm has been trading for less than four years, but this data also supports temporal validation because there are thirty-nine firms with an unbroken trading history of seven years. The table above conveys this well with the expanding window plan: starting from a size 8 in 2016 and ending up in a full dataset of 533 observations for the final forecast in 2023. That increasing structure is essentially because of the systematic increase in the target value. Annual average scores went up from 1.29 to 2.77 (shown in Figure 1), and an explanation of this increase lies with dynamic corporate wrap-up factors that are closely synced with global compliance policies in the long run.

4.2. Feature Selection Dynamics and Stability Analysis

Running rolling-window Lasso selection showed drastic changes in model informativeness needs as the size of the training sample changed. In the first window (2018), when the training sample contained only 48 observations, Lasso was quite lenient and included 61 features. However, when expanding to 533 observations over longitudinal depth (by 2023), the model became more parsimonious, limiting the model-predictive set to seven primary key features. In an average case, the method reduced a 496-dimensional space to about 19 predictors for a window, thereby reducing both noise and the “curse of dimensionality”.

Stability analysis of these features during the study period highlights the structural factors driving ESG performance. The selection of the lagged target variable (BESG_Lag1) is most remarkable, as it appeared in all six windows and corroborates that sustainability performance is fundamentally path-dependent and persistent. Apart from historical scores, the model systematically identified a combination of governance and financial health indicators as stable predictors. In particular, a more common non-financial driver was the presence of the Equal Opportunity Policy, and financial metrics such as Treasury Stock, Short-Term Borrowing trends, and Lease Commitments were some of the most common selections. This shows that the model is able to capture formal ESG policy vs. real financial strength well enough that they can be reliably utilized as a base in the final ensemble learning stage.

4.3. Main Model Performance

The primary empirical findings are reported in Table 3, which presents the predictive performance of the three model classes under alternative sampling strategies for ESG score prediction.

Table 3 presents the empirical results of the best ESG forecasting model, Random Forest. The performance for Random Forest with R-squared = 0.679 and MAPE = 0.372 is the highest performing when applied to the data in their original distribution. The R-squared value indicates substantial explanatory power for annual ESG performance, where approximately 68% of the variation is predictable via a lagged feature set. These results are aligned with previous findings reported in the finance literature, which have shown that tree-based models perform well on nonlinearities and complex interaction effects frequently present in sustainability datasets. XGBoost achieved second place with 0.664. This also further confirms that boost mechanism is suitable for this structured panel data.

The results demonstrate that noise in data leads to a decline in predictive accuracy. The reduction in how well the model can predict was also at its most clear across all models that were both under- and oversampled. If no value is chosen for K, R-squared of Random Forest will decline to 0.642 with over-sampling cut observations, which might be treated as outliers, hence diminishing data density, which then creates a space for identifying minute variations in the conduct of firm. These results align with the literature, where churn distributions are sometimes aggressively resampled in a misleading fashion.

Another important thing to highlight is the low performance of the Multi-Layer Perceptron. The benchmark. An important finding is yielding an R-squared = 0.000 on the hold-out set at baseline and worsening to negative territory as applied using resample settings. This supports an emerging view in the field that although deep learning has revolutionized unstructured data such as images or text, in certain scenarios it consistently fails to outperform classical ensemble methods on tabular financial data, which have smaller sample sizes and high signal-to-noise ratios. In conclusion, our results indicate that across the selected Random Forest formulations implemented in this study, moderation by structural complexity emerges as the most reliable approach for ESG performance forecasting.

4.4. Robustness Analysis: ESG Disclosure Prediction

To evaluate the stability of the predictive framework, Table 4 reports results for ESG disclosure robustness, which captures the quality and consistency of ESG reporting.

Table 3 shows the robustness analysis, where we further investigate the predictive ability of this framework for ESG Disclosure Scores. The results show that the models are still highly stable, and Random Forest remains the best model. For the original sampling distribution, it generated an R-squared of 0.710, indicating that disclosure transparency might be slightly more predictable than composite performance scores created from composite ranking providers. This implies that organizational reporting habits and levels of transparency follow a well-defined trajectory in which ensemble models can make predictions with high predictive accuracy about future behavior.

Consistent with the performance from the main model output, secondary XGBoost R-squared scores for performance were also strong at 0.697. Consistent with prior results, the sensitivity to resampling. Prediction performance improved with over-sampling as well as under-sampling methods, whilst error rates still increased for removing observations (RMSE increased from 8.446 to 9.825). This verifies that data distribution is essentially the most robust foundation on which to train sustainability models for robustness checks. The reduction in the precision with under-sampling indicates that each observation contributes meaningfully when acquiring a model for the reporting landscape.

In this manner, the Multi-Layer Perceptron had consistently performed poorly. The original distribution gave a low R-squared (0.075), but this deteriorated substantially with under-sampling, where it produced a negative R-squared (−0.535) and very large RMSE (19.069). This volatility towards the ensemble methods seems to support our hypothesis that the neural network was unable to generalize those patterns leveraged from the disclosure data. Overall, the robustness analysis demonstrates that tree-based ensemble methods offer a reliably accurate but also very robust alternative in terms of predicting multiple template dimensions of ESG reporting.

R-squared will drop to 0.642 with under-sampling cut observations, which may be considered outliers, thereby reducing data density, which provides an opportunity for detecting subtle changes in firm behavior.

4.5. Econometric Model Results

The results are shown for both pooled OLS and fixed effects in Table 5 and Table 6 covering the full spectrum of ESG as a composite score. The lagged ESG score (BESG_Lag1) is highly significant in both models and, as such, further confirms a strong persistence of ESG performance over time. Although the coefficient is greater in absolute size for pooled OLS, it also maintains a positive and robust result when adjusting for firm-specific effects according to the fixed effects model, alluding to an important finding whereby past ESG performance influences outcomes today even when controlling for firm-specific effects.

The Equal Opportunity Policy variable is also positive and significant in both models, implying that a firm adopting Equal Opportunity Policies tends to have higher ESG scores. Interestingly, under the fixed effects specification, its effect increases in strength and significance, suggesting that policy adoption at the within-firm level is particularly significant for positive ESG performance.

In terms of financial variables, Treasury Stock is only significant in the fixed effects model, indicating that firm-specific dynamics feature prominently in this relationship. Second, short-term borrowings consistently exhibit weak or no significance in all six models, revealing little to no impact on ESG outcomes. In the pooled OLS model, lease commitments seem to be significant but become insignificant when accounting for firm fixed effects, which suggests that part of their effect is due to cross-sectional differences rather than intra-firm changes.

4.6. Model Fidelity and Predictive Accuracy

As shown previously in Figure 2, the scatter plot shows our predictions correlate strongly with observations and are characterized by their predicted values clustering around the benchmark line located along the 45-degree line. There is some variation, especially for low-value ESG sellers, and although dispersion is evident, the clustering suggests that the model captures some of the fundamental structure behind ESG performance. This agrees with the extensive empirical literature in machine learning-based asset pricing and ESG prediction, where nonlinear models typically fit training data well but trade off with weak quantile dispersion for lower quantiles. A visual audit suggests that the predictive consistency of the model stays solid across a wide range of firm performance.

4.7. Temporal Stability and Generalization

Figure 3 shows that the model’s R² score ranges from 0.46 to 0.87 across different test years, with an average of 0.68. Although some fluctuations are observed, the model generally maintains stable predictive capability over time. This temporal consistency is particularly important in ESG applications, where data characteristics and regulatory conditions may evolve across periods. The results indicate that the framework is capable of adapting to changing conditions while preserving reliable forecasting performance. Overall, the model demonstrates robustness in long-term prediction tasks without exhibiting substantial deterioration in accuracy over time.

Figure 4 shows the residuals of the ESG prediction model. The distribution is mostly between zero (right after 0) which indicates that the model does not have a bias in its predictions. Furthermore, the near symmetrical distribution of residuals implies that there are no systematic errors either, which indicates reliability and robustness of estimation procedure. The findings uncover a slight rightward skew with scant extreme positive residuals suggesting that the model has slightly higher prediction errors on high ESG scores. This is consistent with earlier findings that show prediction of ESG scores and financial modeling for heterogeneous and high performing firms are more difficult to estimate out accurately.

4.8. Interpretability and Feature Contributions

The SHAP summary plot focused on explainability on impact level and helped identify the most direct drivers of model predictions (as seen in Figure 5). We already discussed how Shapley values can help us know not only the importance of each respective variable but also in what direction (and by how much) they impact the resulting ESG score.

The lagged estimate indicates a positive upward effect, suggesting that the ESG Score at time (t − 1) is statistically significant. Moreover, it proves that corporate sustainability performance is extremely stable and changes continuously rather than chronologically. In addition, historical scores also show that Total Liabilities and Revenue are important indicators of financial wellbeing, which our deep learning model does show. It also shows the natural linkage between sustainability metrics, for which they obtain credit through disclosures with firm performance and capital structure.

Moreover, the graph shows a nonlinear governance and operational indicator effect. Of these, items such as the existence of an Equal Opportunity Policy and the management of lease obligations stabilize as non-financial predictors. Namely, these features are characterized as horizontally dispersed: they affect firms differently. Instead, the model highlights that the level of influence these variables really have depends on broader corporate dynamics. The SHAP analysis also provides a success in that the factors characterizing modern ESG performance are not usually simple linear coefficients more so for what comprises their actual multifactorial nature.

5. Discussion

This study focuses on predictability of ESG performance from an econometric model and machine learning perspective. These findings also show that ESG outcomes form structured and semi predictable classes instead of random. Random Forest and XGBoost are moderately effective predictive models with predicted labels vs. actual ~0.65 R² to 0.70 R² scores. This suggests that a substantial portion of ESG variation can be explained by firm-level heterogeneity.

One of the key findings is that ESG is very serially persistent through time. The lagged variable for the ESG predicted significant weighted importance spanning all models. The evidence it generates is low-frequency and cannot change metrics overnight. From an economic perspective, it is reflective of the eventual outcome of sustainability investments that are rightly anticipated to play out over a generation as governance/environmental/social policy improvement occur. Amenable to investors, this persistence means that the prior performance of ESG can be predictive of future outcomes.

In addition to persistence from ESG performance, governance variables, especially whether management has an equal opportunity policy, explain a significant proportion of variability in future ESG performance. This suggests that even soft corporate policies matter: these words translate to better ESG performance in practice. Financial variables, on the other hand, are also predictors depending on the organization’s economic bandwidth, and each financial structure is limited by its flexibility. ESG performance, then, is the product of both institutional commitments and firm fundamentals.

An additional comparison of econometric versus machine learning approaches shows that Pooled OLS and fixed effects models discover significant associations and have high interpretability but low predictive performance. As an instance, machine learning models learn and fit a nonlinear relationship and variable interactions, so they scale better than traditional regression for predictive accuracy. It gives an example of the added power machine learning can add to ESG analysis by fitting less restrictive models on high-dimensional data, which is often used in ESG measures.

A key finding of this study from this analysis is that, already at lower levels (e.g., ESG disclosure prediction task), the deep learning model performs poorly at negative (R²) values. This result is not surprising and confirms what a growing number of publications have shown: that fr structured, tabular datasets, deep neural networks perform worse than tree-based ensemble methods (Shwartz-Ziv & Armon, 2022). In settings with high dimensionality, low sample size, and heavy noise, all of which are common features of ESG data since they can change daily, neural networks have a limited ability to generalize well.

In addition, the negative R² values that we observed in this study highlight that the deep learning model performs worse than a basic benchmark, indicating severe overfitting or that no stable signal could be extracted from the data. This is especially true in the case of ESG disclosure measures, which are fundamentally noisy and subject to subjective reporting practices and rating methodologies (Berg et al., 2022; H. B. Christensen et al., 2021). Highly flexible models such as neural networks tend to fit idiosyncratic patterns in the training data rather than capturing structural relationships between variables, fostering out-of-sample failure in such environments.

In contrast, ensemble methods like Random Forest and XGBoost are well suited to capturing nonlinearities and interaction effects of tabular financial data through implicit regularization (Breiman, 2001; T. Chen & Guestrin, 2016). Thus, the results provide further support for a key methodological lesson from the broader literature that is highly relevant in ESG analytics, where increasing model complexity does not necessarily improve predictive accuracy; it is crucial to select models that fit data structure well.

SHAP is an additional analysis that reveals the drivers of ESG performance. As shown in the results, these key predictors are consistent across models in terms of significance, especially lagged ESG scores and certain other governance and financial variables. The consistency raises confidence in the results and indicates that the identified relationships do not seem to depend on the model employed. Economically, the results provide evidence that adopts ability as a mixture of past behavior and institutional capabilities underlying ESG outcomes.

Most of our empirical results are consistent with the literature on ESG prediction and financial machine learning. Ensemble methods such as Random Forest and XGBoost show better performance on structured, tabular datasets, reflecting their ability to capture nonlinear relationships and interaction effects among ESG-related variables, consistent with previous literature (D’Amato et al., 2021; Dipierro et al., 2025; Patel et al., 2026). Simultaneously, those results are consistent with recent literature, indicating that deep learning models tend to be outperformed as soon as there are even moderate levels of noise in the data or when heterogeneity and/or limited sample sizes arise in application contexts (Shwartz-Ziv & Armon, 2022). Moreover, less predictable disclosure-based ESG measures are in line with literature highlighting the limiting role of reporting discretion, measurement error, and rating divergence in ESG data (Berg et al., 2022; D. M. Christensen et al., 2021).

6. Policy and Practical Implications

The implications of the findings from this study can be summarized for both investors and policymakers. This addresses the question of whether ESG integration may provide investors with a directional process oriented towards foresight as opposed to mere static screening processes. Because ESG performance can be predictable, predicted ESG scores can be integrated in both portfolio construction and risk management. Additionally, investors then adjust and increase their holdings in firms covered by analysts with a better ESG trajectory. Lastly, from an investment perspective, the mere persistence of ESG scores suggests the explanation for why momentum-type strategies develop in this area.

The findings emphasize the need for more consistent and standardized ESG reporting to policymakers. The extent to which one can predict a score from observable characteristics determines how endogenous the score is. Instead, they are motivated by the fundamentals of an economy and a polity. This illustrates the necessity for standardized disclosure frameworks to allow results among providers to be comparable, which will enhance the integrity of ESG data.

7. Limitations and Future Research

Although this study represented an advancement in the existing literature, some limitations should be noted. This assessment is based on data from two leading global ESG providers. As a result of different methodologies and metrics, ESG ratings differ from provider to provider on a disclosure-by-issuer basis, despite broad coverage and wide diffusion. This also limits the generalization of the results to other datasets or measures of performance.

Secondly, and related to the first point, this panel did not have enough time to model more complex temporal models (such as recurrent neural networks). Thus, future studies should examine these models in longer time series settings. Third, although the study takes prediction performance into account, it does not implement a complete portfolio management strategy or simulate policy and/or regulatory impact analysis. Therefore, adding portfolio optimization or policy evaluation of ESG prediction models to help clarify the practical importance of these attempts is logical.

Future research may determine transferability bias across the region and provider, as well as effects associated with variations in ESG definitions. This would better provide an understanding of both the extent of the persuasiveness and exploitability of machine learning-based ESG analysis.

8. Conclusions

This study aims to construct a single framework with econometric and machine learning methods for forecasting ESG outcomes. The results suggest that ESG outcomes are predictable from ancillary data, with ensemble models performing best. The findings show that persistence in sustainability themes, governance policies, and other financial attributes does make a difference.

The model comparisons clearly demonstrate that while machine learning is indeed helpful for modeling complex relationships, they also highlight the inability to achieve interpretability associated with a traditional econometric framework. These results have also shown that a complex model like neural networks is not the best approach in every setting.

In summary, the mean–variance estimation based on a joint distribution makes a positive contribution to the literature by providing theoretically consistent and interpretable estimates of ESG. Lastly, it is underlining the centrality of data quality, model selection, and interpretability for machine learning models to sustainable information. The results yield some implications for ESG prediction in investment decision-making and policy construction, although the data invariance and external validity are limited.

Author Contributions

Conceptualization, T.M.S.M. and Y.I.; methodology, Y.I.; software, Y.I.; validation, T.M.S.M.; formal analysis, Y.I.; investigation, T.M.S.M.; resources, Y.I.; data curation, Y.I., K.H. and T.M.S.M.; writing—original draft preparation, Y.I., K.H. and T.M.S.M.; writing—review and editing, Y.I., K.H. and T.M.S.M.; visualization, Y.I.; supervision, K.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available from the corresponding author upon reasonable request.

Acknowledgments

During the preparation of this manuscript, the authors utilized Grammarly and ChatGPT (GPT-5.5) for proofreading purposes. The authors have thoroughly reviewed and edited the output and assume full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ESG	Environmental, Social, and Governance
CSR	Corporate Social Responsibility
ML	Machine Learning
AI	Artificial Intelligence
XGBoost	Extreme Gradient Boosting
MLP	Multi-Layer Perceptron
SHAP	SHapley Additive exPlanations
OLS	Ordinary Least Squares
FE	Fixed Effects
2SLS	Two-Stage Least Squares
VIF	Variance Inflation Factor
Lasso	Least Absolute Shrinkage and Selection Operator
MSE	Mean Squared Error
RMSE	Root Mean Squared Error
MAE	Mean Absolute Error
R²	Coefficient of Determination
KNN	K-Nearest Neighbors
DSR	Design Science Research
BESG	Bloomberg ESG Score (context-specific variable name)
X	Feature Matrix (input variables)
Y	Target Variable (output)
β (beta)	Model Coefficients
λ (lambda)	Regularization Parameter

References

Berg, F., Kölbel, J. F., & Rigobon, R. (2022). Aggregate confusion: The divergence of ESG ratings. Review of Finance, 26(6), 1315–1344. [Google Scholar] [CrossRef]
Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(10), 281–305. [Google Scholar]
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. [Google Scholar] [CrossRef]
Chen, F. H., Chi, D. J., & Zhu, J. Y. (2014). Application of random forest, rough set theory, decision tree and neural network to detect financial statement fraud—Taking corporate governance into consideration. In Intelligent computing theory (Vol. 8588 LNCS, pp. 221–234). Springer. [Google Scholar] [CrossRef]
Chen, T., & Guestrin, C. (2016, August 13–17). XGBoost: A scalable tree boosting system [Conference session]. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794), San Francisco, CA, USA. [Google Scholar] [CrossRef]
Chowdhury, M. A. F., Abdullah, M., Azad, M. A. K., Sulong, Z., & Islam, M. N. (2023). Environmental, social and governance (ESG) rating prediction using machine learning approaches. Annals of Operations Research, 2023, 1–25. [Google Scholar] [CrossRef]
Christensen, D. M., Serafeim, G., & Sikochi, A. (2021). Why is corporate virtue in the eye of the beholder? The case of ESG ratings. Accounting Review, 97(1), 147–175. [Google Scholar] [CrossRef]
Christensen, H. B., Hail, L., & Leuz, C. (2021). Mandatory CSR and sustainability reporting: Economic analysis and literature review. Review of Accounting Studies, 26(3), 1176–1248. [Google Scholar] [CrossRef]
D’Amato, V., D’Ecclesia, R., & Levantesi, S. (2021). Fundamental ratios as predictors of ESG scores: A machine learning approach. Decisions in Economics and Finance, 44(2), 1087–1110. [Google Scholar] [CrossRef]
D’Amato, V., D’Ecclesia, R., & Levantesi, S. (2024). Firms’ profitability and ESG score: A machine learning approach. Applied Stochastic Models in Business and Industry, 40(2), 243–261. [Google Scholar] [CrossRef]
Del Vitto, A., Marazzina, D., & Stocco, D. (2023). ESG ratings explainability through machine learning techniques. Annals of Operations Research, 2023, 1–30. [Google Scholar] [CrossRef]
Dimaggio, P. J., & Powell, W. W. (2021). The iron cage revisited: Institutional isomorphism and collective rationality in organizational fields. The New Economic Sociology: A Reader, 48, 111–134. [Google Scholar] [CrossRef]
Dipierro, A. R., Barrionuevo, F. J., & Toma, P. (2025). Predicting ESG controversies in banks using machine learning techniques. Corporate Social Responsibility and Environmental Management, 32(3), 3525–3544. [Google Scholar] [CrossRef]
Dossa, J. V., Ukwuoma, C. C., Thomas, D., Dossa, J. M., & Gopang, A. A. (2025). Prediction of nexus among ESG disclosure and firm performance: Applicability, explainability and implications. Innovation and Green Development, 4(4), 100261. [Google Scholar] [CrossRef]
Eccles, R., Ioannou, I., & Serafeim, G. (2014). The impact of corporate sustainability on organizational processes and performance. Management Science, 60(11), 2835–2857. [Google Scholar] [CrossRef]
El Ghoul, S., Guedhami, O., Kwok, C. C. Y., & Mishra, D. R. (2011). Does corporate social responsibility affect the cost of capital? Journal of Banking & Finance, 35(9), 2388–2406. [Google Scholar] [CrossRef]
Friede, G., Busch, T., & Bassen, A. (2015). ESG and financial performance: Aggregated evidence from more than 2000 empirical studies. Journal of Sustainable Finance and Investment, 5(4), 210–233. [Google Scholar] [CrossRef]
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. Available online: https://www.deeplearningbook.org (accessed on 1 March 2026).
Gu, S., Kelly, B., & Xiu, D. (2020). Empirical asset pricing via machine learning. The Review of Financial Studies, 33(5), 2223–2273. [Google Scholar] [CrossRef]
Gunnarsson, B. R., vanden Broucke, S., Baesens, B., Óskarsdóttir, M., & Lemahieu, W. (2021). Deep learning for credit scoring: Do or don’t? European Journal of Operational Research, 295(1), 292–305. [Google Scholar] [CrossRef]
Hajek, P., Abedin, M. Z., & Sivarajah, U. (2023). Fraud detection in mobile payment systems using an XGBoost-based framework. Information Systems Frontiers, 25(5), 1985–2003. [Google Scholar] [CrossRef] [PubMed]
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. Springer Series in Statistics. Springer. [Google Scholar] [CrossRef]
Herawati, E., Agustin, F., Subranta, A., Solissa, F., & Wiriatmaja, N. U. (2024). Sustainable financial strategies: Analyzing the role of ESG in corporate financial performance and risk management. The Journal of Academic Science, 1(6), 813–820. [Google Scholar] [CrossRef]
Hevner, A. R., March, S. T., Park, J., & Ram, S. (2004). Design science in information systems research. MIS Quarterly: Management Information Systems, 28(1), 75–105. [Google Scholar] [CrossRef]
Ibrahim, Y., Moubarak, H., & Badawy, H. (2026). Explainable neural algorithms for corporate sustainability forecasting: A layered predictive model anchored in executive awareness, green finance, and digital innovation. Innovation and Green Development, 5(1), 100335. [Google Scholar] [CrossRef]
Jensen, M. C., & Meckling, W. H. (1976). Theory of the firm: Managerial behavior, agency costs and ownership structure. Journal of Financial Economics, 3(4), 305–360. [Google Scholar] [CrossRef]
Kaufman, S., Rosset, S., & Perlich, C. (2011). Leakage in data mining: Formulation, detection, and avoidance. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 4, 556–563. [Google Scholar] [CrossRef]
Khamis, S., Ibrahim, Y., & Moubarak, H. (2025). Sustainability practices and financial performance: A meta-analysis approach. Journal of Financial Reporting and Accounting, 24, 1347–1371. [Google Scholar] [CrossRef]
Khan, M., Serafeim, G., & Yoon, A. (2016). Corporate sustainability: First evidence on materiality. The Accounting Review, 91(6), 1697–1724. [Google Scholar] [CrossRef]
Krauss, C., Do, X. A., & Huck, N. (2017). Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the S&P 500. European Journal of Operational Research, 259(2), 689–702. [Google Scholar] [CrossRef]
Lahmiri, S. (2016). Features selection, data mining and finacial risk classification: A comparative study. Intelligent Systems in Accounting, Finance and Management, 23(4), 265–275. [Google Scholar] [CrossRef]
Lin, H.-Y., Hsu, B.-W., Lin, H.-Y., & Hsu, B.-W. (2023). Empirical study of ESG score prediction through machine learning—A case of non-financial companies in Taiwan. Sustainability, 15(19), 14106. [Google Scholar] [CrossRef]
Lins, K. V., Servaes, H., & Tamayo, A. (2015). Social capital, trust, and firm performance during the financial crisis. Available online: https://papers.ssrn.com/abstract=2562924 (accessed on 14 February 2026).
Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30, 4766–4775. [Google Scholar] [CrossRef]
Molnar, C. (2019). Interpretable machine learning a guide for making black box models explainable. Lean Publishing. Available online: http://leanpub.com/interpretable-machine-learning (accessed on 14 February 2026).
Papíková, L., & Papík, M. (2022). Effects of classification, feature selection, and resampling methods on bankruptcy prediction of small and medium-sized enterprises. Intelligent Systems in Accounting, Finance and Management, 29(4), 254–281. [Google Scholar] [CrossRef]
Patel, S., Nath, A., & Desai, P. (2026). Predicting ESG scores using machine learning for data-driven sustainable investment. Analytics, 5(1), 7. [Google Scholar] [CrossRef]
Pedersen, L. H., Fitzgibbons, S., & Pomorski, L. (2021). Responsible investing: The ESG-efficient frontier. Journal of Financial Economics, 142(2), 572–597. [Google Scholar] [CrossRef]
Peffers, K., Tuunanen, T., Rothenberger, M. A., & Chatterjee, S. (2007). A design science research methodology for information systems research. Journal of Management Information Systems, 24(3), 45–77. [Google Scholar] [CrossRef]
Shin, J. J., Kwak, Y. H., & Park, C. (2024). Using machine learning to analyze ESG performance in project-driven sectors. Procedia Computer Science, 239, 840–845. [Google Scholar] [CrossRef]
Shwartz-Ziv, R., & Armon, A. (2022). Tabular data: Deep learning is not all you need. Information Fusion, 81, 84–90. [Google Scholar] [CrossRef]
Sirignano, J., & Cont, R. (2018). Universal features of price formation in financial msarkets: Perspectives from deep learning. Quantitative Finance, 19(9), 1449–1459. [Google Scholar] [CrossRef]
Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems. [Google Scholar] [CrossRef]
Spence, M. (1973). Job market signaling. The Quarterly Journal of Economics, 87(3), 355–374. [Google Scholar] [CrossRef]
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1), 267–288. [Google Scholar] [CrossRef]
Yu, E. P. y., Van Luu, B., & Chen, C. H. (2020). Greenwashing in environmental, social and governance disclosures. Research in International Business and Finance, 52, 101192. [Google Scholar] [CrossRef]
Zikriani, H., Fatwa, N., Fatiyurrobbany, F., Nasution, L. Z., & Rini, N. (2025). The impact of ESG principles implementation on risk management effectiveness in the banking sector. International Journal of Integrative Research, 3(2), 137–150. [Google Scholar] [CrossRef]
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2), 301–320. [Google Scholar] [CrossRef]

Figure 1. Target stability and drift (2017–2023).

Figure 2. Predicted and actual ESG scores.

Figure 3. Temporal stability R² across rolling validation windows.

Figure 4. Model fidelity (regression fit).

Figure 5. SHAP-based feature importance analysis.

Table 1. Comparative evidence on ESG prediction and dimensionality reduction approaches.

Study	Objective	Methodological Approach	ESG Reduction/Feature Handling	Treatment of ESG Noise and Divergence	Key Findings	Limitation
(Del Vitto et al., 2023)	Explain and replicate ESG rating construction and improve interpretability	Machine learning (tree-based models + explainability techniques)	Uses full ESG feature space; no explicit dimensionality reduction	Explicitly identifies measurement noise and inconsistencies in ESG ratings	ML models can replicate ESG scores and improve transparency through explainability tools	ESG ratings contain irreducible noise; limited ability to fully capture underlying constructs
(D’Amato et al., 2024)	Examine the relationship between ESG scores and firm profitability	Machine learning models with interpretability analysis	Implicit feature selection via model-based importance	Does not explicitly address ESG divergence	ESG scores are significant predictors of firm profitability; ML improves predictive accuracy	Focuses on financial outcomes rather than ESG measurement challenges
(Lin et al., 2023)	Analyze the relationship between ESG indicators and firm performance/sustainability outcomes	Empirical statistical analysis with ESG indicators	Variable selection based on ESG metrics	Limited discussion of ESG divergence or measurement error	ESG factors are associated with firm-level outcomes	Results are context-specific; limited generalizability
(Chowdhury et al., 2023)	Apply machine learning to ESG prediction or evaluation	Machine learning (predictive modeling, likely ensemble-based)	Includes feature engineering and selection	Limited explicit treatment of ESG divergence	ML methods enhance ESG prediction performance	Limited interpretability and insufficient treatment of ESG data noise
(D’Amato et al., 2021)	Examine ESG or sustainability impact on firm outcomes	Econometric/statistical modeling	No feature reduction	Does not explicitly address ESG noise or divergence	ESG dimensions influence firm-level outcomes	Static modeling; does not account for temporal dynamics
(Dipierro et al., 2025)	Investigate ESG/CSR impact on firm performance	Empirical panel regression analysis	No feature reduction	Limited treatment of ESG divergence	ESG/CSR engagement is associated with firm performance	Potential endogeneity and disclosure bias

Table 2. Hyperparameter tuning strategy for implemented ESG prediction models.

Model	Key Hyperparameters Tuned	Typical Search Range	Purpose
Random Forest Regressor	Number of estimators, maximum depth, minimum samples split, minimum samples leaf	Estimators: 100–500; Depth: 5–30; Split: 2–10; Leaf: 1–5	Controls tree complexity and ensemble diversity, reducing variance and improving generalization.
XGBoost Regressor	Learning rate, number of estimators, maximum depth	Learning rate: 0.01–0.1; Estimators: 100–500; Depth: 3–12	Regulates boosting process, balancing model bias and variance while improving predictive accuracy.
Multi-Layer Perceptron (MLP)	Hidden layer size, learning rate, regularization strength	Neurons: 32–128; Learning rate: 0.0001–0.01; Regularization: 0.0001–0.01	Controls network capacity and convergence stability in nonlinear approximation.

Table 3. Evaluation metrics of main model.

Algorithm	Sampling	R²	MAE	RMSE
Random Forest	Original	0.679	0.372	0.533
	Over-sampling	0.649	0.389	0.557
	Under-sampling	0.642	0.398	0.561
XGBoost	Original	0.664	0.384	0.547
	Over-sampling	0.598	0.410	0.587
	Under-sampling	0.591	0.421	0.601
MLP (Deep Learning)	Original	0.000	0.548	0.755
	Over-sampling	<0	0.637	0.861
	Under-sampling	<0	0.670	0.872

Table 4. Evaluation metrics of robustness analysis.

Algorithm	Sampling	R²	MAE	RMSE
Random Forest	Original	0.710	6.112	8.446
	Over-sampling	0.684	6.370	8.817
	Under-sampling	0.659	6.817	9.240
XGBoost	Original	0.697	6.274	8.693
	Over-sampling	0.650	6.648	9.237
	Under-sampling	0.617	7.046	9.825
MLP (Deep Learning)	Original	0.075	11.789	14.853
	Over-sampling	0.056	11.536	14.594
	Under-sampling	−0.535	15.514	19.069

Table 5. Pooled OLS and fixed effects regression results.

Variable	Pooled OLS Coef.	FE Coef.	Pooled OLS p-Value	FE p-Value
Const	0.2302	0.7237	0.000	0.0000
BESG_Lag1	0.9058	0.6533	0.000	0.0000
Equal Opportunity Policy_Lag1	0.1449	0.2887	0.004	0.0071
Treasury Stock_Lag1	2.696 × 10⁻¹⁰	8.804 × 10⁻¹⁰	0.105	0.0191
Increase/Decrease in Short-Term Borrowings_Lag1	2.477 × 10⁻¹⁰	1.201 × 10⁻¹⁰	0.059	0.4192
Lease Commitments Year 1_Lag1	4.586 × 10⁻¹⁰	8.29 × 10⁻¹¹	0.005	0.6910

Table 6. Pooled OLS and fixed effects statistics.

Metric	Pooled OLS	Fixed Effects
R-squared	0.7860	0.5006 (Within)
Observations	713	713
Entities		186
F-statistic		104.65
Prob (F-statistic)		0.0000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ibrahim, Y.; Hussainey, K.; Moawad, T.M.S. A Design Science Approach to Predicting ESG Performance Using Ensemble Machine Learning. Int. J. Financial Stud. 2026, 14, 133. https://doi.org/10.3390/ijfs14050133

AMA Style

Ibrahim Y, Hussainey K, Moawad TMS. A Design Science Approach to Predicting ESG Performance Using Ensemble Machine Learning. International Journal of Financial Studies. 2026; 14(5):133. https://doi.org/10.3390/ijfs14050133

Chicago/Turabian Style

Ibrahim, Yara, Khaled Hussainey, and Taghred Mokhtar Sayed Moawad. 2026. "A Design Science Approach to Predicting ESG Performance Using Ensemble Machine Learning" International Journal of Financial Studies 14, no. 5: 133. https://doi.org/10.3390/ijfs14050133

APA Style

Ibrahim, Y., Hussainey, K., & Moawad, T. M. S. (2026). A Design Science Approach to Predicting ESG Performance Using Ensemble Machine Learning. International Journal of Financial Studies, 14(5), 133. https://doi.org/10.3390/ijfs14050133

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

A Design Science Approach to Predicting ESG Performance Using Ensemble Machine Learning

Abstract

1. Introduction

2. Literature Review

2.1. ESG Measurement, Data Quality, and Theoretical Foundations

2.2. Predictors, Persistence, and Predictability of ESG Performance

2.3. Machine Learning in ESG Prediction: Comparative Evidence and Methodological Challenges

2.4. Deep Learning, Model Complexity, and Data Suitability

2.5. Data Engineering, Feature Selection, and Model Robustness

2.6. Interpretability, Transparency, and Model Accountability

3. Empirical Methodology

3.1. Research Design and Data Sources

3.2. Data Preprocessing and Leakage Mitigation

3.3. Feature Selection and Dimensionality Reduction

3.4. Model Specification and Ensemble Learning

3.4.1. Baseline Econometric Models

3.4.2. Machine Learning and Ensemble Models

3.4.3. Deep Learning Benchmark

3.5. Hyperparameter Optimization

3.6. Temporal Validation and Sampling Strategy

3.7. Interpretability and Model Transparency

4. Empirical Results and Model Performance

4.1. Empirical Data Diagnostics and Temporal Validation

4.2. Feature Selection Dynamics and Stability Analysis

4.3. Main Model Performance

4.4. Robustness Analysis: ESG Disclosure Prediction

4.5. Econometric Model Results

4.6. Model Fidelity and Predictive Accuracy

4.7. Temporal Stability and Generalization

4.8. Interpretability and Feature Contributions

5. Discussion

6. Policy and Practical Implications

7. Limitations and Future Research

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI