A Data-Driven Machine Learning Framework for Multi-Criteria ESG Evaluation

Wang, Zhiyuan; Lim, Tristan; Teng, Yun; Xia, Chongwu

doi:10.3390/bdcc10050130

Open AccessArticle

A Data-Driven Machine Learning Framework for Multi-Criteria ESG Evaluation

¹

School of Business, Singapore University of Social Sciences, Singapore 599494, Singapore

²

SUSS Academy, Singapore University of Social Sciences, Singapore 599494, Singapore

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Big Data Cogn. Comput. 2026, 10(5), 130; https://doi.org/10.3390/bdcc10050130

Submission received: 1 March 2026 / Revised: 12 April 2026 / Accepted: 20 April 2026 / Published: 22 April 2026

(This article belongs to the Section Data Mining and Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

This study proposes a novel data-driven machine learning (ML) framework for multi-criteria environmental, social, and governance (ESG) evaluation. The framework aims to address the transparency, consistency, and subjectivity limitations of existing ESG evaluation systems by employing a fully data-driven, modular, and ML-supported architecture. It comprises three main modules: (i) ESG data preprocessing with missing-data imputation by the MissForest algorithm; (ii) a three-plane ESG feature selection workflow that integrates clustering, feature importance, and classification algorithms to identify representative ESG indicators; and (iii) a hybrid weighting and ranking procedure that combines unsupervised principal component analysis (PCA), criteria importance through inter-criteria correlation (CRITIC), and technique for order preference by similarity to ideal solution (TOPSIS) methods. A recent 2024 real-world application involving 57 listed Chinese pharmaceutical and biotechnology companies and 70 ESG indicators demonstrates the framework’s practical utility in producing transparent and objective ESG rankings. The main contributions of this work are fourfold: (1) the development of an end-to-end, entirely data-driven ML framework for ESG evaluation; (2) the introduction of an innovative three-plane ESG feature selection workflow within the framework; (3) the first study for designing a hybrid PCA-CRITIC-TOPSIS approach in ESG weighting and ranking; (4) the validation of the framework through a real-world industry application using recent and authentic ESG data.

Keywords:

ESG; financial decision-making; machine learning; sustainable finance; multi-criteria decision-making

1. Introduction

The concept of environmental, social, and governance (ESG) represents a comprehensive framework for evaluating corporate sustainability and ethical performance across the three interrelated dimensions [1]. Recent years have witnessed a significant surge in ESG investment activity [2], indicating a growing belief that sustainability-oriented practices are not merely ethical imperatives but strategic considerations. In line with this perspective, prior research suggested that incorporating a social responsibility constraint into a firm’s decision framework could lead to higher levels of profitability [3]. Historically, the term ESG was first formally introduced in the 2004 United Nations (UN) report entitled “Who Cares Wins” [4], a collaborative initiative between the UN Global Compact and leading financial institutions. Although ESG’s conceptual roots trace back to earlier movements in corporate social responsibility (CSR) and socially responsible investing (SRI) during the 1960s and 1970s [5], the formalization of ESG marked a pivotal shift toward measurable, data-driven sustainability assessment. This shift has also encouraged the use of econometric and machine learning (ML) methods to translate multi-dimensional ESG data into meaningful, decision-ready insights.

In parallel, ESG considerations have become increasingly central to corporate strategy, investment analysis, and policy development [6]. This growing importance stems from the recognition that environmental stewardship, social responsibility, and sound governance practices are not only moral imperatives but also critical determinants of long-term financial performance and risk management [7,8]. Fatemi and Fooladi [9] highlighted that rising societal awareness and global information flows increasingly reward firms demonstrating responsible ESG practices. As global markets face mounting challenges from climate change, resource scarcity, social inequality, and regulatory pressure, ESG performance has emerged as a key indicator of a company’s resilience and adaptability [10], and many sectors (including banking and finance) are increasingly turning to artificial intelligence (AI) and ML to strengthen their sustainability performance [11]. In view of this, institutional investors, rating agencies, and regulators now demand greater transparency and accountability in ESG disclosures to evaluate a firm’s long-term sustainability and risk profile [12]. Empirical studies have shown that firms with strong ESG practices tend to experience lower operational risks, enhanced innovation capacity, and improved access to financing, reinforcing the financial relevance of sustainable management [13]. From the prior studies, Zhang et al. [14] observed a positive relationship between ESG report disclosure and the amount of capital raised, reflecting that investors tend to favor firms demonstrating stronger ESG performance. Consequently, ESG integration has evolved from a voluntary disclosure or reporting exercise into a strategic necessity, underpinning corporate competitiveness and long-term value creation in an increasingly sustainability-driven global economy, and ESG ratings have become central tools in assessing corporate sustainability.

Through our extensive review of the literature, we found that current ESG rating systems exhibit several notable limitations, particularly regarding transparency and consistency (as further discussed in the Literature Review in Section 2). Many rating systems incorporate much subjectivity in their evaluation process, leading to inconsistent and sometimes non-reproducible outcomes. These shortcomings create challenges for regulators and stakeholders seeking greater accountability and reliability. Such limitations highlight the pressing need for an objective, transparent, and data-driven ESG rating framework that minimizes subjectivity, takes multiple criteria (indicators) well into account, and ensures methodological rigor.

To address these shortcomings, this study proposes a novel objective end-to-end ML framework for multi-criteria ESG evaluation and rating. The framework is designed to enhance transparency, consistency, and methodological rigor throughout the ESG evaluation process. It consists of three major modules, as shown (see Figure 1 for a graphical illustration). First, in the data-processing stage, missing data are handled using the MissForest ML algorithm, which is based on random forest (RF). MissForest is able to effectively impute both numerical and categorical variables while preserving the underlying data structure. Second, a novel three-plane ESG feature selection workflow is developed to identify the most representative and non-redundant indicators. This workflow employs a portfolio of ML techniques, including clustering algorithms such as Gaussian mixture model (GMM), k-means, and spectral clustering; feature importance estimation using minimum-redundancy-maximum-relevance (mRMR); and supervised learning algorithms such as support vector machines (SVM), neighborhood components analysis (NCA), linear discriminant analysis (LDA), and RF and SHAP (Shapley additive explanation) for validation and interpretability. Third, for applying multi-criteria decision-making (MCDM) in company ESG rating, a hybrid weighting and ranking procedure is introduced, where the unsupervised ML algorithm, principal component analysis (PCA), is combined with the CRITIC method to derive data-driven indicator weights. These weights are subsequently incorporated into TOPSIS to calculate composite ESG performance scores and rank companies objectively. To the best of our knowledge, this study is the first to apply the comprehensive PCA-CRITIC-TOPSIS MCDM approach in ESG evaluation, although some simpler MCDM methods have been increasingly adopted across recent ESG research (as shown in the Literature Review in Section 2). Together, these components form an objective and fully ML-supported multi-criteria ESG evaluation framework (Figure 1).

The key contributions of this work can be summarized in four main points: (1) it proposes a holistic end-to-end ML framework for multi-criteria ESG evaluation, integrating data preprocessing, feature selection, indicator weighting, and company ranking into a unified, fully data-driven pipeline; (2) within the framework, it introduces a novel three-plane ESG feature selection workflow that systematically combines unsupervised ML discovery, supervised ML validation, and explainable attribution to ensure the stability, interpretability, and robustness of selected ESG indicators; (3) a hybrid PCA-CRITIC-TOPSIS approach is innovatively designed as the weighting and ranking scheme (the first of its kind in ESG study), which merges unsupervised ML with MCDM principles to objectively derive indicator weights and generate ESG rankings; and (4) it demonstrates the practical applicability of the proposed framework using authentic and recent 2024 ESG data of 57 listed Chinese pharmaceutical and biotechnology companies with 70 ESG indicators, providing a replicable case study of ESG evaluation.

It is important to distinguish the proposed framework from existing hybrid approaches in the ESG MCDM literature (as Section 2 shall further elaborate). Recent studies have applied various MCDM methods to ESG evaluation. While these studies make valuable contributions, they typically focus on the weighting and ranking stages of ESG evaluation, without addressing the preceding pipeline stages. Specifically, three key gaps remain in the existing literature. First, most existing frameworks either rely on subjective expert-derived weights or use a single objective weighting method, without exploiting the complementary strengths of multiple data-driven weighting approaches. Second, none of the existing ESG MCDM studies incorporates a systematic, ML-driven feature selection procedure to identify the most representative ESG indicators prior to weighting and ranking; instead, they typically adopt pre-defined indicator sets without formal selection or validation. Third, the critical issue of missing data in ESG datasets (this can significantly distort evaluation outcomes) is rarely addressed. The present study fills these gaps by proposing a holistic end-to-end ML framework that integrates a non-parametric missing-data imputation via MissForest, a novel three-plane ML-aided feature selection workflow with leakage prevention and silo-balanced interpretability, and a hybrid PCA-CRITIC-TOPSIS weighting and ranking scheme that is entirely data-driven and requires no subjective input. To the best of our knowledge, no existing ESG study offers this degree of pipeline coverage, from raw data to final ranking, within a unified and fully objective ML architecture.

The remainder of this article is organized as follows. Section 2 reviews the relevant literature on ESG, with emphasis on ESG ratings, ESG feature selection, and the application of MCDM in ESG evaluation. Section 3 provides a detailed description of the proposed end-to-end ML framework for multi-criteria ESG evaluation, including its data-processing, feature selection, weighting, and ranking components. Section 4 applies the proposed framework to a real-world case study involving 57 listed Chinese pharmaceutical and biotechnology companies and 70 ESG indicators, demonstrating its practicality and effectiveness. Section 5 presents the discussion and limitations of the present study. Finally, Section 6 concludes the paper and offers recommendations for future research directions.

2. Literature Review

2.1. ESG and Corporate Strategy

ESG has become increasingly important in corporate strategy over the past decade. This trend is observed in emerging markets globally [15]. Based on observation, ESG is not only for compliance or reputation, but also a vital force to drive the expansion performance [16]. A company with a high ESG standard tends to have higher risk resilience and international competitiveness [17]. A number of recent studies show that ESG performance has positive effects on companies’ trade. Li [18] found that Chinese manufacturing companies with higher ESG scores tend to export more, which is due to lower operation costs and eased financing constraints associated with ESG compliance. Similarly, Wu et al. [19] analyzed the data collected from Chinese listed companies and found that the companies with better ESG performance tend to exhibit more favorable trading activity, and the influence was mainly driven by innovation and lower financial constraints. Investors and governments tend to trust companies with better ESG performance more, so the companies will have lower capital costs and easier access to credit [16]. These results were supported by Cai and Hao [20], who found that stronger financial outcomes are linked to superior ESG performance, especially for companies investing in green technology innovation. Jiao and Liu [21] investigated investment portfolio optimization problems that integrate multiple ESG factors across different categories of investors. Tan et al. [22] noted that an increasing number of listed firms are participating in ESG activities, including green transformation, social responsibility programs, and governance innovation.

2.2. ESG Ratings: Providers, Divergence, and Challenges

To better evaluate firms’ sustainability efforts, ESG ratings have become important tools in assessing corporate sustainability and responsible investment performance. The growing demand for companies with strong ESG performance reflects investors’ increasing attention to sustainability [23]. As institutional investors and regulators push for greater transparency, ESG ratings serve as a key mechanism to bridge sustainability disclosure and financial market decision-making [24]. Currently, prominent ESG data providers such as London Stock Exchange Group (LSEG) Data and Analytics (formerly Refinitiv), Morgan Stanley Capital International (MSCI), Sustainalytics, and Bloomberg supply detailed datasets for decision-making. However, these agencies differ in scope and rely on distinct, often proprietary, methodologies. LSEG Data and Analytics evaluates about 12,000 public and private firms across 76 countries using 630 ESG measures grouped into ten categories and three pillars, with scores ranging from 0 to 100 that reflect ESG performance and disclosure quality. MSCI rates over 8500 firms globally on their exposure and response to ESG risks, using a scale from leader (AAA, AA) to laggard (B, CCC). Sustainalytics covers more than 16,000 firms, assessing over 70 industry-specific indicators to produce numeric risk ratings, where lower scores indicate lower ESG risk. Bloomberg also assesses the extent of companies’ disclosure of their ESG activities and then estimates disclosure scores ranging between 0.1 and 100 [25].

As a result of differences in scope and methodology, research documents persistent variation (often termed ESG rating divergence or disagreement), reflecting a lack of consensus across providers [26]. Berg et al. [27] reported correlations ranging from 0.38 to 0.71 across six raters and 924 firms, while Billio et al. [28] found correlations of 0.43 to 0.69 among four major providers. Zhu et al. [29] documented pairwise correlations ranging from 0.057 to 0.736 for 195 Chinese firms, averaging 0.411, and Capizzi et al. [30] reported values from 0.03 to 0.64 for Italian firms. Kimbrough et al. [31] quantified average disagreement levels of 26.69 between LSEG and MSCI, and 19.21 when including Moody’s, indicating substantial dispersion in firm assessments. Notably, Charlin et al. [32] observed that ESG ratings are even less consistent than subjective domains like wine tasting. Disagreement is greater for individual pillars than for aggregate ESG scores. Dimson et al. [33] reported correlations of 0.45 for overall ESG, but only 0.42 (E), 0.30 (S), and 0.07 (G). Capizzi et al. [30] also revealed the weakest governance correlations (0.06 to 0.09), compared to environmental (0.28 to 0.35) and social (0.35 to 0.43) scores, while Brandon et al. [34] observed similar patterns among S&P 500 firms. These studies highlighted the global inconsistency in ESG evaluations.

The divergence in ESG ratings can be attributed to four main dimensions. First, what is being rated matters. Definitions of what constitutes good or bad practice differ across the environmental, social, and governance pillars, as do the weighting schemes. The emphasis placed on specific dimensions (e.g., board diversity vs. shareholder rights) reflects distinct conceptual frameworks and normative assumptions. These definitional variations account for a substantial portion of the observed rating discrepancies [27,35]. Second, who conducts the rating matters. Each rater’s philosophy and bias introduce systematic divergence. Geographic clustering also influences alignment, with stronger correlations found among raters operating within the same region [27,36]. Third, firm-level factors play a role. The quality, consistency, and comprehensiveness of a firm’s ESG disclosures shape how raters interpret available information. High-quality and consistent disclosures reduce information asymmetry, whereas excessive or strategically curated reporting can create information overload [31,37], and even greenwashing [38], the practice of overstating or misrepresenting environmental or social performance to appear more sustainable than reality. Firm size, operational complexity, industry characteristics, and geographic context further affect consistency, as larger or more diversified firms are inherently harder to evaluate [34]. Finally, methodological differences are also a source of disagreement [27]. These include differences in measurement design, benchmarking standards, data sources, missing-data treatment, and retrospective score revisions. Taken together, these conceptual, rater-specific, firm-level, and methodological factors help explain why ESG assessments often diverge markedly across providers, undermining the comparability and reliability of ESG ratings worldwide.

The inconsistency and divergence in ESG ratings create significant frictions in financial markets. In general, rating inconsistency leads to valuation distortions, as reflected in wider analyst forecast dispersion and larger bid-ask spreads [31,34]. It also weakens the predictive power of consensus ESG ratings and dampens market reactions to ESG-related news [35]. In short, our extensive literature review suggests that while ESG ratings serve as vital tools for evaluating firms’ sustainability performance, they still exhibit notable shortcomings. Developing a more objective, transparent, and data-driven ESG rating framework is therefore essential, as inconsistent and unreliable ratings can distort market perceptions, create inefficiencies in financial markets, and hinder firms’ efforts to enhance their ESG performance. As mentioned, this work proposes a novel end-to-end ML framework for multi-criteria ESG evaluation and rating.

2.3. ESG Feature Selection Methodologies

To attain feature inputs for the ESG rating construction, there exists an intermediate goal of finding a relatively small, reliable set of ESG indicators that can separate/rank firms in meaningful ways, while showing clearly why those indicators matter more. However, ESG data present a difficult learning regime [39]. Challenges include varied scales, strong within-silo redundancy, weak cross-silo alignment, small samples relative to their dimensionality, and labels that are either scarce or disputed. Accordingly, an effective ESG feature selection workflow should target to: (i) ingest and normalize heterogeneous, high-dimensional E/S/G inputs; (ii) reduce redundancy while preserving balanced coverage across silos after projection; (iii) estimate predictive relationships under leak-aware evaluation for downstream targets such as returns, risk, or controversies; and (iv) return explanations that map cleanly to concrete, auditable indicators to support oversight and rebalancing [37,40].

Classical internal geometry indices (e.g., silhouette score, gap statistic) and likelihood criteria (e.g., Akaike/Bayesian information criterion) are widely used for model-order selection; however, in heterogeneous, low-

n

, high-

p

(

n < p)

datasets, these criteria can be unstable across preprocessing configurations, initializations, and subsamples [41]. To mitigate, consensus (ensemble) clustering aggregates multiple base partitions, varying algorithms, resolutions, and seeds, often with row subsampling and feature bagging, into a co-association matrix that encodes how frequently pairs co-cluster across perturbations; clustering a dissimilarity derived from this matrix operationalizes evidence accumulation and reduces variance relative to any single method [42]. Within this consensus space, stability-based model-order selection uses resampling diagnostics to prioritize resolutions that produce decisive “almost always together/almost never together” pairwise assignments, with parsimony and minimum-mass constraints to preserve downstream evaluability [43,44].

As ESG indicators are organized into semantically distinct blocks, multiblock analysis advocates a compress-then-fuse representation involving within-silo standardization, small per-silo PCA to collapse redundancies, concatenation of block scores, and optional global projection to ensure balanced, interpretable fusion across E/S/G dimensions [45]. Leak-aware evaluation is another critical consideration, as data-dependent operations performed on the full dataset can induce optimistic bias, particularly in small-sample settings [46,47]. Nested protocols, which restrict all transformations and label inferences to training folds and score only on held-out data, are essential yet often under-specified in unsupervised ESG-related workflows [48]. Interpretability further plays a central role in ESG screening and index construction, where model explanations must remain faithful, stable, and auditable [49]. Stability selection enhances reproducibility under resampling [50], while post hoc tools such as SHAP provide additive, instance-level attributions that can be aggregated for feature and silo interpretability [51]. Collectively, these methodological considerations (multi-view compression, leak-aware nesting, and fold-clean interpretability) can directly address the statistical and governance challenges inherent in ESG feature selection analytics, as Section 3.2 further elaborates.

2.4. Multi-Criteria Decision-Making in ESG Evaluation

Moreover, MCDM methods have been attracting attention in ESG evaluation and rating due to their inherent capability to reconcile diverse, often conflicting, indicators (criteria) into an ultimately coherent ranking and/or performance score. ESG performance assessment inherently involves multiple dimensions. As a result, methods like the technique for order preference by similarity to ideal solution (TOPSIS) and analytic hierarchy process (AHP) are becoming popular in the ESG literature, enabling firms, investors, and regulators to systematically evaluate ESG performance across multiple criteria. For instance, Wanke et al. [52] adopted a two-stage fuzzy TOPSIS method to examine banking efficiency in the BRICS (Brazil, Russia, India, China, and South Africa) countries, considering several social dimensions. Reig-Mullor et al. [53] employed the TOPSIS method to evaluate and rank corporate ESG performance, addressing uncertainty and subjectivity in sustainability assessment through a hybrid fuzzy framework applied to firms in the oil and gas sector. Sood et al. [54] utilized the fuzzy AHP MCDM method to investigate ESG factors influencing individual equity investors’ decision-making in India. Meng and Shaikh [55] combined both AHP and weighted aggregated sum product assessment (WASPAS) methods to prioritize green finance investment strategies, identifying environmental factors as most critical for sustainable and ethical investment development. Yu et al. [56] developed an integrated MCDM framework for determining ESG criteria weights and ranking the firms, to evaluate the ESG sustainable performance of listed companies using the combined compromise solution (CoCoSo) method. Rathi et al. [57] applied TOPSIS to rate companies’ ESG performance in the electric utilities and independent power producers industry, identifying the most sustainable firms based on multiple ESG criteria. Assefa et al. [58] used the PROMETHEE (preference ranking organization method for enrichment evaluation) MCDM method to analyze the relationship between companies’ ESG performance and financial performance across S&P 500 firms between the years 2010 and 2023. Sklavos et al. [59] proposed a two-stage integrated MCDM framework combining data envelopment analysis (DEA), criteria importance through inter-criteria correlation (CRITIC), and TOPSIS to evaluate the ESG-driven eco-efficiency and green accounting practices of European financial institutions. The key findings from the reviewed literature are summarized in Table 1.

3. Methodology

This section presents the proposed end-to-end ML framework for multi-criteria ESG evaluation in detail. The framework comprises four sequential stages. Section 3.1 describes the data preprocessing module, which handles data cleaning and missing-value imputation using the MissForest algorithm. Section 3.2 presents the novel three-plane ESG feature selection workflow, consisting of unsupervised discovery, supervised prediction with nested evaluation, and fold-local explanation. Section 3.3 introduces the hybrid PCA-CRITIC weighting scheme used to determine objective indicator weights. Finally, Section 3.4 describes the TOPSIS method employed for multi-criteria ranking of companies. The following Figure 1 presents the graphical illustration of the proposed framework.

To clarify the originality of each component within the proposed framework, the data preprocessing stage (Section 3.1) adopts the established MissForest algorithm for ESG data imputation, chosen for its proven effectiveness with mixed-type data and non-linear interactions; this is an application of an existing technique to the ESG context. The three-plane ESG feature selection workflow (Section 3.2) is an original contribution of this study, designed specifically to address the challenges of high-dimensional, multi-silo ESG data. While individual components within this workflow (e.g., PCA, k-means, GMM, spectral clustering, SHAP) are well-established techniques, their systematic orchestration into a three-plane architecture (comprising evidence accumulation with auto-k, nested cross-validation with label-free evaluation guards, and fold-local silo-balanced SHAP attribution) is novel and has not been previously proposed in the literature. The PCA and CRITIC weighting methods are individually well-documented techniques from the MCDM literature; however, their multiplicative hybridization into a PCA-CRITIC hybrid weighting scheme (Section 3.3) and the subsequent integration with TOPSIS for ESG ranking (Section 3.4) constitute a novel integration of complementary weighting perspectives within the ESG domain. We note that the individual methods (PCA, CRITIC, and TOPSIS) are well-established in the MCDM literature; the contribution lies in their systematic integration into an end-to-end ML framework that specifically addresses the challenges of ESG data, including within-silo redundancy, mixed optimization directions, and the need for fully objective weighting. To the best of our knowledge, this particular combination has not been previously applied in the ESG evaluation context.

3.1. Data Preprocessing

The raw ESG dataset typically consists of many indicators collected from heterogeneous sources (e.g., annual reports, sustainability disclosures, third-party databases) and is plagued by significant missing data, resulting in inconsistencies in ESG ratings [60]. In the present study, our dataset first undergoes structured preprocessing steps: (i) data cleaning, (ii) missing-value imputation, and (iii) quality validation. It begins with duplicate resolution, where records referring to the same firm–indicator pair are matched through deterministic keys (e.g., company name and indicator label), and only the instance from the most authoritative source is retained. Next, units and fields are harmonized by standardizing measurement units (e.g., energy in MWh, emissions in tCO₂e) and discarding non-analytical metadata (e.g., disclosure year, reporting source, and free-text labels). Indicators exhibiting excessively high missingness (e.g., more than 90% missing values) are then removed, because the uncertainty introduced by imputing such variables outweighs their potential informational value. The 90% missingness threshold is chosen based on the principle that, with a sample of n = 57 companies, an indicator with more than 90% missing values has fewer than six observed entries, which is insufficient for reliable random forest-based imputation and would introduce more uncertainty than informational value. In practice, lowering this threshold to 80% or 70% would remove additional indicators from the candidate pool; however, the downstream three-plane feature selection workflow independently identifies the most informative indicators from whatever set remains, providing a secondary robustness mechanism against threshold variation. Finally, the cleaned records are reshaped into a matrix of number of firms (

n

) × number of indicators (

p

).

To address incomplete (i.e., missing) entries, the study adopts MissForest [61], a non-parametric iterative imputation algorithm based on the random forest (RF) algorithm. RF predictive models are combinations of decision trees and bootstrap aggregation [62]. Unlike parametric methods (e.g., mean imputation, multivariate normal models), MissForest can handle mixed-type data (continuous and categorical), capture non-linear interactions, and does not assume any distributional form. The pseudocode of the MissForest algorithm is provided in Appendix A Algorithm A1; readers may also refer to Stekhoven and Buhlmann [61] for further details. As the MissForest pseudocode illustrates, it first makes an initial guess for missing values for each feature (i.e., indicator)

X_{j}

, such as imputing them with mean/median/mode values as provisional placeholders. It then iteratively treats each feature with missing values as a prediction target in turn. Specifically, for a given

X_{j}

, rows where

X_{j}

is observed are used to train an RF model (i.e., RF regressor for numeric variable, and RF classifier for categorical variable). Here, the column

X_{j}

itself serves as the target (or output, denoted as

y_{o b s}^{(j)}

in the pseudocode), while all remaining columns are treated as predictors (or inputs, denoted as

x_{o b s}^{(j)}

in the pseudocode). Subsequently, the missing values in

X_{j}

(i.e.,

x_{m i s}^{(j)}

) are predicted using the trained RF model to get

y_{m i s}^{(j)}

. These predictions replace the initial placeholders, and the algorithm iterates through all incomplete features until convergence (or the stopping criterion is met). The iterative refinement allows the imputations to improve gradually as better estimates feed back into the RF models.

MissForest is selected in this study because of its proven performance in imputations for environmental and healthcare data. The results from relevant studies show that MissForest imputation performs better than other common algorithms, such as K-Nearest Neighbour (KNN) and Multivariate Imputation By Chained Equations (MICE) [63,64], especially when dealing with datasets with different types of variables [61].

To validate the imputation accuracy, the same RF models are used to impute again based on the same dataset, but with 5% of the originally observed entries artificially masked to purposefully create additional missing values. The results are compared with their true values to test the reliability and accuracy of the imputation model. Multiple metrics are employed to evaluate the imputation performance, including the coefficient of determination (R²) and normalized root mean squared error (NRMSE). An R² value close to 1 and a low NRMSE value indicate good imputation accuracy. Through these cleaning, imputation and validation preprocessing steps, the ESG dataset becomes complete and ready for the next stage of analysis.

3.2. ESG Feature Selection

To derive a robust set of input features for ESG evaluation, this research developed a three-plane feature selection workflow (Figure 2) for small-

n

, high-

p

ESG data that separates (i) unsupervised discovery, (ii) supervised prediction with strict held-out evaluation, and (iii) fold-local explanation computed on test. The design targets three requirements: extracting structure without labels, inducing and using labels strictly within training folds, and attributing model behavior only to held-out data. Built-in feasibility guards ensure evaluation proceeds even under degenerate clustering without violating hygiene.

As observed from Figure 2, this workflow summarizes the three-plane feature selection and attribution process. The discovery plane performs data pruning, stability screening, and redundancy control to generate a consolidated, silo-balanced top-15 feature set. The prediction plane ensures evaluation through quantile-stratified splits, fold-local training, and model validation using compact learners (e.g., PCA–LDA, NCA–LDA). The explanation plane conducts fold-local SHAP analysis on held-out test data and aggregates results across folds to derive global feature importance and E/S/G contribution profiles.

Data model and notation: Let

X \in R^{n \times p}

denote the ESG feature matrix with variables partitioned into silos

g \in \{E, S, G\}

and index sets

I_{g}

. All learned transforms used for prediction are re-fit inside training folds; the single global pass outside cross-validation (CV) is the label-free pruning that defines the feature universe.

3.2.1. Plane I: Discovery

Hygiene and pruning: ESG indicators exhibit plateaus, heterogeneous scales, and redundancy within silos. We first remove near-constants by dropping columns with fewer than

u = 3

distinct values. To reduce redundancy without imposing parametric form, we apply within-silo rank-based de-duplication: for any pair

(a, b)

with absolute Spearman correlation

∣ ρ_{s} (a, b) ∣ \geq τ

(with

τ = 0.92

), we retain the member with a larger robust spread, measured by mean absolute deviation (MAD), and drop the other. Ties are resolved deterministically by column order. This step is label-free and runs once to produce

X_{clean}

and an updated silo map; it lowers downstream variance and stabilizes explanations while preserving weak but distinct signals. The within-silo Spearman correlation threshold

τ = 0.92

is set conservatively to remove only near-duplicate indicators (pairs sharing more than 84.6% of rank variance) while preserving all moderately correlated indicators that may carry distinct information. In the feature selection literature, correlation thresholds for redundancy removal typically range from 0.7 to 0.95. The conservative choice of

τ = 0.92

on the side of retention, and the downstream three-plane feature selection further ensure that only the most informative indicators are selected for the final evaluation.

Balanced multi-view embedding: To prevent any silo from dominating representation learning, we build a two-stage, label-free multi-view embedding. First, within each silo, we Z-standardize

X_{g}

and fit a “small” PCA retaining,

d_{g} = m i n (d_{per - view}, r a n k (X_{g}), |I_{g}|)

(1)

with

d_{per - view} = 2

by default, yielding scores

H_{g} \in R^{n \times d_{g}}

. We concatenate the three score matrices

H = [H_{E} H_{S} H_{G}]

, re-standardize columns to equalize variance across views, and apply a second PCA to obtain a consensus latent space

Z \in R^{n \times k}

with

k \in [6,10]

capped by

m i n (d_{Σ}, n - 1)

. This sequential compression (per silo and global) mitigates collinearity and aligns cross-silo covariation while remaining label-agnostic. If a silo becomes empty post-pruning, the procedure continues with the remaining views.

Evidence accumulation with Auto-K: We estimate stable regimes by aggregating assignments across algorithms, resolutions, and perturbations in

Z

. For each resolution

K \in \{2, 3, 4\}

and each draw, we subsample rows with fraction

ϕ \in (0.7, 0.8)

, bag a subset of features with a small absolute floor (≥12), and run a portfolio of base clustering algorithms (k-means, GMM with covariance regularization fallbacks, agglomerative clustering, spectral clustering with

k

-NN graph using

n_{b} \approx \sqrt{n}

and clamping). Each run updates a co-association matrix; averaging across runs yields

\overline{C} \in [0, 1]^{n \times n}

, the empirical probability that pairs co-cluster under perturbations. We select the resolution by minimizing the Proportion of Ambiguous Clustering (PAC):

P A C (\overline{C}; l, h) = \frac{1}{n (n - 1)} \sum_{i \neq j} 1 \{l < {\overline{C}}_{i j} < h\}

(2)

with band

(l, h) = (0.1, 0.9)

. We then cluster the dissimilarity

D = 1 - \overline{C}

(average linkage) to obtain labels and apply a conservative minimum-class-size repair, such that while any class has size

< m_{m i n}

, merge the smallest class to its nearest-centroid neighbor in

Z

, preserving at least one non-trivial split. The outputs of discovery are

\overline{C}

, the chosen

K^{*}

, and labels suitable for descriptive analysis and feasibility diagnostics.

3.2.2. Plane II: Prediction

Evaluation guard: Unsupervised discovery can collapse to a single class in small-

n

settings, which breaks stratified CV if used naively. We therefore construct label-free base splits stratified by the dominant unsupervised axis: standardize

X

, compute PC1, bin into

q

equal-frequency bands, and sample train/test folds stratified by these bins. For each candidate split, we perform train-only label induction (below), assign test labels via a fixed mapping (nearest centroid in the train-fit latent space), and keep the split only if both partitions contain at least two classes. If the number of feasible splits falls short of a target budget, a top-up sampler repeats this procedure until the budget is met. The guard preserves test informativeness without using any test statistic to induce labels or to tune transforms.

Nested evaluation per split: For each feasible split

(I_{train}, I_{test})

, we standardize on

X_{train}

, fit a PCA (≤10 components) to obtain

Z_{train}

, and map

X_{test}

using the train-fit scaler and loadings to obtain

Z_{test}

. We induce labels on

Z_{train}

via the consensus procedure using a compact

K

grid; if it collapses, we fall back to agglomerative

K = 2

, and if needed, to a balanced PC1 split, guaranteeing two groups without fabricating test labels. We compute class centroids in

Z_{train}

and assign each test point in

Z_{test}

to its nearest centroid, which fixes test labels by a rule determined entirely on training data. With labels fixed, we train candidate pipelines on

(X_{train}, y_{train})

and score on

(X_{test}, y_{test})

. Pipelines comprise dimensionality reduction plus classifier variants suited to small-

n

regimes (e.g., PCA-LDA, NCA-LDA, and simple baseline 1-NN). We report accuracy, balanced accuracy, macro-F1, and area under the curve (AUC) where applicable. Throughout, no statistic from the test set participates in discovery, label induction, feature screening, model selection, or attribution.

3.2.3. Plane III: Explanation

Stability-balanced feature selection: Explanations are produced fold-locally with strict hygiene. Within each training fold and within each silo, we repeatedly subsample rows (~70%) over

R

iterations and fit group-sparse models (group lasso when groupings are available; otherwise per-silo

L_{1}

logistic with lightweight inner tuning). We compute per-feature selection frequencies with Wilson intervals and lock

m

stable “anchor” features per silo above a conservative threshold (e.g.,

p \geq 0.10

). From the remaining pool, we expand to exactly five features per silo using an ensemble of mRMR runs that maximize mutual information with the induced labels while penalizing redundancy by

∣ ρ ∣

. Concatenating lists yields a top-15 (5E/5S/5G) per fold.

Held-out attribution: We fit a compact, transparent model (e.g., RF) on training data restricted to the top-15 and compute SHAP values only on held-out test points. We aggregate mean absolute SHAP per feature across splits and summarize at the silo level, producing leakage-free explanations that are directly comparable across folds and balanced across E/S/G.

3.2.4. Defaults, Complexity, and Reproducibility

Hyperparameters are conservative and specified for replication:

u = 3

,

τ = 0.92

,

d_{per - view} = 2

,

k \in [6, 10]

capped by

m i n (d_{Σ}, n - 1)

,

K \in \{2, 3, 4\}

,

ϕ \in (0.7, 0.8)

, feature-bag floor ≥12, PAC band

(0.1, 0.9)

, PC1-quantile stratification for base splits, deterministic tie-breaking by column order. All random procedures use fixed seeds per split. Pruning costs

O (\sum ∣ I_{g} ∣^{2})

rank correlations; PCA steps scale with

O (n p m i n (n, p))

but are bounded by small

d_{g}

and

k

; consensus draws are parallelizable; stability selection is linear in

R

and sample size. Every transform used in evaluation is re-fit inside training folds to eliminate leakage, and every mapping from train to test is fixed (nearest centroid) before any model sees test labels or features.

This ESG feature selection provides: (i) stability through evidence accumulation and PAC-based Auto-

K

with conservative minimum-size repair; (ii) validity via strict separation of discovery, label induction, mapping, training, and testing, supported by a label-free evaluation guard; and (iii) interpretability via fold-local, silo-balanced stability selection and test-only SHAP. The workflow, comprising per-silo then global multi-view embedding, consensus clustering with ambiguity minimization, PC1-stratified guard with fixed nearest-centroid assignment, and top-15 balanced explanations, yields a coherent pipeline that is replicable, leak-free, and well-conditioned for small-

n

ESG analytics.

To summarize the anti-leakage safeguards embedded in the proposed feature selection workflow: (a) the initial global pruning pass, comprising near-constant removal and within-silo Spearman correlation de-duplication, is entirely label-free and does not involve any target, outcome, or clustering-derived labels; it defines only the feature universe available for subsequent selection; (b) label induction via consensus clustering is performed exclusively within training folds during the prediction plane, and labels are never induced on or using test-fold data; (c) test-fold label assignment uses a fixed nearest-centroid mapping learned from training-fold data, ensuring that no test-set statistic participates in label generation or model selection; (d) all feature transformations, including standardization, PCA embedding, and NCA projection, are fitted on training-fold data and applied to test-fold data using the training-fit parameters; (e) stability selection and mRMR-based feature expansion are conducted within training folds only; and (f) SHAP attributions are computed exclusively on held-out test points using models trained on training-fold data. The workflow does not use any externally provided labels (such as commercial ESG ratings or financial performance measures) at any stage; the evaluation labels are entirely unsupervised, derived from consensus clustering, and induced fresh within each training fold.

3.3. Determination of Indicator Weights

PCA and CRITIC capture fundamentally different (and complementary) facets of indicator importance, and their hybridization addresses a specific challenge inherent in ESG data. PCA derives weights from the variance structure of the dataset through eigenvalue decomposition, assigning higher weights to indicators that contribute more strongly to the principal components capturing the greatest data variability. However, PCA-based weights do not explicitly penalize inter-criteria redundancy: two highly correlated indicators within the same ESG silo can each receive high PCA-weights despite providing largely overlapping information. This is particularly problematic in ESG datasets, where strong within-silo correlations are prevalent. CRITIC, by contrast, jointly considers both the standard deviation (discriminatory power) and the pairwise correlations (redundancy) of criteria through the information content measure. CRITIC thus penalizes redundancy by assigning lower weights to indicators that are highly correlated with others, which is a property that PCA-weights lack. However, CRITIC operates on pairwise correlations and standard deviations of the normalized data and does not capture the multivariate covariance structure that PCA models through its principal components. By multiplicatively combining PCA and CRITIC-weights, the hybrid approach ensures that an indicator receives a high final weight only if it satisfies both conditions: (i) it contributes substantially to the overall variance of the dataset (as captured by PCA) and (ii) it provides informationally distinct content with high discriminatory power (as captured by CRITIC). Neither PCA nor CRITIC alone can simultaneously guarantee both properties. The quantitative comparison of rankings produced by PCA-only, CRITIC-only, equal-weight, and the proposed hybrid PCA-CRITIC weighting methods is presented in Section 4.2, demonstrating the robustness and stability of the proposed approach.

3.3.1. Principal Component Analysis (PCA) for Weighting

PCA is an unsupervised ML technique widely used for dimensionality reduction and feature extraction. It transforms a set of potentially correlated features (i.e., criteria or indicators) into a smaller number of uncorrelated principal components that capture the maximum possible variance in the dataset. Specifically, the first principal component accounts for the largest share of the total variance, followed by subsequent components with progressively smaller shares.

When applied for determining criteria weights in an alternative-criteria matrix (ACM), PCA deems that criteria contributing more strongly to the overall variance of the dataset contain greater informational value and should therefore receive larger weights. By transforming the original criteria into a set of orthogonal principal components and analyzing their corresponding eigenvalues and correlations (also known as loadings in this context), PCA provides an objective, data-driven basis for calculating criterion weights, without relying on subjective judgment.

Step 1: Standardize the ACM to eliminate the influence of differing measurement scales across criteria. For each criterion

j \in \{1, 2, \dots, n\}

, the mean

μ_{j}

and standard deviation

σ_{j}

are first computed. The standardized value

Z_{i j}

is then obtained by applying the Z-score normalization, ensuring all criteria have a zero mean and unit standard deviation. Following the MCDM conventions, it is worth mentioning that the row and column indices (i.e.,

i

and

j

) satisfy that

i \in \{1, 2, \dots, m\}

and

j \in \{1, 2, \dots, n\}

in all the steps of PCA.

μ_{j} = \frac{1}{m} \sum_{i = 1}^{m} f_{i j}

(3)

σ_{j} = \sqrt{\frac{1}{m} \sum_{i = 1}^{m} {(f_{i j} - μ_{j})}^{2}}

(4)

Z_{i j} = \frac{f_{i j} - μ_{j}}{σ_{j}}

(5)

After standard scaling is performed, the resulting standardized ACM is denoted as

Z \in R^{m \times n}

.

Step 2: Compute the covariance matrix

R

of the standardized matrix

Z

using the following equation, which quantifies the pairwise linear relationships among the criteria.

R = \frac{1}{m - 1} Z^{⊤} Z, w h e r e R \in R^{n \times n}

(6)

Step 3: Perform eigen-decomposition on the covariance matrix

R

to obtain the eigenvalues and corresponding eigenvectors.

R v_{j} = λ_{j} v_{j}

(7)

where

v_{j}

is the

j^{t h}

eigenvector (i.e., principal component), a column vector with

v_{j} \in R^{n \times 1}

;

λ_{j}

is the associated eigenvalue, and ordered such that

λ_{1} > λ_{2} > \dots > λ_{n}

.

Next, the explained variance ratio of the

j^{t h}

principal component,

β_{j}

, is calculated as follows:

β_{j} = \frac{λ_{j}}{\sum_{k = 1}^{n} λ_{k}}

(8)

Step 4: Determine the number of principal components to be employed and project the standardized ACM,

Z

, onto these principal components.

Pick the smallest

α

such that the sum of the first

α

number of eigenvalues is greater than or equal to the threshold value

τ

(by default,

τ = 0.8

).

\sum_{j = 1}^{α} β_{j} \geq τ

(9)

Then, the selected principal components form the matrix

V_{α}

as follows:

V_{α} = [v_{1}, v_{2}, \dots, v_{α}], w h e r e V_{α} \in R^{n \times α}

(10)

Project

Z

onto

V_{α}

to obtain the principal component score matrix (

S

):

S = Z V_{α}, w h e r e S \in R^{m \times α}

(11)

Note that since the projection here is an orthogonal transformation, it does not affect the data centering; hence, each column of

S

still has a mean of zero.

Step 5: Compute the correlation between the

j^{t h}

criterion (i.e., the

j^{t h}

column in

Z

) and

k^{t h}

principal component score vector (i.e., the

k^{t h}

column in

S

):

ρ_{j k} = \frac{C o v (Z_{\cdot j}, S_{\cdot k})}{σ (Z_{\cdot j}) σ (S_{\cdot k})} = \frac{C o v (Z_{\cdot j}, S_{\cdot k})}{\sqrt{λ_{k}}}, j \in \{1, 2, \dots, n\} a n d k \in \{1, 2, \dots, α\}

(12)

where the covariance between

Z

and

S

can be computed as follows:

C o v (Z, S) = \frac{1}{m - 1} Z^{⊤} S, C o v (Z, S) \in R^{n \times α}

(13)

It is noteworthy that

σ (Z_{\cdot j})

is equal to 1 because all the columns in

Z

are already standardized (i.e., a mean of zero, as well as a standard deviation of one). Meanwhile,

σ (S_{\cdot k})

is simply the square root of the

k^{t h}

eigenvalue

λ_{k}

, since

λ_{k}

represents the variance explained by this

k^{t h}

principal component.

In essence,

ρ_{j k}

tells us how much the

j^{t h}

criterion loads onto (i.e., contributes to) the

k^{t h}

principal component.

Step 6: Calculate the PCA-weight for each criterion (

w_{j, P C A}

) as follows:

h_{j}^{2} = \sum_{k = 1}^{α} ρ_{j k}^{2}

(14)

where

h_{j}^{2}

(also known as the communality) reflects how much the

j^{t h}

criterion collectively loads onto all the retained principal components.

w_{j, P C A} = \frac{h_{j}^{2}}{\sum_{l = 1}^{n} h_{l}^{2}}

(15)

As observed, essentially, the PCA-weight of a criterion is determined by how strongly it is correlated with the principal components that capture the major variance structure of the ACM. A large

h_{j}^{2}

value indicates that the

j^{t h}

criterion is highly aligned with the principal components and obtains a larger weight. Conversely, a criterion that has weak correlations across the principal components (i.e., small

h_{j}^{2}

value) is assigned a smaller weight.

3.3.2. Criteria Importance Through Inter-Criteria Correlation (CRITIC) for Weighting

Diakoulaki et al. [65] first introduced the CRITIC method, which determines weights of criteria by collectively considering the variability of each criterion (via its standard deviation) and the degree of redundancy (or dependency) between criteria, measured through their pairwise correlations. The underlying principle of CRITIC is that a criterion would receive a higher weight when it exhibits both greater dispersion (i.e., spread) across alternatives and low redundancy with respect to other criteria. The method comprises the following three key steps.

Step 1: Normalize the ACM with

m

rows (alternatives) and

n

columns (criteria) using the Min-Max normalization. Likewise, as in the PCA steps, the row and column indices (i.e.,

i \in \{1,2, \dots, m\}

and

j \in \{1,2, \dots, n\}

, respectively) apply throughout all steps of the CRITIC method.

F_{i j} = \frac{f_{i j} - \underset{k \in \{1, 2, \dots, m\}}{m i n} f_{k j}}{\underset{k \in \{1, 2, \dots, m\}}{m a x} f_{k j} - \underset{k \in \{1, 2, \dots, m\}}{m i n} f_{k j}}, f o r t h e m a x i m i z a t i o n c r i t e r i o n

(16)

F_{i j} = \frac{\underset{k \in \{1,2, \dots, m\}}{m a x} f_{k j} - f_{i j}}{\underset{k \in \{1, 2, \dots, m\}}{m a x} f_{k j} - \underset{k \in \{1, 2, \dots, m\}}{m i n} f_{k j}}, f o r t h e m i n i m i z a t i o n c r i t e r i o n

(17)

Step 2: Compute the pairwise correlation matrix by evaluating the degree of linear dependency between every two criteria.

ρ_{j k} = \frac{\sum_{i = 1}^{m} (F_{i j} - {\bar{F}}_{j}) (F_{i k} - {\bar{F}}_{k})}{\sqrt{{\sum_{i = 1}^{m} (F_{i j} - {\bar{F}}_{j})}^{2}} \sqrt{{\sum_{i = 1}^{m} (F_{i k} - {\bar{F}}_{k})}^{2}}}

(18)

where

{\bar{F}}_{j} = \frac{1}{m} \sum_{i = 1}^{m} F_{i j}

and

{\bar{F}}_{k} = \frac{1}{m} \sum_{i = 1}^{m} F_{i k}

in this step.

The correlation coefficient

ρ_{j k}

takes a value of

+ 1

when

j = k

, and lies within the interval

[- 1, + 1]

when

j \neq k

. The resulting correlation matrix is symmetric [66], that is,

ρ_{j k} = ρ_{k j}

, with the diagonal elements all equal to

+ 1

. A value of

ρ_{j k} = + 1

indicates perfect positive linear correlation, meaning the two criteria vary in exactly the same direction. A value of

ρ_{j k} = 0

implies no linear relationship, suggesting that the two criteria are statistically independent from one another. On the other end of the spectrum,

ρ_{j k} = - 1

denotes perfect negative linear correlation, meaning one criterion increases exactly as the other decreases, reflecting complete inverse dependence.

Step 3: Calculate standard deviation

σ_{j}

for each criterion, aggregate both the standard deviation and correlation to form

c_{j}

, and derive the CRITIC-weights,

w_{j, C R I T I C}

, as follows.

σ_{j} = \sqrt{\frac{1}{m} \sum_{i = 1}^{m} {(F_{i j} - {\bar{F}}_{j})}^{2}}

(19)

c_{j} = σ_{j} \sum_{k = 1}^{n} (1 - ρ_{j k})

(20)

w_{j, C R I T I C} = \frac{c_{j}}{\sum_{k = 1}^{n} c_{k}}

(21)

As can be seen, the CRITIC-weight assigned to each criterion is jointly determined by its standard deviation and its correlations with the other criteria.

The standard deviation

σ_{j}

reflects the discriminatory power of a criterion. A higher

σ_{j}

indicates that the criterion shows larger differences among the alternatives, meaning it can better differentiate the alternatives (that is, it can better tell us which alternative is better, and which is worse), and therefore receives a larger weight. In contrast, a lower

σ_{j}

(e.g., in the extreme case where all alternatives have identical values under criterion

j

, and hence

σ_{j} = 0

) provides little discriminatory power to clearly differentiate the alternatives (to tell us which is better, and which is worse), and thus receives a smaller weight.

The correlation term

ρ_{j k}

, on the other hand, measures the degree of redundancy between criteria. A criterion that is highly positively correlated with many others provides information that is already present elsewhere, so its weight gets penalized because of the large redundancy. On the contrary, a criterion that is weakly correlated (or even negatively correlated) with others is considered to offer new, distinct (or contrasting) information, and is therefore rewarded with a larger weight.

3.3.3. Hybrid PCA-CRITIC for Weighting

To exploit both the variance-based properties of PCA and the deviation-redundancy structure captured by CRITIC, the two sets of weights are hybridized using a multiplicative synthesis, producing the final weight of each criterion,

w_{j}

, as follows.

w_{j} = \frac{w_{j, P C A} \cdot w_{j, C R I T I C}}{\sum_{k = 1}^{n} (w_{k, P C A} \cdot w_{k, C R I T I C})}

(22)

3.4. Multi-Criteria Decision-Making (MCDM) for Ranking

Technique for Order Preference by Similarity to Ideal Solution (TOPSIS)

The TOPSIS method, first introduced by Hwang and Yoon [67], is a well-established MCDM method used to rank alternatives evaluated under multiple, and often competing, criteria. It is among the most extensively applied MCDM techniques in the literature [68]. The fundamental idea behind TOPSIS is that the top-ranked alternative should have the shortest distance from the positive ideal solution (PIS), and concurrently the greatest distance from the negative ideal solution (NIS) [69]. The PIS corresponds to the most desirable performance level for each criterion (i.e., largest values for maximization criteria and smallest values for minimization criteria), whereas the NIS consists of the least desirable performance levels (i.e., smallest values for maximization criteria and largest values for minimization criteria), as the following Step 3 shows.

The essential TOPSIS steps are succinctly outlined as follows; readers can refer to Hwang and Yoon [67] and Wang and Rangaiah [70] for more methodological details and comparative discussions if interested.

Step 1: Normalize the original ACM with

m

rows (i.e., alternatives) and

n

columns (i.e., criteria) using Vector normalization. The row and column indices (i.e.,

i

and

j

) satisfy that

i \in \{1, 2, \dots, m\}

and

j \in \{1, 2, \dots, n\}

in all steps of TOPSIS.

F_{i j} = \frac{f_{i j}}{\sqrt{\sum_{k = 1}^{m} f_{k j}^{2}}}

(23)

where

f_{i j}

and

f_{k j}

denote the values of the

i^{t h}

and

k^{t h}

alternative under the

j^{t h}

criterion in the original ACM, and

F_{i j}

represents the corresponding normalized value in the normalized ACM.

Step 2: Generate the weighted normalized ACM by multiplying each normalized criterion value

F_{i j}

by its assigned weight

w_{j}

. In this work, the weights are obtained using the aforementioned hybrid PCA-CRITIC weighting approach.

v_{i j} = F_{i j} \times w_{j}

(24)

Step 3. Using the weighted normalized ACM, identify the PIS (i.e.,

A^{+}

) and NIS (i.e.,

A^{-}

) as follows.

\begin{matrix} A^{+} & = \{(\max_{i} (v_{i j})| j \in J_{m a x}), (\min_{i} (v_{i j})| j \in J_{m i n})\} \\ = \{v_{1}^{+}, v_{2}^{+}, \dots, v_{j}^{+}, \dots, v_{n}^{+}\} \end{matrix}

(25)

\begin{matrix} A^{-} & = \{(\min_{i} (v_{i j})| j \in J_{m a x}), (\max_{i} (v_{i j})| j \in J_{m i n})\} \\ = \{v_{1}^{-}, v_{2}^{-}, \dots, v_{j}^{-}, \dots, v_{n}^{-}\} \end{matrix}

(26)

where

J_{m a x}

denotes the set of criteria for maximization;

J_{m i n}

represents the set of criteria for minimization.

Step 4. Determine the Euclidean distances of each alternative from the PIS and NIS, denoted as

S_{i +}

and

S_{i -}

, respectively.

S_{i +} = \sqrt{\sum_{j = 1}^{n} {(v_{i j} - v_{j}^{+})}^{2}}

(27)

S_{i -} = \sqrt{\sum_{j = 1}^{n} {(v_{i j} - v_{j}^{-})}^{2}}

(28)

Step 5. Compute the performance score (

P_{i}

) for each alternative, the alternative with the largest

P_{i}

is ranked first and considered the most preferred option.

P_{i} = \frac{S_{i -}}{S_{i -} + S_{i +}}

(29)

where

S_{i -} \geq 0

,

S_{i +} \geq 0

, and

P_{i} \in [0, 1]

. There are two extreme cases: when

S_{i +} = 0

, the

i^{th}

alternative lies exactly on the PIS, resulting in the highest performance score

P_{i} = 1

; conversely, when

S_{i -} = 0

, the

i^{th}

alternative coincides with the NIS, yielding the lowest performance score

P_{i} = 0

. The surface plot shown in Figure 3 illustrates how

P_{i}

varies as a function of the

S_{i +}

and

S_{i -}

. As observed,

P_{i}

increases when the alternative becomes closer to the PIS (i.e., smaller

S_{i +}

) and simultaneously farther from the NIS (i.e., larger

S_{i -}

). The surface also highlights the non-linear nature of this relationship, that is, high performance scores are typically achieved when these two conditions are satisfied together, reinforcing the dual-objective structure of the TOPSIS ranking principle [71].

As seen, the selection of each technique in this proposed end-to-end ML framework for multi-criteria ESG evaluation is driven by its specific suitability for the corresponding pipeline stage. MissForest is chosen for data imputation because it is a non-parametric, iterative algorithm based on random forests that handles mixed-type data (both continuous and categorical), captures non-linear interactions, and operates without distributional assumptions; these properties are particularly well-suited for ESG data comprising heterogeneous metrics such as emissions in tCO₂e and board member counts. For feature selection, a clustering ensemble via evidence accumulation is employed because single clustering algorithms are inherently unstable in the small-n, high-p settings typical of ESG datasets; evidence accumulation aggregates multiple base partitions (varying algorithms, resolutions, random seeds, and feature subsets) into a co-association matrix, substantially reducing variance and yielding more robust cluster assignments, with PAC-based auto-k selection further removing the need for subjective specification of the number of clusters. PCA is selected for indicator weighting because it provides a fully data-driven, unsupervised basis for computing weights through eigenvalue decomposition, requiring no expert input and ensuring full objectivity and reproducibility. CRITIC complements PCA by uniquely combining two dimensions of indicator importance, namely, discriminatory power (standard deviation) and information uniqueness (inter-criteria correlation), thereby penalizing redundancy among highly correlated within-silo indicators, a property that PCA-weights lack. Finally, TOPSIS is selected for ranking because it evaluates alternatives based on simultaneous proximity to a positive ideal solution and distance from a negative ideal solution, handles mixed optimization directions (benefit and cost criteria), and is computationally scalable and widely validated in the MCDM literature.

4. Application

To demonstrate the practical utility of the proposed holistic ML pipeline for multi-criteria evaluation of corporate ESG performance, a case study is conducted in this section, using real and recent ESG data from the year 2024. The dataset encompasses 57 listed Chinese pharmaceutical and biotechnology companies and includes 70 ESG indicators. This case study illustrates how the proposed pipeline organically integrates data preprocessing, feature selection, weighting, and ranking analysis to objectively assess and compare the ESG performance of firms within the same industry context. The English names of the 57 listed Chinese pharmaceutical and biotechnology companies are presented in Table 2, and their corresponding Chinese names are provided in the Supplementary Materials.

4.1. Detailed Calculations

The initial inspection of the dataset indicates a few instances of duplicate records originating from different sources for the same company. To address these duplications, a systematic screening process is conducted to identify and accurately remove redundant entries. Records are matched based on company name and ESG indicators. When duplicate records are identified, one is retained, and the others are removed. Such duplication primarily arises from the same parameter being captured by multiple data sources. In cases where duplication occurs between the same indicator obtained from ESG or annual reports and other external sources, preference is given to values reported in the ESG or annual reports, as these sources are generally considered more reliable. When both ESG and annual reports disclose the same indicator for a company (which is environmental-related data in this case), the value from the ESG report is selected, given that ESG reports typically provide more comprehensive coverage of environmental information. Additionally, all measurement units are verified to ensure consistency across the dataset. Common or redundant fields, such as disclosure year, currency, and text-based descriptors, are removed to streamline the dataset. After cleaning, the number of valid data points is reduced to 3090. The cleaned dataset is then structured into a 57 × 70 matrix to facilitate the identification of missing values and the subsequent imputation step using the MissForest algorithm.

To ensure the stability of the model, indicators with a missing rate greater than 90% are removed, since the uncertainty arising from imputing such variables would surpass their potential informational contribution. The remaining indicators are then processed using the MissForest algorithm to impute missing values. As mentioned above, MissForest is a non-parametric and iterative imputation algorithm based on RF, which is particularly effective in handling datasets with mixed data types and non-linear relationships. It captures complex interactions among variables (ESG indicators in this case) and can model both continuous and categorical features without assuming any specific data distribution.

To validate the imputation performance, 5% of the observed data are randomly masked and treated as missing, and the imputation accuracy is evaluated on these masked entries. Results show that the overall R² reaches approximately 0.986, with an NRMSE value of 0.0085, indicating excellent imputation quality. These results showcase that MissForest effectively preserves the statistical properties and internal relationships within the ESG dataset, as well as makes sure reliable data completeness.

After all the data preprocessing steps, the resulting 57 × 61 matrix (i.e., 57 companies, 61 ESG indicators) gets partitioned across three canonical silos (E = 15, S = 25, G = 21). E variables capture emissions, energy, water, and resource efficiency; S variables cover workforce, welfare/insurance, and research and development (R&D) intensity; G variables encode board/ownership structure and turnover. Dispersion patterns differ by silo: E exhibits very high scale dispersion (average MAD ≈

1.05 \times 10^{6}

), S is moderate (average MAD ≈

4.5 \times 10^{3}

), and G is relatively discrete (median unique values = 16). Pairwise Spearman correlations reveal strong intra-silo redundancy (frequent

∣ ρ ∣ > 0.9

) and weak cross-silo association, motivating label-free within-silo de-duplication and a multi-view treatment that prevents any single silo from dominating.

Prior to any learning, the study applies a single global, label-free pass. That drops near-constants (<3 distinct values) and de-duplicates within each silo by absolute Spearman correlation with threshold

τ = 0.92

, retaining the more variable member (higher MAD) with deterministic tie-breaking. This defines the fixed selection universe used inside folds.

Selection is based on each cross-validation split and within each silo. First, bootstrap-based stability selection (~70% row subsamples,

R

repetitions) with group-sparse estimators (group lasso or

L_{1}

logistic) identifies anchor features that persist across resamples. Anchors are retained using a conservative stability threshold (≥10%) with Wilson intervals. Second, an ensemble mRMR expands anchors to exactly five features per silo, maximizing mutual information with train-only labels while penalizing redundancy (

∣ ρ ∣

). This yields a fold-local top-15 that is balanced (5E/5S/5G) by construction, simplifying later index weighting and ensuring interpretability across silos.

To prevent invalid stratification when unsupervised discovery collapses, research generates label-free, PC1-quantile-stratified splits from standardized inputs. We retain only those base splits that remain bi-class after train-only label induction (consensus with conservative fallbacks) and nearest-centroid assignment of test items in the train-fit latent space. Out of the 20 candidate splits, six of them (i.e., 30%) are feasible; the remainder fail for train imbalance or single-class test after assignment. All subsequent selection and verification use only these guard-validated folds.

Using guard-validated folds, we train compact discriminants on the top-15 to confirm that the selected subset preserves the underlying structure required for index design. As seen, Table 3 reports results from 20 base splits, of which six meet the bi-class feasibility criterion under the evaluation guard. Metrics include mean accuracy (MeanAcc), standard deviation of accuracy (StdAcc), mean balanced accuracy (MeanBAcc), mean macro-F1 (MeanF1m), and mean AUC. Results show that low-dimensional linear discriminants (e.g., PCA-LDA, NCA-LDA) consistently outperform more complex ensemble and kernel-based models in both accuracy and class balance. Specifically, across six feasible folds, low-dimensional linear pipelines dominate under strict nesting:

NCA-LDA (2D): Accuracy $= 0.9889 \pm 0.0272$ , Balanced Accuracy $= 0.9167$ , AUC $= 0.9643$ .
PCA-LDA (2D): Accuracy $= 0.9889 \pm 0.0272$ , Balanced Accuracy $= 0.9167$ , AUC $= 0.9762$ .
MultiView-LDA (2D): Accuracy $= 0.9778 \pm 0.0344$ , Balanced Accuracy $= 0.8333$ , AUC $= 0.9405$ .
A 1-NN reference in standardized space attains Accuracy $= 0.933 \pm 0.038$ . More complex alternatives (bagging, SVM, graph-augmented embeddings) achieve comparable accuracy but consistently lower balanced accuracy (≈0.50–0.66), indicating inferior class-balance generalization. These results validate that capacity-controlled, linear embeddings on the top-15 retain the separative information needed for robust index formation.

The study repeats discovery-to-selection under multiple seeds (varying clustering initializations,

ϕ = 0.8

row subsampling, feature-bagging ≥12 or 70%). As expected in small-

n

and high-

p

, raw consensus partitions show low clustering stability, with adjusted rand index (ARI) ≈

0.09 \pm 0.18

and normalized mutual information (NMI) ≈

0.06 \pm 0.08

. More crucially, predictive verification remains stable, where between-draw standard deviations of held-out accuracy are ≤0.034, and average within-draw standard deviations are ≤ 0.059. For interpretability, the study fits a compact RF surrogate on the train-only top-15 features and computes SHAP values on the held-out test. After aggregating importances across folds, the group-level contribution ranks as S > E > G (namely, S = 0.446, E = 0.379, G = 0.175). Recurrent feature-level drivers include R&D intensity/growth and welfare/insurance (S); energy intensity and environmental tax (E); and ownership concentration and board structure (G).

The resulting complete top-15 ESG indicators identified by this ML-driven feature selection pipeline are reported in Table 4. From an optimization and MCDM perspective, these indicators differ in terms of optimization direction. For example, some indicators are smaller-the-better and therefore treated as minimization criteria, such as electricity consumption per revenue (MWh per million RMB; abbreviated as electricity_per_rev) and Scope 3 GHG emissions per revenue (tons CO₂e per million RMB; abbreviated as ghg_scope3_per_rev). Conversely, some other indicators are larger-the-better and treated as maximization criteria, including R&D expense growth rate (%; abbreviated as rnd_expense_growth_pct) and the proportion of female board directors (%; abbreviated as board_female_pct).

Based on these companies and ESG indicators, a 57 × 15 ACM is constructed, where each row corresponds to one company and each column corresponds to one selected ESG indicator. The detailed ACM dataset is provided in the Supplementary Materials.

To determine the weights of the 15 indicators, which are required for the subsequent TOPSIS ranking procedure, the hybrid PCA-CRITIC weighting method is applied. The resulting indicator weights are presented in Figure 4 and Table 5.

Using the PCA-CRITIC hybrid weights, we then apply the TOPSIS method to compute the raw ESG performance score for each company. TOPSIS is selected because it is among the most extensively applied and theoretically grounded MCDM methods [72], offering a dual-reference-point evaluation (distances to both positive and negative solutions), computational efficiency, and the ability to handle mixed optimization directions. The recent successful application of CRITIC-TOPSIS in ESG evaluation by Sklavos et al. [59] further supports the suitability of this combination.

The TOPSIS scores are subsequently normalized by dividing each raw value by the maximum score (among the 57 companies), such that the top-ranked company receives a score of 100.0, while the remaining scores are expressed as proportional percentages of this benchmark. The distribution of ESG scores and rankings is displayed in Figure 5. As shown, BeOne Medicines attains the highest ESG performance score of 100.0, thereby ranking first among the 57 companies evaluated. WuXi AppTec follows in second place with a score of 92.4, while Sichuan Biokin ranks third with a score of 90.8. The complete numerical scores and rankings are provided in the Supplementary Materials.

Building on the score distribution shown in Figure 5, the ranking outcomes reveal several notable structural patterns in the ESG performance landscape of the 57 listed Chinese pharmaceutical and biotechnology companies. The clear separation between the top-performing firms (led by BeOne Medicines, WuXi AppTec, and Sichuan Biokin) reflects their ability to simultaneously outperform peers across multiple high-impact ESG dimensions. For example, all the top-three companies score exceptionally well on governance indicators such as indep_director_pct, board_phd_count, and board_female_pct. In addition, they further distinguish themselves through strong social contributions, including substantial employee welfare investment and high levels of external donations. Their environmental profiles also tend to feature low energy intensity or low GHG emissions, which collectively reduce their distance to the PIS in the TOPSIS ranking procedure. By contrast, the middle-ranked group exhibits noticeably more uneven ESG structures. Many firms in this tier demonstrate solid performance in isolated silos (e.g., reasonable R&D expenditure intensity, moderate governance transparency, or manageable environmental footprints) but show weaker or inconsistent results in the others. For example, several mid-tier companies have robust social expenditure but relatively concentrated ownership structures, or they maintain environmentally efficient operations but score poorly on the board diversity metrics. Such mixed profiles produce moderate ranking scores, as strengths in one pillar of ESG are insufficient to offset deficiencies in another. The lower-ranked companies typically display systemic underperformance across multiple indicators or possess specific ESG weaknesses that exert a substantial downward impact on their final scores. Common patterns include low indep_director_pct and low rnd_expense_pct_rev. Some companies additionally show weak social investment (e.g., low med_insurance_per_capita or salary_bonus_total), or heightened environmental intensity (e.g., high electricity_per_rev or ghg_scope3_per_rev) relative to revenue, placing them farther from the ideal ESG performance frontier. Overall, the ranking results demonstrate the ability of the proposed framework to differentiate ESG profiles with both transparency and granularity.

4.2. Robustness Analysis: Comparison of Rankings Under Alternative Weighting Schemes

To assess the robustness of the proposed framework, a sensitivity analysis is conducted by comparing the company rankings obtained under the proposed hybrid PCA-CRITIC weighting scheme with those produced by three alternative weighting approaches: (1) PCA-only weights, (2) CRITIC-only weights, and (3) equal weights (i.e., each indicator receives a weight of 1/15). All four weight vectors are applied within the same TOPSIS ranking procedure to isolate the effect of the weighting method on the final rankings. As presented in Table 5, while PCA-weights are relatively uniform (ranging from 0.057 to 0.076), reflecting the variance contribution of each indicator to the principal components, CRITIC-weights show greater differentiation, assigning notably higher weights to board_female_pct (0.097), top10_shareholding_pct (0.094), and env_tax (0.088). The hybrid PCA-CRITIC-weights inherit the discriminatory power of CRITIC while being moderated by the variance-based PCA-weights, yielding a balanced weight distribution that rewards indicators exhibiting both high variance and low inter-criteria redundancy.

The Spearman rank correlation coefficients between the rankings produced by each pair of weighting methods indicate that all correlations exceed 0.94, a high degree of agreement across the four approaches. The proposed PCA-CRITIC hybrid method shows particularly strong agreement with CRITIC-only (0.979) and PCA-only (0.962), confirming that the multiplicative hybridization effectively preserves the informational content of both constituent methods. The Kendall rank correlation coefficients, which account for pairwise concordance and are generally more conservative than Spearman correlations, range from 0.808 to 0.907, further corroborating the robustness of the rankings.

Additionally, BeOne Medicines consistently ranks first across all four weighting methods, demonstrating its superior ESG performance. The same five companies (BeOne Medicines, WuXi AppTec, Sichuan Biokin, Jiangsu Sinopep-Allsino, and Inner Mongolia Furui Medical Science) appear in the top five under three of the four methods (PCA-CRITIC, CRITIC-only, and equal weights), with Pharmaron Beijing replacing Inner Mongolia Furui under PCA-only. Their relative ordering shows some variation; for instance, WuXi AppTec ranks 2nd under PCA-CRITIC, PCA-only, and equal weights, but 5th under CRITIC-only. More pronounced rank shifts occur in the middle tier: Intco Medical Technology ranks 7th under the proposed method but 18th under CRITIC-only, reflecting the sensitivity of firms with mixed ESG profiles to the weighting scheme. The average absolute rank difference between the proposed PCA-CRITIC method and the equal-weight method is 3.72 positions, while the average difference relative to CRITIC-only is 2.07 positions. These results collectively demonstrate that the rankings generated by the PCA-CRITIC-TOPSIS are robust, lending confidence to the reliability and stability of the proposed approach.

5. Discussion and Limitations

The proposed end-to-end ML framework for constructing multi-criteria ESG ranking advances both methodological innovation and practical application in ESG evaluation. By embedding ML algorithms across the full analytical pipeline (i.e., from data preprocessing and feature selection to indicator weighting and ranking), the framework establishes a fully data-driven, objective, and reproducible approach to ESG assessment. This end-to-end design reduces human bias and enhances methodological transparency, directly addressing the persistent opacity and subjectivity that characterize many existing ESG rating systems. Additionally, the integration of the three-plane ESG feature selection workflow with the hybrid PCA-CRITIC-TOPSIS weighting and ranking procedure further demonstrates how combining ML and MCDM techniques can yield interpretable, stable, and financially relevant ESG performance scores. The empirical application to 57 listed Chinese pharmaceutical and biotechnology firms illustrates the framework’s ability to uncover nuanced ESG performance structures within a sector that plays a critical role in sustainable development and investor interest. The proposed framework can also assist in risk exposure management and portfolio optimization by integrating ESG ratings into asset selection.

To contextualize the empirical findings of this study, we compare and contrast our results with those of relevant studies that employ similar MCDM-based ESG evaluation methods. First, regarding pillar-level importance, our SHAP-based attribution analysis reveals that social indicators contribute the most to firm differentiation (S = 0.446), followed by environmental (E = 0.379) and governance (G = 0.175). This ordering is consistent with the characteristics of the pharmaceutical and biotechnology industry, where workforce quality, R&D investment, and employee welfare are critical competitive differentiators. Interestingly, the relatively lower contribution of governance aligns with the broader empirical observation in the ESG rating literature that governance metrics tend to show the weakest cross-rater consistency. Dimson et al. [33] reported pillar-level correlations of 0.42 (E), 0.30 (S), and only 0.07 (G) across ESG providers, while Capizzi et al. [30] found governance correlations as low as 0.06 to 0.09, suggesting that governance indicators are inherently more difficult to differentiate across firms, which may explain their lower discriminative contribution in our framework. Second, regarding the ranking methodology, our PCA-CRITIC-TOPSIS approach can be compared with the DEA-CRITIC-TOPSIS framework proposed by Sklavos et al. [59] for evaluating the European financial institutions. Both frameworks employ CRITIC for objective weighting and TOPSIS for ranking, but differ in their treatment of efficiency and data preprocessing. Sklavos et al. [59] incorporated DEA to assess eco-efficiency prior to CRITIC weighting, whereas our framework replaces DEA with PCA-based variance weighting, which is more suitable for contexts where efficiency frontier estimation is not the primary objective. Additionally, our framework extends the pipeline upstream with systematic data imputation and ML-driven feature selection, which are absent in Sklavos et al. [59]. Rathi et al. [57] applied standalone TOPSIS with expert-assigned weights to the electric utilities industry, reporting that environmental indicators dominated the rankings—a finding that contrasts with our social-dominant result and reflects the expected industry-specific materiality differences. Yu et al. [56] employed the CoCoSo method with integrated entropy-expert weights, finding that governance and environmental criteria were most influential for their sample of listed companies; the difference from our findings further underscores that ESG pillar importance is context-dependent and reinforces the value of data-driven (rather than pre-specified) weighting used in our proposed framework.

Nevertheless, this study is not without limitations. The primary limitation is that the proposed framework has been demonstrated using a single dataset comprising 57 listed Chinese pharmaceutical and biotechnology companies. While this application successfully illustrates the framework’s practical utility and methodological rigor, the use of a single industry and geographic context constrains the generalizability of the empirical findings to other sectors, markets, or regulatory environments. Different industries exhibit distinct ESG materiality profiles, which may influence the relative importance of ESG indicators and the structure of the resulting rankings. Furthermore, the sample size of 57 companies, although sufficient for the methodological demonstration, represents a small-n setting that may not fully capture the distributional characteristics of larger and more diverse ESG datasets. Importantly, however, the methodological architecture of the proposed framework is inherently transferable across domains and scalable. Future research will therefore extend the application to multiple industries, cross-sectoral datasets, and larger firm samples to further validate the framework’s robustness and generalizability across diverse ESG evaluation contexts. Moreover, future work will seek to enrich the framework with complementary data sources, including textual ESG disclosures, regulatory filings, and real-time sustainability indicators, to capture a more holistic view of corporate performance. As ESG continues to evolve as a quantifiable dimension within financial modeling, this study lays the groundwork for a transparent, reproducible, and data-driven multi-criteria ESG evaluation paradigm.

We acknowledge that external validation against established commercial ESG ratings (e.g., MSCI, LSEG, Sustainalytics) and assessment of external validity through correlation with financial performance measures (e.g., stock returns, Tobin’s Q, or market-to-book ratios) or market recognition indicators would further strengthen the framework’s credibility. However, direct comparison with commercial ratings is currently infeasible due to their proprietary methodologies, differing indicator sets and scoring scales, and licensing restrictions for the specific Chinese pharmaceutical firms in our sample. Assessing external validity through correlation with financial performance constitutes a valuable and important direction for future research that would complement the internal validity demonstrated through the robustness analysis in Section 4.2. The present study focuses on establishing the methodological rigor and internal consistency of the proposed framework, which is a necessary prerequisite before external validation can be meaningfully conducted.

While the current application focuses on the pharmaceutical and biotechnology industry, the proposed framework is designed with a modular and domain-agnostic architecture that facilitates adaptation to other sectors. Extending the framework to different domains would primarily involve adjustments in three areas. First, the ESG indicator set must be redefined to reflect domain-specific material ESG factors. For instance, the energy and natural resource sectors would place greater emphasis on environmental indicators [72], such as carbon emissions intensity, water withdrawal, and biodiversity impact; whereas the financial services sector [73] would prioritize governance indicators, such as regulatory compliance, risk management practices, and executive compensation structures. The identification of domain-relevant indicators can be guided by established materiality frameworks, such as those provided by the Sustainability Accounting Standards Board (SASB) or the Global Reporting Initiative (GRI). Second, the optimization direction (maximization or minimization) and measurement units of certain indicators may differ across industries; for example, revenue-intensity metrics that are meaningful in manufacturing may need to be replaced with per-asset or per-client metrics in services-oriented sectors. Third, the data preprocessing parameters may require recalibration; sectors with higher or lower data availability (e.g., heavily regulated industries such as banking versus less regulated industries such as retail) may necessitate adjustments to the missingness thresholds and imputation configurations. Importantly, the core algorithmic components of the framework (i.e., MissForest imputation, the three-plane feature selection workflow, the hybrid PCA-CRITIC weighting, and the TOPSIS ranking) are inherently general-purpose and require no structural modification, only the re-specification of the input data and indicator definitions. This modular design ensures that the framework can be readily deployed across diverse industry contexts with appropriate domain-specific customization of the ESG indicator set and evaluation parameters.

Furthermore, the proposed framework carries several important ESG-related implications for diverse stakeholders. For investors and asset managers, the framework provides a transparent, objective, and reproducible ESG scoring mechanism that can inform portfolio construction, ESG integration strategies, and risk assessment. Unlike proprietary ESG ratings from commercial providers, which suffer from well-documented divergence and opacity, the proposed framework produces scores that are fully traceable to the underlying data and methodology, enabling investors to understand precisely why a company receives a particular ESG rank and which indicators drive the assessment. For corporate managers and sustainability officers, the indicator-level weights and SHAP-based attributions reveal which specific ESG dimensions most strongly differentiate high-performing from low-performing firms within their industry, thereby guiding targeted improvement efforts. For example, in the pharmaceutical sector, the framework identifies R&D investment intensity, employee welfare expenditure, and board diversity as particularly influential indicators, providing actionable guidance for strategic ESG enhancement. For regulators and policymakers, the framework offers a standardized and replicable methodology for benchmarking corporate ESG performance within specific industries, which could support the development of industry-specific ESG disclosure requirements and performance standards. The data-driven nature of the framework directly addresses the regulatory concern that existing ESG ratings are insufficiently transparent and consistent. For the ESG rating industry, this study demonstrates a viable alternative paradigm: one that replaces opaque, proprietary scoring with a fully transparent, ML-supported pipeline where every methodological choice is documented and reproducible, potentially contributing to greater convergence and standardization in ESG evaluation practices.

6. Conclusions

In conclusion, this study developed an innovative end-to-end ML framework for multi-criteria ESG evaluation, aiming to overcome the subjectivity, inconsistency, and opacity that often characterize existing ESG evaluation and rating systems. The framework systematically integrates data preprocessing, feature selection, and MCDM into a unified, fully data-driven architecture. It leverages the MissForest ML algorithm for robust imputation of missing data, introduces a three-plane ML-aided ESG feature selection workflow to identify stable and representative indicators, and integrates PCA, CRITIC, and TOPSIS methods for objective ESG indicator weighting and ranking for the first time. Applied to real-world 2024 ESG data from 57 listed Chinese pharmaceutical and biotechnology companies across 70 ESG indicators, the framework successfully produced objective, transparent, and reproducible ESG scores and rankings. The empirical results demonstrated the framework’s effectiveness in capturing complex multi-dimensional relationships among ESG indicators, producing consistent and credible evaluations. In addition, the proposed framework enables scalable, explainable, and computationally efficient ESG performance measurement, offering valuable insights for investors, regulators, and corporate decision-makers. Future research will extend the application of this framework to larger and cross-sectoral ESG datasets, as well as explore dynamic ESG indicators incorporating real-time and textual data.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/bdcc10050130/s1, Table S1: Cleaned ESG dataset with 57 pharmaceutical and biotechnology companies (both English and Chinese names) and 15 selected indicators; Table S2: ESG performance scores and rankings of the 57 pharmaceutical and biotechnology companies.

Author Contributions

Z.W.: Writing—original draft, Software, Methodology, Investigation, Formal analysis, Conceptualization. T.L.: Writing—original draft, Software, Methodology, Investigation, Formal analysis, Conceptualization. Y.T.: Writing—original draft, Software, Methodology, Investigation, Formal analysis, Conceptualization. C.X.: Writing—original draft, Methodology, Investigation, Formal analysis, Conceptualization. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

We thank Ann Guo from Jiran Think Tank for your input on initial data preprocessing, and Qian Zhong Chao from Qinglv for providing access to the real dataset used in this study.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Algorithm A1. Pseudocode of the MissForest algorithm for missing-data imputation.

Require:

X

an

n \times p

matrix, stopping criterion

γ

1. Make initial guess for missing values;
2.

k

← vector of sorted indices of columns in

X

w.r.t. increasing amount of missing values;
3. while not

γ

do
4.

X_{o l d}^{i m p}

← store previously imputed matrix;
5. for

j

in

k

do
6. Fit a random forest:

y_{o b s}^{(j)} ~ x_{o b s}^{(j)}

;
7. Predict

y_{m i s}^{(j)}

using

x_{m i s}^{(j)}

;
8.

X_{n e w}^{i m p}

← update imputed matrix, using predicted

y_{m i s}^{(j)}

;
9.            end for
10.        update γ.
11.    end while
12.    return the imputed matrix

X^{i m p}

References

Kocmanová, A.; Šimberová, I. Determination of environmental, social and corporate governance indicators: Framework in the measurement of sustainable performance. J. Bus. Econ. Manag. 2014, 15, 1017–1033. [Google Scholar] [CrossRef]
He, F.; Ding, C.; Yue, W.; Liu, G. ESG performance and corporate risk-taking: Evidence from China. Int. Rev. Financ. Anal. 2023, 87, 102550. [Google Scholar] [CrossRef]
Fatemi, A.; Fooladi, I.J.; Wheeler, D. The relative valuation of socially responsible firms: An exploratory study. In Finance for a Better World: The Shift Toward Sustainability; Springer: Berlin/Heidelberg, Germany, 2009; pp. 140–167. [Google Scholar]
Bowley, T.; Hill, J.G. The global ESG stewardship ecosystem. Eur. Bus. Organ. Law Rev. 2024, 25, 229–268. [Google Scholar] [CrossRef]
Pathak, N. The Genesis of ESG: Tracing the Historical Roots. In Building a Sustainable Future: Roadmap for India’s Progress & Prosperity; Shree Vinayak Publication Agra: Agra, India, 2024; p. 236. [Google Scholar]
Weston, P.; Nnadi, M. Evaluation of strategic and financial variables of corporate sustainability and ESG policies on corporate finance performance. J. Sustain. Financ. Investig. 2023, 13, 1058–1074. [Google Scholar] [CrossRef]
Chen, S.; Song, Y.; Gao, P. Environmental, social, and governance (ESG) performance and financial outcomes: Analyzing the impact of ESG on financial performance. J. Environ. Manag. 2023, 345, 118829. [Google Scholar] [CrossRef] [PubMed]
Ayton, J.; Krasnikova, N.; Malki, I. Corporate social performance and financial risk: Further empirical evidence using higher frequency data. Int. Rev. Financ. Anal. 2022, 80, 102030. [Google Scholar] [CrossRef]
Fatemi, A.M.; Fooladi, I.J. Sustainable finance: A new paradigm. Glob. Financ. J. 2013, 24, 101–113. [Google Scholar] [CrossRef]
Wang, H.; Jiao, S.; Ma, C. The impact of ESG responsibility performance on corporate resilience. Int. Rev. Econ. Financ. 2024, 93, 1115–1129. [Google Scholar] [CrossRef]
Siddik, A.B.; Yong, L.; Du, A.M.; Vigne, S.A.; Sharif, A. Harnessing Artificial Intelligence for Enhanced Environmental Sustainability in China’s Banking Sector: A Mixed-Methods Approach. Br. J. Manag. 2025, 36, 1256–1273. [Google Scholar] [CrossRef]
Xu, J.; Wu, W.; Feng, X. The impact of ESG performances on analyst report readability: Evidence from China. Int. Rev. Financ. Anal. 2025, 102, 104056. [Google Scholar] [CrossRef]
Lee, J.; Koh, K. ESG performance and firm risk in the US financial firms. Rev. Financ. Econ. 2024, 42, 328–344. [Google Scholar] [CrossRef]
Zhang, D.; Wang, C.; He, Y.; Vigne, S.A. Does FinTech efficiently hamper manipulating ESG data behavior? Br. Account. Rev. 2024, 58, 101494. [Google Scholar] [CrossRef]
Adardour, Z.; Ed-Dafali, S.; Mohiuddin, M.; El Mortagi, O.; Sbai, H.; Bouzahir, B. Exploring the drivers of environmental, social, and governance (ESG) disclosure in an emerging market context using a mixed methods approach. Future Bus. J. 2025, 11, 107. [Google Scholar] [CrossRef]
Ma, D.; Xie, Y.; Huang, H.; Qiu, J. Does corporate ESG performance promote export resilience? New insights from risk resistance and resilience. J. Environ. Manag. 2024, 371, 122881. [Google Scholar] [CrossRef]
Wu, H.; Zhang, K.; Li, R. ESG score, analyst coverage and corporate resilience. Financ. Res. Lett. 2024, 62, 105248. [Google Scholar] [CrossRef]
Li, Q. Impact of ESG performance of manufacturing companies on their export trade. J. Educ. Humanit. Soc. Sci. 2024, 35, 259–272. [Google Scholar] [CrossRef]
Wu, Q.; Chen, G.; Han, J.; Wu, L. Does corporate ESG performance improve export intensity? Evidence from Chinese listed firms. Sustainability 2022, 14, 12981. [Google Scholar] [CrossRef]
Cai, T.; Hao, J. The influence of ESG responsibility performance on enterprises’ export performance. Int. Rev. Econ. Financ. 2025, 98, 103917. [Google Scholar] [CrossRef]
Jiao, Y.; Liu, H. Optimal portfolio choice with ESG considerations and asymmetric information. Quant. Financ. 2025, 25, 1163–1176. [Google Scholar] [CrossRef]
Tan, W.; Liu, Y.; Teng, M. When ESG news talks: How media sentiment shapes corporate financial behavior in China. Glob. Financ. J. 2025, 67, 101161. [Google Scholar] [CrossRef]
Yang, R.; Caporin, M.; Jiménez-Martin, J.-A. ESG risk exposure: A tale of two tails. Quant. Financ. 2024, 24, 827–849. [Google Scholar] [CrossRef]
Bayat, A.; Qu, R.; Rahmani, Z. ESG Rating Uncertainty: Causes, Consequences and Potential Remedies. Conseq. Potential Remedies 2025. [Google Scholar] [CrossRef]
Yu, E.P.-y.; Van Luu, B. International variations in ESG disclosure–do cross-listed companies care more? Int. Rev. Financ. Anal. 2021, 75, 101731. [Google Scholar] [CrossRef]
Avramov, D.; Cheng, S.; Lioui, A.; Tarelli, A. Sustainable investing with ESG rating uncertainty. J. Financ. Econ. 2022, 145, 642–664. [Google Scholar] [CrossRef]
Berg, F.; Kölbel, J.F.; Rigobon, R. Aggregate confusion: The divergence of ESG ratings. Rev. Financ. 2022, 26, 1315–1344. [Google Scholar] [CrossRef]
Billio, M.; Costola, M.; Hristova, I.; Latino, C.; Pelizzon, L. Inside the ESG ratings:(Dis) agreement and performance. Corp. Soc. Responsib. Environ. Manag. 2021, 28, 1426–1445. [Google Scholar] [CrossRef]
Zhu, Y.; Yang, H.; Zhong, M. Do ESG ratings of Chinese firms converge or diverge? A comparative analysis based on multiple domestic and international ratings. Sustainability 2023, 15, 12573. [Google Scholar] [CrossRef]
Capizzi, V.; Gioia, E.; Giudici, G.; Tenca, F. The divergence of ESG ratings: An analysis of Italian listed companies. J. Financ. Manag. Mark. Inst. 2021, 9, 2150006. [Google Scholar] [CrossRef]
Kimbrough, M.D.; Wang, X.; Wei, S.; Zhang, J. Does voluntary ESG reporting resolve disagreement among ESG rating agencies? Eur. Account. Rev. 2024, 33, 15–47. [Google Scholar] [CrossRef]
Charlin, V.; Cifuentes, A.; Alfaro, J. ESG ratings: An industry in need of a major overhaul. J. Sustain. Financ. Investig. 2024, 14, 1037–1055. [Google Scholar] [CrossRef]
Dimson, E.; Marsh, P.; Staunton, M. Practical Applications of Divergent ESG Ratings. Pract. Appl. 2021, 9, 1–7. [Google Scholar] [CrossRef]
Brandon, R.G.; Krueger, P.; Schmidt, P.S. ESG rating disagreement and stock returns. Financ. Anal. J. 2021, 77, 104–127. [Google Scholar] [CrossRef]
Serafeim, G.; Yoon, A. Stock price reactions to ESG news: The role of ESG ratings and disagreement. Rev. Account. Stud. 2023, 28, 1500–1530. [Google Scholar] [CrossRef]
Chatterji, A.K.; Durand, R.; Levine, D.I.; Touboul, S. Do ratings of firms converge? Implications for managers, investors and strategy researchers. Strateg. Manag. J. 2016, 37, 1597–1614. [Google Scholar] [CrossRef]
Kotsantonis, S.; Serafeim, G. Four things no one will tell you about ESG data. J. Appl. Corp. Financ. 2019, 31, 50–58. [Google Scholar] [CrossRef]
Tian, L.; Song, X.; Du, M.; Xu, B. The disciplinary impact of capital market internationalization on corporate ESG greenwashing: A study of A-shares’ inclusion in the MSCI index. Int. Rev. Financ. Anal. 2025, 103, 104202. [Google Scholar] [CrossRef]
Chopra, S.S.; Senadheera, S.S.; Dissanayake, P.D.; Withana, P.A.; Chib, R.; Rhee, J.H.; Ok, Y.S. Navigating the challenges of environmental, social, and governance (ESG) reporting: The path to broader sustainable development. Sustainability 2024, 16, 606. [Google Scholar] [CrossRef]
Demartini, P.; Pagliei, C. Can we trust ESG ratings? Some insights based on a bibliometric analysis of ESG data quality and rating reliability. Manag. Control. Spec. 2023. [Google Scholar]
Tibshirani, R.; Walther, G.; Hastie, T. Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B Stat. Methodol. 2001, 63, 411–423. [Google Scholar] [CrossRef]
Șenbabaoğlu, Y.; Michailidis, G.; Li, J.Z. Critical limitations of consensus clustering in class discovery. Sci. Rep. 2014, 4, 6207. [Google Scholar] [CrossRef]
Hao, Z.; Lu, Z.; Li, G.; Nie, F.; Wang, R.; Li, X. Ensemble clustering with attentional representation. IEEE Trans. Knowl. Data Eng. 2023, 36, 581–593. [Google Scholar] [CrossRef]
Chen, Y.; Yang, Y. The one standard error rule for model selection: Does it work? Stats 2021, 4, 868–892. [Google Scholar] [CrossRef]
Tao, Z.; Liu, H.; Li, S.; Ding, Z.; Fu, Y. From ensemble clustering to multi-view clustering. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017. [Google Scholar]
Guignard, F.; Ginsbourger, D.; Levy Häner, L.; Herrera, J.M. Some combinatorics of data leakage induced by clusters. Stoch. Environ. Res. Risk Assess. 2024, 38, 2815–2828. [Google Scholar] [CrossRef]
Varoquaux, G. Cross-validation failure: Small sample sizes lead to large error bars. Neuroimage 2018, 180, 68–77. [Google Scholar] [CrossRef]
Kapoor, S.; Cantrell, E.M.; Peng, K.; Pham, T.H.; Bail, C.A.; Gundersen, O.E.; Hofman, J.M.; Hullman, J.; Lones, M.A.; Malik, M.M. REFORMS: Consensus-based Recommendations for Machine-learning-based Science. Sci. Adv. 2024, 10, eadk3452. [Google Scholar] [CrossRef]
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef]
Hofner, B.; Boccuto, L.; Göker, M. Controlling false discoveries in high-dimensional situations: Boosting with stability selection. BMC Bioinform. 2015, 16, 144. [Google Scholar] [CrossRef]
Covert, I.; Lundberg, S.; Lee, S.-I. Explaining by removing: A unified framework for model explanation. J. Mach. Learn. Res. 2021, 22, 9477–9566. [Google Scholar]
Wanke, P.; Azad, A.K.; Emrouznejad, A. Efficiency in BRICS banking under data vagueness: A two-stage fuzzy approach. Glob. Financ. J. 2018, 35, 58–71. [Google Scholar] [CrossRef]
Reig-Mullor, J.; Garcia-Bernabeu, A.; Pla-Santamaria, D.; Vercher-Ferrandiz, M. Evaluating ESG corporate performance using a new neutrosophic AHP-TOPSIS based approach. Technol. Econ. Dev. Econ. 2022, 28, 1242–1266. [Google Scholar] [CrossRef]
Sood, K.; Pathak, P.; Jain, J.; Gupta, S. How does an investor prioritize ESG factors in India? An assessment based on fuzzy AHP. Manag. Financ. 2023, 49, 66–87. [Google Scholar] [CrossRef]
Meng, X.; Shaikh, G.M. Evaluating environmental, social, and governance criteria and green finance investment strategies using fuzzy AHP and fuzzy WASPAS. Sustainability 2023, 15, 6786. [Google Scholar] [CrossRef]
Yu, K.; Wu, Q.; Chen, X.; Wang, W.; Mardani, A. An integrated MCDM framework for evaluating the environmental, social, and governance (ESG) sustainable business performance. Ann. Oper. Res. 2024, 342, 987–1018. [Google Scholar] [CrossRef]
Rathi, P.; Nyati, A.; Singhi, R.; Srivastava, A. Assessing Firm’s ESG Performance Using the TOPSIS. In Responsible Firms: CSR, ESG, and Global Sustainability; Emerald Publishing Limited: Manchester, UK, 2024; pp. 119–136. [Google Scholar]
Assefa, D.Z.; Ishizaka, A.; La Torre, D. Exploring the ESG-finance relationship through PROMETHEE ranking. Ann. Oper. Res. 2025, 1–47. [Google Scholar] [CrossRef]
Sklavos, G.; Zournatzidou, G.; Ragazou, K.; Sariannidis, N. Green accounting and ESG-driven eco-efficiency in European financial institutions: A two-stage DEA–CRITIC-TOPSIS evaluation. PLoS ONE 2025, 20, e0334882. [Google Scholar] [CrossRef]
Caprioli, S.; Foschi, J.; Crupi, R.; Sabatino, A. Denoising ESG: Uncertainty-aware scoring through probabilistic imputation of missing data. Quant. Financ. 2025, 1–10. [Google Scholar] [CrossRef]
Stekhoven, D.J.; Bühlmann, P. MissForest—Non-parametric missing value imputation for mixed-type data. Bioinformatics 2012, 28, 112–118. [Google Scholar] [CrossRef]
Wang, Z.; Li, J.; Rangaiah, G.P.; Wu, Z. Machine learning aided multi-objective optimization and multi-criteria decision making: Framework and two applications in chemical engineering. Comput. Chem. Eng. 2022, 165, 107945. [Google Scholar] [CrossRef]
Dixneuf, P.; Errico, F.; Glaus, M. A computational study on imputation methods for missing environmental data. arXiv 2021, arXiv:2108.09500. [Google Scholar] [CrossRef]
Joel, L.O.; Doorsamy, W.; Paul, B.S. On the performance of imputation techniques for missing values on healthcare datasets. arXiv 2024, arXiv:2403.14687. [Google Scholar] [CrossRef]
Diakoulaki, D.; Mavrotas, G.; Papayannakis, L. Determining objective weights in multiple criteria problems: The critic method. Comput. Oper. Res. 1995, 22, 763–770. [Google Scholar] [CrossRef]
Wang, Y.; Wang, Z.; Wu, Z. Multi-objective optimal control of nonlinear processes using reinforcement learning with adaptive weighting. Comput. Chem. Eng. 2025, 201, 109206. [Google Scholar] [CrossRef]
Hwang, C.-L.; Yoon, K. Multiple Attribute Decision Making: Methods and Applications A State-of-the-Art Survey; Springer: Berlin/Heidelberg, Germany, 1981. [Google Scholar] [CrossRef]
Ding, D.; Li, Y.; Neo, P.L.; Wang, Z.; Xia, C. Subjective-objective median-based importance technique (SOMIT) to aid multi-criteria renewable energy evaluation. Appl. Energy 2025, 402, 126872. [Google Scholar] [CrossRef]
Nabavi, S.R.; Wang, Z.; Rodríguez, M.L. Multi-Objective Optimization and Multi-Criteria Decision-Making Approach to Design a Multi-Tubular Packed-Bed Membrane Reactor in Oxidative Dehydrogenation of Ethane. Energy Fuels 2024, 39, 491–503. [Google Scholar] [CrossRef]
Wang, Z.; Rangaiah, G.P. Multi-Criteria Decision-Making: Principles, Methods and Programs; CRC Press: Boca Raton, FL, USA, 2026. [Google Scholar] [CrossRef]
Wang, Z.; Nabavi, S.R.; Rangaiah, G.P. Selected Multi-criteria Decision-Making Methods and Their Applications to Product and System Design. In Optimization Methods for Product and System Design; Springer: Berlin/Heidelberg, Germany, 2023; pp. 107–138. [Google Scholar]
Wang, Z.; Nabavi, S.R.; Rangaiah, G.P. Multi-criteria decision making in chemical and process engineering: Methods, progress, and potential. Processes 2024, 12, 2532. [Google Scholar] [CrossRef]
Turgay, S.; Erdoğan, S.; Stević, Ž.; Elma, O.E.; Eren, T.; Wang, Z.; Baydaş, M. Risk-Aware Financial Forecasting Enhanced by Machine Learning and Intuitionistic Fuzzy Multi-Criteria Decision-Making. arXiv 2025, arXiv:2512.17936. [Google Scholar]

Figure 1. Graphical illustration of the proposed end-to-end ML framework for multi-criteria ESG evaluation.

Figure 2. ESG feature selection and attribution workflow.

Figure 3. Surface plot of the TOPSIS performance score (

P_{i}

) as a function of the distances to the PIS (

S_{i +}

) and NIS (

S_{i -}

).

Figure 3. Surface plot of the TOPSIS performance score (

P_{i}

) as a function of the distances to the PIS (

S_{i +}

) and NIS (

S_{i -}

).

Figure 4. Weight distributions of the 15 ESG indicators using the hybrid PCA-CRITIC weighting method.

Figure 5. ESG performance scores and rankings of the 57 Chinese pharmaceutical and biotechnology companies.

Table 1. Summary of key findings from the literature review.

Theme	Key Findings
ESG and Corporate Performance	Higher ESG scores in Chinese manufacturing firms are positively associated with export volume through lower operating costs.
	Better ESG performance is linked to more favorable trading activity, driven by innovation and reduced financial constraints.
	ESG drives expansion performance and lowers capital costs; investors tend to trust companies with better ESG performance more.
	Stronger financial outcomes linked to superior ESG, especially for firms investing in green technology innovation.
ESG Rating Divergence	Correlations of 0.38–0.71 across six ESG raters and 924 firms; divergence attributed to scope, measurement, and weight differences.
	Correlations of 0.43–0.69 among four major ESG providers
	Pairwise correlations of 0.057–0.736 for 195 Chinese firms; average correlation of 0.411.
	ESG ratings are even less consistent than subjective domains such as wine tasting.
	Pillar-level correlations lower than aggregate ESG (E: 0.42, S: 0.30, G: 0.07 vs. overall: 0.45).
ESG Feature Selection	ESG data present a difficult learning regime due to varied scales, within-silo redundancy, and small samples.
	ESG data quality issues and information overload hinder reliable evaluation.
	Leak-aware evaluation is critical but often under-specified in ESG-related workflows.
MCDM in ESG	Fuzzy AHP-TOPSIS applied to ESG evaluation in the oil and gas sector.
	Integrated MCDM framework using CoCoSo for ESG sustainable performance evaluation.
	TOPSIS applied to rate ESG performance in the electric utilities industry.
	Two-stage DEA-CRITIC-TOPSIS framework for ESG-driven eco-efficiency of European financial institutions.
	PROMETHEE used to analyze ESG-finance relationships across S&P 500 firms (2010–2023).

Table 2. Names of the 57 Chinese pharmaceutical and biotechnology companies in this study.

Name	Name
Mehow Innovative Ltd.	Hangzhou AllTest Biotech Co., Ltd.
Huizhou Jinghao Medical Technology Co., Ltd.	Contec Medical Systems Co., Ltd.
Anhui Hongyu Wuzhou Medical Manufacturer Co., Ltd.	Nanjing King-Friend Biochemical Pharmaceutical Co., Ltd.
Jiangxi Synergy Pharmaceutical Co., Ltd.	Aidite (Qinhuangdao) Technology Co., Ltd.
Guangdong Transtek Medical Electronics Co., Ltd.	Hangzhou Biotest Biotech Co., Ltd.
Intco Medical Technology Co., Ltd.	Honsun (Nantong) Co., Ltd.
Guangzhou Jet Bio-Filtration Co., Ltd.	Edan Instruments, Inc.
Qianjiang Yongan Pharmaceutical Co., Ltd.	HitGen Inc.
Hybio Pharmaceutical Co., Ltd.	Kingchem (Liaoning) Life Science Co., Ltd.
Blue Sail Medical Co., Ltd.	Zhejiang Orient Gene Biotech Co., Ltd.
Zhejiang Haisen Pharmaceutical Co., Ltd.	Sichuan Biokin Pharmaceutical Co., Ltd.
Zhende Medical Co., Ltd.	Pharmablock Sciences (Nanjing), Inc.
Allmed Medical Products Co., Ltd.	BMC Medical Co., Ltd.
Jianerkang Medical Technology Co., Ltd.	Hangzhou AGS MedTech Co., Ltd.
Ningbo Menovo Pharmaceutical Co., Ltd.	Shenzhen Hepalink Pharmaceutical Group Co., Ltd.
Caina Technology Co., Ltd.	Asymchem Laboratories (Tianjin) Co., Ltd.
Zhejiang Ausun Pharmaceutical Co., Ltd.	PharmaResources (Shanghai) Co., Ltd.
Zhonghong Pulin Medical Products Co., Ltd.	Jenkem Technology Co., Ltd.
Zhejiang Hisoar Pharmaceutical Co., Ltd.	Porton Pharma Solutions Ltd.
Well Lead Medical Co., Ltd.	Sino Biological Inc.
Jiangsu Sinopep-Allsino Biopharmaceutical Co., Ltd.	WuXi AppTec Co., Ltd.
Assure Tech (Hangzhou) Co., Ltd.	Pharmaron Beijing Co., Ltd.
Zhejiang Gongdong Medical Technology Co., Ltd.	Bide Pharmatech Co., Ltd.
Shenzhen Glory Medical Co., Ltd.	Chempartner Pharmatech Co., Ltd.
Shantou Institute of Ultrasonic Instruments Co., Ltd.	Acrobiosystems Co., Ltd.
Zhejiang Jiuzhou Pharmaceutical Co., Ltd.	Inner Mongolia Furui Medical Science Co., Ltd.
Aurisco Pharmaceutical Co., Ltd.	BeOne Medicines Ltd.
Zhejiang Tianyu Pharmaceutical Co., Ltd.	Andon Health Co., Ltd.
Chison Medical Technologies Co., Ltd.

Table 3. Nested cross-validation performance across model variants for ESG feature selection.

Design	MeanAcc	StdAcc	MeanBAcc	MeanF1m	MeanAUC
NCA-LDA	0.988889	2.72 × 10⁻²	0.916667	0.913793	0.964286
PCA-LDA	0.988889	2.72 × 10⁻²	0.916667	0.913793	0.97619
MultiView-LDA	0.977778	3.44 × 10⁻²	0.833333	0.827586	0.940476
MultiView-Bagging	0.955556	3.44 × 10⁻²	0.666667	0.655172	0.690476
MultiView-SVM	0.944444	2.72 × 10⁻²	0.583333	0.568966	0.833333
NCA-Graph-LDA	0.944444	2.72 × 10⁻²	0.583333	0.568966	0.97619
PCA-Graph-LDA	0.944444	2.72 × 10⁻²	0.583333	0.568966	0.964286
NCA-Bagging	0.933333	4.22 × 10⁻²	0.577381	0.565887	0.988095
NCA-Graph-SVM	0.933333	1.22 × 10⁻¹⁶	0.500000	0.482759	0.904762
NCA-SVM	0.933333	1.22 × 10⁻¹⁶	0.500000	0.482759	0.904762
PCA-Graph-SVM	0.933333	1.22 × 10⁻¹⁶	0.500000	0.482759	0.940476
PCA-SVM	0.933333	1.22 × 10⁻¹⁶	0.500000	0.482759	0.940476
NCA-Graph-Bagging	0.922222	2.72 × 10⁻²	0.494048	0.47968	0.97619
PCA-Bagging	0.922222	6.55 × 10⁻²	0.571429	0.56258	0.869048
PCA-Graph-Bagging	0.922222	2.72 × 10⁻²	0.494048	0.47968	0.839286
LDA-Bagging	0.911111	1.56 × 10⁻¹	0.720238	0.72342	0.720238
LDA-Graph-Bagging	0.911111	1.56 × 10⁻¹	0.720238	0.72342	0.750000
LDA-LDA	0.911111	1.56 × 10⁻¹	0.720238	0.72342	0.714286
LDA-Graph-LDA	0.888889	1.44 × 10⁻¹	0.553571	0.551006	0.553571
LDA-SVM	0.888889	1.44 × 10⁻¹	0.553571	0.551006	0.857143
LDA-Graph-SVM	0.877778	1.36 × 10⁻¹	0.470238	0.464799	0.833333

Table 4. List of the 15 ESG indicators/criteria with their measurement units, abbreviations and optimization types (maximization or minimization).

ESG Indicators/Criteria (Unit)	Abbreviations (Type)
Environmental Protection Tax (10,000 RMB)	env_tax (Min)
Thermal Energy Use (tons of standard coal)	thermal_energy_use (Min)
GHG Emissions–Scope 3 (tCO₂e)	ghg_scope3 (Min)
Electricity Consumption per Revenue (MWh per million RMB)	electricity_per_rev (Min)
GHG Emissions–Scope 3 per Revenue (tCO₂e per million RMB)	ghg_scope3_per_rev (Min)
Per Capita Medical Insurance Expense (10,000 RMB/person)	med_insurance_per_capita (Max)
External Donations (10,000 RMB)	external_donations (Max)
Salaries, Bonuses, Allowances, and Subsidies (10,000 RMB)	salary_bonus_total (Max)
R&D Expense Growth Rate (%)	rnd_expense_growth_pct (Max)
R&D Expenses as % of Revenue (%)	rnd_expense_pct_rev (Max)
Number of Shareholders (persons)	shareholder_count (Max)
Proportion of Independent Directors (%)	indep_director_pct (Max)
Number of Board Directors with PhDs	board_phd_count (Max)
Shareholding Ratio of Top 10 Shareholders (%)	top10_shareholding_pct (Min)
Proportion of Female Board Directors (%)	board_female_pct (Max)

Table 5. ESG indicator weights determined using the hybrid PCA-CRITIC weighting method.

ESG Indicator	PCA-Weights	CRITIC-Weights	PCA-CRITIC Hybrid Weights
env_tax	0.059872	0.087983	0.078599
thermal_energy_use	0.065053	0.049906	0.048440
ghg_scope3	0.073081	0.053834	0.058702
electricity_per_rev	0.057960	0.043104	0.037276
ghg_scope3_per_rev	0.066052	0.052562	0.051803
med_insurance_per_capita	0.056863	0.055636	0.047204
external_donations	0.071555	0.083095	0.088718
salary_bonus_total	0.073465	0.061538	0.067455
rnd_expense_growth_pct	0.071358	0.073622	0.078387
rnd_expense_pct_rev	0.062755	0.059325	0.055549
shareholder_count	0.073128	0.053015	0.057846
indep_director_pct	0.063074	0.053413	0.050268
board_phd_count	0.068408	0.082675	0.084387
top10_shareholding_pct	0.061784	0.093490	0.086185
board_female_pct	0.075591	0.096802	0.109181

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Z.; Lim, T.; Teng, Y.; Xia, C. A Data-Driven Machine Learning Framework for Multi-Criteria ESG Evaluation. Big Data Cogn. Comput. 2026, 10, 130. https://doi.org/10.3390/bdcc10050130

AMA Style

Wang Z, Lim T, Teng Y, Xia C. A Data-Driven Machine Learning Framework for Multi-Criteria ESG Evaluation. Big Data and Cognitive Computing. 2026; 10(5):130. https://doi.org/10.3390/bdcc10050130

Chicago/Turabian Style

Wang, Zhiyuan, Tristan Lim, Yun Teng, and Chongwu Xia. 2026. "A Data-Driven Machine Learning Framework for Multi-Criteria ESG Evaluation" Big Data and Cognitive Computing 10, no. 5: 130. https://doi.org/10.3390/bdcc10050130

APA Style

Wang, Z., Lim, T., Teng, Y., & Xia, C. (2026). A Data-Driven Machine Learning Framework for Multi-Criteria ESG Evaluation. Big Data and Cognitive Computing, 10(5), 130. https://doi.org/10.3390/bdcc10050130

Article Menu

A Data-Driven Machine Learning Framework for Multi-Criteria ESG Evaluation

Abstract

1. Introduction

2. Literature Review

2.1. ESG and Corporate Strategy

2.2. ESG Ratings: Providers, Divergence, and Challenges

2.3. ESG Feature Selection Methodologies

2.4. Multi-Criteria Decision-Making in ESG Evaluation

3. Methodology

3.1. Data Preprocessing

3.2. ESG Feature Selection

3.2.1. Plane I: Discovery

3.2.2. Plane II: Prediction

3.2.3. Plane III: Explanation

3.2.4. Defaults, Complexity, and Reproducibility

3.3. Determination of Indicator Weights

3.3.1. Principal Component Analysis (PCA) for Weighting

3.3.2. Criteria Importance Through Inter-Criteria Correlation (CRITIC) for Weighting

3.3.3. Hybrid PCA-CRITIC for Weighting

3.4. Multi-Criteria Decision-Making (MCDM) for Ranking

Technique for Order Preference by Similarity to Ideal Solution (TOPSIS)

4. Application

4.1. Detailed Calculations

4.2. Robustness Analysis: Comparison of Rankings Under Alternative Weighting Schemes

5. Discussion and Limitations

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI