ESG-Graph: Hierarchical Residual Graph Attention Network with Analyst-Defined ESG Taxonomy

Elouargui, Yasser; Sassioui, Abdellatif; Chergui, Meriyem; Benouini, Rachid; El Kamili, Mohamed; Benyoussef, Elmehdi; Ouzzif, Mohammed

doi:10.3390/technologies14050258

Open AccessArticle

ESG-Graph: Hierarchical Residual Graph Attention Network with Analyst-Defined ESG Taxonomy

by

Yasser Elouargui

^1,2,*,

Abdellatif Sassioui

^1,2,

Meriyem Chergui

¹,

Rachid Benouini

²,

Mohamed El Kamili

¹

,

Elmehdi Benyoussef

² and

Mohammed Ouzzif

¹

C3S Laboratory, Higher School of Technology, Hassan II University, Casablanca 20010, Morocco

²

Leyton, Casablanca 20190, Morocco

^*

Author to whom correspondence should be addressed.

Technologies 2026, 14(5), 258; https://doi.org/10.3390/technologies14050258

Submission received: 9 February 2026 / Revised: 11 April 2026 / Accepted: 14 April 2026 / Published: 25 April 2026

(This article belongs to the Section Information and Communication Technologies)

Download

Browse Figures

Versions Notes

Abstract

Environmental, Social, and Governance (ESG) text classification is important for applications in sustainable finance. However, it remains a challenging task due to domain terminology and regulatory constraints. While transformer-based models achieve strong predictive performance, they often lead to high energy costs and provide limited interpretability. To address these limitations, we introduce ESG-Graph, a lightweight and interpretable graph-based framework for modeling ESG disclosures. In our approach, each sentence is represented as a token-level dependency graph augmented with virtual nodes initialized from a European Sustainability Reporting Standards (ESRS)-based taxonomy, enabling the addition of new ESG concepts without retraining. A multi-layer Graph Attention Network is used instead of transformer encoders, allowing grammatical structure and domain semantics to be modeled jointly. Experiments on three ESG benchmark datasets show that ESG-Graph achieves performance comparable to efficient transformer baselines while consuming up to 60× less energy and using 10× fewer parameters. Additional attribution and ablation studies suggest the method’s policy alignment, interpretability, and robustness.

Keywords:

Graph Neural Network (GNN); ESG text classification; token-level dependency graph; European Sustainability Reporting Standards (ESRS); environmental, social, and governance (ESG); explainable AI (XAI); sustainability reporting

1. Introduction

The accelerating demand for sustainability transparency has elevated Environmental, Social, and Governance (ESG) reporting from a voluntary practice to a regulatory mandate. The European Union’s Sustainable Finance Disclosure Regulation (SFDR) and the Corporate Sustainability Reporting Directive (CSRD) alone now generate tens of millions of narrative words each year, describing corporate climate policies, labor practices, and governance frameworks [1,2]. These disclosures, although rich in qualitative information, are produced at a scale that exceeds human analytical capacity; however, the labeled annotation budget available for training automated classifiers remains limited in practice, as domain-specific annotation requires expert knowledge and incurs significant cost, even when assisted by large language models [3,4,5], motivating the low-data evaluation regime adopted in this work.

As a result, investment managers, auditors, and regulators must rely on automated text classification systems to process ESG disclosures efficiently under strict computational, operational, and regulatory constraints.

In response to this need, current state-of-the-art systems are predominantly based on large pretrained transformer models. Architectures such as BERT [6] and RoBERTa have become central to modern natural language processing, and their success has extended to ESG applications through domain-specific variants such as ESG-BERT [4]. By leveraging contextual representations, these models capture domain-relevant semantic patterns and establish strong empirical baselines for ESG classification tasks.

However, despite their strong performance, transformer-based approaches present several limitations in ESG contexts. First, their quadratic computational complexity with respect to sequence length [7] makes them unsuitable for long disclosures or real-time deployment on limited hardware. Second, they operate as black-box models, limiting interpretability and auditability [8]. Third, their substantial energy consumption raises concerns, as it contradicts the sustainability objectives that ESG analysis aims to support [9,10]. Finally, these models treat text as linear token sequences, overlooking the inherent syntactic dependencies and structured domain knowledge embedded in ESG disclosures.

To mitigate these limitations, prior work has explored more efficient transformer variants. Knowledge distillation methods such as DistilBERT [11] and TinyBERT [12] reduce model size while preserving predictive performance. In parallel, architectures such as Longformer [13] and FNet [14] improve computational efficiency through sparse attention and alternative transformations.

Nevertheless, these approaches remain grounded in sequential token processing and do not explicitly model linguistic structure or domain knowledge, limiting their interpretability and alignment with regulatory requirements.

This limitation motivates alternative representations. Graph Neural Networks (GNNs) provide a flexible framework for modeling relational data. In NLP, text can be represented as a graph where nodes correspond to linguistic units and edges encode syntactic or semantic relationships [15,16]. Through message passing, GNNs capture long-range dependencies and compositional structure beyond positional encoding.

However, despite these advantages, existing graph-based approaches have seen limited adoption in ESG applications, particularly in integrating structured regulatory knowledge.

Beyond structural modeling, integrating external domain knowledge can enhance robustness and interpretability. In this setting, symbolic knowledge refers to structured representations such as ontologies, knowledge graphs, or curated lexicons that encode relationships between domain concepts. In ESG analysis, these representations formalize regulatory notions (e.g., environmental risk, governance practices, or social impact) into machine readable structures aligned with reporting standards.

Prior work shows that such knowledge can ground predictions in analyst-defined frameworks [17,18]. This perspective aligns with neuro-symbolic AI, where statistical models are guided by structured priors to improve interpretability and consistency with expert reasoning.

In ESG, such knowledge is encoded through evolving taxonomies aligned with standards such as the European Sustainability Reporting Standards (ESRS).

However, most existing models—both transformer-based and graph-based—remain weakly connected to these structured knowledge sources, limiting their ability to produce interpretable and policy-aligned predictions.

To identify the research gap, we adopt a problem-driven analytical decomposition of ESG text classification. Rather than organizing prior work solely by model families, we analyze existing approaches through the fundamental requirements imposed by ESG applications.

Specifically, we consider three key dimensions: (i) computational efficiency, driven by large-scale disclosures and deployment constraints, (ii) structural modeling, required to capture syntactic and relational dependencies in text, and (iii) domain knowledge integration, necessary for alignment with regulatory frameworks and analyst-defined concepts.

We examine representative approaches, including transformer-based models, efficient architectures, graph-based methods, and knowledge-enhanced systems through these dimensions. This unified perspective enables a structured comparison of otherwise heterogeneous methods and provides a transparent basis for identifying the research gap, as summarized in Table 1.

As shown in Table 1, existing approaches fail to simultaneously achieve efficiency, structural expressiveness, and domain alignment. This fragmented landscape reveals a clear research gap: the absence of a unified, lightweight framework capable of integrating relational structure and regulatory knowledge while maintaining computational efficiency. This gap emerges directly from the above analytical decomposition, which highlights that no existing approach simultaneously satisfies all three dimensions.

This work is guided by the following research question: how can ESG text classification be designed to simultaneously achieve computational efficiency, interpretability, and alignment with evolving regulatory frameworks.

To address this gap, we argue that ESG text is inherently graph-structured. For example, phrases such as “reducing carbon emissions under board supervision” encode explicit dependencies between environmental and governance concepts.

Building on this observation, we propose ESG-Graph, a lightweight and interpretable graph neural architecture that models ESG disclosures as token-level dependency graphs augmented with ESRS-aligned taxonomy nodes. This design enables the joint modeling of syntactic structure and regulatory semantics, resulting in a system that is efficient, interpretable, and aligned with ESG reporting requirements. Importantly, ESG-Graph does not introduce a fundamentally new learning paradigm; rather, it integrates graph-based modeling and domain-specific knowledge within a unified architecture tailored to ESG constraints.

From a theoretical perspective, this work contributes to three complementary research streams. First, it advances neuro-symbolic learning by embedding domain specific regulatory knowledge directly into graph-based neural architectures, enabling structured reasoning over ESG concepts. Second, it contributes to explainable AI (XAI) in financial text analysis by providing transparent, node level representations aligned with analyst defined taxonomies. Third, it extends representation learning for structured text by demonstrating how syntactic dependencies and domain knowledge can be jointly encoded as inductive biases within graph neural networks.

Together, these contributions frame ESG text classification as a structured reasoning problem over hybrid symbolic–neural representations, where regulatory concepts act as explicit constraints guiding model behavior.

Our contributions are as follows:

We introduce ESG-Graph, a graph neural architecture that jointly models syntactic dependencies and ESG domain concepts.
We propose an evolutive ESG taxonomy, allowing dynamic integration of new regulatory concepts without retraining.
We demonstrate competitive performance with efficient transformer baselines while using significantly fewer parameters and up to 60× less energy.
We provide transparent, policy-aligned explanations through gradient-based attribution mechanisms.

Building on these contributions, the remainder of this paper is structured as follows. Section 2 presents the materials and methods. Section 3 reports the results and discussion. Finally, Section 4 concludes the paper.

2. Materials and Methods

This section describes the proposed ESG-Graph framework in detail, including the overall architecture, the construction of ESG aware graph representations, and the graph neural processing used for classification. The presentation follows the full pipeline from raw text input to final ESG relevance prediction.

2.1. Architecture Overview

Figure 1 provides a visual summary of the proposed ESG-Graph architecture, illustrating the interaction between linguistic preprocessing, ESG aware graph construction, and graph neural message passing. All architectural components presented in the figure are explained in the following subsections.

Our approach converts ESG-related text segments into structured token-level graphs enhanced with domain knowledge provided by an analyst-defined ESG taxonomy aligned with the European Sustainability Reporting Standards (ESRS).

The model integrates four components:

(i): Syntactic dependency relations capturing the grammatical structure of the sentence.
(ii): ESG subtopic anchor nodes encoding high-level thematic categories curated by domain analysts and aligned with ESRS disclosure scopes.
(iii): A sentence-level ESG node aggregating all ESG-relevant features detected in the text.
(iv): A multilayer Graph Attention Network (GAT) performs message passing to produce an ESG relevance prediction.

2.2. Analyst Defined ESG Taxonomy for Domain Aware Subtopic Annotation

To inject ESG knowledge into the graph, an ESG taxonomy is curated by in-house analysts to reflect regulatory concepts aligned with ESRS reporting scopes.

The ESRS framework provides a conceptual structure for the Environmental, Social, and Governance pillars, including scopes such as Climate Change (E1), Pollution (E2), Own Workforce (S1), Consumers and End users (S4), and Governance, Risk Management and Internal Control (G1). Analysts integrate these scopes by associating each subtopic with defined keyword sets to represent regulatory terminology and ESG concepts.

Each subtopic is associated with specific terms, for example emission, carbon, renewable for Climate Change, diversity, training, safety for Own Workforce, and audit, compliance, oversight for Governance. Representative examples appear in Table 2, and the complete taxonomy is provided in Appendix A.

During graph construction, tokens matching a subtopic keyword are connected to their corresponding subtopic anchor nodes, which act as local aggregators. All subtopic nodes are then linked to a global ESG node summarizing the sustainability content of the entire document. This design enables the model to integrate ESRS-aligned thematic structure.

The analyst defined ESG taxonomy has three main functions. First, it injects domain knowledge aligned with regulatory frameworks while remaining flexible to evolving disclosure practices. Second, it enhances interpretability, as downstream analyses such as node importance can be expressed directly in terms of ESG subtopics rather than uninterpretable embedding dimensions. Third, it provides weak supervision signals that improve the model’s generalization ability. Importantly, the taxonomy is evolutive by design, allowing new keywords and emerging ESG themes to be incorporated without modifying the underlying model architecture.

The keyword sets were constructed through a team-based process in which analysts independently proposed keywords for each subtopic without prior consultation. A keyword was retained in the final taxonomy only if it appeared in the lists of a majority of analysts; keywords proposed by a single analyst were systematically rejected unless unanimously reinstated upon group review. This intersection-based acceptance criterion approximates a consensus-based annotation protocol, ensuring that the final vocabulary reflects shared professional judgment rather than any individual interpretation, thereby reducing idiosyncratic bias.

Adapting the taxonomy to alternative reporting frameworks such as SASB or TCFD requires updating the keyword sets and subtopic labels, without requiring retraining of the core GNN architecture. In practice, a major framework transition involves three steps: (i) aligning existing ESRS subtopics with semantically equivalent concepts in the target framework, leveraging the substantial conceptual overlap across ESG standards; (ii) updating keyword sets for subtopics with no direct equivalent; and (iii) rerunning graph construction with the updated vocabulary, a fully automated step. Steps (i) and (ii) are estimated to require approximately 2 to 4 person-days of analyst effort for a complete framework transition.

2.3. Processing Pipeline

Given an input sentence, the model pipeline proceeds in three steps:

Preprocessing: Tokens are extracted and mapped to pretrained embeddings.
Graph Construction: A syntactic dependency graph is generated and enriched with ESRS subtopic nodes and a global ESG node, producing a hybrid linguistic domain graph structure.
Graph Neural Processing: Multilayer attention-based message passing propagates information across token, subtopic, and global nodes. Pooled graph embeddings are passed to a feed forward classifier for binary ESG relevance prediction.

This methodology combines grammatical structure with ESG domain knowledge to provide a domain-aware text classifier.

2.4. Preprocessing

Each input sentence is tokenized into non stopword units

s = (w_{1}, \dots, w_{n}),

(1)

and each token is embedded via a pretrained GloVe vector [19]

g_{i} \in R^{d_{g}} .

(2)

This produces the initial feature matrix

X_{0} = [g_{1}; \dots; g_{n}] \in R^{n \times d_{g}} .

(3)

All nodes in the graph, including tokens, subtopic anchors, and the sentence level ESG anchor, share this same dimensionality for architectural homogeneity.

2.5. Graph Construction

2.5.1. Syntactic Dependency Graph

We parse the sentence using a dependency parser [20] and create bidirectional edges for each head dependent pair

E_{dep} = {(h, d), (d, h) ∣ (h, d) \in T (s)} .

(4)

If the parser fails, we fall back to a bidirectional chain graph

E_{chain} = {(i, i + 1), (i + 1, i)}_{i = 1}^{n - 1} .

(5)

2.5.2. ESRS Subtopic Anchor Nodes

Each ESG subtopic t has a curated lexicon

L_{t}

. For each subtopic, if any token matches its lexicon, we create a subtopic anchor node

v_{t}

and connect it bidirectionally to all matched tokens

E_{t} = {(v_{t}, i), (i, v_{t}) ∣ w_{i} \in L_{t}} .

(6)

The embedding of the anchor node is computed as the mean vector of its matched tokens

x_{v_{t}} = \frac{1}{| {i : w_{i} \in L_{t}} |} \sum_{i : w_{i} \in L_{t}} g_{i} .

(7)

2.5.3. Sentence Level ESG Anchor Node

To aggregate sustainability cues, we introduce a sentence level ESG anchor node

v^{ESG}

with embedding

x_{v^{ESG}} = \frac{1}{| I |} \sum_{i \in I} g_{i}, I = {i : w_{i} \in \cup_{t} L_{t}} .

(8)

This node is connected to all tokens

E_{ESG} = {(v^{ESG}, i), (i, v^{ESG}) ∣ i = 1, \dots, n} .

(9)

The final augmented graph is

\tilde{G} = (\tilde{V}, \tilde{E}), \tilde{V} = V \cup {v_{t}}_{t} \cup {v^{ESG}} .

(10)

2.6. Graph Neural Processing

At each layer l, node i aggregates information from its neighbors using a graph attention mechanism [21]

h_{i}^{(l)} = \sum_{j \in N (i)} α_{i j}^{(l)} W^{(l)} h_{j}^{(l - 1)} .

(11)

After L layers, the final graph representation is obtained by mean pooling

z = \frac{1}{| \tilde{V} |} \sum_{v \in \tilde{V}} h_{v}^{(L)} .

(12)

The classifier outputs a binary prediction

\hat{y} = σ (W_{c} z + b_{c}) .

(13)

2.7. Algorithm Summary

The full processing pipeline for converting a sentence into an ESG-structured graph and producing a classification output is summarized in Algorithm 1.

Algorithm 1 Token-level ESG graph construction and classification

Require:: Sentence $s = (w_{1}, \dots, w_{n})$ ; ESRS lexicons ${L_{t}}$
Ensure:: Predicted ESG label $\hat{y}$
1:: Token Embedding:
2:: for each token $w_{i}$ in s do
3:: $g_{i} \leftarrow GloVe (w_{i})$
4:: end for
5:: Syntactic Graph Structure:
6:: Build dependency edges $E_{dep} (s)$
7:: if parser fails then
8:: $E_{dep} \leftarrow E_{chain}$
9:: end if
10:: Initialize Graph:
11:: $V \leftarrow {1, \dots, n}$ token nodes
12:: $E \leftarrow E_{dep}$
13:: Subtopic Anchor Nodes:
14:: for each ESRS subtopic t do
15:: $I_{t} \leftarrow {i ∣ w_{i} \in L_{t}}$
16:: if $| I_{t} | > 0$ then
17:: create subtopic anchor node $u_{t}$
18:: $x_{u_{t}} \leftarrow \frac{1}{| I_{t} |} \sum_{i \in I_{t}} g_{i}$
19:: for each $i \in I_{t}$ do
20:: add edges $(u_{t}, i)$ and $(i, u_{t})$ to $E$
21:: end for
22:: $V \leftarrow V \cup {u_{t}}$
23:: end if
24:: end for
25:: Sentence-Level ESG Anchor Node:
26:: $I \leftarrow ⋃_{t} I_{t}$
27:: if $| I | > 0$ then
28:: create sentence-level ESG anchor node $u^{ESG}$
29:: $x_{u^{ESG}} \leftarrow \frac{1}{| I |} \sum_{i \in I} g_{i}$
30:: for $i = 1$ to n do
31:: add edges $(u^{ESG}, i)$ and $(i, u^{ESG})$ to $E$
32:: end for
33:: $V \leftarrow V \cup {u^{ESG}}$
34:: end if
35:: Graph Neural Network:
36:: $H \leftarrow {GAT}_{L} (V, E)$ {L-layer GAT}
37:: Graph Readout:
38:: $z \leftarrow \frac{1}{| V |} \sum_{v \in V} h_{v}^{(L)}$
39:: Prediction:
40:: $\hat{y} \leftarrow σ (W_{c} z + b_{c})$

2.8. Experimental Setup

This section describes the datasets, hardware environment, and training protocol used to evaluate ESG-Graph.

2.8.1. Datasets

We evaluate our models on three equally sized subsets of the ESG-BERT benchmark corpus [4]: environment_2k, social_2k, and governance_2k. Each subset contains 2000 labeled short text segments corresponding to one of the three ESG dimensions: Environmental (E), Social (S), or Governance (G). For each subset, the task is formulated as a binary classification problem, where the model predicts whether a given text segment is related to the target ESG pillar.

The datasets consist of short documents extracted from corporate reports and regulatory disclosures, including policy statements, sustainability sections, and governance-related mentions. Table 3 summarizes the token length statistics.

2.8.2. Graph Construction Robustness

Across all three benchmark datasets (6000 documents), the dependency parser successfully processed 99.85% of sentences. The chain-graph fallback was triggered for only 9 documents (0.15%), corresponding to short or structurally malformed fragments in the validation splits. These cases were distributed as follows: 4 in Environmental (0.20%), 3 in Governance (0.15%), and 2 in Social (0.10%).

To assess whether fallback usage impacts predictive performance, we compared model outputs on fallback cases against standard parsed sentences. No systematic degradation in classification performance was observed, and the contribution of fallback samples to the overall evaluation metrics was negligible due to their extremely low frequency.

This negligible fallback rate, combined with the absence of measurable performance degradation, confirms that the dependency parser operates reliably on ESG disclosure text and that the fallback mechanism serves as a rarely-invoked safety net rather than a primary processing path.

2.8.3. Scalability Considerations

While experiments are conducted on 2000-document subsets, ESG-Graph operates at the sentence level and processes each sentence as an independent graph. This design allows linear scaling to larger corpora, as the number of nodes and edges grows proportionally with input size without increasing model complexity.

In practice, large-scale ESG corpora can be processed efficiently through parallelized graph construction and batched inference, suggesting that the framework can be extended to collections containing millions of disclosures.

2.8.4. Hardware and Computational Environment

All experiments were conducted on a single laptop equipped with an NVIDIA RTX 4070 Laptop GPU (8 GB VRAM) and 60 GB of system memory. By limiting training to an 8 GB GPU, we demonstrate that ESG text classification can be achieved on cost-efficient hardware.

All experiments were implemented in Python (v3.10) using PyTorch (v2.1) and PyTorch Geometric (v2.4). Dependency parsing was performed using spaCy (v3.7). The CodeCarbon framework (v3.0.7) was used to measure energy consumption and carbon emissions.

2.8.5. Training Configuration

We used a consistent training pipeline across ESG-Graph and all transformer baselines to ensure a consistent comparison between models. Training was performed using mini-batches of 16 documents over 15 epochs. Optimization was carried out using the AdamW optimizer [22], conducted with an initial learning rate of

2 \times 10^{- 4}

and a weight decay of

1 \times 10^{- 4}

to improve convergence and regularization. A linear learning rate decay schedule was applied throughout training to further improve optimization stability.

Model performance was monitored using the validation F1 score, and early stopping was applied when no improvement was observed for one consecutive epoch. To account for randomness, each experiment was repeated three times using different random seeds (11, 17, and 23), and the reported results are averaged across these runs.

3. Results and Discussion

This section presents a comprehensive empirical evaluation of ESG-Graph, analyzing its predictive performance, computational efficiency, and robustness in comparison with transformer-based baselines and large language models.

We begin by evaluating predictive performance across ESG domains.

3.1. Dataset-Wise Benchmarks

We compare ESG-Graph with lightweight Transformer baselines (TinyBERT, DistilBERT, FNet, and Longformer) and zero-shot large language models (GPT-4o and Gemini-2.5) across the three ESG datasets to assess both predictive performance and model efficiency. All models are evaluated under identical conditions on balanced 2000-document splits to ensure fair comparison. Reported metrics include Accuracy and F1-score, together with the approximate number of trainable parameters and effective sequence length for each architecture. Training efficiency is analyzed separately in terms of wall-clock epoch time and peak GPU memory usage.

3.1.1. Environmental Dataset

The Environmental dataset focuses on climate-related disclosures, energy usage, emissions, and sustainability practices. Results are summarized in Table 4. ESG-Graph attains an accuracy of 93.3% with an F1-score of 92.6%, while requiring only ∼1.65 M parameters. While Longformer achieves the highest overall accuracy and F1-score, ESG-Graph offers a strong balance between predictive performance and model efficiency.

This result indicates that ESG-Graph achieves near state-of-the-art performance while operating under significantly lower model complexity, confirming the effectiveness of structural and domain-aware representations.

3.1.2. Social Dataset

The Social dataset includes disclosures on workforce, diversity, community engagement, and social impact. Results are summarized in Table 5. ESG-Graph achieves an accuracy of 88.5% with an F1-score of 88.1%, while using only ∼1.65 M parameters. Although larger Transformer-based models attain slightly higher peak F1 score, ESG-Graph provides a favorable trade-off between predictive performance and training cost.

The lower performance observed across all models on this dataset reflects the inherently ambiguous and context-dependent nature of social disclosures, which are less lexically standardized than environmental or governance reporting.

Governance Dataset

The Governance dataset focuses on compliance, transparency, and ethical governance disclosures. Results are summarized in Table 6. ESG-Graph achieves an accuracy of 87.0% and an F1-score of 83.2%, while requiring only ∼1.65 M parameters, outperforming all baseline methods in this dataset.

This strong performance can be attributed to the more formalized and consistent vocabulary used in governance disclosures, which aligns closely with the G1 taxonomy and enables precise subtopic anchoring.

Across all three datasets, a consistent pattern emerges: ESG-Graph maintains competitive or superior performance relative to transformer baselines while requiring orders of magnitude fewer parameters.

Zero-shot LLMs exhibit variable performance across datasets, reflecting limited robustness in low-data ESG classification settings. In contrast, ESG-Graph delivers consistently competitive Accuracy and F1-scores while maintaining an extremely lightweight architecture. The performance differential across ESG pillars reflects domain-specific linguistic properties. Governance disclosures rely on a compact, unambiguous regulatory vocabulary that aligns precisely with the G1 taxonomy lexicon, enabling highly discriminative anchor node matching and explaining the strongest F1 scores observed for this pillar. Social disclosures, by contrast, are expressed in more varied narrative forms: keywords such as employee or community frequently appear in non-ESG corporate communication, reducing anchor node precision. This lexical ambiguity is intrinsic to the Social domain and explains the consistently lower F1 scores observed across all models on this pillar, not only ESG-Graph.

In sum, ESG-Graph achieves competitive ESG classification performance with orders of magnitude fewer parameters than standard Transformer models, while removing architectural sequence-length constraints and substantially reducing computational cost.

3.2. Training Efficiency

While predictive performance is critical, practical ESG deployment also requires strong computational efficiency. We therefore evaluate ESG-Graph against Transformer-based baselines in terms of wall-clock training time per epoch and peak GPU memory consumption.

Results are summarized in Table 7.

ESG-Graph achieves the lowest training time and memory footprint among all evaluated models, requiring approximately 4.0× less memory than TinyBERT (0.18 GB vs. 0.72 GB) and over 15.6× less than Longformer (0.18 GB vs. 2.81 GB), while also providing an order-of-magnitude speedup in epoch time. These results demonstrate that ESG-Graph offers an efficient alternative for ESG document classification, for limited-resource or real-time settings.

At inference time, ESG-Graph processes approximately 220 sentence-level graphs per second on the evaluation hardware, enabling a 500-page annual report (approximately 15,000 sentences) to be fully classified in under 70 s. This throughput is achievable on consumer hardware without server-grade GPU infrastructure, making the system accessible to smaller regulatory bodies and asset managers operating under computational constraints. Furthermore, the

O (n)

message-passing complexity of the GNN, in contrast to the

O (n^{2})

self-attention of transformer models, ensures that inference latency scales gracefully with sequence length.

3.3. Long-Sequence Scalability Analysis

To assess the scalability of ESG-Graph on long ESG disclosures, we evaluated the model on five extended paragraphs constructed from real ESG report sections, with token lengths exceeding 1000 tokens (range: 1048–1352). These inputs approximate long-form disclosure segments commonly encountered in corporate sustainability and governance reporting.

For each input, we measured per-graph inference latency, peak GPU memory consumption, and node representation variance, with the latter serving as an indicator of potential over-smoothing effects. Results are reported in Table 8, alongside a baseline measured on standard-length validation samples.

Three observations can be drawn from these results. First, peak GPU memory usage remains nearly constant, increasing by approximately 0.44 MB (1.7%) despite the substantial increase in input length. This suggests that memory requirements remain largely stable under longer input conditions within the evaluated range.

Second, inference latency increases by a factor of 4.07× relative to the baseline, where the baseline corresponds to standard sentence-level inputs in the evaluation datasets (average length ∼30–40 tokens) and is computed under the same experimental pipeline. This increase reflects the larger graph sizes induced by longer sequences. However, the absolute latency remains moderate (maximum 24.67 ms per graph), indicating that extended disclosures can be processed efficiently in practical settings.

Third, node representation variance increases for longer inputs (0.328 vs. 0.220), indicating that node representations remain well differentiated across the graph. This pattern suggests that the model does not exhibit pronounced over-smoothing behavior as graph size increases. This behavior is consistent with the use of residual connections and normalization layers in the ESG-Graph architecture, which help stabilize message passing across larger graphs.

3.4. Ablation Study: Contribution of ESG Structural Nodes

To quantify the contribution of each architectural component introduced in ESG-Graph, we conduct an ablation study on the role of the ESRS subtopic anchor nodes and the Global ESG node.

The objective of this analysis is to isolate the effect of each component and verify that performance gains arise from architectural design.

We evaluate four variants of the model:

Full ESG-Graph: the complete architecture described in Section 2.
Subtopic Nodes: ESRS subtopic anchor nodes are removed; only token nodes and syntactic dependency edges are retained.
Global ESG Node: the sentence-level ESG aggregation node is removed.
Subtopic & Global Nodes: both ESG-specific node types are removed.

All experiments were conducted under identical training configurations, graph neural layers, and hyperparameters. Results are summarized in Table 9 and reported as mean F1-score over three random seeds.

Discussion

Removing ESRS subtopic anchor nodes leads to a reduction in F1-score across all datasets. This effect is clearer on the Social dataset, which reflects the fact that social disclosures are closer to everyday language and often describe human-related topics such as working conditions, diversity, or community engagement. Unlike Governance or Environmental reporting, which rely on more technical terminology, social topics are expressed in more varied narrative forms.

Removing the Global ESG node leads to a smaller decrease in performance, indicating that global aggregation provides complementary signals beyond local features. The larger impact observed for Governance disclosures suggests that classification in this category relies more on global document patterns.

The strongest drop is observed when both ESG-specific node types are removed, reducing the architecture to a syntactic dependency graph. While the resulting architecture remains competitive, the performance gap suggests that the ESG-aware components contribute positively to discriminative capacity. Beyond the observed F1 improvement, the hierarchical anchor nodes serve a function that cannot be captured by accuracy metrics alone: they produce named, subtopic-level attribution that is directly legible to ESG practitioners, enabling users to understand not just the classification outcome but which regulatory theme drove that assignment. The added architectural complexity is therefore justified on interpretability grounds independently of its predictive contribution.

3.5. Ablation Study: Effect of Word Embedding Models

In addition to structural inductive biases, we investigate the sensitivity of ESG-Graph to the choice of underlying word embedding model. While the proposed architecture is embedding-agnostic, different static representations may capture ESG-relevant semantics with varying granularity, particularly for domain-specific terminology.

We compare two lightweight pretrained embedding models: FastText and GloVe.

FastText incorporates subword information, which is expected to be beneficial for handling rare terms, morphological variations, and domain-specific vocabulary. GloVe, on the other hand, captures global co-occurrence statistics and has been shown to produce stable semantic representations across diverse text domains.

For this ablation, all architectural components of ESG-Graph are kept identical, including graph construction, attention layers, and training hyperparameters. The only difference lies in the initialization of token node embeddings. Results are reported as mean F1-score over three random seeds.

Discussion

Across all three ESG pillars, GloVe embeddings yield consistently higher performance than FastText, although the observed margin remains modest. This pattern suggests that ESG-Graph is not strongly dependent on a specific embedding choice and that its performance is primarily driven by the graph-based inductive biases introduced by the architecture.

The slight advantage observed for GloVe may be related to its ability to capture global semantic regularities. Conversely, the competitive performance of FastText indicates that subword-level information provides limited additional benefit once ESG-specific taxonomy is defined.

Beyond the GloVe vs. FastText comparison, we note that the choice of static embeddings over domain-adapted or contextualized alternatives such as ESG-BERT was deliberate on three grounds. First, it ensures that performance differences across model variants are attributable to graph structure rather than encoder capacity, preserving the architectural isolation necessary to interpret ablation results. Second, static embeddings introduce zero additional trainable parameters, keeping the model’s lightweight profile intact. Third, stacking transformer-derived contextual features on top of a graph attention network operating in a shallow layer regime risks representational redundancy and over-smoothing. The results in Table 10 empirically support this design choice: even the simpler FastText variant remains competitive, confirming that performance is primarily driven by the graph structure rather than the embedding layer.

3.6. Keyword Sensitivity Analysis

To assess the model’s robustness to variations in the analyst-defined keyword sets, we conducted a systematic perturbation study in which keywords were randomly removed from each subtopic lexicon at three removal rates: 10%, 20%, and 30%. For each rate, five independent random draws were performed and the mean F1-score was recorded across all three domains. Results are reported in Table 11.

Across all three domains and perturbation levels, the maximum observed F1 drop is 0.38 points (Environmental at −30%), indicating a limited degradation in performance under perturbations. This confirms that the model does not critically depend on any specific keyword subset and generalizes robustly across reasonable keyword set variations. The intersection-based consensus process therefore does not need to achieve perfect keyword coverage, the model’s graph-based representations compensate for minor vocabulary gaps.

3.7. Energy, Emissions, and Comparative Efficiency Landscape

To quantify the environmental and computational footprint of all architectures, we measure energy usage using the CodeCarbon v3.0.7 framework, https://github.com/mlco2/codecarbon, accessed on 13 April 2026.

The system samples GPU and CPU power draw at 1 Hz and computes electricity consumption (kWh) and CO₂ emissions following [23].

All experiments were executed under identical hardware conditions (NVIDIA RTX 4070 Laptop GPU, Intel i9-13900H, 60 GB RAM).

3.7.1. Energy and Emissions

Distilled transformer models are the most energy-efficient baselines, consuming between 1.37–1.99 Wh per run. In comparison, FNet requires a moderate amount of energy (8.94 Wh), placing it between distilled models and larger architectures. Longformer, however, is the most energy-intensive model (19.53 Wh), mainly due to its larger memory footprint.

By comparison, ESG-Graph exhibits a very low computational footprint, requiring 0.000312 kWh (0.312 Wh) per run and emitting 0.000148 kg CO₂.

Relative to Transformer baselines, ESG-Graph operates in a lower energy regime, consuming approximately:

4.4× less energy than TinyBERT,
6.4× less energy than DistilBERT,
28.7× less energy than FNet,
62.6× less energy than Longformer.

These results indicate that ESG-Graph achieves competitive classification performance while operating at a substantially lower energy cost than Transformer baselines, demonstrating the efficiency benefits of sparse, graph-based computation.

Results are summarized in Table 12.

3.7.2. Performance-Energy Trade-Off Curve

To jointly assess predictive performance and computational cost, Figure 2 plots the average F1 score across ESG datasets against the average energy consumption per run (kWh). A logarithmic scale is used on the energy axis to accommodate the wide dynamic range in consumption across models and to improve visual comparability between low and high energy regimes.

TinyBERT operates in a low-energy regime but attains a lower average F1 score. DistilBERT provides moderate performance gains at increased energy cost, while FNet exhibits higher consumption without proportional improvements in F1. Longformer achieves strong predictive performance but requires substantially more energy per run.

ESG-Graph combines very low energy usage with the highest average F1 score among the evaluated supervised models. Compared to Transformer baselines, ESG-Graph achieves favorable performance while operating in a substantially lower energy regime.

3.7.3. Comparison with Large Language Models

In addition to supervised baselines, we evaluate four zero-shot LLMs: GPT-4o, GPT-4o mini, Gemini-2.5 Flash, and Gemini-2.5 Pro.

These models demonstrate strong generalization capabilities, particularly on the Environmental dataset, where Gemini-2.5 Flash attains the highest zero-shot F1 (91.95%). However, they operate at a fundamentally different computational scale and therefore cannot be directly placed on the performance-energy curve used for supervised models.

We did not include fine-tuned or few-shot LLM baselines (e.g., LLaMA-3-8B) due to their significantly higher computational and memory requirements. Fine-tuning such models typically requires GPU memory well beyond 8 GB and substantially higher computational cost and longer training times, making them inconsistent with the efficiency-focused evaluation setting considered in this study. The comparison is therefore restricted to zero-shot LLMs.

Nonetheless, the supervised ESG-Graph remains competitive with these LLMs, achieving:

Higher accuracy and F1 score than all evaluated LLMs on the Governance dataset;
Higher F1 score than all evaluated LLMs on the Environmental dataset;
Substantially higher F1 score and accuracy than all evaluated LLMs on the Social dataset;
Vastly lower energy cost, by several orders of magnitude.

Given that none of the LLMs undergo task-specific fine-tuning and their energy usage is several orders of magnitude larger, as reported in public model cards and efficiency analyses [24,25], these results suggest that compact, task-aligned architectures can offer competitive performance under realistic computational budgets [16,26].

3.7.4. Discussion

Overall, the findings present a coherent picture across efficiency and performance. Distilled transformers offer strong baseline efficiency, while full transformer architectures such as Longformer achieve high predictive accuracy at the cost of substantially increased energy usage. In contrast, ESG-Graph exhibits a favorable balance, combining competitive supervised accuracy with very low energy consumption and a strong position on the performance-energy trade-off curve.

3.8. Attention Dynamics Across ESG Pillars

To understand how ESG-Graph processes textual information across ESG dimensions, we analyze the layer-wise evolution of attention weights on the Environmental, Social, and Governance datasets. As reported in Table 13, the mean attention remains approximately constant at 0.22 across all layers. However, entropy consistently decreases while skewness increases with depth, which indicates that the model shifts progressively from general language patterns to more concentrated focus on domain-relevant tokens.

A full visualization of the attention entropy curves and layer-wise weight distributions is provided in Appendix B, where Figure A1 and Figure A2 represent how the model transitions from general language patterns (Layers 1–2) to domain-specific semantics (Layers 3–4).

3.9. Interpretability Analysis

Interpretability is crucial in ESG text classification, where models must provide transparent justifications for their decisions, particularly in regulatory contexts [8]. To analyze how ESG-Graph forms its predictions, we employ gradient-based attribution methods, which estimate the sensitivity of the model output to changes in node representations [27,28].

Unlike attention mechanisms, which explain information routing, gradient-based attributions focus on estimating how variations in node representations influence the model output [29,30].

Given the classifier output

f (G)

and the embedding of a node

h_{v}

, the attribution score is computed as:

Importance (v) = ∥\frac{\partial f}{\partial h_{v}} ⊙ h_{v}∥ .

(14)

Because gradients propagate through the graph attention layers, these scores capture both the lexical meaning of individual tokens and the structural influence of the dependency edges connecting them, as shown in earlier work on graph-level explainability [31].

The gradient-based node importance results, averaged across documents, are shown in Figure 3, Figure 4 and Figure 5.

For Governance documents, tokens such as governance, compliance, audit, and regulatory receive the highest attribution scores. These terms highlight that the model bases its predictions on meaningful corporate-governance content.

For Environmental documents, influential tokens include environmental, emissions, climate, carbon, and renewable. The spread of attribution across multiple environmental themes suggests that the model captures the multiple dimensions of environmental disclosures.

For Social documents, tokens such as gender, diversity, employees, rights, and communities show high importance. These reflect key aspects of social impact and show that the model captures relevant patterns.

Case Study: Governance Classification

As a concrete illustration, consider the following excerpt drawn from a governance disclosure aligned with corporate governance standards:

“The UK Corporate Governance Code recommends that the Board should include a balance of executive and non-executive Directors (and in particular independent non-executive Directors) such that no individual or small group of individuals can dominate the Board’s decision making.”

The dependency parser identifies recommends as the root, linking key entities such as Board, Directors, and decision making through syntactic relations (e.g., nsubj, dobj, and clausal modifiers). Several tokens, including Board, Directors, and governance, match entries in the G1 subtopic lexicon, leading to the activation of a Governance anchor node.

Gradient-based attribution highlights the anchor node and its directly connected syntactic neighbors as the most influential components in the classification decision. These elements correspond to core governance concepts such as board composition, independence, and decision-making structure. This behavior provides a transparent and policy-aligned explanation of the model output, illustrating how subtopic anchor nodes bridge token-level representations and higher-level regulatory concepts.

To assess the practical relevance of these explanations, a subset of attribution outputs was qualitatively examined by ESG analysts. The identified anchor nodes and salient tokens were generally consistent with expert assessments of governance materiality, particularly in cases where disclosure language follows established regulatory frameworks. While a formal user study is beyond the scope of this work, these observations suggest that ESG-Graph produces explanations that align with professional analytical reasoning.

3.10. Discussion

Beyond the empirical findings reported above, it is essential to examine how the results of this study relate to and extend established research directions. In the Introduction, we identified four families of approaches for ESG text classification and analyzed their respective limitations along three dimensions: efficiency, structural expressiveness, and domain knowledge integration (Table 1). The following discussion revisits each model family in light of the experimental evidence presented in this study.

3.10.1. Relation to Transformer-Based Models

To begin with, large pretrained transformers such as BERT [6] and its ESG-adapted variant ESG-BERT [4] have established strong empirical baselines for text classification by leveraging deep contextual representations. However, as noted in Table 1, these models score low across all three dimensions due to their quadratic computational complexity [7], sequential token processing, and absence of structured domain knowledge. The results presented in Section 3 and Section 3.7 confirm that ESG-Graph achieves competitive or superior classification performance, particularly on the Governance dataset where it outperforms all transformer baselines, while requiring orders of magnitude fewer parameters and up to 60× less energy. These findings corroborate the broader concerns raised by Strubell et al. [9] and Patterson et al. [25] regarding the environmental costs of large-scale language models, and provide evidence that task-specific architectures grounded in linguistic structure constitute a viable alternative in domain-constrained settings.

3.10.2. Relation to Efficient Transformer Architectures

Building on this comparison, efficient transformer variants such as DistilBERT [11], TinyBERT [12], FNet [14], and Longformer [13] have convincingly demonstrated that model compression and architectural simplification can substantially reduce computational cost without proportional accuracy degradation. ESG-Graph extends this line of inquiry by establishing that competitive ESG classification can be attained without relying on transformer encoders altogether, suggesting that graph-based message passing over linguistically grounded structures constitutes a complementary efficiency pathway. Moreover, while efficient transformers address the efficiency limitation identified in Table 1, they remain grounded in sequential token processing and do not incorporate structured domain knowledge, leaving the structure and knowledge dimensions unresolved. The training efficiency results reported in Table 7 and the energy measurements in Table 12 confirm that ESG-Graph operates in a substantially lower computational regime while simultaneously addressing the structural and knowledge gaps that efficient transformers leave open. This contributes to the emerging literature on sustainable AI [10] and provides evidence that architectural design choices, not only compression or distillation strategies, constitute a viable lever for reducing the carbon footprint of NLP systems, a consideration that carries particular weight when the application domain itself concerns environmental sustainability.

3.10.3. Relation to Graph-Based Methods

Furthermore, the foundational work of Yao et al. [15] demonstrated that graph convolutional networks can effectively model text for classification through corpus-level word co-occurrence graphs, achieving strong structural expressiveness as reflected in Table 1. ESG-Graph refines this approach in two important respects. First, it operates at the sentence level and leverages syntactic dependency relations rather than statistical co-occurrence, enabling the model to capture grammatical compositionality and thematic relevance within individual disclosures. This shift from corpus-level to sentence-level graph construction extends the applicability of graph-based NLP to structured regulatory text domains where document-level granularity is insufficient. Second, while existing graph-based methods score low on domain knowledge integration (Table 1), ESG-Graph augments syntactic graphs with ESRS-aligned taxonomy anchor nodes, directly addressing this limitation. The ablation results in Section 3.4 confirm that these domain-specific components contribute measurably to classification performance, and the interpretability analysis in Section 3.9 demonstrates that gradient-based attribution over named taxonomy nodes extends prior work on GNN explainability [31] to the ESG disclosure domain, responding to the concerns raised by Lipton [8] regarding the opacity of neural models in high-stakes applications.

3.10.4. Relation to Knowledge-Enhanced Approaches

Closely related to the knowledge integration challenge, Peters et al. [18] proposed enriching contextual word representations with external knowledge bases through additional pretraining or embedding alignment. While such approaches achieve high domain alignment as reflected in Table 1, they remain grounded in sequential architectures with limited structural expressiveness. ESG-Graph adopts a more modular strategy by encoding domain knowledge through taxonomy-based virtual nodes embedded directly within the graph topology, thereby resolving the structure–knowledge trade-off that characterizes existing knowledge-enhanced methods. This evolutive design is particularly well suited to the ESG domain, where regulatory frameworks such as ESRS undergo frequent revisions, and it confirms that lightweight, structure-based knowledge injection can serve as a practical alternative to knowledge-intensive pretraining. In a broader theoretical context, Garcez and Lamb [17] have advocated for the integration of symbolic reasoning with neural computation as a pathway toward more robust and interpretable AI systems. ESG-Graph provides a concrete, domain-specific instantiation of this paradigm: the ESRS-aligned taxonomy nodes function as symbolic priors grounding neural predictions in analyst-defined regulatory concepts. The ablation results presented in Section 3.4 offer empirical support for the neuro-symbolic hypothesis in the context of financial text analysis.

3.10.5. Unified Positioning

Taken together, these connections demonstrate that ESG-Graph addresses the specific limitations identified for each model family in Table 1: it matches the predictive strength of transformer-based models without their computational burden, extends the efficiency gains of compressed architectures by eliminating the transformer dependency entirely, augments the structural advantages of graph-based methods with domain-specific knowledge integration, and provides a more modular and evolutive alternative to knowledge-enhanced approaches while simultaneously achieving structural expressiveness. The empirical evidence presented in this study suggests that these properties, previously achieved only in isolation by separate families of approaches, can be unified within a single lightweight framework, providing a coherent response to the research question posed in the Introduction.

4. Conclusions

This paper introduced ESG-Graph, a lightweight and interpretable graph neural framework for ESG text classification that addresses the computational, environmental, and transparency limitations of transformer-based approaches in sustainability analytics.

Our findings demonstrate that ESG-Graph achieves competitive classification performance across Environmental, Social, and Governance benchmarks while using only 1.65 million parameters.

Moreover, ESG-Graph reduces energy consumption by between approximately 4× and 60× relative to efficient transformer baselines, while eliminating sequence length constraints.

Limitations and Future Work: While ESG-Graph demonstrates strong performance on benchmark datasets, its evaluation on other jurisdictional ESG disclosures remains future work. Moreover, the current taxonomy requires significant manual curation. Future work could explore the integration of a relevance gating mechanism. This would enable adaptive token selection based on context, without introducing additional components such as large language models. Future work should also evaluate few-shot prompting of smaller open-source LLMs (e.g., Llama-3-8B) as an additional baseline, acknowledging that such models were excluded from the current study due to their incompatibility with the single-GPU hardware constraint and their lack of native interpretability. Furthermore, a formal practitioner user study evaluating the alignment between model attributions and expert materiality assessments constitutes an important direction for future validation and is left for subsequent work.

In conclusion, ESG-Graph represents a strong alternative to large language models for ESG text classification. By encoding domain knowledge into graph topology, this offers financial institutions and regulators a practical tool for scalable ESG disclosure analysis.

Author Contributions

Conceptualization and methodology: Y.E. and A.S.; software and formal analysis: Y.E.; data curation and investigation: Y.E., M.C. and R.B.; validation: M.O. and E.B.; writing original draft preparation: Y.E.; writing review and editing: M.C., R.B. and M.E.K.; visualization: Y.E.; supervision: M.C. and R.B.; project administration: R.B. and E.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed in this study are derived from publicly available ESG benchmark corpora. Processed data and code used to reproduce the experiments are available from the corresponding author upon reasonable request.

Acknowledgments

AI or AI-assisted tools were not used in drafting this manuscript.

Conflicts of Interest

Authors Yasser Elouargui, Abdellatif Sassioui, Rachid Benouini and Elmehdi Benyoussef were employed by the Leyton. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ESG	Environmental, Social, and Governance
GNN	Graph Neural Network
GAT	Graph Attention Network
ESRS	European Sustainability Reporting Standards
LLM	Large Language Model

Appendix A. Full ESRS Lexicon

This appendix provides the complete ESRS-aligned lexicon used for token-to-subtopic mapping in the graph construction stage. The lexicon follows the structure of the European Sustainability Reporting Standards (ESRS) and includes all keyword sets associated with Environmental (E1–E5), Social (S1–S4), and Governance (G1) thematic areas.

Appendix A.1. Environmental (E) Lexicon

Table A1. Environmental subtopics (E1–E5) and their full keyword sets.

Code	Label	Keywords
E1	Climate Change	climate, emission, co2, carbon, ghg, greenhouse, renewable, energy, decarbonization, net, zero, adaptation, mitigation, transition, offset, neutrality, temperature, paris, target, fuel, electricity, efficiency, scope, low-carbon, resilience, sustainability
E2	Pollution	pollution, hazardous, chemical, waste, air, water, soil, toxic, pesticide, fertiliser, microplastic, spill, discharge, contamination, depollution, remediation, particulate, nitrogen, sulfur, ozone
E3	Water and Marine Resources	water, marine, ocean, river, lake, groundwater, desalination, scarcity, reuse, recycling, effluent, eutrophication, catchment, aquatic, wetland, watershed, hydrology
E4	Biodiversity and Ecosystems	biodiversity, ecosystem, habitat, species, deforestation, conservation, reforestation, afforestation, restoration, natural, capital, flora, fauna, soil, degradation, pollination, forest, management
E5	Circular Economy	circular, economy, recycling, reuse, remanufacturing, waste, reduction, lifecycle, packaging, material, recovery, composting, upcycling, sustainable, durability, take-back, zero, waste

Appendix A.2. Social (S) Lexicon

Table A2. Social subtopics (S1–S4) and their full keyword sets.

Code	Label	Keywords
S1	Own Workforce	employee, workforce, labour, union, diversity, inclusion, gender, training, safety, wellbeing, wage, salary, harassment, equality, parental, leave, remuneration, collective, bargaining, employment
S2	Workers in the Value Chain	supplier, subcontractor, value, chain, due, diligence, human, rights, child, labour, forced, migrant, worker, audit, sourcing, procurement, traceability, ethical, vendor
S3	Affected Communities	community, local, indigenous, stakeholder, consultation, grievance, land, rights, resettlement, philanthropy, education, healthcare, sanitation, vulnerable, infrastructure
S4	Consumers and End-users	consumer, customer, end, user, product, safety, privacy, data, security, accessibility, affordability, complaint, marketing, information, trust, loyalty, quality, warranty

Appendix A.3. Governance (G) Lexicon

Table A3. Governance subtopic (G1) and its full keyword set.

Code	Label	Keywords
G1	Governance, Risk Management and Internal Control	board, governance, audit, compliance, ethics, bribery, whistleblowing, transparency, risk, management, policy, stakeholder, shareholder, remuneration, independence, diversity, committee, oversight, conflict, interest, lobbying, fraud, tax, integrity, accountability

Appendix B. Detailed Attention Visualizations

Appendix B.1. Evolution of Attention Statistics

Figure A1 shows how entropy and variance evolve across the four GNN layers for each ESG dataset.

Figure A1. Evolution of attention statistics across ESG pillars.

Appendix B.2. Distribution of Attention Weights

Figure A2 presents the attention-weight distributions across layers for each ESG dataset.

Figure A2. Distribution of attention weights across layers for each ESG dataset.

References

European Union. Regulation (EU) 2019/2088 on Sustainability-Related Disclosures in the Financial Services Sector (SFDR). Official Journal of the European Union. 2019. Available online: https://climate-laws.org/document/regulation-eu-2019-2088-on-sustainability-related-disclosures-in-the-financial-services-sector-and-amending-regulation-2020-852-taxonomy_eb3b (accessed on 9 April 2026).
European Union. Directive (EU) 2022/2464 on Corporate Sustainability Reporting (CSRD). Official Journal of the European Union. 2022. Available online: https://climate-laws.org/document/directive-eu-2022-2464-of-the-european-parliament-and-of-the-council-of-14-december-2022-amending-regulation-eu-no-537-2014-directive-2004-109-ec-directive-2006-43-ec-and-directive-2013-34-eu-as-regards-corporate-sustainability-reporting-corporate-sustainability-reporting-directive_3a34 (accessed on 9 April 2026).
Araci, D. FinBERT: Financial Sentiment Analysis with Pre-trained Language Models. arXiv 2019, arXiv:1908.10063. [Google Scholar]
Gupta, R.; Rane, S.; Chakraborty, T. ESG-BERT: A Domain-Specific Language Model for ESG Text Classification. In Proceedings of the ICML Workshop on Climate Change, Virtually, 26–30 April 2020. [Google Scholar]
Aguda, T.D.; Siddagangappa, S.; Kochkina, E.; Kaur, S.; Wang, D.; Smiley, C. Large Language Models as Financial Data Annotators: A Study on Effectiveness and Efficiency. In Proceedings of the Joint International Conference on Computational Linguistics and Language Resources and Evaluation (LREC-COLING), Torino, Italy, 20–24 May 2024; pp. 10124–10145. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the NeurIPS, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Lipton, Z.C. The Mythos of Model Interpretability. Commun. ACM 2018, 61, 36–43. [Google Scholar] [CrossRef]
Strubell, E.; Ganesh, A.; McCallum, A. Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the ACL, Florence, Italy, 28 July–2 August 2019; pp. 3645–3650. [Google Scholar]
Henderson, P.; Hu, J.; Romoff, J.; Brunskill, E.; Jurafsky, D.; Pineau, J. Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning. J. Mach. Learn. Res. 2020, 21, 1–43. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT: A Distilled Version of BERT. In Proceedings of the NeurIPS Workshop, Vancouver, BC, Canada, 13 December 2019. [Google Scholar]
Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. TinyBERT: Distilling BERT for Natural Language Understanding. In Proceedings of the EMNLP, Virtually, 16–20 November 2020; pp. 4163–4174. [Google Scholar]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. In Proceedings of the ACL, Seattle, WA, USA, 5–10 July 2020; pp. 8449–8461. [Google Scholar]
Lee, T.; Kim, H.; Tokmakov, P.; Cho, K. FNet: Mixing Tokens with Fourier Transforms. In Proceedings of the NAACL-HLT, Mexico City, Mexico, 6–11 June 2021; pp. 4485–4494. [Google Scholar]
Yao, L.; Mao, C.; Luo, Y. Graph Convolutional Networks for Text Classification. In Proceedings of the AAAI, Honolulu, HI, USA, 27 January–1 February 2019; pp. 7370–7377. [Google Scholar]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4–24. [Google Scholar] [CrossRef] [PubMed]
Garcez, A.; Lamb, L.C. Neurosymbolic AI: The 3rd Wave. Artif. Intell. Rev. 2023, 56, 12387–12406. [Google Scholar] [CrossRef]
Peters, M.E.; Ruder, S.; Smith, N.A.; Lee, K.; Beltagy, I. Knowledge Enhanced Contextual Word Representations. In Proceedings of the EMNLP-IJCNLP, Hong Kong, China, 3–7 November 2019; pp. 43–54. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global Vectors for Word Representation. In Proceedings of the EMNLP, Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Dozat, T.; Manning, C.D. Deep Biaffine Attention for Neural Dependency Parsing. In Proceedings of the ICLR, Toulon, France, 24–26 April 2017. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. In Proceedings of the ICLR, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the ICLR, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Lacoste, A.; Luccioni, A.; Schmidt, V.; Dandres, T. Quantifying the Carbon Emissions of Machine Learning. arXiv 2020, arXiv:1910.09700. [Google Scholar]
Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the Opportunities and Risks of Foundation Models. arXiv 2021, arXiv:2108.07258. [Google Scholar] [CrossRef]
Patterson, D.; Gonzalez, J.; Le, Q.; Liang, C.; Munguia, L.; Rothchild, D.; So, D.; Texier, M.; Dean, J. Carbon Emissions and Large Neural Network Training. arXiv 2021, arXiv:2104.10350. [Google Scholar] [CrossRef]
Tay, Y.; Dehghani, M.; Abnar, S.; Shen, Y.; Bahri, D.; Metzler, D. Efficient Transformers: A Survey. ACM Comput. Surv. 2022, 55, 109. [Google Scholar] [CrossRef]
Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic Attribution for Deep Networks. In Proceedings of the ICML, Sydney, Australia, 6–11 August 2017; pp. 3319–3328. [Google Scholar]
Shrikumar, A.; Greenside, P.; Kundaje, A. Learning Important Features through Propagating Activation Differences. In Proceedings of the ICML, Sydney, Australia, 6–11 August 2017; pp. 3145–3153. [Google Scholar]
Jain, S.; Wallace, B.C. Attention Is Not Explanation. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019; pp. 3543–3556. [Google Scholar]
Wiegreffe, S.; Pinter, Y. Attention Is Not Not Explanation. In Proceedings of the EMNLP, Hong Kong, China, 3–7 November 2019; pp. 11–20. [Google Scholar]
Ying, R.; Bourgeois, D.; You, J.; Zitnik, M.; Leskovec, J. GNNExplainer: Generating Explanations for Graph Neural Networks. In Proceedings of the NeurIPS, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]

Figure 1. Overview of the ESG-Graph architecture.

Figure 2. Average F1 score versus average energy consumption (kWh, log scale) across supervised architectures.

Figure 3. Node importance in Governance documents.

Figure 4. Node importance in Environmental documents.

Figure 5. Node importance in Social documents.

Table 1. Systematic comparison of existing approaches for ESG text classification. Representative works include BERT [6], ESG-BERT [4], DistilBERT [11], Longformer [13], FNet [14], TextGCN [15], and knowledge-integrated models [18].

Approach	Efficiency	Structure	Knowledge
Transformers	Low	Low	Low
Efficient Transformers	Medium	Low	Low
Graph Neural Networks	High	High	Low
Knowledge-enhanced NLP	Medium	Low	High

Table 2. Representative ESRS subtopics and example keywords used for token to subtopic mapping.

Code	Label	Example Keywords
E1	Climate Change	emission, carbon, renewable, mitigation
E2	Pollution	pollution, hazardous, chemical, contamination
E3	Water and Marine Resources	water, marine, groundwater, effluent
E4	Biodiversity and Ecosystems	biodiversity, habitat, conservation, species
E5	Circular Economy	circular, recycling, material, recovery
S1	Own Workforce	employee, diversity, training, safety
S2	Value Chain Workers	supplier, due diligence, human rights
S3	Affected Communities	community, indigenous, grievance
S4	Consumers and End users	consumer, privacy, product safety
G1	Governance and Internal Control	audit, compliance, ethics, oversight

Table 3. Token statistics per document across ESG datasets.

Corpus	Mean	Median	p95	Max
Environmental	32.2	29	62	202
Social	32.7	29	64	202
Governance	32.4	29	63	202

Table 4. Environmental dataset performance and scalability comparison.

Model	Params	Seq Length	Accuracy	F1-Score
TinyBERT	15 M	512	92.50	89.90
DistilBERT	66 M	512	94.50	92.40
FNet	82 M	512	94.30	92.00
Longformer	149 M	4096	95.80	94.20
ESG-Graph (ours)	1.65 M	No restrictions *	93.30	92.60
GPT-4o	–	–	90.50	86.84
GPT-4o-mini	–	–	82.17	78.38
Gemini-2.5-Pro	–	–	93.83	91.38
Gemini-2.5-Flash	–	–	94.17	91.95

* ESG-Graph does not impose any architectural constraint on sequence length.

Table 5. Social dataset performance and scalability comparison.

Model	Params	Seq Length	Accuracy	F1-Score
TinyBERT	15 M	512	91.75	91.50
DistilBERT	66 M	512	92.50	92.60
FNet	82 M	512	86.75	89.20
Longformer	149 M	4096	93.00	91.30
ESG-Graph (ours)	1.65 M	No restrictions *	88.50	88.10
GPT-4o	–	–	77.50	73.68
GPT-4o-mini	–	–	74.33	74.92
Gemini-2.5-Pro	–	–	73.00	74.04
Gemini-2.5-Flash	–	–	71.50	73.57

* ESG-Graph does not impose any architectural constraint on sequence length.

Table 6. Governance dataset performance and scalability comparison.

Model	Params	Seq Length	Accuracy	F1-Score
TinyBERT	15 M	512	77.75	69.40
DistilBERT	66 M	512	84.50	70.80
FNet	82 M	512	83.00	65.30
Longformer	149 M	4096	86.30	73.90
ESG-Graph (ours)	1.65 M	No restrictions *	87.0	83.2
GPT-4o	–	–	74.83	84.11
GPT-4o-mini	–	–	69.67	52.85
Gemini-2.5-Flash	–	–	46.50	46.05
Gemini-2.5-Pro	–	–	43.00	45.19

* ESG-Graph does not impose any architectural constraint on sequence length.

Table 7. Training time and comparison across models.

Model	Epoch Time (s)	Peak GPU (GB)
TinyBERT	0.9	0.72
DistilBERT	23.2	1.90
FNet	171.2	1.56
Longformer	415.2	2.81
ESG-Graph (ours)	0.7	0.18

Peak GPU memory ratio: ESG-Graph (0.18 GB) requires 4.0× less memory than TinyBERT (0.72 GB) and 15.6× less than Longformer (2.81 GB).

Table 8. Long-sequence evaluation across ESG domains.

Sample (Domain)	Tokens	Latency (ms)	Peak Mem. (MB)	Node Var.
Environmental-1	1048	18.72	26.72	0.329
Social-1	1125	22.15	26.73	0.361
Governance-1	1180	20.41	26.74	0.268
Environmental-2	1215	23.08	26.75	0.334
Social-2	1352	24.67	26.76	0.348
Long mean	1184	21.81	26.74	0.328
Normal baseline	∼32	5.35	26.30	0.220
Ratio (long/normal)	—	4.07×	1.02×	1.49×

Table 9. Ablation results measuring ESG nodes contribution.

Model Variant	Environmental	Social	Governance
Full ESG-Graph	92.6	88.10	83.2
Subtopic Nodes	91.9	85.6	82.4
Global ESG Node	92.2	86.1	82.9
Subtopic & Global Nodes	91.1	84.8	81.3

Table 10. Ablation results measuring the impact of word embedding models.

Embedding Model	Environmental	Social	Governance
FastText	92.1	87.6	82.7
GloVe	92.6	88.1	83.2

Table 11. F1-score under keyword removal perturbation (mean over five random draws per rate). Absolute drop from baseline shown in parentheses.

Domain	Baseline	−10%	−20%	−30%
Environmental	92.60	92.41 (−0.19)	92.28 (−0.32)	92.22 (−0.38)
Social	88.10	87.93 (−0.17)	87.79 (−0.31)	87.71 (−0.39)
Governance	83.20	83.04 (−0.16)	82.91 (−0.29)	82.83 (−0.37)

Table 12. Average energy consumption and carbon emissions across the three ESG datasets. CO₂ eq values are computed using a global average grid intensity of 0.475 kgCO₂/kWh.

Model	Avg. Energy (kWh)	Avg. CO₂ (kg)
TinyBERT	0.00137	0.00097
DistilBERT	0.00199	0.00128
FNet	0.00894	0.00609
Longformer	0.01953	0.01231
ESG-Graph (ours)	0.000312	0.000148

Table 13. Evolution of attention statistics across ESG datasets.

Dataset	Layer	Mean	Std.	Entropy	Skewness
Environmental	1	0.2203	0.1241	0.295	0.271
	2	0.2203	0.1570	0.276	0.866
	3	0.2203	0.1847	0.259	1.280
	4	0.2203	0.1996	0.248	1.297
Social	1	0.2204	0.1240	0.295	0.273
	2	0.2204	0.1568	0.276	0.869
	3	0.2204	0.1851	0.259	1.292
	4	0.2204	0.1996	0.248	1.300
Governance	1	0.2201	0.1213	0.296	0.170
	2	0.2201	0.1376	0.287	0.573
	3	0.2201	0.1554	0.277	0.875
	4	0.2201	0.1545	0.278	0.858

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Elouargui, Y.; Sassioui, A.; Chergui, M.; Benouini, R.; El Kamili, M.; Benyoussef, E.; Ouzzif, M. ESG-Graph: Hierarchical Residual Graph Attention Network with Analyst-Defined ESG Taxonomy. Technologies 2026, 14, 258. https://doi.org/10.3390/technologies14050258

AMA Style

Elouargui Y, Sassioui A, Chergui M, Benouini R, El Kamili M, Benyoussef E, Ouzzif M. ESG-Graph: Hierarchical Residual Graph Attention Network with Analyst-Defined ESG Taxonomy. Technologies. 2026; 14(5):258. https://doi.org/10.3390/technologies14050258

Chicago/Turabian Style

Elouargui, Yasser, Abdellatif Sassioui, Meriyem Chergui, Rachid Benouini, Mohamed El Kamili, Elmehdi Benyoussef, and Mohammed Ouzzif. 2026. "ESG-Graph: Hierarchical Residual Graph Attention Network with Analyst-Defined ESG Taxonomy" Technologies 14, no. 5: 258. https://doi.org/10.3390/technologies14050258

APA Style

Elouargui, Y., Sassioui, A., Chergui, M., Benouini, R., El Kamili, M., Benyoussef, E., & Ouzzif, M. (2026). ESG-Graph: Hierarchical Residual Graph Attention Network with Analyst-Defined ESG Taxonomy. Technologies, 14(5), 258. https://doi.org/10.3390/technologies14050258

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ESG-Graph: Hierarchical Residual Graph Attention Network with Analyst-Defined ESG Taxonomy

Abstract

1. Introduction

2. Materials and Methods

2.1. Architecture Overview

2.2. Analyst Defined ESG Taxonomy for Domain Aware Subtopic Annotation

2.3. Processing Pipeline

2.4. Preprocessing

2.5. Graph Construction

2.5.1. Syntactic Dependency Graph

2.5.2. ESRS Subtopic Anchor Nodes

2.5.3. Sentence Level ESG Anchor Node

2.6. Graph Neural Processing

2.7. Algorithm Summary

2.8. Experimental Setup

2.8.1. Datasets

2.8.2. Graph Construction Robustness

2.8.3. Scalability Considerations

2.8.4. Hardware and Computational Environment

2.8.5. Training Configuration

3. Results and Discussion

3.1. Dataset-Wise Benchmarks

3.1.1. Environmental Dataset

3.1.2. Social Dataset

Governance Dataset

3.2. Training Efficiency

3.3. Long-Sequence Scalability Analysis

3.4. Ablation Study: Contribution of ESG Structural Nodes

Discussion

3.5. Ablation Study: Effect of Word Embedding Models

Discussion

3.6. Keyword Sensitivity Analysis

3.7. Energy, Emissions, and Comparative Efficiency Landscape

3.7.1. Energy and Emissions

3.7.2. Performance-Energy Trade-Off Curve

3.7.3. Comparison with Large Language Models

3.7.4. Discussion

3.8. Attention Dynamics Across ESG Pillars

3.9. Interpretability Analysis

Case Study: Governance Classification

3.10. Discussion

3.10.1. Relation to Transformer-Based Models

3.10.2. Relation to Efficient Transformer Architectures

3.10.3. Relation to Graph-Based Methods

3.10.4. Relation to Knowledge-Enhanced Approaches

3.10.5. Unified Positioning

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Full ESRS Lexicon

Appendix A.1. Environmental (E) Lexicon

Appendix A.2. Social (S) Lexicon

Appendix A.3. Governance (G) Lexicon

Appendix B. Detailed Attention Visualizations

Appendix B.1. Evolution of Attention Statistics

Appendix B.2. Distribution of Attention Weights

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI