1. Introduction
The accelerating demand for sustainability transparency has elevated Environmental, Social, and Governance (ESG) reporting from a voluntary practice to a regulatory mandate. The European Union’s Sustainable Finance Disclosure Regulation (SFDR) and the Corporate Sustainability Reporting Directive (CSRD) alone now generate tens of millions of narrative words each year, describing corporate climate policies, labor practices, and governance frameworks [
1,
2]. These disclosures, although rich in qualitative information, are produced at a scale that exceeds human analytical capacity; however, the labeled annotation budget available for training automated classifiers remains limited in practice, as domain-specific annotation requires expert knowledge and incurs significant cost, even when assisted by large language models [
3,
4,
5], motivating the low-data evaluation regime adopted in this work.
As a result, investment managers, auditors, and regulators must rely on automated text classification systems to process ESG disclosures efficiently under strict computational, operational, and regulatory constraints.
In response to this need, current state-of-the-art systems are predominantly based on large pretrained transformer models. Architectures such as BERT [
6] and RoBERTa have become central to modern natural language processing, and their success has extended to ESG applications through domain-specific variants such as ESG-BERT [
4]. By leveraging contextual representations, these models capture domain-relevant semantic patterns and establish strong empirical baselines for ESG classification tasks.
However, despite their strong performance, transformer-based approaches present several limitations in ESG contexts. First, their quadratic computational complexity with respect to sequence length [
7] makes them unsuitable for long disclosures or real-time deployment on limited hardware. Second, they operate as black-box models, limiting interpretability and auditability [
8]. Third, their substantial energy consumption raises concerns, as it contradicts the sustainability objectives that ESG analysis aims to support [
9,
10]. Finally, these models treat text as linear token sequences, overlooking the inherent syntactic dependencies and structured domain knowledge embedded in ESG disclosures.
To mitigate these limitations, prior work has explored more efficient transformer variants. Knowledge distillation methods such as DistilBERT [
11] and TinyBERT [
12] reduce model size while preserving predictive performance. In parallel, architectures such as Longformer [
13] and FNet [
14] improve computational efficiency through sparse attention and alternative transformations.
Nevertheless, these approaches remain grounded in sequential token processing and do not explicitly model linguistic structure or domain knowledge, limiting their interpretability and alignment with regulatory requirements.
This limitation motivates alternative representations. Graph Neural Networks (GNNs) provide a flexible framework for modeling relational data. In NLP, text can be represented as a graph where nodes correspond to linguistic units and edges encode syntactic or semantic relationships [
15,
16]. Through message passing, GNNs capture long-range dependencies and compositional structure beyond positional encoding.
However, despite these advantages, existing graph-based approaches have seen limited adoption in ESG applications, particularly in integrating structured regulatory knowledge.
Beyond structural modeling, integrating external domain knowledge can enhance robustness and interpretability. In this setting, symbolic knowledge refers to structured representations such as ontologies, knowledge graphs, or curated lexicons that encode relationships between domain concepts. In ESG analysis, these representations formalize regulatory notions (e.g., environmental risk, governance practices, or social impact) into machine readable structures aligned with reporting standards.
Prior work shows that such knowledge can ground predictions in analyst-defined frameworks [
17,
18]. This perspective aligns with neuro-symbolic AI, where statistical models are guided by structured priors to improve interpretability and consistency with expert reasoning.
In ESG, such knowledge is encoded through evolving taxonomies aligned with standards such as the European Sustainability Reporting Standards (ESRS).
However, most existing models—both transformer-based and graph-based—remain weakly connected to these structured knowledge sources, limiting their ability to produce interpretable and policy-aligned predictions.
To identify the research gap, we adopt a problem-driven analytical decomposition of ESG text classification. Rather than organizing prior work solely by model families, we analyze existing approaches through the fundamental requirements imposed by ESG applications.
Specifically, we consider three key dimensions: (i) computational efficiency, driven by large-scale disclosures and deployment constraints, (ii) structural modeling, required to capture syntactic and relational dependencies in text, and (iii) domain knowledge integration, necessary for alignment with regulatory frameworks and analyst-defined concepts.
We examine representative approaches, including transformer-based models, efficient architectures, graph-based methods, and knowledge-enhanced systems through these dimensions. This unified perspective enables a structured comparison of otherwise heterogeneous methods and provides a transparent basis for identifying the research gap, as summarized in
Table 1.
As shown in
Table 1, existing approaches fail to simultaneously achieve efficiency, structural expressiveness, and domain alignment. This fragmented landscape reveals a clear research gap: the absence of a unified, lightweight framework capable of integrating relational structure and regulatory knowledge while maintaining computational efficiency. This gap emerges directly from the above analytical decomposition, which highlights that no existing approach simultaneously satisfies all three dimensions.
This work is guided by the following research question: how can ESG text classification be designed to simultaneously achieve computational efficiency, interpretability, and alignment with evolving regulatory frameworks.
To address this gap, we argue that ESG text is inherently graph-structured. For example, phrases such as “reducing carbon emissions under board supervision” encode explicit dependencies between environmental and governance concepts.
Building on this observation, we propose ESG-Graph, a lightweight and interpretable graph neural architecture that models ESG disclosures as token-level dependency graphs augmented with ESRS-aligned taxonomy nodes. This design enables the joint modeling of syntactic structure and regulatory semantics, resulting in a system that is efficient, interpretable, and aligned with ESG reporting requirements. Importantly, ESG-Graph does not introduce a fundamentally new learning paradigm; rather, it integrates graph-based modeling and domain-specific knowledge within a unified architecture tailored to ESG constraints.
From a theoretical perspective, this work contributes to three complementary research streams. First, it advances neuro-symbolic learning by embedding domain specific regulatory knowledge directly into graph-based neural architectures, enabling structured reasoning over ESG concepts. Second, it contributes to explainable AI (XAI) in financial text analysis by providing transparent, node level representations aligned with analyst defined taxonomies. Third, it extends representation learning for structured text by demonstrating how syntactic dependencies and domain knowledge can be jointly encoded as inductive biases within graph neural networks.
Together, these contributions frame ESG text classification as a structured reasoning problem over hybrid symbolic–neural representations, where regulatory concepts act as explicit constraints guiding model behavior.
Our contributions are as follows:
We introduce ESG-Graph, a graph neural architecture that jointly models syntactic dependencies and ESG domain concepts.
We propose an evolutive ESG taxonomy, allowing dynamic integration of new regulatory concepts without retraining.
We demonstrate competitive performance with efficient transformer baselines while using significantly fewer parameters and up to 60× less energy.
We provide transparent, policy-aligned explanations through gradient-based attribution mechanisms.
Building on these contributions, the remainder of this paper is structured as follows.
Section 2 presents the materials and methods.
Section 3 reports the results and discussion. Finally,
Section 4 concludes the paper.
2. Materials and Methods
This section describes the proposed ESG-Graph framework in detail, including the overall architecture, the construction of ESG aware graph representations, and the graph neural processing used for classification. The presentation follows the full pipeline from raw text input to final ESG relevance prediction.
2.1. Architecture Overview
Figure 1 provides a visual summary of the proposed ESG-Graph architecture, illustrating the interaction between linguistic preprocessing, ESG aware graph construction, and graph neural message passing. All architectural components presented in the figure are explained in the following subsections.
Our approach converts ESG-related text segments into structured token-level graphs enhanced with domain knowledge provided by an analyst-defined ESG taxonomy aligned with the European Sustainability Reporting Standards (ESRS).
The model integrates four components:
- (i)
Syntactic dependency relations capturing the grammatical structure of the sentence.
- (ii)
ESG subtopic anchor nodes encoding high-level thematic categories curated by domain analysts and aligned with ESRS disclosure scopes.
- (iii)
A sentence-level ESG node aggregating all ESG-relevant features detected in the text.
- (iv)
A multilayer Graph Attention Network (GAT) performs message passing to produce an ESG relevance prediction.
2.2. Analyst Defined ESG Taxonomy for Domain Aware Subtopic Annotation
To inject ESG knowledge into the graph, an ESG taxonomy is curated by in-house analysts to reflect regulatory concepts aligned with ESRS reporting scopes.
The ESRS framework provides a conceptual structure for the Environmental, Social, and Governance pillars, including scopes such as Climate Change (E1), Pollution (E2), Own Workforce (S1), Consumers and End users (S4), and Governance, Risk Management and Internal Control (G1). Analysts integrate these scopes by associating each subtopic with defined keyword sets to represent regulatory terminology and ESG concepts.
Each subtopic is associated with specific terms, for example emission, carbon, renewable for Climate Change, diversity, training, safety for Own Workforce, and audit, compliance, oversight for Governance. Representative examples appear in
Table 2, and the complete taxonomy is provided in
Appendix A.
During graph construction, tokens matching a subtopic keyword are connected to their corresponding subtopic anchor nodes, which act as local aggregators. All subtopic nodes are then linked to a global ESG node summarizing the sustainability content of the entire document. This design enables the model to integrate ESRS-aligned thematic structure.
The analyst defined ESG taxonomy has three main functions. First, it injects domain knowledge aligned with regulatory frameworks while remaining flexible to evolving disclosure practices. Second, it enhances interpretability, as downstream analyses such as node importance can be expressed directly in terms of ESG subtopics rather than uninterpretable embedding dimensions. Third, it provides weak supervision signals that improve the model’s generalization ability. Importantly, the taxonomy is evolutive by design, allowing new keywords and emerging ESG themes to be incorporated without modifying the underlying model architecture.
The keyword sets were constructed through a team-based process in which analysts independently proposed keywords for each subtopic without prior consultation. A keyword was retained in the final taxonomy only if it appeared in the lists of a majority of analysts; keywords proposed by a single analyst were systematically rejected unless unanimously reinstated upon group review. This intersection-based acceptance criterion approximates a consensus-based annotation protocol, ensuring that the final vocabulary reflects shared professional judgment rather than any individual interpretation, thereby reducing idiosyncratic bias.
Adapting the taxonomy to alternative reporting frameworks such as SASB or TCFD requires updating the keyword sets and subtopic labels, without requiring retraining of the core GNN architecture. In practice, a major framework transition involves three steps: (i) aligning existing ESRS subtopics with semantically equivalent concepts in the target framework, leveraging the substantial conceptual overlap across ESG standards; (ii) updating keyword sets for subtopics with no direct equivalent; and (iii) rerunning graph construction with the updated vocabulary, a fully automated step. Steps (i) and (ii) are estimated to require approximately 2 to 4 person-days of analyst effort for a complete framework transition.
2.3. Processing Pipeline
Given an input sentence, the model pipeline proceeds in three steps:
Preprocessing: Tokens are extracted and mapped to pretrained embeddings.
Graph Construction: A syntactic dependency graph is generated and enriched with ESRS subtopic nodes and a global ESG node, producing a hybrid linguistic domain graph structure.
Graph Neural Processing: Multilayer attention-based message passing propagates information across token, subtopic, and global nodes. Pooled graph embeddings are passed to a feed forward classifier for binary ESG relevance prediction.
This methodology combines grammatical structure with ESG domain knowledge to provide a domain-aware text classifier.
2.4. Preprocessing
Each input sentence is tokenized into non stopword units
and each token is embedded via a pretrained GloVe vector [
19]
This produces the initial feature matrix
All nodes in the graph, including tokens, subtopic anchors, and the sentence level ESG anchor, share this same dimensionality for architectural homogeneity.
2.5. Graph Construction
2.5.1. Syntactic Dependency Graph
We parse the sentence using a dependency parser [
20] and create bidirectional edges for each head dependent pair
If the parser fails, we fall back to a bidirectional chain graph
2.5.2. ESRS Subtopic Anchor Nodes
Each ESG subtopic
t has a curated lexicon
. For each subtopic, if any token matches its lexicon, we create a subtopic anchor node
and connect it bidirectionally to all matched tokens
The embedding of the anchor node is computed as the mean vector of its matched tokens
2.5.3. Sentence Level ESG Anchor Node
To aggregate sustainability cues, we introduce a sentence level ESG anchor node
with embedding
This node is connected to all tokens
The final augmented graph is
2.6. Graph Neural Processing
At each layer
l, node
i aggregates information from its neighbors using a graph attention mechanism [
21]
After
L layers, the final graph representation is obtained by mean pooling
The classifier outputs a binary prediction
2.7. Algorithm Summary
The full processing pipeline for converting a sentence into an ESG-structured graph and producing a classification output is summarized in Algorithm 1.
| Algorithm 1 Token-level ESG graph construction and classification |
- Require:
Sentence ; ESRS lexicons - Ensure:
Predicted ESG label - 1:
Token Embedding: - 2:
for each token in s do - 3:
- 4:
end for - 5:
Syntactic Graph Structure: - 6:
Build dependency edges - 7:
if parser fails then - 8:
- 9:
end if - 10:
Initialize Graph: - 11:
token nodes - 12:
- 13:
Subtopic Anchor Nodes: - 14:
for each ESRS subtopic t do - 15:
- 16:
if then - 17:
create subtopic anchor node - 18:
- 19:
for each do - 20:
add edges and to - 21:
end for - 22:
- 23:
end if - 24:
end for - 25:
Sentence-Level ESG Anchor Node: - 26:
- 27:
if
then - 28:
create sentence-level ESG anchor node - 29:
- 30:
for to n do - 31:
add edges and to - 32:
end for - 33:
- 34:
end if - 35:
Graph Neural Network: - 36:
{L-layer GAT} - 37:
Graph Readout: - 38:
- 39:
Prediction: - 40:
|
2.8. Experimental Setup
This section describes the datasets, hardware environment, and training protocol used to evaluate ESG-Graph.
2.8.1. Datasets
We evaluate our models on three equally sized subsets of the ESG-BERT benchmark corpus [
4]: environment_2k, social_2k, and governance_2k. Each subset contains 2000 labeled short text segments corresponding to one of the three ESG dimensions: Environmental (E), Social (S), or Governance (G). For each subset, the task is formulated as a binary classification problem, where the model predicts whether a given text segment is related to the target ESG pillar.
The datasets consist of short documents extracted from corporate reports and regulatory disclosures, including policy statements, sustainability sections, and governance-related mentions.
Table 3 summarizes the token length statistics.
2.8.2. Graph Construction Robustness
Across all three benchmark datasets (6000 documents), the dependency parser successfully processed 99.85% of sentences. The chain-graph fallback was triggered for only 9 documents (0.15%), corresponding to short or structurally malformed fragments in the validation splits. These cases were distributed as follows: 4 in Environmental (0.20%), 3 in Governance (0.15%), and 2 in Social (0.10%).
To assess whether fallback usage impacts predictive performance, we compared model outputs on fallback cases against standard parsed sentences. No systematic degradation in classification performance was observed, and the contribution of fallback samples to the overall evaluation metrics was negligible due to their extremely low frequency.
This negligible fallback rate, combined with the absence of measurable performance degradation, confirms that the dependency parser operates reliably on ESG disclosure text and that the fallback mechanism serves as a rarely-invoked safety net rather than a primary processing path.
2.8.3. Scalability Considerations
While experiments are conducted on 2000-document subsets, ESG-Graph operates at the sentence level and processes each sentence as an independent graph. This design allows linear scaling to larger corpora, as the number of nodes and edges grows proportionally with input size without increasing model complexity.
In practice, large-scale ESG corpora can be processed efficiently through parallelized graph construction and batched inference, suggesting that the framework can be extended to collections containing millions of disclosures.
2.8.4. Hardware and Computational Environment
All experiments were conducted on a single laptop equipped with an NVIDIA RTX 4070 Laptop GPU (8 GB VRAM) and 60 GB of system memory. By limiting training to an 8 GB GPU, we demonstrate that ESG text classification can be achieved on cost-efficient hardware.
All experiments were implemented in Python (v3.10) using PyTorch (v2.1) and PyTorch Geometric (v2.4). Dependency parsing was performed using spaCy (v3.7). The CodeCarbon framework (v3.0.7) was used to measure energy consumption and carbon emissions.
2.8.5. Training Configuration
We used a consistent training pipeline across ESG-Graph and all transformer baselines to ensure a consistent comparison between models. Training was performed using mini-batches of 16 documents over 15 epochs. Optimization was carried out using the AdamW optimizer [
22], conducted with an initial learning rate of
and a weight decay of
to improve convergence and regularization. A linear learning rate decay schedule was applied throughout training to further improve optimization stability.
Model performance was monitored using the validation F1 score, and early stopping was applied when no improvement was observed for one consecutive epoch. To account for randomness, each experiment was repeated three times using different random seeds (11, 17, and 23), and the reported results are averaged across these runs.
3. Results and Discussion
This section presents a comprehensive empirical evaluation of ESG-Graph, analyzing its predictive performance, computational efficiency, and robustness in comparison with transformer-based baselines and large language models.
We begin by evaluating predictive performance across ESG domains.
3.1. Dataset-Wise Benchmarks
We compare ESG-Graph with lightweight Transformer baselines (TinyBERT, DistilBERT, FNet, and Longformer) and zero-shot large language models (GPT-4o and Gemini-2.5) across the three ESG datasets to assess both predictive performance and model efficiency. All models are evaluated under identical conditions on balanced 2000-document splits to ensure fair comparison. Reported metrics include Accuracy and F1-score, together with the approximate number of trainable parameters and effective sequence length for each architecture. Training efficiency is analyzed separately in terms of wall-clock epoch time and peak GPU memory usage.
3.1.1. Environmental Dataset
The Environmental dataset focuses on climate-related disclosures, energy usage, emissions, and sustainability practices. Results are summarized in
Table 4. ESG-Graph attains an accuracy of 93.3% with an F1-score of 92.6%, while requiring only ∼1.65 M parameters. While Longformer achieves the highest overall accuracy and F1-score, ESG-Graph offers a strong balance between predictive performance and model efficiency.
This result indicates that ESG-Graph achieves near state-of-the-art performance while operating under significantly lower model complexity, confirming the effectiveness of structural and domain-aware representations.
3.1.2. Social Dataset
The Social dataset includes disclosures on workforce, diversity, community engagement, and social impact. Results are summarized in
Table 5. ESG-Graph achieves an accuracy of 88.5% with an F1-score of 88.1%, while using only ∼1.65 M parameters. Although larger Transformer-based models attain slightly higher peak F1 score, ESG-Graph provides a favorable trade-off between predictive performance and training cost.
The lower performance observed across all models on this dataset reflects the inherently ambiguous and context-dependent nature of social disclosures, which are less lexically standardized than environmental or governance reporting.
Governance Dataset
The Governance dataset focuses on compliance, transparency, and ethical governance disclosures. Results are summarized in
Table 6. ESG-Graph achieves an accuracy of 87.0% and an F1-score of 83.2%, while requiring only ∼1.65 M parameters, outperforming all baseline methods in this dataset.
This strong performance can be attributed to the more formalized and consistent vocabulary used in governance disclosures, which aligns closely with the G1 taxonomy and enables precise subtopic anchoring.
Across all three datasets, a consistent pattern emerges: ESG-Graph maintains competitive or superior performance relative to transformer baselines while requiring orders of magnitude fewer parameters.
Zero-shot LLMs exhibit variable performance across datasets, reflecting limited robustness in low-data ESG classification settings. In contrast, ESG-Graph delivers consistently competitive Accuracy and F1-scores while maintaining an extremely lightweight architecture. The performance differential across ESG pillars reflects domain-specific linguistic properties. Governance disclosures rely on a compact, unambiguous regulatory vocabulary that aligns precisely with the G1 taxonomy lexicon, enabling highly discriminative anchor node matching and explaining the strongest F1 scores observed for this pillar. Social disclosures, by contrast, are expressed in more varied narrative forms: keywords such as employee or community frequently appear in non-ESG corporate communication, reducing anchor node precision. This lexical ambiguity is intrinsic to the Social domain and explains the consistently lower F1 scores observed across all models on this pillar, not only ESG-Graph.
In sum, ESG-Graph achieves competitive ESG classification performance with orders of magnitude fewer parameters than standard Transformer models, while removing architectural sequence-length constraints and substantially reducing computational cost.
3.2. Training Efficiency
While predictive performance is critical, practical ESG deployment also requires strong computational efficiency. We therefore evaluate ESG-Graph against Transformer-based baselines in terms of wall-clock training time per epoch and peak GPU memory consumption.
ESG-Graph achieves the lowest training time and memory footprint among all evaluated models, requiring approximately 4.0× less memory than TinyBERT (0.18 GB vs. 0.72 GB) and over 15.6× less than Longformer (0.18 GB vs. 2.81 GB), while also providing an order-of-magnitude speedup in epoch time. These results demonstrate that ESG-Graph offers an efficient alternative for ESG document classification, for limited-resource or real-time settings.
At inference time, ESG-Graph processes approximately 220 sentence-level graphs per second on the evaluation hardware, enabling a 500-page annual report (approximately 15,000 sentences) to be fully classified in under 70 s. This throughput is achievable on consumer hardware without server-grade GPU infrastructure, making the system accessible to smaller regulatory bodies and asset managers operating under computational constraints. Furthermore, the message-passing complexity of the GNN, in contrast to the self-attention of transformer models, ensures that inference latency scales gracefully with sequence length.
3.3. Long-Sequence Scalability Analysis
To assess the scalability of ESG-Graph on long ESG disclosures, we evaluated the model on five extended paragraphs constructed from real ESG report sections, with token lengths exceeding 1000 tokens (range: 1048–1352). These inputs approximate long-form disclosure segments commonly encountered in corporate sustainability and governance reporting.
For each input, we measured per-graph inference latency, peak GPU memory consumption, and node representation variance, with the latter serving as an indicator of potential over-smoothing effects. Results are reported in
Table 8, alongside a baseline measured on standard-length validation samples.
Three observations can be drawn from these results. First, peak GPU memory usage remains nearly constant, increasing by approximately 0.44 MB (1.7%) despite the substantial increase in input length. This suggests that memory requirements remain largely stable under longer input conditions within the evaluated range.
Second, inference latency increases by a factor of 4.07× relative to the baseline, where the baseline corresponds to standard sentence-level inputs in the evaluation datasets (average length ∼30–40 tokens) and is computed under the same experimental pipeline. This increase reflects the larger graph sizes induced by longer sequences. However, the absolute latency remains moderate (maximum 24.67 ms per graph), indicating that extended disclosures can be processed efficiently in practical settings.
Third, node representation variance increases for longer inputs (0.328 vs. 0.220), indicating that node representations remain well differentiated across the graph. This pattern suggests that the model does not exhibit pronounced over-smoothing behavior as graph size increases. This behavior is consistent with the use of residual connections and normalization layers in the ESG-Graph architecture, which help stabilize message passing across larger graphs.
3.4. Ablation Study: Contribution of ESG Structural Nodes
To quantify the contribution of each architectural component introduced in ESG-Graph, we conduct an ablation study on the role of the ESRS subtopic anchor nodes and the Global ESG node.
The objective of this analysis is to isolate the effect of each component and verify that performance gains arise from architectural design.
We evaluate four variants of the model:
Full ESG-Graph: the complete architecture described in
Section 2.
Subtopic Nodes: ESRS subtopic anchor nodes are removed; only token nodes and syntactic dependency edges are retained.
Global ESG Node: the sentence-level ESG aggregation node is removed.
Subtopic & Global Nodes: both ESG-specific node types are removed.
All experiments were conducted under identical training configurations, graph neural layers, and hyperparameters. Results are summarized in
Table 9 and reported as mean F1-score over three random seeds.
Discussion
Removing ESRS subtopic anchor nodes leads to a reduction in F1-score across all datasets. This effect is clearer on the Social dataset, which reflects the fact that social disclosures are closer to everyday language and often describe human-related topics such as working conditions, diversity, or community engagement. Unlike Governance or Environmental reporting, which rely on more technical terminology, social topics are expressed in more varied narrative forms.
Removing the Global ESG node leads to a smaller decrease in performance, indicating that global aggregation provides complementary signals beyond local features. The larger impact observed for Governance disclosures suggests that classification in this category relies more on global document patterns.
The strongest drop is observed when both ESG-specific node types are removed, reducing the architecture to a syntactic dependency graph. While the resulting architecture remains competitive, the performance gap suggests that the ESG-aware components contribute positively to discriminative capacity. Beyond the observed F1 improvement, the hierarchical anchor nodes serve a function that cannot be captured by accuracy metrics alone: they produce named, subtopic-level attribution that is directly legible to ESG practitioners, enabling users to understand not just the classification outcome but which regulatory theme drove that assignment. The added architectural complexity is therefore justified on interpretability grounds independently of its predictive contribution.
3.5. Ablation Study: Effect of Word Embedding Models
In addition to structural inductive biases, we investigate the sensitivity of ESG-Graph to the choice of underlying word embedding model. While the proposed architecture is embedding-agnostic, different static representations may capture ESG-relevant semantics with varying granularity, particularly for domain-specific terminology.
We compare two lightweight pretrained embedding models: FastText and GloVe.
FastText incorporates subword information, which is expected to be beneficial for handling rare terms, morphological variations, and domain-specific vocabulary. GloVe, on the other hand, captures global co-occurrence statistics and has been shown to produce stable semantic representations across diverse text domains.
For this ablation, all architectural components of ESG-Graph are kept identical, including graph construction, attention layers, and training hyperparameters. The only difference lies in the initialization of token node embeddings. Results are reported as mean F1-score over three random seeds.
Discussion
Across all three ESG pillars, GloVe embeddings yield consistently higher performance than FastText, although the observed margin remains modest. This pattern suggests that ESG-Graph is not strongly dependent on a specific embedding choice and that its performance is primarily driven by the graph-based inductive biases introduced by the architecture.
The slight advantage observed for GloVe may be related to its ability to capture global semantic regularities. Conversely, the competitive performance of FastText indicates that subword-level information provides limited additional benefit once ESG-specific taxonomy is defined.
Beyond the GloVe vs. FastText comparison, we note that the choice of static embeddings over domain-adapted or contextualized alternatives such as ESG-BERT was deliberate on three grounds. First, it ensures that performance differences across model variants are attributable to graph structure rather than encoder capacity, preserving the architectural isolation necessary to interpret ablation results. Second, static embeddings introduce zero additional trainable parameters, keeping the model’s lightweight profile intact. Third, stacking transformer-derived contextual features on top of a graph attention network operating in a shallow layer regime risks representational redundancy and over-smoothing. The results in
Table 10 empirically support this design choice: even the simpler FastText variant remains competitive, confirming that performance is primarily driven by the graph structure rather than the embedding layer.
3.6. Keyword Sensitivity Analysis
To assess the model’s robustness to variations in the analyst-defined keyword sets, we conducted a systematic perturbation study in which keywords were randomly removed from each subtopic lexicon at three removal rates: 10%, 20%, and 30%. For each rate, five independent random draws were performed and the mean F1-score was recorded across all three domains. Results are reported in
Table 11.
Across all three domains and perturbation levels, the maximum observed F1 drop is 0.38 points (Environmental at −30%), indicating a limited degradation in performance under perturbations. This confirms that the model does not critically depend on any specific keyword subset and generalizes robustly across reasonable keyword set variations. The intersection-based consensus process therefore does not need to achieve perfect keyword coverage, the model’s graph-based representations compensate for minor vocabulary gaps.
3.7. Energy, Emissions, and Comparative Efficiency Landscape
To quantify the environmental and computational footprint of all architectures, we measure energy usage using the
CodeCarbon v3.0.7 framework,
https://github.com/mlco2/codecarbon, accessed on 13 April 2026.
The system samples GPU and CPU power draw at 1 Hz and computes electricity consumption (kWh) and CO
2 emissions following [
23].
All experiments were executed under identical hardware conditions (NVIDIA RTX 4070 Laptop GPU, Intel i9-13900H, 60 GB RAM).
3.7.1. Energy and Emissions
Distilled transformer models are the most energy-efficient baselines, consuming between 1.37–1.99 Wh per run. In comparison, FNet requires a moderate amount of energy (8.94 Wh), placing it between distilled models and larger architectures. Longformer, however, is the most energy-intensive model (19.53 Wh), mainly due to its larger memory footprint.
By comparison, ESG-Graph exhibits a very low computational footprint, requiring 0.000312 kWh (0.312 Wh) per run and emitting 0.000148 kg CO2.
Relative to Transformer baselines, ESG-Graph operates in a lower energy regime, consuming approximately:
4.4× less energy than TinyBERT,
6.4× less energy than DistilBERT,
28.7× less energy than FNet,
62.6× less energy than Longformer.
These results indicate that ESG-Graph achieves competitive classification performance while operating at a substantially lower energy cost than Transformer baselines, demonstrating the efficiency benefits of sparse, graph-based computation.
3.7.2. Performance-Energy Trade-Off Curve
To jointly assess predictive performance and computational cost,
Figure 2 plots the average F1 score across ESG datasets against the average energy consumption per run (kWh). A logarithmic scale is used on the energy axis to accommodate the wide dynamic range in consumption across models and to improve visual comparability between low and high energy regimes.
TinyBERT operates in a low-energy regime but attains a lower average F1 score. DistilBERT provides moderate performance gains at increased energy cost, while FNet exhibits higher consumption without proportional improvements in F1. Longformer achieves strong predictive performance but requires substantially more energy per run.
ESG-Graph combines very low energy usage with the highest average F1 score among the evaluated supervised models. Compared to Transformer baselines, ESG-Graph achieves favorable performance while operating in a substantially lower energy regime.
3.7.3. Comparison with Large Language Models
In addition to supervised baselines, we evaluate four zero-shot LLMs: GPT-4o, GPT-4o mini, Gemini-2.5 Flash, and Gemini-2.5 Pro.
These models demonstrate strong generalization capabilities, particularly on the Environmental dataset, where Gemini-2.5 Flash attains the highest zero-shot F1 (91.95%). However, they operate at a fundamentally different computational scale and therefore cannot be directly placed on the performance-energy curve used for supervised models.
We did not include fine-tuned or few-shot LLM baselines (e.g., LLaMA-3-8B) due to their significantly higher computational and memory requirements. Fine-tuning such models typically requires GPU memory well beyond 8 GB and substantially higher computational cost and longer training times, making them inconsistent with the efficiency-focused evaluation setting considered in this study. The comparison is therefore restricted to zero-shot LLMs.
Nonetheless, the supervised ESG-Graph remains competitive with these LLMs, achieving:
Higher accuracy and F1 score than all evaluated LLMs on the Governance dataset;
Higher F1 score than all evaluated LLMs on the Environmental dataset;
Substantially higher F1 score and accuracy than all evaluated LLMs on the Social dataset;
Vastly lower energy cost, by several orders of magnitude.
Given that none of the LLMs undergo task-specific fine-tuning and their energy usage is several orders of magnitude larger, as reported in public model cards and efficiency analyses [
24,
25], these results suggest that compact, task-aligned architectures can offer competitive performance under realistic computational budgets [
16,
26].
3.7.4. Discussion
Overall, the findings present a coherent picture across efficiency and performance. Distilled transformers offer strong baseline efficiency, while full transformer architectures such as Longformer achieve high predictive accuracy at the cost of substantially increased energy usage. In contrast, ESG-Graph exhibits a favorable balance, combining competitive supervised accuracy with very low energy consumption and a strong position on the performance-energy trade-off curve.
3.8. Attention Dynamics Across ESG Pillars
To understand how ESG-Graph processes textual information across ESG dimensions, we analyze the layer-wise evolution of attention weights on the Environmental, Social, and Governance datasets. As reported in
Table 13, the mean attention remains approximately constant at 0.22 across all layers. However, entropy consistently decreases while skewness increases with depth, which indicates that the model shifts progressively from general language patterns to more concentrated focus on domain-relevant tokens.
A full visualization of the attention entropy curves and layer-wise weight distributions is provided in
Appendix B, where
Figure A1 and
Figure A2 represent how the model transitions from general language patterns (Layers 1–2) to domain-specific semantics (Layers 3–4).
3.9. Interpretability Analysis
Interpretability is crucial in ESG text classification, where models must provide transparent justifications for their decisions, particularly in regulatory contexts [
8]. To analyze how ESG-Graph forms its predictions, we employ gradient-based attribution methods, which estimate the sensitivity of the model output to changes in node representations [
27,
28].
Unlike attention mechanisms, which explain information routing, gradient-based attributions focus on estimating how variations in node representations influence the model output [
29,
30].
Given the classifier output
and the embedding of a node
, the attribution score is computed as:
Because gradients propagate through the graph attention layers, these scores capture both the lexical meaning of individual tokens and the structural influence of the dependency edges connecting them, as shown in earlier work on graph-level explainability [
31].
For Governance documents, tokens such as governance, compliance, audit, and regulatory receive the highest attribution scores. These terms highlight that the model bases its predictions on meaningful corporate-governance content.
For Environmental documents, influential tokens include environmental, emissions, climate, carbon, and renewable. The spread of attribution across multiple environmental themes suggests that the model captures the multiple dimensions of environmental disclosures.
For Social documents, tokens such as gender, diversity, employees, rights, and communities show high importance. These reflect key aspects of social impact and show that the model captures relevant patterns.
Case Study: Governance Classification
As a concrete illustration, consider the following excerpt drawn from a governance disclosure aligned with corporate governance standards:
“The UK Corporate Governance Code recommends that the Board should include a balance of executive and non-executive Directors (and in particular independent non-executive Directors) such that no individual or small group of individuals can dominate the Board’s decision making.”
The dependency parser identifies recommends as the root, linking key entities such as Board, Directors, and decision making through syntactic relations (e.g., nsubj, dobj, and clausal modifiers). Several tokens, including Board, Directors, and governance, match entries in the G1 subtopic lexicon, leading to the activation of a Governance anchor node.
Gradient-based attribution highlights the anchor node and its directly connected syntactic neighbors as the most influential components in the classification decision. These elements correspond to core governance concepts such as board composition, independence, and decision-making structure. This behavior provides a transparent and policy-aligned explanation of the model output, illustrating how subtopic anchor nodes bridge token-level representations and higher-level regulatory concepts.
To assess the practical relevance of these explanations, a subset of attribution outputs was qualitatively examined by ESG analysts. The identified anchor nodes and salient tokens were generally consistent with expert assessments of governance materiality, particularly in cases where disclosure language follows established regulatory frameworks. While a formal user study is beyond the scope of this work, these observations suggest that ESG-Graph produces explanations that align with professional analytical reasoning.
3.10. Discussion
Beyond the empirical findings reported above, it is essential to examine how the results of this study relate to and extend established research directions. In the Introduction, we identified four families of approaches for ESG text classification and analyzed their respective limitations along three dimensions: efficiency, structural expressiveness, and domain knowledge integration (
Table 1). The following discussion revisits each model family in light of the experimental evidence presented in this study.
3.10.1. Relation to Transformer-Based Models
To begin with, large pretrained transformers such as BERT [
6] and its ESG-adapted variant ESG-BERT [
4] have established strong empirical baselines for text classification by leveraging deep contextual representations. However, as noted in
Table 1, these models score low across all three dimensions due to their quadratic computational complexity [
7], sequential token processing, and absence of structured domain knowledge. The results presented in
Section 3 and
Section 3.7 confirm that ESG-Graph achieves competitive or superior classification performance, particularly on the Governance dataset where it outperforms all transformer baselines, while requiring orders of magnitude fewer parameters and up to 60× less energy. These findings corroborate the broader concerns raised by Strubell et al. [
9] and Patterson et al. [
25] regarding the environmental costs of large-scale language models, and provide evidence that task-specific architectures grounded in linguistic structure constitute a viable alternative in domain-constrained settings.
3.10.2. Relation to Efficient Transformer Architectures
Building on this comparison, efficient transformer variants such as DistilBERT [
11], TinyBERT [
12], FNet [
14], and Longformer [
13] have convincingly demonstrated that model compression and architectural simplification can substantially reduce computational cost without proportional accuracy degradation. ESG-Graph extends this line of inquiry by establishing that competitive ESG classification can be attained without relying on transformer encoders altogether, suggesting that graph-based message passing over linguistically grounded structures constitutes a complementary efficiency pathway. Moreover, while efficient transformers address the efficiency limitation identified in
Table 1, they remain grounded in sequential token processing and do not incorporate structured domain knowledge, leaving the structure and knowledge dimensions unresolved. The training efficiency results reported in
Table 7 and the energy measurements in
Table 12 confirm that ESG-Graph operates in a substantially lower computational regime while simultaneously addressing the structural and knowledge gaps that efficient transformers leave open. This contributes to the emerging literature on sustainable AI [
10] and provides evidence that architectural design choices, not only compression or distillation strategies, constitute a viable lever for reducing the carbon footprint of NLP systems, a consideration that carries particular weight when the application domain itself concerns environmental sustainability.
3.10.3. Relation to Graph-Based Methods
Furthermore, the foundational work of Yao et al. [
15] demonstrated that graph convolutional networks can effectively model text for classification through corpus-level word co-occurrence graphs, achieving strong structural expressiveness as reflected in
Table 1. ESG-Graph refines this approach in two important respects. First, it operates at the sentence level and leverages syntactic dependency relations rather than statistical co-occurrence, enabling the model to capture grammatical compositionality and thematic relevance within individual disclosures. This shift from corpus-level to sentence-level graph construction extends the applicability of graph-based NLP to structured regulatory text domains where document-level granularity is insufficient. Second, while existing graph-based methods score low on domain knowledge integration (
Table 1), ESG-Graph augments syntactic graphs with ESRS-aligned taxonomy anchor nodes, directly addressing this limitation. The ablation results in
Section 3.4 confirm that these domain-specific components contribute measurably to classification performance, and the interpretability analysis in
Section 3.9 demonstrates that gradient-based attribution over named taxonomy nodes extends prior work on GNN explainability [
31] to the ESG disclosure domain, responding to the concerns raised by Lipton [
8] regarding the opacity of neural models in high-stakes applications.
3.10.4. Relation to Knowledge-Enhanced Approaches
Closely related to the knowledge integration challenge, Peters et al. [
18] proposed enriching contextual word representations with external knowledge bases through additional pretraining or embedding alignment. While such approaches achieve high domain alignment as reflected in
Table 1, they remain grounded in sequential architectures with limited structural expressiveness. ESG-Graph adopts a more modular strategy by encoding domain knowledge through taxonomy-based virtual nodes embedded directly within the graph topology, thereby resolving the structure–knowledge trade-off that characterizes existing knowledge-enhanced methods. This evolutive design is particularly well suited to the ESG domain, where regulatory frameworks such as ESRS undergo frequent revisions, and it confirms that lightweight, structure-based knowledge injection can serve as a practical alternative to knowledge-intensive pretraining. In a broader theoretical context, Garcez and Lamb [
17] have advocated for the integration of symbolic reasoning with neural computation as a pathway toward more robust and interpretable AI systems. ESG-Graph provides a concrete, domain-specific instantiation of this paradigm: the ESRS-aligned taxonomy nodes function as symbolic priors grounding neural predictions in analyst-defined regulatory concepts. The ablation results presented in
Section 3.4 offer empirical support for the neuro-symbolic hypothesis in the context of financial text analysis.
3.10.5. Unified Positioning
Taken together, these connections demonstrate that ESG-Graph addresses the specific limitations identified for each model family in
Table 1: it matches the predictive strength of transformer-based models without their computational burden, extends the efficiency gains of compressed architectures by eliminating the transformer dependency entirely, augments the structural advantages of graph-based methods with domain-specific knowledge integration, and provides a more modular and evolutive alternative to knowledge-enhanced approaches while simultaneously achieving structural expressiveness. The empirical evidence presented in this study suggests that these properties, previously achieved only in isolation by separate families of approaches, can be unified within a single lightweight framework, providing a coherent response to the research question posed in the Introduction.
4. Conclusions
This paper introduced ESG-Graph, a lightweight and interpretable graph neural framework for ESG text classification that addresses the computational, environmental, and transparency limitations of transformer-based approaches in sustainability analytics.
Our findings demonstrate that ESG-Graph achieves competitive classification performance across Environmental, Social, and Governance benchmarks while using only 1.65 million parameters.
Moreover, ESG-Graph reduces energy consumption by between approximately 4× and 60× relative to efficient transformer baselines, while eliminating sequence length constraints.
Limitations and Future Work: While ESG-Graph demonstrates strong performance on benchmark datasets, its evaluation on other jurisdictional ESG disclosures remains future work. Moreover, the current taxonomy requires significant manual curation. Future work could explore the integration of a relevance gating mechanism. This would enable adaptive token selection based on context, without introducing additional components such as large language models. Future work should also evaluate few-shot prompting of smaller open-source LLMs (e.g., Llama-3-8B) as an additional baseline, acknowledging that such models were excluded from the current study due to their incompatibility with the single-GPU hardware constraint and their lack of native interpretability. Furthermore, a formal practitioner user study evaluating the alignment between model attributions and expert materiality assessments constitutes an important direction for future validation and is left for subsequent work.
In conclusion, ESG-Graph represents a strong alternative to large language models for ESG text classification. By encoding domain knowledge into graph topology, this offers financial institutions and regulators a practical tool for scalable ESG disclosure analysis.