ESG-SASB Label Stability: A Curated Benchmark and Reproducible Pipeline for Reusing Sentence-Level Sustainability Disclosure Labels
Abstract
1. Introduction
- RQ1. How closely do automated label sources reproduce released SASB-aligned labels at binary, parent category, and child category levels?
- RQ2. How do retention, agreement, and self-reported ambiguity differ when a fixed LLM annotator labels the same held-out sentences under blind and candidate-visible annotation conditions?
- RQ3. How sensitive are aggregate ESG/Non-ESG category profiles to the sentence-level label source used for aggregation?
2. Literature Review
2.1. Sustainability Reporting and SASB-Aligned Sentence Labels
2.2. Automated ESG Disclosure Analytics and Standards-Based Text Classification
2.3. Annotation Reliability, Ambiguity, and Candidate-Visible LLM Annotation
2.4. From Sentence Labels to Aggregate Profiles
3. Materials and Methods
3.1. Reproducibility Artifacts and Reusable Records
3.2. Corpus, Label Hierarchy, and Split
3.3. Classifier-Based Reproduction of Released Labels
3.4. LLM Annotation Under Blind and Candidate-Visible Conditions
3.5. Aggregation into ESG/Non-ESG Category Profiles
4. Results
4.1. Reuse Validation Across Label Granularities
4.2. Annotation Conditions Change Retention and Ambiguity Signals
4.3. Aggregate Profiles Are Sensitive to Label Source
5. Discussion
5.1. Reusable Annotated Text Labels Need Stability Evidence
5.2. Annotation Condition Should Travel with the Label
5.3. Aggregate Profiles Should Cite Their Label Source
5.4. Usage Notes and Limitations
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Supplementary Results
Appendix A.1. Corpus Split Check
| Quantity | Full Corpus | Held-Out Test Set | Check |
|---|---|---|---|
| Sentences | 6460 | 969 | Final reuse comparisons use the held-out test set |
| Child labels represented | 27 | 27 | All released child labels remain represented |
| Smallest child label count | 6 | 1 | Rare released labels are retained in the task |
| Non-ESG share | 54.89% | 54.90% | 0.01 percentage points |
| Largest ESG/Non-ESG category share difference | — | — | 0.11 percentage points |
| Largest child label share difference | — | — | 0.07 percentage points |
| Mean sentence length | 26.53 | 26.27 | 0.26 words |
| Median sentence length | 23 | 23 | No difference |
| Interquartile range of sentence length | 16–32 | 16–33 | Similar middle range |
Appendix A.2. Classifier and Annotation Results
| Mode | N | Retention Rate | Cohen’s | Ambiguity Rate | ||
|---|---|---|---|---|---|---|
| Child | Parent | Child | Parent | |||
| Blind | 969 | 0.5480 | 0.5986 | 0.3950 | 0.4317 | 0.4025 |
| Candidate-visible | 969 | 0.6481 | 0.6997 | 0.5042 | 0.5535 | 0.0000 |
| Prompt | Validation Child Macro F1 | Test Binary Macro F1 | Test Parent Macro F1 | Test Child Macro F1 | Test Child Top-3 Acc. |
|---|---|---|---|---|---|
| Minimal | 0.3790 | 0.6721 | 0.5317 | 0.3461 | 0.6347 |
| Definitions | 0.3823 | 0.7192 | 0.5225 | 0.3604 | 0.7203 |
| Hierarchy | 0.3769 | 0.6755 | 0.5292 | 0.3569 | 0.5841 |
| Method | Binary Macro F1 | Parent Macro F1 | Child Macro F1 | Child Top-3 Acc. |
|---|---|---|---|---|
| TF-IDF + LinearSVC | 0.8919 | 0.7204 | 0.5707 | — |
| TF-IDF + Logistic regression | 0.8764 | 0.7194 | 0.4959 | — |
| SBERT + Logistic regression | 0.8744 | 0.7182 | 0.4612 | — |
| GPT-4o (definitions) | 0.7192 | 0.5225 | 0.3604 | 0.7203 |
Appendix A.3. Profile Distance Results
| Method | Task | N | Accuracy | Macro Precision | Macro Recall | Macro F1 |
|---|---|---|---|---|---|---|
| TF-IDF + LinearSVC | Binary | 969 | 0.893 | 0.891 | 0.893 | 0.892 |
| TF-IDF + LinearSVC | Parent | 437 | 0.739 | 0.731 | 0.713 | 0.720 |
| TF-IDF + LinearSVC | Child | 969 | 0.771 | 0.605 | 0.567 | 0.571 |
| SBERT + LogReg | Binary | 969 | 0.875 | 0.874 | 0.876 | 0.874 |
| SBERT + LogReg | Parent | 437 | 0.730 | 0.712 | 0.727 | 0.718 |
| SBERT + LogReg | Child | 969 | 0.612 | 0.444 | 0.593 | 0.461 |
| GPT-4o (definitions) | Binary | 969 | 0.719 | 0.723 | 0.724 | 0.719 |
| GPT-4o (definitions) | Parent | 437 | 0.563 | 0.590 | 0.484 | 0.522 |
| GPT-4o (definitions) | Child | 969 | 0.554 | 0.349 | 0.448 | 0.360 |
| Source | Mean Parent | P90 Parent | Mean Parent | 95% Interval for |
|---|---|---|---|---|
| TF-IDF | 0.1051 | 0.1488 | 0.0369 | [0.0165, 0.0738] |
| Candidate-visible annotation | 0.1537 | 0.2149 | 0.0580 | [0.0246, 0.1157] |
| GPT-4o (definitions) | 0.2160 | 0.2975 | 0.0869 | [0.0331, 0.1653] |
| Blind annotation | 0.2859 | 0.3802 | 0.1171 | [0.0574, 0.1967] |
| SBERT | 0.4225 | 0.5124 | 0.2074 | [0.1393, 0.2810] |
| Source | Mean Child | P90 Child | Mean Child | 95% Interval for |
|---|---|---|---|---|
| TF-IDF | 0.2057 | 0.2645 | 0.0329 | [0.0165, 0.0579] |
| Candidate-visible annotation | 0.3516 | 0.4298 | 0.0593 | [0.0248, 0.1157] |
| GPT-4o (definitions) | 0.4334 | 0.5289 | 0.0846 | [0.0331, 0.1653] |
| Blind annotation | 0.5022 | 0.6116 | 0.1111 | [0.0492, 0.1967] |
| SBERT | 0.5354 | 0.6281 | 0.2074 | [0.1393, 0.2810] |
Appendix B. Full Prompt Text
| Prompt Family | Task | Information Supplied | Required JSON Fields |
|---|---|---|---|
| Minimal | GPT-4o child classification | Sentence text and the 27 child label names | top1_label, top3_labels, confidence, rationale |
| Definitions | GPT-4o child classification | Sentence text and child label definitions | top1_label, top3_labels, confidence, rationale |
| Hierarchy | GPT-4o hierarchy-aware classification | Sentence text, parent–child mapping, and child label definitions | parent_label, top1_label, top3_labels, confidence, rationale |
| Blind annotation | Claude LLM annotation | Sentence text and child label definitions | child_label, parent_label, confidence, ambiguity_flag, note |
| Candidate-visible annotation | Claude candidate-visible annotation | Sentence text, released label, model candidate, and child label definitions | verdict, recommended_child, recommended_parent, confidence, ambiguity_flag, note |
Appendix B.1. Classification Prompts (GPT-4o)
- Minimal template.
System. “You classify corporate disclosure sentences into one of the provided SASB child labels. Use only the supplied label set and return JSON.”
User template. “Sentence: "{text}". Child labels: {label_list}. Return valid JSON with keys top1_label, top3_labels, confidence, rationale.”
- Definitions template.
System. “You are an ESG disclosure analyst classifying sentences with reference to SASB child category definitions.”
User template. “Sentence: "{text}". Use the following SASB child labels and definitions: {definitions}. Return valid JSON with keys top1_label, top3_labels, confidence, rationale.”
- Hierarchy template.
System. “You classify ESG disclosure sentences using the SASB hierarchy. Think first about broad parent meaning and then choose a child label.”
User template. “Sentence: "{text}". Parent-to-child hierarchy: {hierarchy}. Child label definitions: {definitions}. Return valid JSON with keys parent_label, top1_label, top3_labels, confidence, rationale.”
Appendix B.2. Annotation Prompts (Claude Sonnet 4.6)
- Blind annotation template.
System. “You are an ESG reviewer. Read the sentence and assign the best SASB child label and parent label. If the sentence is too ambiguous for a unique child label, mark ambiguity_flag as ambiguous.”
User template. “Sentence: "{text}". SASB child label definitions: {definitions}. Return valid JSON with keys child_label, parent_label, confidence, ambiguity_flag, note.”
- Candidate-visible template.
System. “You are an ESG reviewer comparing candidate labels for a sentence. Decide whether the original label, the model candidate, both, or neither is best supported.”
User template. “Sentence: "{text}". Original label: {original_label}. Model candidate: {model_candidate}. SASB child label definitions: {definitions}. Return valid JSON with keys verdict, recommended_child, recommended_parent, confidence, ambiguity_flag, note. Allowed verdict values: original, model, both, other.”
References
- IFRS Foundation. Understanding the SASB Standards, 2026. Guidance Page Describing the SASB Standards as Industry-Based Guidance Used Within the ISSB Framework. Available online: https://www.ifrs.org/issued-standards/sasb-standards/understanding-sasb-standards/ (accessed on 4 May 2026).
- Cahan, S.F.; Chen, L.; Wei, Y. Do sustainability standards improve the information environment? Evidence from the influence of SASB standards on disagreement among ESG rating agencies. Meditari Account. Res. 2025. ahead-of-print. [Google Scholar] [CrossRef]
- Junprung, E. SASB-Aligned ESG Sentences. Kaggle Datasets, 2023. License: Apache 2.0. Available online: https://www.kaggle.com/datasets/edwardjunprung/sasb-aligned-esg-sentences (accessed on 4 May 2026).
- Korca, B.; Costa, E.; Bouten, L. Disentangling the concept of comparability in sustainability reporting. Sustain. Account. Manag. Policy J. 2023, 14, 815–851. [Google Scholar] [CrossRef]
- Jørgensen, S.; Mjøs, A.; Pedersen, L.J.T. Sustainability reporting and approaches to materiality: Tensions and potential resolutions. Sustain. Account. Manag. Policy J. 2022, 13, 341–361. [Google Scholar] [CrossRef]
- León, R.; Salesa, A. Is sustainability reporting disclosing what is relevant? Assessing materiality accuracy in the Spanish telecommunication industry. Environ. Dev. Sustain. 2024, 26, 21433–21460. [Google Scholar] [CrossRef]
- Bochkay, K.; Brown, S.V.; Leone, A.J.; Tucker, J.W. Textual Analysis in Accounting: What’s Next? Contemp. Account. Res. 2023, 40, 765–805. [Google Scholar] [CrossRef]
- Velte, P. Automated text analyses of sustainability & integrated reporting. A literature review of empirical-quantitative research. J. Glob. Responsib. 2023, 14, 530–566. [Google Scholar] [CrossRef]
- Linhares Pontes, E.; Ben Jannet, M.; Moreno, J.G.; Doucet, A. Using Contextual Sentence Analysis Models to Recognize ESG Concepts. In Proceedings of the Fourth Workshop on Financial Technology and Natural Language Processing (FinNLP), Abu Dhabi, United Arab Emirates (Hybrid); Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 218–223. [Google Scholar] [CrossRef]
- Schimanski, T.; Reding, A.; Reding, N.; Bingler, J.; Kraus, M.; Leippold, M. Bridging the gap in ESG measurement: Using NLP to quantify environmental, social, and governance communication. Financ. Res. Lett. 2024, 61, 104979. [Google Scholar] [CrossRef]
- Zou, Y.; Shi, M.; Chen, Z.; Deng, Z.; Lei, Z.; Zeng, Z.; Yang, S.; Tong, H.; Xiao, L.; Zhou, W. ESGReveal: An LLM-based approach for extracting structured data from ESG reports. J. Clean. Prod. 2025, 489, 144572. [Google Scholar] [CrossRef]
- Tseng, Y.M.; Chen, C.C.; Huang, H.H.; Chen, H.H. DynamicESG: A Dataset for Dynamically Unearthing ESG Ratings from News Articles. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, United Kingdom; ACM: New York, NY, USA, 2023; pp. 5412–5416. [Google Scholar] [CrossRef]
- He, C.; Zhou, X.; Wu, Y.; Yu, X.; Zhang, Y.; Zhang, L.; Wang, D.; Lyu, S.; Xu, H.; Wang, X.; et al. ESGenius: Benchmarking LLMs on Environmental, Social, and Governance (ESG) and Sustainability Knowledge. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 14612–14653. [Google Scholar] [CrossRef]
- Su, X.; Liu, T.; Pang, P.; Luo, Y.T.; Wong, D. How Can Large Language Models Drive Environmental Sustainability? A Systematic Scoping Review. Sustainability 2026, 18, 4327. [Google Scholar] [CrossRef]
- Li, J.; Yang, Y.; Mao, C.; Pang, P.C.I.; Zhu, Q.; Xu, D.; Wang, Y. Revealing Patient Dissatisfaction with Health Care Resource Allocation in Multiple Dimensions Using Large Language Models and the International Classification of Diseases 11th Revision: Aspect-Based Sentiment Analysis. J. Med. Internet Res. 2025, 27, e66344. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Guo, J.; Pang, P.; Oliveira, H.G.; Ng, B.K.; Tan, T. Multi-task Specialized Expert Model for Hierarchical Aspect-based Sentiment Analysis in Consumer Healthcare. Expert Syst. Appl. 2026, 331, 133419. [Google Scholar] [CrossRef]
- Junprung, E. SASB-ESG Category Classifier. Hugging Face Model Card, 2023. Available online: https://huggingface.co/ejunprung/SASB-ESG-Category-Classifier (accessed on 4 May 2026).
- Adcock, R.; Collier, D. Measurement Validity: A Shared Standard for Qualitative and Quantitative Research. Am. Political Sci. Rev. 2001, 95, 529–546. [Google Scholar] [CrossRef]
- Birkenmaier, L.; Lechner, C.M.; Wagner, C. The Search for Solid Ground in Text as Data: A Systematic Review of Validation Practices and Practical Recommendations for Validation. Commun. Methods Meas. 2024, 18, 249–277. [Google Scholar] [CrossRef]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 3980–3990. [Google Scholar] [CrossRef]
- Gao, T.; Yao, X.; Chen, D. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 6894–6910. [Google Scholar] [CrossRef]
- Liu, K. SetCSE: Set Operations using Contrastive Learning of Sentence Embeddings. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Fang, Q.; Nguyen, D.; Oberski, D.L. Evaluating the construct validity of text embeddings with application to survey questions. EPJ Data Sci. 2022, 11, 39. [Google Scholar] [CrossRef]
- Hayes, A.F.; Krippendorff, K. Answering the Call for a Standard Reliability Measure for Coding Data. Commun. Methods Meas. 2007, 1, 77–89. [Google Scholar] [CrossRef]
- Dawid, A.P.; Skene, A.M. Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm. Appl. Stat. 1979, 28, 20. [Google Scholar] [CrossRef]
- Aroyo, L.; Welty, C. Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation. AI Mag. 2015, 36, 15–24. [Google Scholar] [CrossRef]
- Fornaciari, T.; Uma, A.; Paun, S.; Plank, B.; Hovy, D.; Poesio, M. Beyond Black & White: Leveraging Annotator Disagreement via Soft-Label Multi-Task Learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 2591–2597. [Google Scholar] [CrossRef]
- Frenay, B.; Verleysen, M. Classification in the Presence of Label Noise: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2014, 25, 845–869. [Google Scholar] [CrossRef] [PubMed]
- Tan, Z.; Li, D.; Wang, S.; Beigi, A.; Jiang, B.; Bhattacharjee, A.; Karami, M.; Li, J.; Cheng, L.; Liu, H. Large Language Models for Data Annotation and Synthesis: A Survey. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 930–957. [Google Scholar] [CrossRef]
- Feuston, J.L.; Brubaker, J.R. Putting Tools in Their Place: The Role of Time and Perspective in Human-AI Collaboration for Qualitative Analysis. Proc. ACM Hum.-Comput. Interact. 2021, 5, 1–25. [Google Scholar] [CrossRef]
- Eschrich, J.; Sterman, S. A Framework For Discussing LLMs as Tools for Qualitative Analysis. Version Number: 1. arXiv 2024, arXiv:2407.11198. [Google Scholar] [CrossRef]
- Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.P.; et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Version Number: 4. arXiv 2023, arXiv:2306.05685. [Google Scholar] [CrossRef]
- Berg, F.; Kölbel, J.F.; Rigobon, R. Aggregate Confusion: The Divergence of ESG Ratings. Rev. Financ. 2022, 26, 1315–1344. [Google Scholar] [CrossRef]
- Chatterji, A.K.; Durand, R.; Levine, D.I.; Touboul, S. Do ratings of firms converge? Implications for managers, investors and strategy researchers. Strateg. Manag. J. 2016, 37, 1597–1614. [Google Scholar] [CrossRef]
- Henry, E.; Leone, A.J. Measuring Qualitative Information in Capital Markets Research: Comparison of Alternative Methodologies to Measure Disclosure Tone. Account. Rev. 2016, 91, 153–178. [Google Scholar] [CrossRef]
- He, H.; Garcia, E. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
- Silla, C.N.; Freitas, A.A. A survey of hierarchical classification across different application domains. Data Min. Knowl. Discov. 2011, 22, 31–72. [Google Scholar] [CrossRef]
- Kosmopoulos, A.; Partalas, I.; Gaussier, E.; Paliouras, G.; Androutsopoulos, I. Evaluation measures for hierarchical classification: A unified view and novel approaches. Data Min. Knowl. Discov. 2015, 29, 820–865. [Google Scholar] [CrossRef]
- Salton, G.; Buckley, C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 1988, 24, 513–523. [Google Scholar] [CrossRef]
- Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
- Gebru, T.; Morgenstern, J.; Vecchione, B.; Vaughan, J.W.; Wallach, H.; Iii, H.D.; Crawford, K. Datasheets for datasets. Commun. ACM 2021, 64, 86–92. [Google Scholar] [CrossRef]
- Mitchell, M.; Wu, S.; Zaldivar, A.; Barnes, P.; Vasserman, L.; Hutchinson, B.; Spitzer, E.; Raji, I.D.; Gebru, T. Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency; ACM: New York, NY, USA, 2019; pp. 220–229. [Google Scholar] [CrossRef]
- Wu, S.; Irsoy, O.; Lu, S.; Dabravolski, V.; Dredze, M.; Gehrmann, S.; Kambadur, P.; Rosenberg, D.; Mann, G. BloombergGPT: A Large Language Model for Finance. arXiv 2023, arXiv:2303.17564. [Google Scholar]




| Category | Role in the Label Scheme | Illustrative Child Labels |
|---|---|---|
| Environment | Environmental impacts and resource use | GHG Emissions; Energy Management; Water & Wastewater Management |
| Social Capital | Relationships with customers, communities, and society | Customer Privacy; Data Security; Product Quality & Safety |
| Human Capital | Workforce-related issues | Labor Practices; Employee Health & Safety; Employee Engagement, Diversity & Inclusion |
| Business Model & Innovation | Sustainability issues embedded in products, supply chains, and business models | Product Design & Lifecycle Management; Supply Chain Management; Business Model Resilience |
| Leadership & Governance | Governance, ethics, risk, and regulatory management | Business Ethics; Competitive Behavior; Critical Incident Risk Management |
| Non-ESG | Separate corpus category used alongside the SASB ESG categories in profile analyses | Sentences outside the SASB ESG child categories |
| Module | Practical Question | What We Compare | Main Output |
|---|---|---|---|
| Classifier reproduction | Can automated tools reproduce the released label scheme? | TF-IDF, SBERT, and GPT-4o label sources | Macro F1, top-3 accuracy |
| LLM annotation | Which label does a fixed LLM annotator assign under blind and candidate-visible conditions? | Blind annotation and candidate-visible LLM annotation | Retention, , ambiguity |
| Profile sensitivity | Do label source choices change ESG/Non-ESG profiles? | Released labels and alternative label sources | Category share shifts |
| Record Group | Release Files | Purpose | Redistribution Note |
|---|---|---|---|
| Source manifest | source_corpus_manifest.json | Records source URL, access date, loaded row count, landing page count, and SHA-256 hash | Raw text is omitted; users reconstruct from Kaggle |
| Label hierarchy | label_hierarchy/ | Provides the child-to-parent mapping used in all projections | Derived metadata |
| Split metadata | splits/split_indices.csv | Reconstructs train, validation, and test membership with sentence hashes | Text column removed; hashes support alignment |
| Prompt templates | prompts/ | Documents GPT-4o classification and Claude annotation calls | Author-generated templates |
| Predictions and annotations | predictions/, annotations/ | Supports reproduction of classifier, prompt, and annotation metrics | Public outputs retain labels and diagnostics, not raw sentence text |
| Metrics and profile outputs | metrics/, profile_sensitivity/ | Supports reported results and profile sensitivity summaries | Derived analysis outputs |
| Figure inputs, code, and tests | manuscript_inputs/, code/, tests/ | Recreates manuscript tables, figures, and validation checks | Reproducibility materials |
| Paraphrased Sentence Description | Released Child Label | Parent Category Label | Comment |
|---|---|---|---|
| A sentence describes projected annual carbon-dioxide-equivalent avoidance from environmental projects. | GHG Emissions | Environment | Direct topical signal |
| A sentence describes a third-party valuation review related to goodwill impairment. | Non-ESG | Non-ESG | Accounting/report fragment |
| A sentence states that customer data are protected according to contractual commitments. | Customer Privacy | Social Capital | Potentially adjacent SASB issue |
| A sentence reports global employee completion of cybersecurity training. | Data Security | Social Capital | Context-dependent issue |
| A sentence describes procedures for responding to non-permitted wastewater discharge. | Water & Wastewater Management | Environment | Direct environmental disclosure |
| Label Source | Sentence Representation | Split Use | Role in This Study |
|---|---|---|---|
| TF-IDF classifiers | scikit-learn lowercasing and tokenization, word unigram–bigram TF-IDF features | Train for fitting; test for final evaluation | Lexical supervised baseline; LinearSVC predictions enter profile aggregation |
| SBERT classifier | all-mpnet-base-v2 tokenizer and frozen 768-dimensional sentence embeddings | Train for fitting; test for final evaluation | Embedding supervised baseline and profile aggregation label source |
| GPT-4o classifier | Prompted classification using the sentence, label resources, and JSON output schema | Validation for prompt family selection; test for final evaluation | Prompted LLM label source and candidate label shown in candidate-visible annotation |
| Component | Settings |
|---|---|
| TF-IDF vectorizer | word unigrams and bigrams; sublinear term frequency scaling; maximum vocabulary of 50,000 features |
| TF-IDF classifier heads | logistic regression with lbfgs, , balanced class weights, and 1000-iteration limit; LinearSVC with , balanced class weights, and 2000-iteration limit |
| SBERT encoder | all-mpnet-base-v2; frozen encoder; 768-dimensional sentence embeddings |
| SBERT classifier head | logistic regression with , balanced class weights, and 1000-iteration limit |
| GPT-4o classifier calls | zero-shot prompted classification; temperature 0; maximum token budget of 900 tokens; OpenRouter API |
| GPT-4o output schema | top1_label, top3_labels, confidence, and rationale |
| Metric | Definition | Use in This Study |
|---|---|---|
| Macro F1 | Macro F1 averages per-category F1 scores, giving rare and common categories equal weight [41] | Main closeness-to-released-labels measure for binary, parent, and child projections. A value closer to 1 indicates closer reproduction of the released labels |
| Validation child Macro F1 | Macro F1 on the validation split for the 27 released child labels | Selects the GPT-4o prompt family carried forward to the test set |
| Top-3 accuracy | Share of test sentences where the released child label appears among GPT-4o’s three returned labels | Reports whether GPT-4o retrieves the released label as a ranked candidate |
| Prompt family flip rate | Share of test sentences where two GPT-4o prompt families assign different child or parent labels | Summarizes prompt sensitivity across Minimal, Definitions, and Hierarchy prompts |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Li, Y.; Chen, T.; Ke, W.; Pang, P. ESG-SASB Label Stability: A Curated Benchmark and Reproducible Pipeline for Reusing Sentence-Level Sustainability Disclosure Labels. Informatics 2026, 13, 106. https://doi.org/10.3390/informatics13070106
Li Y, Chen T, Ke W, Pang P. ESG-SASB Label Stability: A Curated Benchmark and Reproducible Pipeline for Reusing Sentence-Level Sustainability Disclosure Labels. Informatics. 2026; 13(7):106. https://doi.org/10.3390/informatics13070106
Chicago/Turabian StyleLi, Yufei, Tianhao Chen, Wei Ke, and Patrick Pang. 2026. "ESG-SASB Label Stability: A Curated Benchmark and Reproducible Pipeline for Reusing Sentence-Level Sustainability Disclosure Labels" Informatics 13, no. 7: 106. https://doi.org/10.3390/informatics13070106
APA StyleLi, Y., Chen, T., Ke, W., & Pang, P. (2026). ESG-SASB Label Stability: A Curated Benchmark and Reproducible Pipeline for Reusing Sentence-Level Sustainability Disclosure Labels. Informatics, 13(7), 106. https://doi.org/10.3390/informatics13070106

