Agentic Generative AI for Methodology-Grounded Modelling from Unstructured Documents: Design and Evaluation of a Multi-Agent Ecosystem Mapping Pipeline
Abstract
1. Introduction
2. Related Work
2.1. Generative Artificial Intelligence for Structured Extraction and Modelling
2.2. Agentic Architectures and Orchestration Frameworks
2.3. Evaluation of Generative and Semantic Outputs
2.4. Closest Prior Work and Differentiation
2.5. Research Gap
3. Modelling as a Methodology-Grounded Transformation Process
3.1. Modelling as Structured Representation Construction
3.2. Ecosystem Modelling as a Structured Schema
3.3. Transformation Stages in Methodology-Grounded Modelling
3.4. Human Oversight and Representational Accountability
4. Agentic Generative AI Architecture for Modelling
4.1. Design Principles
4.2. Multi-Agent Decomposition Aligned with Modelling Stages
4.3. Controlled Model Editing and Governance Mechanisms
4.4. Separation of Planning and Execution
5. Evaluation Framework for Generative Modelling Pipelines
5.1. Challenges in Evaluating Semantic Modelling Outputs
5.2. Hybrid Evaluation Design
5.3. Experimental Setup
5.4. Scoring Procedures
6. Empirical Results
6.1. Document Extraction Performance
6.2. Performance Across Agent Roles
6.3. Model Selection as the Dominant Performance Factor
6.4. Summary of Empirical Patterns
7. Discussion
7.1. Implications for Methodology-Grounded Generative Modelling
7.2. Orchestration and Entity Generation as Emerging Capabilities
7.3. Reliable Automation and Its Structural Limits
7.4. Reference Integration and Incremental Modelling
7.5. Architectural Implications for Hybrid Modelling Frameworks
7.6. Provenance Preservation as a Structural Limitation
7.7. Operational Workflow Implications
7.8. Governance and Accountability
7.9. Limitations and Future Directions
8. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Detailed Per-Model Results
| Model | n | s | Total | Actor | Role | Inter. | Attrib | TEXT | IMAGE |
|---|---|---|---|---|---|---|---|---|---|
| Claude Opus 4.5 | 30 | 30 | 0.798 | 0.936 | 0.967 | 0.567 | 0.723 | 0.986 | 0.708 |
| Claude Haiku 4.5 | 113 | 113 | 0.787 | 0.991 | 0.977 | 0.501 | 0.677 | 0.988 | 0.716 |
| GPT-5.2 | 106 | 88 | 0.782 | 0.951 | 1.000 | 0.479 | 0.699 | 0.991 | 0.692 |
| Gemini 2.5 Flash | 109 | 108 | 0.776 | 0.890 | 0.962 | 0.569 | 0.683 | 0.997 | 0.668 |
| Gemini 2.5 Pro | 33 | 32 | 0.769 | 0.909 | 0.896 | 0.571 | 0.700 | 0.983 | 0.641 |
| Claude Sonnet 4.5 | 33 | 33 | 0.756 | 0.889 | 0.926 | 0.551 | 0.659 | 0.995 | 0.632 |
| Qwen3-VL 235B | 30 | 10 | 0.733 | 0.925 | 0.933 | 0.519 | 0.554 | 0.967 | 0.662 |
| GPT-5-mini | 108 | 107 | 0.730 | 0.941 | 0.954 | 0.415 | 0.610 | 0.984 | 0.621 |
| Qwen3-VL 32B | 33 | 24 | 0.704 | 0.837 | 0.889 | 0.577 | 0.512 | 0.927 | 0.625 |
| Gemini 2.5 Flash Lite | 34 | 31 | 0.684 | 0.790 | 0.889 | 0.498 | 0.561 | 0.996 | 0.509 |
| Ministral 8B | 108 | 99 | 0.676 | 0.899 | 0.934 | 0.438 | 0.431 | 0.923 | 0.628 |
| Ministral 14B | 30 | 27 | 0.668 | 0.852 | 0.868 | 0.506 | 0.445 | 0.974 | 0.553 |
| Qwen3-VL 30B | 30 | 17 | 0.637 | 0.775 | 0.791 | 0.482 | 0.499 | 0.944 | 0.448 |
| Grok 4.1 Fast | 30 | 30 | 0.626 | 0.750 | 0.819 | 0.397 | 0.538 | 0.967 | 0.393 |
| Qwen3 235B | 30 | 6 | 0.619 | 0.708 | 0.833 | 0.397 | 0.539 | 0.993 | 0.360 |
| Ministral 3B | 34 | 31 | 0.599 | 0.841 | 0.824 | 0.367 | 0.365 | 0.869 | 0.512 |
| Qwen3-VL 8B | 34 | 4 | 0.594 | 0.771 | 0.778 | 0.429 | 0.400 | 0.969 | 0.392 |
| GPT-5.2 Chat | 30 | 28 | 0.548 | 0.732 | 0.770 | 0.284 | 0.406 | 0.893 | 0.329 |
| GPT-OSS 120B | 30 | 29 | 0.510 | 0.580 | 0.713 | 0.315 | 0.433 | 0.925 | 0.182 |
| Qwen3-Next 80B | 30 | 30 | 0.486 | 0.500 | 0.733 | 0.351 | 0.359 | 0.972 | 0.134 |
| GPT-5-nano | 108 | 108 | 0.467 | 0.654 | 0.698 | 0.245 | 0.273 | 0.833 | 0.266 |
| Kimi K2 | 30 | 26 | 0.457 | 0.500 | 0.671 | 0.299 | 0.358 | 0.925 | 0.088 |
| GPT-4o-mini | 34 | 31 | 0.324 | 0.449 | 0.624 | 0.164 | 0.058 | 0.775 | 0.052 |
| Mean | 1187 | 1042 | 0.640 | 0.786 | 0.846 | 0.431 | 0.499 | 0.947 | 0.470 |
| Model | n | s | Fail% | Comp | Roles (0–5) | Actors (0–3) | Inter. (0–8) | VCs (0–2) | Ørsted=3 | Permit Fixed | Int (0–6) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Claude Sonnet 4.5 | 15 | 15 | 0 | 100.0% | 5.0 | 3.0 | 8.0 | 2.0 | 100% | 100% | 5.0 |
| Claude Opus 4.5 | 15 | 15 | 0 | 100.0% | 5.0 | 3.0 | 8.0 | 2.0 | 100% | 100% | 5.5 |
| Grok 4.1 Fast | 15 | 15 | 0 | 99.3% | 5.0 | 3.0 | 8.0 | 1.9 ± 0.5 | 100% | 100% | 5.2 |
| GPT-5-mini | 15 | 11 | 27 | 96.5% | 5.0 | 3.0 | 8.0 | 1.4 ± 0.9 | 100% | 100% | 5.5 |
| GPT-5.2 Chat | 15 | 15 | 0 | 96.3% | 5.0 | 3.0 | 8.0 | 1.3 ± 0.9 | 100% | 87% | 5.1 |
| Claude Haiku 4.5 | 15 | 15 | 0 | 95.9% | 5.0 | 3.0 | 7.3 ± 1.0 | 2.0 | 100% | 60% | 5.1 |
| Nemotron 3 Nano 30B | 15 | 14 | 7 | 92.9% | 4.8 ± 0.8 | 2.8 ± 0.8 | 7.4 ± 2.1 | 1.7 ± 0.7 | 93% | 79% | 5.3 |
| Ministral 3B | 15 | 9 | 40 | 92.6% | 4.4 ± 1.7 | 3.3 ± 1.6 | 7.6 ± 2.1 | 1.3 ± 0.9 | 89% | 33% | 5.1 |
| GPT-5.2 | 15 | 15 | 0 | 91.1% | 5.0 | 3.0 | 6.4 ± 1.5 | 2.0 | 100% | 100% | 5.9 |
| Ministral 8B | 15 | 11 | 27 | 87.9% | 5.0 | 3.0 | 5.9 ± 1.5 | 1.9 ± 0.3 | 100% | 91% | 5.0 |
| Kimi K2 | 15 | 14 | 7 | 85.7% | 4.3 ± 1.8 | 2.6 ± 1.1 | 6.9 ± 2.9 | 1.7 ± 0.7 | 86% | 79% | 4.1 |
| Grok Code Fast 1 | 15 | 15 | 0 | 83.0% | 4.7 ± 1.3 | 2.4 ± 1.2 | 6.4 ± 2.8 | 1.5 ± 0.9 | 80% | 73% | 3.8 |
| GPT-5-nano | 15 | 12 | 20 | 74.1% | 5.0 | 3.7 ± 1.0 | 3.7 ± 3.5 | 1.0 ± 1.0 | 100% | 92% | 5.8 |
| Gemini 2.5 Flash Lite | 15 | 15 | 0 | 70.7% | 4.0 ± 2.1 | 2.4 ± 1.2 | 5.1 ± 3.8 | 1.2 ± 1.0 | 80% | 67% | 4.8 |
| Ministral 14B | 15 | 7 | 53 | 65.1% | 3.6 ± 2.4 | 2.1 ± 1.5 | 4.9 ± 3.6 | 1.1 ± 1.1 | 71% | 43% | 3.7 |
| Gemini 2.5 Pro | 15 | 13 | 13 | 45.3% | 2.3 ± 2.6 | 1.4 ± 1.6 | 3.5 ± 4.0 | 0.9 ± 1.0 | 46% | 46% | 2.8 |
| Qwen3-Next 80B | 15 | 13 | 13 | 42.3% | 3.5 ± 2.4 | 2.1 ± 1.4 | 1.2 ± 1.5 | 0.9 ± 1.0 | 69% | 69% | 3.5 |
| GPT-4o-mini | 15 | 15 | 0 | 41.9% | 4.3 ± 1.8 | 2.6 ± 1.1 | 0.5 ± 1.1 | 0.1 ± 0.5 | 87% | 13% | 2.7 |
| Qwen3-VL 30B | 15 | 4 | 73 | 29.2% | 2.5 ± 2.9 | 0.8 ± 1.5 | 2.0 ± 4.0 | 0.0 | 25% | 0% | 2.5 |
| Gemini 2.5 Flash | 15 | 14 | 7 | 7.1% | 0.4 ± 1.3 | 0.2 ± 0.8 | 0.6 ± 2.1 | 0.1 ± 0.5 | 7% | 7% | 0.5 |
| GPT-OSS 120B | 15 | 2 | 87 | 0.0% | 0.0 | 0.0 | 0.0 | 0.0 | 0% | 0% | 0.0 |
| Qwen3 8B | 15 | 15 | 0 | 0.0% | 0.0 | 0.0 | 0.0 | 0.0 | 0% | 0% | 1.0 |
| Qwen3 235B | 15 | 0 | 100 | — | — | — | — | — | — | — | — |
| Qwen3-VL 235B | 15 | 0 | 100 | — | — | — | — | — | — | — | — |
| Qwen3-VL 32B | 15 | 0 | 100 | — | — | — | — | — | — | — | — |
| Qwen3-VL 8B | 15 | 0 | 100 | — | — | — | — | — | — | — | — |
| Mean, | 274 | 17 | 68.0% | 3.8 | 2.3 | 5.0 | 1.2 | 74% | 61% | 4.0 |
| Model | n | s | Fail% | Duration (s) | F1 | Quality |
|---|---|---|---|---|---|---|
| Claude Haiku 4.5 | 10 | 10 | 0 | 259 ± 81 | 1.00 | 2.8 ± 0.6 |
| Claude Sonnet 4.5 | 10 | 9 | 10 | 373 ± 194 | 1.00 | 2.3 ± 0.5 |
| o4-mini | 6 | 3 | 50 | 394 ± 115 | 1.00 | 2.7 ± 0.6 |
| Ministral 8B | 10 | 1 | 90 | 739 | 1.00 | 2.0 |
| Claude Opus 4.5 | 10 | 9 | 10 | 291 ± 114 | 0.98 ± 0.06 | 2.1 ± 0.3 |
| GPT-5-mini | 10 | 7 | 30 | 620 ± 161 | 0.98 ± 0.04 | 3.1 ± 0.9 |
| Claude Sonnet 4 | 6 | 6 | 0 | 390 ± 96 | 0.96 ± 0.07 | 2.7 ± 0.5 |
| GPT-5-nano | 10 | 6 | 40 | 762 ± 85 | 0.92 ± 0.14 | 2.2 ± 0.4 |
| GPT-5.2 | 10 | 8 | 20 | 437 ± 114 | 0.91 ± 0.10 | 3.8 ± 0.9 |
| Ministral 3B | 10 | 6 | 40 | 465 ± 276 | 0.86 ± 0.18 | 2.3 ± 0.5 |
| Grok Code Fast 1 | 10 | 6 | 40 | 456 ± 254 | 0.79 ± 0.18 | 2.8 ± 0.8 |
| Grok 4.1 Fast | 10 | 7 | 30 | 412 ± 71 | 0.75 ± 0.35 | 2.1 ± 0.4 |
| GPT-5.1 | 16 | 11 | 31 | 340 ± 185 | 0.71 ± 0.46 | 3.3 ± 1.1 |
| GPT-5.2 Chat | 10 | 10 | 0 | 215 ± 87 | 0.70 ± 0.39 | 3.3 ± 0.7 |
| GPT-4o-mini | 17 | 16 | 6 | 205 ± 91 | 0.69 ± 0.30 | 2.0 |
| Gemini 2.5 Flash | 10 | 6 | 40 | 337 ± 191 | 0.66 ± 0.26 | 2.0 |
| Gemini 2.5 Flash Lite | 10 | 4 | 60 | 116 ± 61 | 0.65 ± 0.44 | 2.2 ± 0.5 |
| Gemini 2.5 Pro | 16 | 13 | 19 | 282 ± 77 | 0.60 ± 0.20 | 2.0 |
| Ministral 14B | 10 | 4 | 60 | 75 ± 103 | 0.25 ± 0.50 | 2.0 |
| GPT-5.1 Chat | 10 | 10 | 0 | 73 ± 53 | 0.18 ± 0.30 | 2.1 ± 0.6 |
| GPT-OSS 20B | 10 | 4 | 60 | 84 ± 88 | 0.14 ± 0.27 | 2.0 |
| Qwen3-Next 80B | 10 | 6 | 40 | 85 ± 142 | 0.11 ± 0.27 | 1.7 ± 0.5 |
| Qwen3 8B | 10 | 7 | 30 | 111 ± 120 | 0.00 | 1.7 ± 0.5 |
| GPT-OSS 120B | 10 | 2 | 80 | 13 | 0.00 | 2.0 |
| Qwen3 235B | 10 | 0 | 100 | — | — | — |
| Mean, | 261 | 171 | 34.5 | 314 |
| Test Case | Models | Runs | Baseline | Mean |
|---|---|---|---|---|
| TC-D: Document extraction | 6 | 300 | T = 0.7 | 0.024 |
| TC-G: Entity generation | 6 | 540 | T = 0.7 | 0.025 |
| TC-S: Search relevance | 7 | 359 | T = 0.7 | <0.01 |
| TC-M: Methodology expert | 7 | 519 | T = 1.0 | 0.028 |
| Test Case | Metric | N | Mean | SD | 95 % CI |
|---|---|---|---|---|---|
| TC-O | Pipeline F1 | 24 | 0.660 | 0.349 | [0.513, 0.794] |
| TC-G | Entity completion (%) | 22 | 68.0 | 34.1 | [52.6, 82.0] |
| TC-D | Total score | 23 | 0.640 | 0.127 | [0.588, 0.690] |
| TC-D | Actor | 23 | 0.786 | 0.157 | [0.722, 0.847] |
| TC-D | Role | 23 | 0.846 | 0.108 | [0.801, 0.889] |
| TC-D | Interaction | 23 | 0.431 | 0.115 | [0.383, 0.477] |
| TC-D | Attribution | 23 | 0.499 | 0.162 | [0.431, 0.562] |
| TC-D | TEXT | 23 | 0.947 | 0.058 | [0.921, 0.968] |
| TC-D | IMAGE | 23 | 0.470 | 0.211 | [0.384, 0.553] |
| TC-M | Weighted quality score | 25 | 0.877 | 0.069 | [0.848, 0.904] |
References
- Ma, Z.; Christensen, K.; Jørgensen, B.N. Business ecosystem architecture development: A case study of Electric Vehicle home charging. Energy Inform. 2021, 4, 9. [Google Scholar] [CrossRef]
- Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2023, arXiv:2303.18223. [Google Scholar] [CrossRef]
- Xu, D.; Chen, W.; Peng, W.; Zhang, C.; Xu, T.; Zhao, X.; Wu, X.; Zheng, Y.; Chen, E. Large Language Models for Generative Information Extraction: A Survey. Front. Comput. Sci. 2024, 18, 186357. [Google Scholar] [CrossRef]
- Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst. 2025, 43, 42. [Google Scholar] [CrossRef]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the Advances in Neural Information Processing Systems 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 9459–9474. [Google Scholar] [CrossRef]
- Dagdelen, J.; Dunn, A.; Lee, S.; Walker, N.; Rosen, A.S.; Ceder, G.; Persson, K.A.; Jain, A. Structured Information Extraction from Scientific Text with Large Language Models. Nat. Commun. 2024, 15, 1418. [Google Scholar] [CrossRef] [PubMed]
- Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A Survey on Large Language Model based Autonomous Agents. Front. Comput. Sci. 2024, 18, 186345. [Google Scholar] [CrossRef]
- Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Models. In Proceedings of the International Conference on Learning Representations (ICLR); ICLR: Appleton, WI, USA, 2023. [Google Scholar] [CrossRef]
- Guo, T.; Chen, X.; Wang, Y.; Chang, R.; Pei, S.; Chawla, N.V.; Wiest, O.; Zhang, X. Large Language Model based Multi-Agents: A Survey of Progress and Challenges. In Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI-24); Survey Track; IJCAI: Menlo Park, CA, USA, 2024; pp. 8048–8057. [Google Scholar] [CrossRef]
- Ma, Z. Business ecosystem modeling—The hybrid of system modeling and ecological modeling: An application of the smart grid. Energy Inform. 2019, 2, 35. [Google Scholar] [CrossRef]
- Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.P.; et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Proceedings of the Advances in Neural Information Processing Systems 2023; Curran Associates Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 46595–46623. [Google Scholar] [CrossRef]
- Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems 2022; Curran Associates Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 24824–24837. [Google Scholar] [CrossRef]
- Wei, X.; Cui, X.; Cheng, N.; Wang, X.; Zhang, X.; Huang, S.; Xie, P.; Xu, J.; Chen, Y.; Zhang, M.; et al. ChatIE: Zero-Shot Information Extraction via Chatting with ChatGPT. arXiv 2023, arXiv:2302.10205. [Google Scholar] [CrossRef]
- Luo, Y.; Ru, X.; Liu, K.; Yuan, L.; Sun, M.; Zhang, N.; Liang, L.; Zhang, Z.; Zhou, J.; Wei, L.; et al. OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System. In Companion Proceedings of the ACM on Web Conference 2025; Association for Computing Machinery: New York, NY, USA, 2025. [Google Scholar] [CrossRef]
- Chhetri, T.R.; Chen, Y.; Trivedi, P.; Jarecka, D.; Haobsh, S.; Ray, P.; Ng, L.; Ghosh, S.S. StructSense: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking. arXiv 2025, arXiv:2507.03674. [Google Scholar] [CrossRef]
- Colakoglu, G.; Solmaz, G.; Fürst, J. AgenticIE: An Adaptive Agent for Information Extraction from Complex Regulatory Documents. arXiv 2025, arXiv:2509.11773. [Google Scholar] [CrossRef]
- Luo, J.; Zhang, W.; Yuan, Y.; Zhao, Y.; Yang, J.; Gu, Y.; Wu, B.; Chen, B.; Qiao, Z.; Long, Q.; et al. Large Language Model Agent: A Survey on Methodology, Applications and Challenges. arXiv 2025, arXiv:2503.21460. [Google Scholar] [CrossRef]
- Hong, S.; Zhuge, M.; Chen, J.; Zheng, X.; Cheng, Y.; Zhang, C.; Wang, J.; Wang, Z.; Yau, S.K.S.; Lin, Z.; et al. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. In Proceedings of the International Conference on Learning Representations (ICLR); ICLR: Appleton, WI, USA, 2024. [Google Scholar] [CrossRef]
- Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language Models Can Teach Themselves to Use Tools. In Proceedings of the Advances in Neural Information Processing Systems 2023; Curran Associates Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 68539–68551. [Google Scholar] [CrossRef]
- Qu, C.; Dai, S.; Wei, X.; Cai, H.; Wang, S.; Yin, D.; Xu, J.; Wen, J.R. Tool Learning with Large Language Models: A Survey. Front. Comput. Sci. 2025, 19, 198343. [Google Scholar] [CrossRef]
- Shinn, N.; Cassano, F.; Berman, E.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language Agents with Verbal Reinforcement Learning. In Proceedings of the Advances in Neural Information Processing Systems 2023; Curran Associates Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 8634–8652. [Google Scholar] [CrossRef]
- Bavaresco, A.; Bernardi, R.; Bertolazzi, L.; Elliott, D.; Fernández, R.; Gatt, A.; Ghaleb, E.; Giulianelli, M.; Hanna, M.; Koller, A.; et al. LLMs instead of Human Judges? A Large-Scale Empirical Study across 20 NLP Evaluation Tasks. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 238–255. [Google Scholar] [CrossRef]
- Calderon, N.; Reichart, R.; Dror, R. The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 16051–16081. [Google Scholar] [CrossRef]
- Lu, Y.; Liu, Y.; Dong, S.; Song, Q.; Zhang, C.; Zhao, Y.; Lu, J. KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment. arXiv 2025, arXiv:2502.06472. [Google Scholar] [CrossRef]
- Lin, L.; Zeng, W.; Luo, B.; Wen, L.; Wang, J. MAO: A Framework for Process Model Generation with Multi-Agent Orchestration. arXiv 2024, arXiv:2408.01916. [Google Scholar] [CrossRef]
- Dijkstra, E.W. On the Role of Scientific Thought. In Selected Writings on Computing: A Personal Perspective; Originally written in 1974 as EWD447; Springer: New York, NY, USA, 1982; pp. 60–66. [Google Scholar] [CrossRef]
- Xi, Y.; Lin, J.; Xiao, Y.; Zhou, Z.; Shan, R.; Gao, T.; Zhu, J.; Liu, W.; Yu, Y.; Zhang, W. A Survey of LLM-based Deep Search Agents: Paradigm, Optimization, Evaluation, and Challenges. arXiv 2025, arXiv:2508.05668. [Google Scholar] [CrossRef]
- Auer, C.; Lysak, M.; Nassar, A.; Dolfi, M.; Livathinos, N.; Vagenas, P.; Berrospi Ramis, C.; Omenetti, M.; Lindlbauer, F.; Dinkla, K.; et al. Docling Technical Report. arXiv 2024, arXiv:2408.09869. [Google Scholar] [CrossRef]
- Cohen, J. A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
- Landis, J.R.; Koch, G.G. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]
- Bass, L.; Clements, P.; Kazman, R. Software Architecture in Practice, 4th ed.; SEI Series in Software Engineering; Addison-Wesley: London, UK, 2021. [Google Scholar]
- Doshi-Velez, F.; Kim, B. Towards A Rigorous Science of Interpretable Machine Learning. arXiv 2017, arXiv:1702.08608. [Google Scholar] [CrossRef]



| Dimension | OneKE | StructSense | AgenticIE | KARMA | MAO | Present Work |
|---|---|---|---|---|---|---|
| Lifecycle coverage | Extraction and error correction only | Extraction, ontology alignment, judging, HITL feedback | Single-document extraction (KIE/QA) | Ingestion through KG integration | Single-text to process model generation | Boundary, retrieval, conversion, extraction, validation, editing |
| Ontology enforcement | Output format specs (Pydantic/JSON); KB-backed schema retrieval | Post-hoc alignment to formal ontologies via vector DB | JSON schema templates (fixed and open) | Schema alignment agent maps to existing KG types | BPMN format constraints injected via prompts | Schema-bound typed output constraints; methodology expert validation |
| Provenance and gating | None; case repository tracks history | Source sentence tracking; HITL via feedback agent | Verification step; no HITL | Cross-agent verification; optional manual review escalation | No provenance; no HITL | Source-linked provenance; staged proposals with human accept/reject |
| Agent decomposition | 3 agents: schema, extract, reflect | 4 agents: extract, align, judge, feedback | Single agent with tool-calling loop | 9 agents by pipeline stage; central controller | 3 roles across 4 phases (generate, refine, review, test) | 5 agents and orchestrator aligned with methodology stages |
| Evaluation | 2 IE benchmarks; ablation study | 3 tasks, 3 models; P/R/F1 and concept alignment | 1 domain; schema adherence and exact match; 3 model variants | 3 domains, 3 models; 1200 articles; LLM-verified correctness | 4 datasets; distance to reference BPMN; ablation study | 4300 runs, 34 models; operational metrics, LLM judge, human validation |
| Construct | Definition | Structural Constraints |
|---|---|---|
| Actor | An identifiable organisational entity or institution participating in the ecosystem | Must have a unique identifier; classified as either an active actor or a passive object, where objects represent entities such as infrastructure components that participate without exercising independent initiative |
| Role | A function or position that an actor assumes within the ecosystem | An actor may hold multiple roles; multiple actors may share the same role |
| Interaction | A structured relationship between two roles, typed as one of five categories, namely monetary value, intangible value, goods, information, or data exchange | Must reference two existing participants; typed and optionally directional; must cite evidential source |
| Test Case | Agent Tested | Input and Ground Truth | Models/Runs |
|---|---|---|---|
| TC-O: Orchestrator | Orchestrator with fixed sub-agents | User-level task; 6 ground-truth actors in Danish energy domain | 25/261 |
| TC-G: Entity generation | Ecosystem editor | Multi-part instruction with 6 reference documents; 18 ground-truth entities | 26/390 |
| TC-D: Document extraction | Document analyser | Purpose-built PDF with prose text and embedded diagram; 42 ground-truth entities | 23/1187 |
| TC-S: Search relevance | Search agent | Relevant vs. irrelevant web page; binary classification | 30/432 |
| TC-M: Methodology expert | Methodology expert | Boundary definition task with 2 reference documents; rubric-based quality assessment | 25/394 |
| Test Case | Task Category | Mean Score | 95 % CI | Range | Models |
|---|---|---|---|---|---|
| TC-O: Orchestrator | Pipeline F1 | 0.66 | 0.00 to 1.00 | 25 | |
| Successful completion rate | 65.5% | ||||
| TC-G: Entity generation | Entity completion | 68.0% | 0 to 100% | 26 | |
| Reference integration | 4.0/6 | ||||
| TC-D: Document extraction | Total score | 0.640 | 0.32 to 0.80 | 23 | |
| by entity type | Actor identification | 0.786 | 0.45 to 0.99 | ||
| Role extraction | 0.846 | 0.62 to 1.00 | |||
| Interaction extraction | 0.431 | 0.16 to 0.58 | |||
| Attribution | 0.499 | 0.06 to 0.72 | |||
| by modality | Text-sourced | 0.947 | 0.78 to 1.00 | ||
| Image-sourced | 0.470 | 0.05 to 0.72 | |||
| TC-S: Search relevance | Classification accuracy | 100%, 27 of 30 | 30 | ||
| TC-M: Methodology expert | Weighted quality score | 0.877 | 0.75 to 0.99 | 25 |
| Model | Actor | Role | Inter. | TEXT | IMAGE |
|---|---|---|---|---|---|
| Claude Opus 4.5 | 0.936 | 0.967 | 0.567 | 0.986 | 0.708 |
| Claude Haiku 4.5 | 0.991 | 0.977 | 0.501 | 0.988 | 0.716 |
| GPT-5.2 | 0.951 | 1.000 | 0.479 | 0.991 | 0.692 |
| Gemini 2.5 Flash | 0.890 | 0.962 | 0.569 | 0.997 | 0.668 |
| Gemini 2.5 Pro | 0.909 | 0.896 | 0.571 | 0.983 | 0.641 |
| GPT-5-mini | 0.941 | 0.954 | 0.415 | 0.984 | 0.621 |
| Qwen3-VL 235B | 0.925 | 0.933 | 0.519 | 0.967 | 0.662 |
| Ministral 8B | 0.899 | 0.934 | 0.438 | 0.923 | 0.628 |
| Grok 4.1 Fast | 0.750 | 0.819 | 0.397 | 0.967 | 0.393 |
| GPT-5-nano | 0.654 | 0.698 | 0.245 | 0.833 | 0.266 |
| Kimi K2 | 0.500 | 0.671 | 0.299 | 0.925 | 0.088 |
| GPT-4o-mini | 0.449 | 0.624 | 0.164 | 0.775 | 0.052 |
| Mean, | 0.786 | 0.846 | 0.431 | 0.947 | 0.470 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Gärdström, H.F.; Jørgensen, B.N.; Ma, Z.G. Agentic Generative AI for Methodology-Grounded Modelling from Unstructured Documents: Design and Evaluation of a Multi-Agent Ecosystem Mapping Pipeline. Information 2026, 17, 570. https://doi.org/10.3390/info17060570
Gärdström HF, Jørgensen BN, Ma ZG. Agentic Generative AI for Methodology-Grounded Modelling from Unstructured Documents: Design and Evaluation of a Multi-Agent Ecosystem Mapping Pipeline. Information. 2026; 17(6):570. https://doi.org/10.3390/info17060570
Chicago/Turabian StyleGärdström, Hampus Fink, Bo Nørregaard Jørgensen, and Zheng Grace Ma. 2026. "Agentic Generative AI for Methodology-Grounded Modelling from Unstructured Documents: Design and Evaluation of a Multi-Agent Ecosystem Mapping Pipeline" Information 17, no. 6: 570. https://doi.org/10.3390/info17060570
APA StyleGärdström, H. F., Jørgensen, B. N., & Ma, Z. G. (2026). Agentic Generative AI for Methodology-Grounded Modelling from Unstructured Documents: Design and Evaluation of a Multi-Agent Ecosystem Mapping Pipeline. Information, 17(6), 570. https://doi.org/10.3390/info17060570

