Agentic Generative AI for Methodology-Grounded Modelling from Unstructured Documents: Design and Evaluation of a Multi-Agent Ecosystem Mapping Pipeline

Gärdström, Hampus Fink; Jørgensen, Bo Nørregaard; Ma, Zheng Grace

doi:10.3390/info17060570

Open AccessArticle

Agentic Generative AI for Methodology-Grounded Modelling from Unstructured Documents: Design and Evaluation of a Multi-Agent Ecosystem Mapping Pipeline

by

Hampus Fink Gärdström

,

Bo Nørregaard Jørgensen

^*

and

Zheng Grace Ma

Mærsk Mc-Kinney Møller Institute, University of Southern Denmark, 5230 Odense, Denmark

^*

Author to whom correspondence should be addressed.

Information 2026, 17(6), 570; https://doi.org/10.3390/info17060570

Submission received: 1 May 2026 / Revised: 1 June 2026 / Accepted: 3 June 2026 / Published: 9 June 2026

(This article belongs to the Special Issue Modeling in the Era of Generative AI)

Download

Browse Figures

Versions Notes

Abstract

Modelling constitutes a disciplined transformation process through which heterogeneous, unstructured evidence is translated into structured representations that support reasoning and decision-making. The integration of generative artificial intelligence into such processes introduces new possibilities for automation, yet risks undermining methodological rigour, traceability, and human accountability. This paper proposes a methodology-grounded multi-agent architecture for constructing structured business ecosystem maps from unstructured document collections. The architecture decomposes the modelling lifecycle into specialised agent functions covering boundary specification, source discovery, document analysis, semantic extraction, and controlled model editing, addressing four of the five methodology stages while leaving automated completeness verification outside the current scope. A central orchestrator coordinates agents while enforcing ontological constraints derived from a formal modelling methodology. All proposed modifications are staged for human review before execution, and each map element maintains explicit provenance links to source material. To evaluate the reliability and correctness of generative modelling pipelines, a hybrid evaluation framework integrates operational metrics, semantic assessment using an LLM-based judge, and human agreement validation. Empirical evaluation across 34 generative models and 4382 experimental runs characterises capabilities across modelling tasks. In a controlled single-document extraction task, text-based extraction achieves a mean semantic match score of 0.947, whereas interaction extraction scores 0.431 and visual diagram interpretation scores 0.470, identifying relational reasoning and multimodal interpretation as principal bottlenecks. Model performance varies across agent roles, with task-aligned model selection associated with larger performance changes than hyperparameter tuning; the architecture’s causal contribution is not isolated, and comparison against monolithic or ablated baselines remains future work.

Keywords:

generative artificial intelligence; multi-agent systems; business ecosystem modelling; methodology-grounded modelling; LLM evaluation; agentic architecture

1. Introduction

Modelling pervades scientific, engineering, and organisational practice as a systematic transformation process through which heterogeneous, incomplete, and often ambiguous evidence is translated into structured representations that support reasoning, communication, and decision-making. Whether expressed as process models, system architectures, ecosystem maps, or conceptual frameworks, modelling artefacts provide formal structure to complex realities. Traditionally, the construction of such artefacts has relied on manual analysis of documents, expert interviews, and iterative synthesis. This manual character constrains scalability, introduces subjectivity, and limits the frequency with which models can be updated in dynamic environments.

These constraints intensify as the volume and diversity of source material increase. Constructing a single ecosystem map may require systematic analysis of dozens or hundreds of documents spanning policy reports, strategic publications, technical specifications, and organisational diagrams. Analysts must traverse these sources repeatedly to identify entities, reconcile terminology, and verify relational structures. The process depends on the availability of domain experts whose time is limited, and the resulting models reflect the interpretive choices of individual analysts. When source material evolves or new documents become available, updating the map demands re-engagement with the entire evidence base. These characteristics make manual modelling workflows inherently difficult to scale, reproduce, or maintain over time [1].

The rapid advancement of generative artificial intelligence, hereafter GAI, introduces new possibilities for reshaping modelling practices. Large language models and multimodal generative systems demonstrate a capacity to analyse unstructured textual and graphical material, extract entities and relations, and produce structured outputs in natural and formal languages [2,3]. At the same time, these systems are inherently probabilistic, non-deterministic, and susceptible to hallucination and inconsistency [4]. Without methodological constraints and governance mechanisms, these characteristics may compromise the reliability of modelling outputs.

Integrating GAI into modelling processes therefore requires preserving methodological rigour, source attribution, and human accountability while exploiting its capacity for large-scale semantic extraction and synthesis. Several strategies exist along a spectrum of structural complexity. Retrieval-augmented generation grounds model outputs in external documents, reducing hallucination risk but not enforcing ontological constraints across extraction stages [5]. Constrained decoding and schema-guided output ensure format compliance but do not coordinate multi-step reasoning across a modelling lifecycle [6]. These limitations motivate multi-agent architectures that distribute modelling responsibilities across specialised agents under centralised orchestration, with established modelling methodologies providing the conceptual backbone while generative agents augment specific lifecycle stages [7,8,9].

The empirical context of this study is business ecosystem modelling. Ecosystem maps aim to represent actors, roles, and interactions within complex value networks [1,10]. Constructing such maps from policy documents, reports, web resources, and diagrams can be labour-intensive and cognitively demanding [1]. It requires consistent interpretation of organisational roles, value exchanges, and structural relationships across heterogeneous sources. This domain provides three characteristics that make it a suitable context for investigating GAI-assisted modelling. The ontology is formally specified, comprising three core constructs with explicit definitional criteria, which enables precise evaluation against reference models. The source material is heterogeneous, spanning textual descriptions, tabular data, and graphical diagrams, thereby exercising both language understanding and multimodal interpretation capabilities. The modelling task requires not only entity enumeration but relational reasoning, as the structural topology of actor interactions determines the explanatory value of the resulting map. These properties are shared with a broader class of structured knowledge construction problems, lending the findings potential relevance beyond the immediate domain.

To this end, the study designs and evaluates a multi-agent GAI pipeline for transforming unstructured document collections into structured ecosystem maps. The architecture decomposes the modelling process into five specialised agents, each aligned with a distinct transformation stage from boundary specification through controlled model editing. All proposed modifications to the ecosystem map are staged for human review prior to execution, and each map element is linked to its documentary sources through explicit provenance records. The implementation operationalises four of the five methodological stages, while completeness verification remains under human control and is not yet computationally automated.

Generative modelling pipelines produce semantic outputs that cannot be adequately assessed through exact-match metrics alone [11]. The study therefore develops a hybrid evaluation framework that combines operational reliability indicators with semantic assessment using an LLM-based judge, complemented by human agreement validation. This approach enables systematic characterisation of both pipeline robustness and modelling correctness under conditions of semantic variability.

The study is positioned as a design and characterisation study rather than a comparative or ablation evaluation. It develops a methodology-grounded architecture and characterises generative model performance across the modelling stages it operationalises, but does not isolate the causal contribution of individual architectural components through comparison against monolithic or simplified baselines. The four contributions are framed accordingly. First, it provides empirical evidence that ontology-constrained generative agents can reach high extraction accuracy for entities and roles within a controlled single-document setting, while relational interaction extraction remains consistently difficult across model configurations. Second, it proposes a modular architecture in which agent responsibilities are aligned with modelling stages, and reports that within this architecture task-aligned model selection produces larger performance changes than hyperparameter tuning; the causal contribution of stage alignment itself is not directly measured. Third, it develops a hybrid evaluation framework for semantic modelling outputs, combining operational metrics with LLM-based judgement calibrated against human agreement. Fourth, it provides an empirical characterisation of model capabilities across 34 models and five agent roles, indicating that no single model dominates all stages and that multimodal diagram interpretation constitutes the primary differentiator of overall pipeline quality.

2. Related Work

The integration of GAI into modelling workflows builds upon three intersecting research streams. The first concerns the use of large language models for structured information extraction and knowledge construction. The second addresses agentic architectures and orchestration frameworks designed to coordinate complex multi-step reasoning processes. The third focuses on the evaluation of semantic outputs generated by probabilistic models.

2.1. Generative Artificial Intelligence for Structured Extraction and Modelling

Large language models have demonstrated growing capabilities in entity recognition, relation extraction, summarisation, and structured output generation from heterogeneous textual sources, though performance on extraction tasks still falls short of fine-tuned specialist models in established extraction benchmarks [2,3]. Prompt-based approaches allow users to instruct models to extract actors, classify roles, or identify relationships within documents [12], and chain-of-thought prompting has been shown to improve multi-step reasoning [13]. More advanced configurations support ontology-constrained generation, where outputs must conform to predefined formats such as JSON structures or domain-specific templates [6]. These developments have encouraged experimentation with automated model construction from text, including the generation of knowledge graphs, process models, and conceptual representations.

However, most existing approaches treat extraction as a single-stage task rather than as part of a broader modelling lifecycle. The emphasis is frequently placed on prompt engineering or fine-tuning, with limited attention to how extracted elements are integrated into a coherent and evolving modelling artefact [14,15]. Moreover, generative extraction is typically evaluated in isolation from the methodological frameworks that govern model structure and interpretation. As a result, while large language models can produce structured fragments, the systematic construction of methodologically consistent models from diverse sources remains insufficiently addressed.

A further limitation concerns relational reasoning. While entity identification often achieves acceptable performance, the extraction of interactions, value flows, or structural dependencies presents greater challenges [16,17]. Relational constructs require the generative model to interpret context, infer implicit connections, and distinguish between actors, activities, and artefacts. In modelling contexts, such distinctions are ontological rather than purely linguistic. Misclassification can propagate structural inconsistencies throughout the ecosystem map.

2.2. Agentic Architectures and Orchestration Frameworks

Recent developments in agentic GAI move beyond single-prompt interactions towards coordinated multi-step workflows [7,18]. In such architectures, specialised agents perform discrete tasks such as search, planning, analysis, and synthesis. An orchestration layer manages task decomposition, information flow, and error handling. This paradigm reflects an emerging recognition that complex reasoning processes are better supported by modular agent collaboration than by monolithic model invocations [9,19].

Agentic systems often incorporate external tools, retrieval mechanisms, and memory components. Retrieval-augmented generation techniques enable models to ground outputs in external documents, reducing hallucination risk [5]. Tool-calling frameworks allow models to invoke structured functions or access databases [20,21]. Memory structures maintain contextual continuity across multi-step tasks. Together, these components facilitate the construction of workflows that implement structured procedural reasoning [8,22].

Nevertheless, most reviewed agentic systems remain technology-centric. The partitioning of tasks is frequently driven by technical convenience, with limited attention to alignment with established modelling stages. As a result, while agentic architectures improve robustness and modularity, they do not automatically ensure that generated outputs conform to domain-specific modelling constraints.

2.3. Evaluation of Generative and Semantic Outputs

Evaluating generative systems presents distinctive methodological challenges. Traditional deterministic metrics such as exact match or token-level accuracy are poorly suited to tasks involving semantic variability and open-ended outputs. Two semantically equivalent modelling elements may differ in wording, granularity, or representation while preserving substantive meaning. Conversely, superficially similar outputs may differ in structural correctness or ontological classification.

To address these issues, recent research explores the use of large language models as evaluators of generative outputs [11]. In this approach, a model assesses the quality or correctness of another model’s output through pairwise comparison or single-answer grading, often without requiring a reference answer. While this strategy offers flexibility and scalability, it introduces questions of reliability and bias [23]. Without validation against human judgement, LLM-based evaluation may reproduce the same weaknesses present in generative outputs.

Hybrid evaluation designs that combine operational metrics with semantic assessment and human validation are therefore increasingly advocated [24]. Operational metrics capture reliability characteristics such as task completion rates and stage failures. Semantic evaluation assesses correctness relative to modelling intent. Human validation provides a grounding reference for interpreting automated assessments.

2.4. Closest Prior Work and Differentiation

Within the reviewed literature, existing multi-agent systems address individual stages of the modelling lifecycle, from extraction through knowledge graph construction, but do not jointly integrate methodology-anchored decomposition, full lifecycle coverage, and provenance-linked governance within a single architecture. The architecturally nearest systems originate from the agentic information extraction and knowledge graph construction space, which shares three core design elements with the present work, namely multi-agent coordination, schema-guided output, and ontology alignment, yet pursues extraction as an end in itself rather than as one stage within a broader modelling process. The comparison below therefore targets shared architectural patterns rather than shared purpose. Table 1 summarises five such systems across five dimensions that jointly characterise a methodology-grounded modelling pipeline.

OneKE [15] coordinates three agents for schema-guided extraction with reflective error correction, but treats schemas as output format specifications rather than domain ontology constraints and covers only the extraction phase, without extending to upstream scoping or downstream human governance. StructSense [16] is the closest prior system, integrating ontology-guided extraction, human-in-the-loop feedback, and four-agent coordination. However, it applies ontology alignment post hoc rather than enforcing typing during generation, and it addresses the extraction-to-alignment portion of the processing chain without extending to upstream scoping or controlled model integration. AgenticIE [17] extracts structured information from regulatory documents using a planner-executor-responder loop with tool routing, but performs no source discovery and identifies the absence of human-in-the-loop correction as a limitation. KARMA [25] deploys nine agents for knowledge graph enrichment with cross-agent verification, representing the largest reviewed pipeline. Agent decomposition follows technical pipeline stages without alignment to a domain modelling methodology, and while contradictions can be escalated to manual expert review, the system provides no systematic human-in-the-loop governance framework. MAO [26] generates BPMN process models through three agent roles across four orchestrated phases. The phases correspond to a software engineering lifecycle rather than a domain-specific modelling methodology, and MAO operates on a single input text without source discovery or integration with an evolving model.

The distinctive methodological position of the present work, relative to these five systems, lies in its organising principle of methodology anchoring, the decomposition of agent responsibilities according to the stages of a domain-specific modelling methodology rather than technical pipeline stages. The two further properties that none of the reviewed systems jointly exhibit follow from that anchoring rather than standing as independent features. Because the agents track the modelling stages, the lifecycle is covered from boundary specification through controlled model editing rather than terminating at extraction; and because every stage operates on the same evolving ecosystem map, provenance-linked governance can gate each modification to it. Table 1 substantiates this delta dimension by dimension; the present study contributes the architectural integration of methodology anchoring together with the lifecycle coverage and provenance-linked governance it entails, and a controlled characterisation of how generative agents perform across the resulting stage decomposition, but does not claim that this architecture has been shown to outperform any of the prior systems.

2.5. Research Gap

Across these research streams and the closest prior systems, a consistent pattern emerges. Multi-agent information extraction and knowledge graph construction systems have developed sophisticated architectural mechanisms, including schema-guided output, ontology alignment, and reflective error correction, yet in every case extraction constitutes the terminal objective. Agent responsibilities are partitioned according to technical pipeline stages rather than the stages of a domain-specific modelling methodology; upstream scoping and boundary specification are either absent or manual, and downstream integration into an evolving ecosystem map under human governance is not addressed. Conversely, agentic orchestration frameworks have demonstrated that modular agent collaboration improves robustness and task decomposition, yet their partitioning criteria remain technology-centric rather than methodology-anchored, and they do not enforce domain modelling constraints across the modelling lifecycle.

Existing systems address subsets of these requirements but do not jointly integrate them. Within the analysed literature, no system simultaneously decomposes agent responsibilities according to the stages of a domain-specific modelling methodology, spans the modelling lifecycle from boundary specification through controlled model editing, and maintains provenance-linked governance over an evolving ecosystem map. The present architecture addresses this gap for four of the five methodological stages, with the completeness verification stage remaining a direction for future computational automation as discussed in Section 3.3. Furthermore, there is limited empirical evidence characterising how generative agents perform across distinct modelling stages and where performance bottlenecks persist within such a lifecycle.

3. Modelling as a Methodology-Grounded Transformation Process

This section establishes the conceptual foundation by defining modelling as a structured transformation process and explicating the ontological commitments that any generative pipeline must respect.

3.1. Modelling as Structured Representation Construction

This study treats modelling as a cognitive activity in which selected aspects of reality are abstracted, categorised, and related according to an explicit or implicit schema. Under this framing, a model embodies ontological commitments concerning what types of entities exist, how they may relate, and which distinctions are considered meaningful. In structured domains such as business ecosystems, these commitments define the boundaries of interpretation and guide the classification of actors, roles, and interactions.

When modelling is performed manually, domain experts interpret textual and graphical evidence through the lens of the chosen methodology. They identify relevant entities, classify them according to predefined categories, infer relationships, and iteratively refine the structure until a coherent representation emerges [1]. This process is interpretive and requires consistent application of methodological rules across heterogeneous sources. The reliability of the resulting model depends on both the expertise of the modeller and the transparency of the transformation from evidence to structure.

Generative artificial intelligence introduces automation into this transformation. However, generative systems do not possess an intrinsic understanding of domain ontologies or modelling constraints, as their outputs are generated based on probabilistic patterns learned from training data [2]. Without explicit guidance, they may conflate actors with activities, misclassify artefacts as organisations, or invent relations that are linguistically plausible but structurally incorrect, as the empirical results in Section 6 confirm.

3.2. Ecosystem Modelling as a Structured Schema

Business ecosystem modelling provides a clear illustration of methodology-grounded modelling. An ecosystem map typically distinguishes between at least three core constructs, namely actors, roles, and interactions [10]. Table 2 formalises these constructs and their structural constraints.

These constructs are not interchangeable. An actor is not equivalent to a role, and an interaction requires more than co-occurrence of names within a document. Each construct must satisfy the definitional criteria in Table 2. The modelling schema defines which categories are permitted, how elements may connect, and what constitutes valid structure.

In manual practice, the modeller applies this schema iteratively, determining which textual evidence qualifies as actors, roles, or interactions through contextual judgement.

When generative agents perform extraction, the schema must be encoded in prompts, constraints, or validation steps. The agent must classify entities according to the modelling ontology, not merely identify them. Integration of new elements must respect representational consistency; orphan entities, such as roles without associated actors or interactions referencing non-existent participants, are flagged during editorial review. These expectations transform modelling from free-text extraction into an ontology-constrained transformation process.

3.3. Transformation Stages in Methodology-Grounded Modelling

The reference methodology [1,10] structures ecosystem analysis as a five-stage process. The first stage establishes the ecosystem boundary, delineating scope along thematic, geographic, and institutional dimensions. The second and third stages execute in parallel. Actor identification determines which organisational entities participate in the ecosystem, while role and value proposition identification determines what functions each actor performs and what value it contributes. The fourth stage maps interactions between participants, specifying the type, direction, and content of exchanges. The fifth stage verifies completeness through Minimum Viable Ecosystem design and value-flow tracing, returning to earlier stages when gaps are detected. Throughout this process, the analyst interleaves searching for relevant documents, interpreting their content, and constructing the ecosystem map.

When a human analyst performs this methodology, source discovery, document reading, and model construction occur fluidly within each stage. A computational pipeline does not replicate this fluidity and must instead decompose the analytical workflow into discrete, automatable transformation steps.

To embed the methodology within a generative pipeline, the present study identifies five computational stages that collectively operationalise the methodological lifecycle. Figure 1 depicts this transformation as a left-to-right pipeline. On the left, unstructured evidence, comprising policy documents, academic literature, and grey literature, enters the central transformation process. Within this process, the five stages proceed sequentially. Boundary specification translates the first-stage scoping decisions into a machine-readable task definition. Source discovery automates the document search that the manual analyst performs throughout Stages 2 through 4. Document conversion transforms retrieved sources into structured text through layout detection, table parsing, and image extraction, a step that the manual analyst performs implicitly through reading. Semantic extraction identifies candidate actors, roles, and interactions according to the modelling ontology, corresponding to the analytical core of Stages 2 through 4. Controlled integration assembles extracted elements into the ecosystem map under methodological constraints and human review, serving the verification and completeness functions of Stage 5.

Two structural features distinguish this pipeline from a simple sequential chain. First, a dashed iteration path connects controlled integration back to source discovery, reflecting the iterative character of the reference methodology. Second, a constraints arrow from the modelling methodology indicates that admissible outputs at each stage are bounded by the ontological schema rather than left to open-ended generation. In the current implementation, the methodology is operationalised primarily through the ontological schema constraining extraction outputs and the staged proposal mechanism enforcing human review. The completeness verification logic of the fifth stage, including Minimum Viable Ecosystem design and value-flow tracing, is not yet computationally automated and remains a direction for future development.

3.4. Human Oversight and Representational Accountability

Accountability is integral to methodology-grounded modelling. As Figure 1 shows, human oversight is positioned alongside the final integration stage, where a dashed gating arrow indicates that proposed modifications require explicit approval before incorporation into the ecosystem map. The modeller must be able to justify why a particular actor, role, or interaction is included and trace how it was derived from evidence. In generative modelling pipelines, this requirement translates into two architectural obligations. First, each extracted element must be linked to its source references so that classification decisions can be independently verified, corresponding to the provenance traces shown at the bottom of the output structure in Figure 1. Second, generative outputs must be treated as proposals subject to human review rather than as authoritative updates, preserving the modeller’s interpretive authority over the evolving ecosystem map. Section 4.3 details how the proposed architecture operationalises these obligations.

4. Agentic Generative AI Architecture for Modelling

4.1. Design Principles

The architecture is designed to operationalise the staged transformation process outlined in Section 3.3. Its organising principle is methodology anchoring, corresponding to the constraints arrow in Figure 1 through which the modelling methodology bounds admissible outputs at each stage. All generative activities are constrained by the explicit modelling schema defined in Table 2, which specifies valid entity types and their admissible relationships. Because no single generative invocation can enforce a multi-stage schema with sufficient consistency, methodology anchoring necessitates staged decomposition, in which the modelling lifecycle is partitioned into discrete transformation stages aligned with the conceptual framework [27]. Each stage is assigned to a specialised agent responsible for a clearly delimited analytical task, reducing ambiguity and improving transparency. Staged outputs in turn require human oversight and gating, represented in Figure 1 by the dashed gating arrow that interposes human review before integration proceeds. Generative outputs are therefore presented as change proposals that require explicit human approval before incorporation into the model, preserving accountability and mitigating the risk of propagating erroneous or hallucinated content. Finally, informed gating decisions demand provenance and auditability. Every modelling element maintains explicit links to source references and to the analytical trace that produced it, embedding verifiability within the modelling process. The architecture enforces methodological constraints across extraction and integration stages but does not automate completeness verification or guarantee full consistency of relational structures without human intervention.

4.2. Multi-Agent Decomposition Aligned with Modelling Stages

The architecture implements the modelling lifecycle through a coordinated set of specialised agents under a central orchestrator. Each agent corresponds to a transformation stage identified in the conceptual framework.

The orchestrator agent occupies the central coordinating role. It decomposes high-level modelling objectives into structured subtasks, assigns these to appropriate agents, aggregates intermediate outputs, and ensures that each stage satisfies the required preconditions before progression. The orchestrator maintains an internal representation of the current modelling state, including the specified boundary, accumulated sources, and provisional map elements. It determines which transformation stages to invoke, in what sequence, and with what contextual parameters.

The search agent is responsible for source discovery within the defined modelling boundary. Given thematic and contextual constraints, it retrieves relevant documents from web resources or predefined repositories [5,28]. Retrieved sources are scored for relevance against the modelling boundary through an independent assessment step, and only sources exceeding a configurable threshold are forwarded for analysis. This filtering step grounds the modelling process in external evidence and reduces reliance on latent knowledge embedded within the generative model. Source discovery operates iteratively rather than as a single retrieval pass. After initial filtering, the agent evaluates whether accumulated sources satisfy the modelling objective and, where gaps remain, refines search parameters before repeating the cycle. This continues until sufficiency criteria are met or a configurable maximum iteration count is reached.

The document analysis agent performs conversion and semantic extraction. It first converts retrieved documents into machine-readable structured text through layout detection, table parsing, and image extraction [29]. The agent then applies ontology-constrained prompting to extract candidate actors, roles, and interactions according to the modelling schema. Extraction constraints enforce structured output formats that mirror the ontological definitions in Table 2. Each proposed entity must include a type classification, a descriptive label, and source references. Where applicable, the agent distinguishes between entity types and explicitly associates proposed interactions with participating roles.

The extraction strategy adapts to document length. When a source falls within practical context limits, the agent processes it in a single pass, preserving cross-section relationships. For longer documents, the agent partitions content by structural boundaries, extracts entities from each segment independently, and deduplicates overlapping proposals. Each extracted entity carries a typed evidence classification distinguishing verbatim quotations from derived inferences and from visual descriptions, enabling downstream reviewers to assess the evidential basis of each proposal.

The constraint enforcement mechanism operates at two levels. At the output level, each agent invocation specifies a JSON schema that mirrors the ontological definitions in Table 2. The generative model is constrained to produce output conforming to this schema, and responses that fail schema validation are rejected and regenerated up to a configurable retry limit. At the semantic level, the methodology expert agent performs post-extraction validation, verifying that proposed entities satisfy definitional criteria and that interactions reference existing actors within the ecosystem map. Proposals that pass schema validation but violate ontological rules, such as interactions referencing actors not yet present in the map or role assignments without a parent entity, are flagged for revision rather than silently discarded. This two-level enforcement separates structural correctness from semantic validity, allowing each to be assessed through independent mechanisms.

Inter-agent communication is mediated by the orchestrator rather than performed agent to agent. Each agent exposes a typed input and output contract whose payload shape is defined by the same JSON schema family that governs the ontology in Table 2, so that the boundary form produced by the methodology expert, the source set produced by the search agent, the extraction proposals produced by the document analyser, and the change operations produced by the editor are all encoded as schema-bound typed objects rather than free text. When the orchestrator forwards a payload from one stage to the next, it validates the payload against the receiving agent’s expected input contract and carries the source identifiers attached to each element forward into the receiving agent’s context, preserving provenance across stages. Ontological constraints are checked both within each agent’s own schema-bound output and again when changes are committed to the map. The orchestrator thereby mediates each hand-off, so that boundary forms, source sets, extraction proposals, and change operations are exchanged as typed objects subject to validation rather than as free text.

Source documents frequently contain both textual and graphical content, and the architecture treats these modalities through distinct processing paths within the document analysis agent. Textual content undergoes layout-aware parsing that preserves document structure, including headings, paragraphs, and tabular data [29]. Graphical content, including organisational diagrams, process flowcharts, and ecosystem visualisations, is extracted as image segments and processed through multimodal generative models capable of interpreting visual representations. The extracted entities from both modalities are unified into a common ontological format before forwarding to subsequent pipeline stages. This separation reflects distinct processing requirements, as textual content permits token-level parsing whereas embedded diagrams necessitate visual interpretation through multimodal models, and conflating both modalities within a single extraction pathway would obscure errors whose diagnostic origins differ.

The methodology expert agent functions as a validation and refinement component. Boundary specification operates through an iterative conversational cycle in which the orchestrator relays user instructions to the methodology expert, which produces a structured boundary definition and generates a follow-up question targeting the largest remaining gap in scope or purpose. This cycle continues until the boundary is sufficiently defined for extraction, enabling progressive refinement through incremental human input.

Operating in a two-phase pipeline, it first produces a structured boundary definition form from user instructions and reference documents, then self-evaluates its output against methodological criteria. Self-evaluation applies five LLM-judged quality dimensions, namely completeness of scope coverage, alignment with the stated modelling purpose, structural consistency with the ontological schema, relevance to the user-specified boundary, and adherence to formatting and token budget constraints.

Alongside these dimensions, the agent performs deterministic checks including verification that scope-defining keywords from user instructions appear in the generated boundary form and that the output length falls within configurable token limits. It evaluates extracted elements against modelling rules, identifies potential classification inconsistencies, and flags proposed additions that conflict with ontological constraints. The effectiveness of this two-phase structure in producing consistent boundary specifications across the model space is evaluated in Section 6. This agent reinforces the primacy of the modelling schema within the generative workflow.

The editor agent is responsible for integrating validated proposals into the structured ecosystem map. It does not commit changes to the ecosystem map autonomously. Instead, it generates structured change proposals that summarise the proposed addition or modification, reference supporting evidence, and indicate structural implications. Each proposal specifies a single ontological operation, whether the addition of an actor, the assignment of a role, the creation of an interaction, or the modification of an existing element. Each operation includes the evidence chain linking the proposal to source material. These proposals are presented to the human modeller for review. Presentation is preceded by an internal evaluation cycle in which the editor assesses whether the original instruction has been satisfied and reviews proposal quality against ontological constraints. Where the evaluation identifies residual gaps or structural inconsistencies, the agent revises its proposals before finalising them for presentation.

The orchestrator enforces sequencing, ensuring that document analysis does not occur without prior boundary specification and source retrieval, and that integration does not proceed without validation. Each agent role can be assigned a different generative model, enabling task-aligned model configuration. This design choice reflects the expectation that distinct analytical responsibilities, from boundary specification to relational extraction, may favour models with different capability profiles. While this staged decomposition is designed to improve transparency and modularity, its causal contribution to performance is interpreted as a structural design hypothesis supported by empirical observations rather than as a demonstrated causal mechanism, in the absence of an architectural ablation study. The resulting architecture is depicted in Figure 2.

4.3. Controlled Model Editing and Governance Mechanisms

The architecture interposes a proposal layer between agent outputs and the ecosystem map, operationalising the governance obligations defined in Section 3.4. Each candidate element includes a description, ontological classification, and source references. The human modeller may accept, reject, or request revision of each proposal; rejected proposals are retained in a decision record for retrospective analysis.

In parallel, a provenance layer maintains links from each map element to its documentary sources and the sequence of analytical steps that led to its inclusion. This dual audit structure supports verifiability and documentary coverage assessment. Interpretive authority remains with the human modeller, bearing most directly on interaction extraction where relational constructs require contextual interpretation that generative agents cannot be assumed to resolve independently.

4.4. Separation of Planning and Execution

An additional architectural feature concerns the separation between planning and execution within the orchestrator. Complex modelling tasks often require multi-step reasoning, including iterative retrieval, extraction, and validation. To manage this complexity, the orchestrator distinguishes between a planning phase, in which the sequence of actions is defined, and an execution phase, in which agents are invoked according to the plan [8].

This separation improves robustness and transparency. The plan can be inspected and adjusted prior to execution, and intermediate results can inform subsequent steps. This separation also permits independent assessment of orchestrator quality, since sub-agent outputs can be held constant while varying the orchestrator model. The architectural hypothesis is that high-quality planning can compensate for moderate sub-agent capability and that orchestrator success may depend more on the capacity to maintain coherent multi-step plans than on raw generation quality, though the empirical results reported in Section 6 characterise orchestration performance without isolating these factors.

Concretely, the orchestrator generates plans as ordered sequences of agent invocations annotated with input dependencies and expected output types. During execution, the orchestrator monitors intermediate outputs and may revise the remaining sequence when deviations such as failed retrievals or extraction errors invalidate subsequent preconditions. This dynamic replanning capability distinguishes the architecture from static pipeline designs and accommodates the inherent variability of generative outputs.

5. Evaluation Framework for Generative Modelling Pipelines

5.1. Challenges in Evaluating Semantic Modelling Outputs

Evaluating GAI within modelling processes presents methodological difficulties that differ from conventional system evaluation [11]. Generative systems produce probabilistic and semantically variable outputs in which two independently generated extractions may be substantively equivalent yet differ in wording. In modelling contexts, correctness further encompasses classification accuracy within the modelling schema, relational structure consistency, and alignment with documentary sources, dimensions that token-level metrics fail to capture. Moreover, multi-stage pipelines distribute reliability across interdependent stages, and evaluating only the final artefact obscures where failures originate.

5.2. Hybrid Evaluation Design

The hybrid evaluation framework integrates three complementary components, namely operational reliability metrics, semantic assessment using an LLM-based judge, and human validation for calibration.

Operational metrics quantify the reliability of the pipeline at each stage. These include stage completion rates, frequency of extraction failures, error propagation between agents, and overall pipeline success rates. Such metrics provide insight into robustness and reproducibility. They enable identification of bottlenecks and systematic failure modes within the architecture.

Semantic assessment addresses the correctness of extracted modelling elements relative to reference expectations. Given the limitations of exact-match metrics, an LLM-based judge evaluates whether proposed actors, roles, and interactions are semantically aligned with reference sets [11]. The judge assesses equivalence, partial alignment, or misclassification based on structured comparison criteria derived from the modelling ontology.

However, reliance on a generative model to evaluate generative outputs introduces the risk of shared bias or correlated error [23]. To mitigate this risk, a subset of evaluation cases is independently assessed by human experts. Agreement between the LLM-based judge and human judgement is analysed using Cohen’s kappa coefficient [30] to calibrate the reliability of automated semantic evaluation. The convergence of these three assessment pillars is depicted in Figure 3.

5.3. Experimental Setup

The evaluation is designed as a controlled experimental study intended to characterise performance across modelling tasks rather than to establish a generalisable benchmark across domains or architectures. It is conducted across five test cases, each targeting a distinct agent within the multi-agent architecture. All cases operate within the domain of business ecosystem modelling for the Danish energy sector, using documents, policy reports, and diagrams related to offshore wind and energy market participants. The modelling ontology of actors, roles, and interactions remains consistent across cases to enable comparative analysis. Table 3 summarises the design of each test case.

A total of 34 generative models spanning eight providers are evaluated, yielding over 4300 experimental runs across baseline and hyperparameter sensitivity configurations. The baseline configurations shown in Table 3 account for 2664 runs; an additional 1718 hyperparameter sensitivity runs across representative model subsets, testing temperature and top-p variations for TC-G, TC-D, TC-S, and TC-M, bring the total to 4382. Each model configuration is evaluated across a minimum of six independent generations per test case to account for non-determinism, with additional runs for selected models to narrow confidence intervals. Distributional statistics are reported in place of point estimates. Sub-agents are held fixed to isolate the contribution of the component under test; specifically, GPT-4o-mini serves as the fixed sub-agent model across all isolation experiments, providing a known moderate-quality baseline against which orchestrator capability can be measured.

Ground-truth reference sets were constructed by the author through manual analysis of the source documents according to the modelling ontology defined in Table 2. This single-author construction enables consistent annotation standards across all test cases but limits external validity, as the reference sets reflect one analyst’s interpretive judgements. The evaluation should therefore be interpreted as a controlled assessment of system behaviour within a single domain rather than as an inter-subjective benchmark. Replication with multiple annotators constitutes a necessary step for establishing external reliability. For TC-D, the test document was purpose-built to contain a known set of 12 actors, 9 roles, and 21 interactions distributed across prose text and an embedded machine-rendered diagram, enabling controlled decomposition by entity type and source modality. For TC-O and TC-G, reference actors and entities were derived from publicly available documents within the Danish energy ecosystem domain. All reference sets were defined prior to model evaluation and held constant across runs.

The evaluation characterises a performance profile across modelling tasks rather than a comparative benchmark against alternative architectures. A full architectural ablation that progressively removes individual agents or merges pipeline stages would further isolate the contribution of multi-agent decomposition but lies beyond the scope of this study and constitutes a direction for future work. The evaluation therefore focuses on characterising absolute performance levels, relative variation across agent roles and model configurations, and the contribution of architectural components such as ontological constraints and staged review to output quality.

5.4. Scoring Procedures

Because generative outputs exhibit semantic variability, exact-match metrics are insufficient. A fixed evaluator model, GPT-4o-mini, serves as an LLM-based judge, comparing extracted elements against ground-truth definitions through semantic matching. GPT-4o-mini was selected for its combination of low inference cost and sufficient reasoning capability for pairwise semantic comparison, enabling consistent evaluation across over 4300 runs. Because GPT-4o-mini is also one of the 34 evaluated models, a potential conflict of interest arises in which the judge might systematically favour or penalise outputs resembling its own generation patterns. Two observations mitigate this concern. First, GPT-4o-mini ranks last among all evaluated models on TC-D extraction, indicating that if any bias exists it does not manifest as inflated self-evaluation. Second, the human agreement validation for TC-G and TC-M provides independent calibration of judge reliability, with Cohen’s

κ

values of 0.942 and 0.847 respectively confirming that judge assessments align with human ratings regardless of which model produced the evaluated output. Conceptually equivalent outputs receive credit regardless of surface phrasing. Each test case employs a scoring procedure tailored to its task structure.

For TC-O, standard precision, recall, and F1 are computed against the six reference actors. For TC-G, entity completion is measured as the fraction of 18 ground-truth entities semantically matched by the judge, and reference integration counts the number of six provided sources cited in the output. For TC-D, component-ratio scores are computed for each entity type, using reference totals of twelve actors, nine roles, and twenty-one interactions.

Actor = \frac{matched}{12}, Role = \frac{matched}{9}, Interaction = \frac{matched}{21}

(1)

An attribution score captures provenance quality through six binary checks per entity. The total score averages across the four components. TEXT and IMAGE sub-scores each average actor, role, and interaction ratios equally rather than weighting by the number of items per component. TEXT and IMAGE sub-scores decompose performance by source modality, where text-sourced entities represent 19 of 42 ground-truth items and image-sourced entities the remaining 23, enabling analysis of the modality gap. Notably, the test document uses a clean, machine-rendered diagram. Real-world documents with handwritten or photographed diagrams would likely yield lower scores.

For TC-M, a multiplicative formula combines deterministic structural checks with a weighted quality assessment. Let

d_{i} \in {0, 1}

for

i = 1, \dots, 8

denote eight binary structural criteria, including scope keyword presence, score consistency, and token budget compliance, and let

q_{j}

denote the LLM-judged quality score for dimension j. The TC-M score is computed as

S_{TC - M} = (\frac{1}{8} \sum_{i = 1}^{8} d_{i}) \times \sum_{j = 1}^{5} w_{j} q_{j}

(2)

where the quality dimension weights

w_{j}

are pipeline coherence at 0.20, boundary completeness at 0.25, summary fidelity at 0.25, downstream usability at 0.20, and score calibration at 0.10. These weights were assigned based on downstream task relevance, prioritising boundary completeness and summary fidelity as the dimensions most directly influencing extraction quality. Structurally invalid outputs are penalised through the first factor regardless of judge ratings.

For TC-S, a binary pass criterion determines whether the relevant page is included and the irrelevant page excluded. This design represents a minimal operational validation of retrieval filtering and does not evaluate discriminative performance under partial topical overlap.

Human agreement validation calibrates the reliability of automated semantic assessment. For entity generation judgements, Cohen’s

κ = 0.942

with 95% bootstrap CI

[0.899, 0.978]

and 269 of 277 concordant decisions indicates almost perfect agreement [31]. For methodology expert assessments, weighted

κ_{w} = 0.847

with 95% bootstrap CI

[0.754, 0.914]

across 125 assessments indicates substantial agreement [24,30]. Per-dimension

κ_{w}

ranges from 0.750 for boundary completeness, the strictest dimension, to 0.891 for pipeline coherence, with summary fidelity and downstream usability between these endpoints. The author performed all human validation independently.

Three further qualifications bear on judge reliability. First, calibrated human agreement covers TC-G and TC-M only; TC-O, TC-D, and TC-S are not independently validated, and their scores rely on judge consistency and on the objective ground-truth counts embedded in their scoring procedures, namely six reference actors for TC-O, fixed entity totals for TC-D, and binary inclusion or exclusion for TC-S. Second, three residual bias modes are recognised, namely self-preference, in which the judge favours outputs resembling its own generation patterns; surface-form leniency, in which the judge accepts paraphrases that drift from the modelling ontology; and structural overcrediting, in which the judge counts a near-match as a full match. The TC-D scoring procedure partially shields against self-preference by anchoring scores to counts against a fixed reference set rather than to judge-only adjudication, though the judge still decides whether each extracted element matches a reference entry; the TC-S binary criterion eliminates the surface-form channel. Third, the observed ranking of GPT-4o-mini at the bottom of the TC-D extraction distribution is inconsistent with a self-preference bias of substantive magnitude, although it does not rule out smaller systematic effects. Expanding human calibration to TC-O, TC-D, and TC-S remains a planned direction for future work; the present results should be read in light of this asymmetry.

6. Empirical Results

The operational evaluation indicates stable upstream execution within the present pipeline configuration. Boundary specification, source retrieval, and document conversion complete reliably in all test cases that reach downstream extraction, with only isolated instances of formatting incompatibilities or incomplete text extraction. These upstream stages are evaluated implicitly through their effect on downstream task scores rather than through dedicated test cases.

Headline aggregate scores are reported with 95% confidence intervals in Table 4 and Appendix A Table A5, obtained from a hierarchical bootstrap (10,000 resamples) that resamples the evaluated models and, within each resampled model, its individual runs. The intervals therefore incorporate both between-model variability and per-run sampling variability around the per-model-mean point estimate, and are computed directly from the released per-run records. The human agreement coefficients reported in Section 5.4 are accompanied by 95% confidence intervals obtained from a percentile bootstrap over the annotation pairs.

A second pattern concerns provenance quality. Despite explicit architectural mechanisms for source attribution as described in Section 4.3, the mean attribution score of 0.499 across 23 models is substantially lower than entity extraction performance, indicating that evidential traceability constitutes a distinct challenge that is not resolved by improvements in extraction accuracy alone.

Semantic extraction and integration stages present greater variability. Actor and role extraction tasks are completed successfully in more than 90% of runs, with limited interruption due to malformed structured outputs or classification ambiguities. In contrast, interaction extraction generates a higher frequency of partial failures, with format-related failure rates between 12% and 30% depending on model and test case. These failures include incomplete identification of participating actors, inconsistent directionality specification, and conflation of interaction types.

6.1. Document Extraction Performance

The per-model extraction scores in Table 5 span twelve representative models covering the full performance range. The colour gradient reveals a three-tier structure. The top tier, from Claude Opus 4.5 through Gemini 2.5 Flash, maintains green shading across actor and role columns with scores consistently above 0.89. A mid-tier cluster, from GPT-5-mini through Ministral 8B, shows the first yellow cells as actor and interaction scores decline. Below the mid-tier, scores transition sharply to orange and red, with GPT-4o-mini achieving only 0.449 on actors and 0.164 on interactions. Across all tiers, text-sourced extraction remains above 0.775, while the IMAGE column exhibits the widest colour variation, ranging from 0.716 to 0.052. The mean row confirms that role extraction at 0.846 (95% CI

[0.80, 0.89]

) and actor extraction at 0.786 are substantially more reliable than interaction extraction at 0.431 (95% CI

[0.38, 0.48]

); the role and interaction intervals do not overlap.

Misclassifications primarily occur in cases where documents describe hybrid constructs, such as initiatives or programmes that function both as coordinating bodies and as funding instruments. In such cases, generative agents occasionally classify artefacts or policy instruments as actors. These errors indicate that ontological ambiguity in source material constitutes a systematic challenge for extraction pipelines, as resolving such ambiguity is precisely the interpretive function that modelling demands.

In contrast to entity extraction, interaction identification exhibits lower semantic alignment with reference models. The mean interaction score across all models and configurations is 0.431. Interactions require the model to infer structured relationships between actors, often based on implicit statements or distributed references across documents. Generative agents demonstrate difficulty in consistently distinguishing between the five interaction types defined by the modelling ontology, namely monetary value, intangible value, goods, information, and data exchange.

Two recurrent error types are observed. First, under-specification, where the agent identifies the existence of a relationship but fails to characterise its type or direction accurately. Second, over-generalisation, where broad statements of collaboration are interpreted as concrete structural interactions without sufficient evidential grounding. In the entity generation task TC-G, interaction counts exhibit the widest variance of any entity type, ranging from 0.5 to 8.0 across model configurations against a reference of eight interactions.

Where source materials include graphical ecosystem representations or schematic diagrams, multimodal interpretation introduces additional complexity. Generative models capable of processing both text and images demonstrate partial capacity to identify actors and high-level connections within diagrams. However, precise interpretation of edge semantics, directional arrows, and interaction types remains inconsistent. The mean image-based extraction score is 0.470 across the 23 models in the documented TC-D extraction panel, within which text-only models score near zero on the image-sourced items.

Performance degradation in multimodal tasks is more pronounced than in purely textual interaction extraction. Text-sourced interactions achieve a mean score of 0.852, whereas image-sourced interactions score only 0.172, representing a 68-percentage-point modality gap. Because image-sourced entities constitute 23 of the 42 ground-truth items in the TC-D test design, variation in image extraction mechanically influences the total score, with a Pearson correlation of

r = 0.96

,

n = 23

models,

p < 0.001

, between image scores and total extraction scores. Two additional model-space correlations corroborate the cross-metric structure of the TC-D results. Response latency correlates with format-related failure rate at

r = 0.82

,

p < 0.001

, indicating that slow responses are also more likely to be malformed, and per-token inference price correlates with total extraction score at

r = 0.56

,

p < 0.05

, indicating that more expensive models tend to score higher within this 23-model panel. This correlation reflects both the test design and genuine capability differences across models. Misinterpretation of graphical conventions and ambiguity in visual labelling contribute to reduced semantic alignment with reference models [29].

Provenance quality, measured through a six-check attribution chain covering reference linkage, evidence typing, and source location, achieves a mean score of 0.499 across 23 models with a range of 0.058 to 0.723. Despite the architectural mandate for provenance described in Section 4.3, attribution scores are consistently lower than entity extraction scores for the same models, indicating a gap between architectural intent and empirical realisation. Generative agents identify entities more reliably than they maintain the evidential chains linking those entities to source material. Failures exhibit a cascade pattern in which early omissions, such as missing reference identifiers, propagate through subsequent checks and reduce the overall attribution score to near zero. Complete attribution scores per model appear in Table A1.

The above patterns are not uniformly distributed across the model space; six qualitatively distinct failure modes recur across the test cases and account for most of the variance summarised in Table 4 and Table 5. First, tool-calling failure, in which a model completes its runs without ever effecting a tool-mediated change to the map, is observed for Qwen3 8B in the entity generation task, which returns valid runs yet produces uniform-zero rows in Table A2; several other zero rows in that table instead reflect provider-side API errors, such as endpoints that do not support tool use, rather than model behaviour, a distinction preserved in the released run records. Second, comprehension-without-action, in which a model inspects the ecosystem state across repeated tool-call turns but commits few or no modifications, is exemplified by Gemini 2.5 Flash in TC-G, which completes only 7% of the reference entity set while adding almost no new entities per run. Third, partial completion, in which entity creation succeeds but relational structure is degraded, is exemplified by GPT-4o-mini in TC-G, which satisfies the Ørsted multi-role constraint in 87% of runs while producing only 0.5 ± 1.1 of the eight reference interactions. Fourth, the attribution cascade described above is most visible at the bottom of Table A1, where GPT-4o-mini reaches an attribution score of

0.058

despite a TEXT score of

0.775

; once an early reference identifier is missing, the remaining attribution checks tend to fail in sequence. Fifth, multimodal extraction failure, the pattern most directly relevant to diagram interpretation, takes two forms visible in the TC-D records. In the first, a model that extracts textual entities reliably collapses on the same entities when they are sourced from the embedded diagram. GPT-4o-mini, for example, reaches a TEXT-sourced score of

0.775

but an IMAGE-sourced score of

0.052

, and Kimi K2 and GPT-5-nano show the same split, with IMAGE scores of

0.088

and

0.266

against TEXT scores above

0.75

. In the second, more capable multimodal models such as Claude Sonnet 4.5 and GPT-5.2-Chat recover the diagram’s actor nodes but recover far fewer of its directional edges, so that the arrows between recovered actors are dropped, reversed, or mis-typed; observed instances include a battery ancillary-service arrow re-attributed to a different source node and exchange type, and a congestion-relief arrow whose direction was frequently not recovered. Sixth, graphical-container misclassification, in which a layer or grouping box drawn in the diagram is read as an actor, is a structural error specific to visual sources. Gemini 2.5 Flash, for example, extracted the diagram’s grouping containers, such as a ‘Generation and Consumption’ layer, as actor entities, an error that arises because the containing relationship is conveyed by layout rather than by text.

6.2. Performance Across Agent Roles

The orchestrator test case evaluates the capacity of models to coordinate multi-step modelling workflows by decomposing high-level user instructions into sequences of agent invocations and integrating intermediate results into a coherent pipeline output. Across 25 evaluated models, the mean pipeline F1 is 0.66 with a successful completion rate of 65.5%. However, this mean obscures a polarised distribution in which approximately half of the models achieve F1 scores above 0.70, while the remainder fall below 0.25 or fail entirely. Complete per-model results appear in Table A3.

Models that successfully maintain multi-step plan coherence achieve consistently high F1 scores, whereas those that lose contextual continuity across agent invocations produce outputs with minimal structural alignment to the reference pipeline. In contrast to entity extraction tasks, where performance degrades gradually across the model space, orchestration scores concentrate at the extremes of the distribution, suggesting that multi-step plan coherence functions as a threshold capability.

The entity generation test case evaluates whether generative agents can synthesise across multiple source documents to produce structurally complete ecosystem entities against a reference model of 18 entities. Across 26 evaluated models, including 4 that achieved 0% completion due to format-related failures, mean entity completion reaches 68.0% across the 22 models that produced any valid run, with per-model rates spanning the full range from 0% to 100%. Reference integration averages 4.0 of 6 source documents per model run, and across the 22 models with valid runs entity completion correlates strongly with reference integration at

r = 0.95

,

p < 0.001

, indicating that models that synthesise more of the available evidence also complete more of the reference entity set. Complete per-model breakdowns appear in Table A2.

The gap between entity generation performance and the 0.947 text-based extraction score in TC-D reflects a difference in task complexity. TC-D extracts from a single controlled document, whereas TC-G requires models to synthesise across multiple documents and integrate extracted entities with an existing model structure. Constraint satisfaction, measured as adherence to the ontological schema, averages 84.0% across successful models.

The methodology expert test case evaluates the two-phase boundary specification pipeline across 25 models, yielding a mean weighted quality score of 0.877 with a range of 0.747 to 0.989. This comparatively narrow range, of width 0.24 against a full

[0, 1]

scale, reflects the structured nature of the task. Among the five LLM-judged quality dimensions, pipeline coherence and downstream usability score at or near ceiling, at 0.98 and 1.00 respectively, while boundary completeness, summary fidelity, and score calibration each average 0.84 and account for most of the inter-model variation. Deterministic structural checks pass in 84.7% of runs. Format-related failures exhibit a bimodal distribution, with 14 models achieving a zero-percent failure rate and the remainder ranging from 6.7% to 100%. The highest failure rates occur among models that perform well on other pipeline tasks, indicating that the two-phase structured output format imposes requirements orthogonal to general generative capability.

The search relevance test case evaluates binary classification between a topically relevant and a topically irrelevant page. Across 30 evaluated models, 27 achieve perfect classification accuracy, with the remaining three producing malformed structured outputs that failed to conform to the expected JSON response schema rather than incorrect classifications. These three failures reflect parsing limitations rather than retrieval judgement errors. Among valid outputs, a mean quality score of 0.918 is observed. This near-ceiling result confirms basic retrieval filtering functionality but, as a binary distinction between semantically distant documents, does not establish discriminative capacity under conditions of partial topical overlap.

6.3. Model Selection as the Dominant Performance Factor

Performance rankings invert across agent roles. Gemini 2.5 Pro achieves the highest methodology expert score at 0.989, yet reaches only 0.60 F1 on orchestration and 45.3% entity completion. Conversely, models that achieve perfect orchestration scores fail entirely on methodology expert format requirements. Hyperparameter sensitivity is systematically assessed by varying temperature and top-p sampling parameters across all test cases, with detailed per-model results reported in Table A4. For models that achieve non-trivial baseline performance, varying temperature from 0.0 to 1.0 and top-p from 0.5 to 1.0 produces a mean absolute performance change below 0.03 across all scored dimensions. In four of six model families tested, the effect of hyperparameter variation is smaller than the standard deviation introduced by non-deterministic generation alone. By contrast, switching between models within the same provider family produces performance differences of 0.20 or more on the same task. These results confirm that model selection dominates configuration tuning as a determinant of output quality, and that optimising hyperparameters without first selecting an appropriate model for each agent role yields negligible returns.

This role-dependent variation implies that selecting a single generative model for all modelling stages may be suboptimal. Instead, task-aligned model configuration emerges as a design principle for agentic modelling architectures [9].

An additional empirical pattern concerns reference integration behaviour. In entity generation, models demonstrate a systematic bias towards creating new entities over enriching pre-existing ones. Reference integration rates reach 80 to 100 percent for newly created entities but fall to 13 to 57 percent when the task requires augmenting an entity that already exists in the ecosystem map. Although derived from a single pre-existing entity in the test case and therefore preliminary, this pattern suggests that generative agents may treat each instruction as an independent creation event, with limited capacity for incremental modification of a persistent structure.

6.4. Summary of Empirical Patterns

The metrics across all five test cases are consolidated in Table 4. Two patterns structure the results. First, entity-level extraction is reliable while relational extraction is not, with a gap of more than 0.4 between role scores and interaction scores. Second, performance variance is dominated by model selection rather than by hyperparameter tuning or task design.

7. Discussion

7.1. Implications for Methodology-Grounded Generative Modelling

The empirical results, consolidated in Table 4, suggest that generative artificial intelligence can augment modelling practice, but only when embedded within explicit methodological frameworks and oversight mechanisms. The present study deliberately scopes this claim to business ecosystem modelling, a domain whose ontology of actors, roles, and interactions provides clearly bounded constructs suitable for a first empirical investigation. Whether the observed performance patterns extend to modelling contexts with fundamentally different ontologies, such as process mining, enterprise architecture, or biomedical knowledge representation, remains an open question that warrants cross-domain investigation.

The findings align with, and extend, recent work on ontology-constrained extraction [6,15] and agentic information extraction [16,17]. Whereas those systems focus on extraction as an isolated task, the present study embeds extraction within a broader modelling lifecycle that includes boundary specification, staged review, and provenance tracking, though computational automation of the completeness verification stage remains outstanding. The empirical observation that entity extraction may be amenable to automation while relational reasoning remains difficult provides a concrete boundary condition that complements the general-purpose extraction benchmarks reported in prior work [3].

A further scoping qualification concerns the evidential status of the architectural pattern itself. The empirical design isolates individual agents and varies the model assigned to each, thereby characterising model performance across modelling tasks. It does not compare the multi-agent pipeline against simplified or monolithic baselines, nor does it isolate the causal contribution of staged decomposition, ontological schema enforcement, or the staged proposal mechanism for human review. The architectural commitments summarised in Section 4 are therefore supported indirectly through the performance patterns reported in Section 6, rather than directly through architectural ablation. Systematic ablation of these components remains a direction for future work.

7.2. Orchestration and Entity Generation as Emerging Capabilities

The polarised orchestration distribution (Section 6.2), in which models either maintain coherent multi-step plans or fail entirely, suggests that orchestration functions as a threshold capability rather than a continuum. A mean pipeline F1 of 0.66 with a 65.5% completion rate indicates that roughly one-third of evaluated models cannot sustain multi-step coordination, a failure rate that would preclude unattended production deployment without model pre-qualification. Whether this threshold reflects intrinsic model capacity or an interaction between model capability and architectural demands cannot be determined without comparative evaluation against alternative coordination designs. Production deployment would therefore require model selection protocols that pre-screen orchestrator candidates on representative planning tasks before committing to a pipeline configuration.

The entity generation gap, at 68% completion relative to the 0.947 text-based extraction score in TC-D, reflects a qualitative shift in task demands. Single-document extraction operates within a bounded context where entities are explicitly stated, whereas multi-document synthesis requires contextual integration across heterogeneous sources and reconciliation of overlapping or contradictory information. The 32% shortfall in entity completion indicates that current models lose coherence when the reasoning horizon extends beyond a single document, a limitation with direct implications for the architecture’s applicability to domains requiring broad evidence synthesis.

The methodology expert’s adequacy reflects a task whose structured output format constrains the solution space more tightly than extraction or generation. The bimodal failure distribution suggests that format compliance represents an orthogonal capability dimension not captured by general benchmarks. The near-ceiling search relevance results confirm basic filtering functionality but do not establish discriminative capacity under partial topical overlap.

7.3. Reliable Automation and Its Structural Limits

The empirical findings are consistent with the position that GAI can be integrated into modelling processes in a manner that preserves representational consistency, provided that the integration occurs within a constrained pipeline structure such as that depicted in Figure 1. Entity and role extraction tasks achieve mean scores above 0.78 under ontology-constrained conditions, but this observation does not isolate the causal contribution of stage-aligned decomposition from underlying model capability and would require architectural ablation to disentangle. By decomposing the transformation process and constraining outputs through formal schemas, the architecture reduces ambiguity and narrows the solution space available to the generative system [27]. Ontology-constrained generation, as explored in recent extraction systems [6,15], reduces the output space from open-ended text to a finite set of typed entities and admissible relations.

In contrast, the persistent difficulty observed in interaction extraction reveals where these architectural constraints are insufficient. Relational constructs demand interpretation of implicit meaning, directional logic, and contextual nuance across documents, capabilities that ontological constraints alone do not confer. The error patterns reported in Section 6.1 align with findings in agentic information extraction [16,17] and indicate that generative agents struggle to consistently apply domain ontologies when relational reasoning is required [4].

These limitations are amplified in multimodal contexts, where diagram interpretation requires correct mapping of graphical conventions to modelling constructs in addition to textual comprehension. Human oversight therefore bears most heavily on interaction proposals, where automated reliability remains insufficient for unreviewed incorporation.

7.4. Reference Integration and Incremental Modelling

Entity generation results show that models integrate 80 to 100 percent of source documents when creating new entities but only 13 to 57 percent when augmenting a pre-existing entity, suggesting that generative agents treat each instruction as a stateless creation event. If this pattern holds beyond the single entity evaluated, sustained model maintenance may require mechanisms that foreground existing map state within the generative context. The current architecture presents the full map structure alongside each proposal, but targeted strategies for directing agent attention to enrichment rather than creation remain an open design challenge.

7.5. Architectural Implications for Hybrid Modelling Frameworks

The role-dependent variation in model performance has direct architectural implications. The observed inversion of performance rankings across agent roles indicates that no single generative model is well suited to every stage of the modelling pipeline. This pattern is consistent with the design hypothesis that modular architectures permitting task-aligned model selection are advantageous [32], and it aligns with multi-agent coordination research [9,19] reporting that specialised agents can outperform monolithic models on complex compositional tasks, although the present study does not itself compare modular against monolithic configurations. Entity extraction, relational reasoning, and multimodal interpretation may each require distinct model capabilities, and the modular agent decomposition presented here enables such task-aligned selection without destabilising the pipeline.

7.6. Provenance Preservation as a Structural Limitation

The evaluation identifies provenance preservation as a distinct limitation that is not resolved by improvements in extraction accuracy. Despite explicit architectural mechanisms for source attribution, the mean attribution score of 0.499 across 23 models indicates that generative agents maintain complete evidential chains for fewer than half of the entities they extract. This separation between extraction and justification constitutes a systematic failure mode rather than an implementation gap. Generative agents can identify entities with high accuracy while simultaneously failing to maintain verifiable links to source material.

The cascade failure pattern, in which a single missing reference identifier propagates through subsequent checks and reduces the overall attribution score to near zero, reveals a structural fragility in source tracing that prompt-level instruction alone does not resolve. Because ecosystem maps may inform strategic or policy decisions [1], incomplete provenance undermines the very accountability that the governance layer is designed to preserve. Addressing this limitation may require architectural interventions, such as enforced citation slots within the extraction schema or post-generation provenance validation passes, that treat attribution as a first-class output constraint rather than an ancillary annotation.

7.7. Operational Workflow Implications

The measured per-task score distribution carries direct implications for how human effort would be allocated if the pipeline were deployed in an operational modelling workflow. The mean TEXT-sourced extraction score of 0.947 places textual actor and role proposals within a regime in which routine acceptance is plausible and review effort would concentrate on edge cases such as the hybrid constructs flagged in Section 6.1. The mean interaction score of 0.431 places relational proposals in the opposite regime, in which proposals could not be accepted routinely and would require systematic human adjudication rather than spot-checking, so that quality assurance for the ecosystem map would depend disproportionately on this stage. The mean IMAGE-sourced score of 0.470 implies that proposals derived from embedded diagrams should be treated as preliminary rather than auto-acceptable, and that diagram-heavy source material would shift the modeller’s role from acceptance review to direct re-interpretation. The mean attribution score of 0.499, paired with the cascade failure pattern, suggests that provenance verification cannot be a sampling activity; every proposal needs its evidence chain checked because cascade failures concentrate the missing-citation problem in particular models rather than distributing it uniformly. The mean orchestration F1 of 0.66, combined with the polarised distribution reported in Section 6.2, indicates that operational deployment would require model pre-qualification rather than relying on average-case capability. The corresponding confidence intervals are reported in Table 4 and Appendix A Table A5.

Operational resource demands can be characterised at the run level from the data already reported. Within TC-M, the boundary specification pipeline averages 20,432 tokens per run, with a range of 14,621 to 35,406, and a mean latency of 103.3 s, with a range of 26.4 to 218.4 s depending on whether the assigned model performs explicit reasoning. Within TC-O, complete pipeline runs average 302.1 s with a standard deviation of 209.7 s and require an average of 2.3 ± 2.9 orchestrator calls. These figures characterise per-run cost and latency rather than throughput or speedup against manual modelling, which was not measured; quantitative time-saving claims would require a controlled comparison against a human-only baseline and remain a direction for future work.

7.8. Governance and Accountability

The integration of GAI into modelling raises questions of responsibility. Ecosystem maps may inform organisational strategy and policy development [1], and autonomous construction would diffuse accountability and obscure representational decisions [33]. The empirical results reinforce this position. The interaction extraction bottleneck demonstrates that generative outputs are not uniformly trustworthy, and the polarised orchestration distribution indicates that pipeline reliability depends on model selection decisions requiring human judgement [23]. These findings suggest that governance mechanisms are structural necessities for architectures embedding generative agents within consequential modelling processes.

7.9. Limitations and Future Directions

Six boundary conditions qualify the scope and generalisability of the findings reported above.

The evaluation is scoped to business ecosystem modelling within one institutional and thematic context, with an ontology of only three core constructs. Whether the approach scales to richer ontologies with deeper type hierarchies has not been established, and performance may not generalise to domains with fundamentally different representational primitives [3]. The TC-D extraction scores in particular characterise a controlled single-document task with a purpose-built source and should be interpreted as task-specific rather than class-level estimates. Cross-domain transferability, though plausible given the architecture’s parameterised ontology definitions, requires empirical demonstration.

External validity is further limited by the construction of the reference sets. Ground-truth actors, roles, and interactions were defined by a single annotator according to the modelling ontology in Table 2, which enforces internal consistency across the over 4300 runs at the cost of inter-subjective replicability. The two test cases with independent human calibration, TC-G and TC-M, return high agreement coefficients (

κ = 0.942

with 95% bootstrap CI

[0.899, 0.978]

;

κ_{w} = 0.847

with 95% bootstrap CI

[0.754, 0.914]

), which supports the reliability of the annotation procedure but does not extend to TC-O, TC-D, and TC-S. Replication of the TC-D reference set by a second annotator, together with a second-annotator

κ

check on at least one of the three uncalibrated test cases, is the principal action that would address this limitation; both are planned future-work items.

The hybrid evaluation framework does not exhaustively capture all dimensions of modelling correctness; fine-grained ontological distinctions may escape detection at broader semantic similarity levels. The judge bias considerations described in Section 5.4, in particular the absence of independent human calibration for TC-O, TC-D, and TC-S, qualify the strength of conclusions that can be drawn from those three cases.

Interaction types are treated at a defined level of abstraction. More granular typologies may clarify whether relational errors stem from linguistic ambiguity, contextual integration gaps, or compositional reasoning limitations. Similarly, multimodal interpretation could benefit from structured diagram parsing pipelines integrating computer vision with ontology-constrained generation [29]. The specific failure modes that underlie this limitation, including the dropping and reversal of directional edges between recovered diagram nodes and the misclassification of grouping containers as actors, are catalogued in Section 6.1.

The study is a design and characterisation study rather than a comparative or ablation evaluation. No monolithic single-agent baseline, no sequential pipeline without the methodology expert agent, and no variant with the staged proposal mechanism disabled were evaluated. Consequently, the empirical results characterise how generative models perform within the proposed architecture but do not isolate the causal contribution of methodology anchoring, staged decomposition, ontology enforcement, or the staged proposal mechanism. Three ablation experiments would address this limitation directly. The first compares the multi-agent pipeline against a single-agent variant that performs boundary specification, retrieval, extraction, and editing within one generative invocation, isolating the contribution of agent decomposition. The second removes the methodology expert agent and compares the resulting outputs against the full pipeline, isolating the contribution of post-extraction semantic validation. The third disables the staged proposal mechanism so that editor outputs are committed directly to the map, isolating the contribution of the human-review gate. Because model selection dominates performance in the present results, each ablation must be repeated across a representative panel of models rather than a single configuration to separate the architectural contribution from model capability, and is left to future work.

The staged proposal mechanism preserves interpretive responsibility but may impose cognitive load. The study does not assess efficiency gains relative to manual modelling; user-centred evaluation of time savings, cognitive burden, and perceived trust constitutes a priority for future work, ideally conducted alongside the ablation experiments described above so that architectural variants can be assessed against both technical and human-factors criteria.

8. Conclusions

This study conceptualises modelling as a methodology-grounded transformation process and proposes a multi-agent architecture aligned with modelling stages, operationalising four of the five methodological stages within the domain of business ecosystem modelling while completeness verification remains under human control. By anchoring generative agents in an explicit ontology, enforcing staged proposal review, and maintaining source-level audit trails, the architecture is designed to preserve representational consistency and traceability of modelling decisions; whether these mechanisms outperform simpler alternatives causally remains open in the absence of architectural ablation.

The empirical evaluation across 34 models and over 4300 runs reveals a task-dependent performance landscape in which entity and role extraction achieve high accuracy under ontology-constrained conditions in a controlled single-document setting, while interaction extraction and multimodal diagram interpretation remain harder. These tasks demand relational reasoning and visual interpretation capabilities that current generative systems do not yet provide with sufficient consistency for autonomous operation. The mean interaction extraction score of 0.431 and provenance attribution score of 0.499 constitute the principal negative results, indicating that relational reasoning and evidential tracing remain unsolved at the level required for unreviewed automation. Because image-sourced entities constitute more than half of the ground-truth items in the controlled extraction task, the high correlation between image extraction and total scores (

r = 0.96

) is partly structural, though multimodal capability nonetheless emerges as a primary differentiator of overall pipeline quality. Model performance varies across agent roles, and model selection consistently outweighs hyperparameter tuning as a determinant of output quality, suggesting that modular architectures permitting task-aligned model assignment may offer a structural advantage over monolithic configurations, though this comparison has not been tested empirically through architectural ablation.

The hybrid evaluation framework developed here addresses the broader methodological challenge of assessing semantic modelling outputs. By combining operational reliability metrics with LLM-based semantic judgement calibrated against human agreement, the framework provides a reusable assessment instrument applicable to generative modelling pipelines beyond the ecosystem domain studied here. The high inter-rater agreement observed in calibrated test cases supports the viability of automated semantic scoring, while the absence of independent human validation for three of five test cases identifies a clear direction for future replication.

Beyond the immediate findings, the study indicates that the architectural pattern of methodology-anchored agentic decomposition may be applicable to other structured knowledge construction problems where heterogeneous sources must be transformed into ontologically consistent representations, subject to empirical validation in those domains. Process modelling, enterprise architecture, and regulatory knowledge graphs share the structural characteristics that make ecosystem modelling amenable to this approach, although cross-domain validation remains necessary before generalisability can be claimed.

Future research should prioritise three directions. Extending the architecture to additional modelling domains would test the generalisability of both the performance patterns and the governance mechanisms. Refining relational reasoning capabilities, whether through improved prompting strategies, specialised fine-tuning, or hybrid approaches combining visual parsing tools with generative agents, would address the principal performance bottleneck. Incorporating user-centred evaluation of cognitive load and efficiency gains would complement the technical performance characterisation by establishing whether the hybrid governance model delivers net productivity improvements in operational modelling practice.

Author Contributions

Conceptualisation, methodology, software, formal analysis, investigation, data curation, validation, visualisation, writing—original draft: H.F.G.; supervision, project administration, writing—review and editing: B.N.J. and Z.G.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The artefacts supporting this study are openly available at https://gitlab.sdu.dk/hamp/methodology-grounded-ecosystem-mapping (accessed on 1 April 2026). The repository includes the agent prompts, the tool-call and structured-output JSON schemas extracted from the evaluation build, the evaluation and scoring scripts, configuration files, the test-case input document, and the full per-run records underlying the reported results, together with an offline harness that reproduces the main results tables.

Acknowledgments

During the preparation of this manuscript, the authors used Claude Opus 4.6 (Anthropic) for language editing and drafting assistance. The authors reviewed and edited all output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Detailed Per-Model Results

Table A1. Document extraction scores by model for TC-D. n = total runs; s = scored runs producing valid output. Total = mean of Actor, Role, Interaction, and Attribution components. Attrib = provenance chain completeness measured as a six-check mean. TEXT and IMAGE sub-scores decompose performance by source modality, averaging actor, role, and interaction ratios equally within each modality. All 23 models in the documented TC-D extraction panel are shown, sorted by total score descending.

Model	n	s	Total	Actor	Role	Inter.	Attrib	TEXT	IMAGE
Claude Opus 4.5	30	30	0.798	0.936	0.967	0.567	0.723	0.986	0.708
Claude Haiku 4.5	113	113	0.787	0.991	0.977	0.501	0.677	0.988	0.716
GPT-5.2	106	88	0.782	0.951	1.000	0.479	0.699	0.991	0.692
Gemini 2.5 Flash	109	108	0.776	0.890	0.962	0.569	0.683	0.997	0.668
Gemini 2.5 Pro	33	32	0.769	0.909	0.896	0.571	0.700	0.983	0.641
Claude Sonnet 4.5	33	33	0.756	0.889	0.926	0.551	0.659	0.995	0.632
Qwen3-VL 235B	30	10	0.733	0.925	0.933	0.519	0.554	0.967	0.662
GPT-5-mini	108	107	0.730	0.941	0.954	0.415	0.610	0.984	0.621
Qwen3-VL 32B	33	24	0.704	0.837	0.889	0.577	0.512	0.927	0.625
Gemini 2.5 Flash Lite	34	31	0.684	0.790	0.889	0.498	0.561	0.996	0.509
Ministral 8B	108	99	0.676	0.899	0.934	0.438	0.431	0.923	0.628
Ministral 14B	30	27	0.668	0.852	0.868	0.506	0.445	0.974	0.553
Qwen3-VL 30B	30	17	0.637	0.775	0.791	0.482	0.499	0.944	0.448
Grok 4.1 Fast	30	30	0.626	0.750	0.819	0.397	0.538	0.967	0.393
Qwen3 235B	30	6	0.619	0.708	0.833	0.397	0.539	0.993	0.360
Ministral 3B	34	31	0.599	0.841	0.824	0.367	0.365	0.869	0.512
Qwen3-VL 8B	34	4	0.594	0.771	0.778	0.429	0.400	0.969	0.392
GPT-5.2 Chat	30	28	0.548	0.732	0.770	0.284	0.406	0.893	0.329
GPT-OSS 120B	30	29	0.510	0.580	0.713	0.315	0.433	0.925	0.182
Qwen3-Next 80B	30	30	0.486	0.500	0.733	0.351	0.359	0.972	0.134
GPT-5-nano	108	108	0.467	0.654	0.698	0.245	0.273	0.833	0.266
Kimi K2	30	26	0.457	0.500	0.671	0.299	0.358	0.925	0.088
GPT-4o-mini	34	31	0.324	0.449	0.624	0.164	0.058	0.775	0.052
Mean	1187	1042	0.640	0.786	0.846	0.431	0.499	0.947	0.470

Table A2. Entity generation task completion by model for TC-G, with standard configuration at temperature 1.0 and top-p 1.0 across 15 runs per model. n = total runs attempted; s = successful runs; Fail% = failure rate. Comp shows overall completion rate. Roles, Actors, Inter., and VCs show mean semantic matches with standard deviation where variance exists. Ørsted=3 and Permit show constraint satisfaction rates. Int shows mean reference integration on a scale of 0 to 6. Models with 100% failure are listed at the bottom. Sorted by completion rate descending.

Model	n	s	Fail%	Comp	Roles (0–5)	Actors (0–3)	Inter. (0–8)	VCs (0–2)	Ørsted=3	Permit Fixed	Int (0–6)
Claude Sonnet 4.5	15	15	0	100.0%	5.0	3.0	8.0	2.0	100%	100%	5.0
Claude Opus 4.5	15	15	0	100.0%	5.0	3.0	8.0	2.0	100%	100%	5.5
Grok 4.1 Fast	15	15	0	99.3%	5.0	3.0	8.0	1.9 ± 0.5	100%	100%	5.2
GPT-5-mini	15	11	27	96.5%	5.0	3.0	8.0	1.4 ± 0.9	100%	100%	5.5
GPT-5.2 Chat	15	15	0	96.3%	5.0	3.0	8.0	1.3 ± 0.9	100%	87%	5.1
Claude Haiku 4.5	15	15	0	95.9%	5.0	3.0	7.3 ± 1.0	2.0	100%	60%	5.1
Nemotron 3 Nano 30B	15	14	7	92.9%	4.8 ± 0.8	2.8 ± 0.8	7.4 ± 2.1	1.7 ± 0.7	93%	79%	5.3
Ministral 3B	15	9	40	92.6%	4.4 ± 1.7	3.3 ± 1.6	7.6 ± 2.1	1.3 ± 0.9	89%	33%	5.1
GPT-5.2	15	15	0	91.1%	5.0	3.0	6.4 ± 1.5	2.0	100%	100%	5.9
Ministral 8B	15	11	27	87.9%	5.0	3.0	5.9 ± 1.5	1.9 ± 0.3	100%	91%	5.0
Kimi K2	15	14	7	85.7%	4.3 ± 1.8	2.6 ± 1.1	6.9 ± 2.9	1.7 ± 0.7	86%	79%	4.1
Grok Code Fast 1	15	15	0	83.0%	4.7 ± 1.3	2.4 ± 1.2	6.4 ± 2.8	1.5 ± 0.9	80%	73%	3.8
GPT-5-nano	15	12	20	74.1%	5.0	3.7 ± 1.0	3.7 ± 3.5	1.0 ± 1.0	100%	92%	5.8
Gemini 2.5 Flash Lite	15	15	0	70.7%	4.0 ± 2.1	2.4 ± 1.2	5.1 ± 3.8	1.2 ± 1.0	80%	67%	4.8
Ministral 14B	15	7	53	65.1%	3.6 ± 2.4	2.1 ± 1.5	4.9 ± 3.6	1.1 ± 1.1	71%	43%	3.7
Gemini 2.5 Pro	15	13	13	45.3%	2.3 ± 2.6	1.4 ± 1.6	3.5 ± 4.0	0.9 ± 1.0	46%	46%	2.8
Qwen3-Next 80B	15	13	13	42.3%	3.5 ± 2.4	2.1 ± 1.4	1.2 ± 1.5	0.9 ± 1.0	69%	69%	3.5
GPT-4o-mini	15	15	0	41.9%	4.3 ± 1.8	2.6 ± 1.1	0.5 ± 1.1	0.1 ± 0.5	87%	13%	2.7
Qwen3-VL 30B	15	4	73	29.2%	2.5 ± 2.9	0.8 ± 1.5	2.0 ± 4.0	0.0	25%	0%	2.5
Gemini 2.5 Flash	15	14	7	7.1%	0.4 ± 1.3	0.2 ± 0.8	0.6 ± 2.1	0.1 ± 0.5	7%	7%	0.5
GPT-OSS 120B	15	2	87	0.0%	0.0	0.0	0.0	0.0	0%	0%	0.0
Qwen3 8B	15	15	0	0.0%	0.0	0.0	0.0	0.0	0%	0%	1.0
Qwen3 235B	15	0	100	—	—	—	—	—	—	—	—
Qwen3-VL 235B	15	0	100	—	—	—	—	—	—	—	—
Qwen3-VL 32B	15	0	100	—	—	—	—	—	—	—	—
Qwen3-VL 8B	15	0	100	—	—	—	—	—	—	—	—
Mean, $n = 22$		274	17	68.0%	3.8	2.3	5.0	1.2	74%	61%	4.0

Table A3. Orchestrator evaluation results by model for TC-O. n = total runs attempted; s = successful runs; Fail% = failure rate. Duration and F1 show mean ± standard deviation of successful runs. Quality is the LLM judge rating on a scale of 1 to 5. Sorted by F1 descending.

Model	n	s	Fail%	Duration (s)	F1	Quality
Claude Haiku 4.5	10	10	0	259 ± 81	1.00	2.8 ± 0.6
Claude Sonnet 4.5	10	9	10	373 ± 194	1.00	2.3 ± 0.5
o4-mini	6	3	50	394 ± 115	1.00	2.7 ± 0.6
Ministral 8B	10	1	90	739	1.00	2.0
Claude Opus 4.5	10	9	10	291 ± 114	0.98 ± 0.06	2.1 ± 0.3
GPT-5-mini	10	7	30	620 ± 161	0.98 ± 0.04	3.1 ± 0.9
Claude Sonnet 4	6	6	0	390 ± 96	0.96 ± 0.07	2.7 ± 0.5
GPT-5-nano	10	6	40	762 ± 85	0.92 ± 0.14	2.2 ± 0.4
GPT-5.2	10	8	20	437 ± 114	0.91 ± 0.10	3.8 ± 0.9
Ministral 3B	10	6	40	465 ± 276	0.86 ± 0.18	2.3 ± 0.5
Grok Code Fast 1	10	6	40	456 ± 254	0.79 ± 0.18	2.8 ± 0.8
Grok 4.1 Fast	10	7	30	412 ± 71	0.75 ± 0.35	2.1 ± 0.4
GPT-5.1	16	11	31	340 ± 185	0.71 ± 0.46	3.3 ± 1.1
GPT-5.2 Chat	10	10	0	215 ± 87	0.70 ± 0.39	3.3 ± 0.7
GPT-4o-mini	17	16	6	205 ± 91	0.69 ± 0.30	2.0
Gemini 2.5 Flash	10	6	40	337 ± 191	0.66 ± 0.26	2.0
Gemini 2.5 Flash Lite	10	4	60	116 ± 61	0.65 ± 0.44	2.2 ± 0.5
Gemini 2.5 Pro	16	13	19	282 ± 77	0.60 ± 0.20	2.0
Ministral 14B	10	4	60	75 ± 103	0.25 ± 0.50	2.0
GPT-5.1 Chat	10	10	0	73 ± 53	0.18 ± 0.30	2.1 ± 0.6
GPT-OSS 20B	10	4	60	84 ± 88	0.14 ± 0.27	2.0
Qwen3-Next 80B	10	6	40	85 ± 142	0.11 ± 0.27	1.7 ± 0.5
Qwen3 8B	10	7	30	111 ± 120	0.00	1.7 ± 0.5
GPT-OSS 120B	10	2	80	13	0.00	2.0
Qwen3 235B	10	0	100	—	—	—
Mean, $n = 25$	261	171	34.5	314	$0.66$	$2.4$

Table A4. Hyperparameter sensitivity across test cases. Baseline configurations vary by test case. Mean

| Δ |

reports the mean absolute score change when varying temperature from 0.0 to 1.0 and top-p from 0.5 to 1.0 relative to baseline. All models tested are those achieving non-trivial baseline performance.

Table A4. Hyperparameter sensitivity across test cases. Baseline configurations vary by test case. Mean

| Δ |

reports the mean absolute score change when varying temperature from 0.0 to 1.0 and top-p from 0.5 to 1.0 relative to baseline. All models tested are those achieving non-trivial baseline performance.

Test Case	Models	Runs	Baseline	Mean $\| Δ \|$
TC-D: Document extraction	6	300	T = 0.7	0.024
TC-G: Entity generation	6	540	T = 0.7	0.025
TC-S: Search relevance	7	359	T = 0.7	<0.01
TC-M: Methodology expert	7	519	T = 1.0	0.028

Table A5. Uncertainty summary for the headline aggregate scores reported in Section 6. The mean is the per-model-mean point estimate and SD is the between-model standard deviation across the per-model means listed in Table A1, Table A2 and Table A3. The 95% CI is a percentile interval from a hierarchical bootstrap (10,000 resamples) that resamples models and, within each model, its runs, and therefore incorporates both between-model and per-run sampling variability; all intervals are reproducible from the released evaluation scripts. Bootstrap CIs for the human agreement coefficients are reported in Section 5.4 and are not duplicated here.

Test Case	Metric	N	Mean	SD	95 % CI
TC-O	Pipeline F1	24	0.660	0.349	[0.513, 0.794]
TC-G	Entity completion (%)	22	68.0	34.1	[52.6, 82.0]
TC-D	Total score	23	0.640	0.127	[0.588, 0.690]
TC-D	Actor	23	0.786	0.157	[0.722, 0.847]
TC-D	Role	23	0.846	0.108	[0.801, 0.889]
TC-D	Interaction	23	0.431	0.115	[0.383, 0.477]
TC-D	Attribution	23	0.499	0.162	[0.431, 0.562]
TC-D	TEXT	23	0.947	0.058	[0.921, 0.968]
TC-D	IMAGE	23	0.470	0.211	[0.384, 0.553]
TC-M	Weighted quality score	25	0.877	0.069	[0.848, 0.904]

References

Ma, Z.; Christensen, K.; Jørgensen, B.N. Business ecosystem architecture development: A case study of Electric Vehicle home charging. Energy Inform. 2021, 4, 9. [Google Scholar] [CrossRef]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2023, arXiv:2303.18223. [Google Scholar] [CrossRef]
Xu, D.; Chen, W.; Peng, W.; Zhang, C.; Xu, T.; Zhao, X.; Wu, X.; Zheng, Y.; Chen, E. Large Language Models for Generative Information Extraction: A Survey. Front. Comput. Sci. 2024, 18, 186357. [Google Scholar] [CrossRef]
Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst. 2025, 43, 42. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the Advances in Neural Information Processing Systems 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 9459–9474. [Google Scholar] [CrossRef]
Dagdelen, J.; Dunn, A.; Lee, S.; Walker, N.; Rosen, A.S.; Ceder, G.; Persson, K.A.; Jain, A. Structured Information Extraction from Scientific Text with Large Language Models. Nat. Commun. 2024, 15, 1418. [Google Scholar] [CrossRef] [PubMed]
Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A Survey on Large Language Model based Autonomous Agents. Front. Comput. Sci. 2024, 18, 186345. [Google Scholar] [CrossRef]
Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Models. In Proceedings of the International Conference on Learning Representations (ICLR); ICLR: Appleton, WI, USA, 2023. [Google Scholar] [CrossRef]
Guo, T.; Chen, X.; Wang, Y.; Chang, R.; Pei, S.; Chawla, N.V.; Wiest, O.; Zhang, X. Large Language Model based Multi-Agents: A Survey of Progress and Challenges. In Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI-24); Survey Track; IJCAI: Menlo Park, CA, USA, 2024; pp. 8048–8057. [Google Scholar] [CrossRef]
Ma, Z. Business ecosystem modeling—The hybrid of system modeling and ecological modeling: An application of the smart grid. Energy Inform. 2019, 2, 35. [Google Scholar] [CrossRef]
Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.P.; et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Proceedings of the Advances in Neural Information Processing Systems 2023; Curran Associates Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 46595–46623. [Google Scholar] [CrossRef]
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems 2022; Curran Associates Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 24824–24837. [Google Scholar] [CrossRef]
Wei, X.; Cui, X.; Cheng, N.; Wang, X.; Zhang, X.; Huang, S.; Xie, P.; Xu, J.; Chen, Y.; Zhang, M.; et al. ChatIE: Zero-Shot Information Extraction via Chatting with ChatGPT. arXiv 2023, arXiv:2302.10205. [Google Scholar] [CrossRef]
Luo, Y.; Ru, X.; Liu, K.; Yuan, L.; Sun, M.; Zhang, N.; Liang, L.; Zhang, Z.; Zhou, J.; Wei, L.; et al. OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System. In Companion Proceedings of the ACM on Web Conference 2025; Association for Computing Machinery: New York, NY, USA, 2025. [Google Scholar] [CrossRef]
Chhetri, T.R.; Chen, Y.; Trivedi, P.; Jarecka, D.; Haobsh, S.; Ray, P.; Ng, L.; Ghosh, S.S. StructSense: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking. arXiv 2025, arXiv:2507.03674. [Google Scholar] [CrossRef]
Colakoglu, G.; Solmaz, G.; Fürst, J. AgenticIE: An Adaptive Agent for Information Extraction from Complex Regulatory Documents. arXiv 2025, arXiv:2509.11773. [Google Scholar] [CrossRef]
Luo, J.; Zhang, W.; Yuan, Y.; Zhao, Y.; Yang, J.; Gu, Y.; Wu, B.; Chen, B.; Qiao, Z.; Long, Q.; et al. Large Language Model Agent: A Survey on Methodology, Applications and Challenges. arXiv 2025, arXiv:2503.21460. [Google Scholar] [CrossRef]
Hong, S.; Zhuge, M.; Chen, J.; Zheng, X.; Cheng, Y.; Zhang, C.; Wang, J.; Wang, Z.; Yau, S.K.S.; Lin, Z.; et al. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. In Proceedings of the International Conference on Learning Representations (ICLR); ICLR: Appleton, WI, USA, 2024. [Google Scholar] [CrossRef]
Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language Models Can Teach Themselves to Use Tools. In Proceedings of the Advances in Neural Information Processing Systems 2023; Curran Associates Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 68539–68551. [Google Scholar] [CrossRef]
Qu, C.; Dai, S.; Wei, X.; Cai, H.; Wang, S.; Yin, D.; Xu, J.; Wen, J.R. Tool Learning with Large Language Models: A Survey. Front. Comput. Sci. 2025, 19, 198343. [Google Scholar] [CrossRef]
Shinn, N.; Cassano, F.; Berman, E.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language Agents with Verbal Reinforcement Learning. In Proceedings of the Advances in Neural Information Processing Systems 2023; Curran Associates Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 8634–8652. [Google Scholar] [CrossRef]
Bavaresco, A.; Bernardi, R.; Bertolazzi, L.; Elliott, D.; Fernández, R.; Gatt, A.; Ghaleb, E.; Giulianelli, M.; Hanna, M.; Koller, A.; et al. LLMs instead of Human Judges? A Large-Scale Empirical Study across 20 NLP Evaluation Tasks. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 238–255. [Google Scholar] [CrossRef]
Calderon, N.; Reichart, R.; Dror, R. The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 16051–16081. [Google Scholar] [CrossRef]
Lu, Y.; Liu, Y.; Dong, S.; Song, Q.; Zhang, C.; Zhao, Y.; Lu, J. KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment. arXiv 2025, arXiv:2502.06472. [Google Scholar] [CrossRef]
Lin, L.; Zeng, W.; Luo, B.; Wen, L.; Wang, J. MAO: A Framework for Process Model Generation with Multi-Agent Orchestration. arXiv 2024, arXiv:2408.01916. [Google Scholar] [CrossRef]
Dijkstra, E.W. On the Role of Scientific Thought. In Selected Writings on Computing: A Personal Perspective; Originally written in 1974 as EWD447; Springer: New York, NY, USA, 1982; pp. 60–66. [Google Scholar] [CrossRef]
Xi, Y.; Lin, J.; Xiao, Y.; Zhou, Z.; Shan, R.; Gao, T.; Zhu, J.; Liu, W.; Yu, Y.; Zhang, W. A Survey of LLM-based Deep Search Agents: Paradigm, Optimization, Evaluation, and Challenges. arXiv 2025, arXiv:2508.05668. [Google Scholar] [CrossRef]
Auer, C.; Lysak, M.; Nassar, A.; Dolfi, M.; Livathinos, N.; Vagenas, P.; Berrospi Ramis, C.; Omenetti, M.; Lindlbauer, F.; Dinkla, K.; et al. Docling Technical Report. arXiv 2024, arXiv:2408.09869. [Google Scholar] [CrossRef]
Cohen, J. A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
Landis, J.R.; Koch, G.G. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]
Bass, L.; Clements, P.; Kazman, R. Software Architecture in Practice, 4th ed.; SEI Series in Software Engineering; Addison-Wesley: London, UK, 2021. [Google Scholar]
Doshi-Velez, F.; Kim, B. Towards A Rigorous Science of Interpretable Machine Learning. arXiv 2017, arXiv:1702.08608. [Google Scholar] [CrossRef]

Figure 1. Modelling as a generative transformation process. Unstructured evidence is progressively transformed into a structured ecosystem map through five computational stages that operationalise the five-stage methodological lifecycle [1]. The modelling methodology constrains admissible outputs at each stage, while human oversight governs the integration of proposed model modifications.

Figure 2. Multi-agent architecture aligned with modelling stages. The orchestrator coordinates specialised agents, each responsible for a distinct transformation stage. Change proposals are gated through human review before integration into the structured ecosystem map. Solid arrows denote the primary transformation pipeline; dashed arrows and dashed boxes denote auxiliary context access and provenance or record-keeping flows.

Figure 3. Hybrid evaluation framework for generative modelling pipelines. Three complementary assessment pillars converge to characterise both operational reliability and semantic correctness of modelling outputs.

Table 1. Comparison of the present architecture with the closest prior schema-guided and agentic information extraction systems across five dimensions of a methodology-grounded modelling pipeline ¹.

Dimension	OneKE	StructSense	AgenticIE	KARMA	MAO	Present Work
Lifecycle coverage	Extraction and error correction only	Extraction, ontology alignment, judging, HITL feedback	Single-document extraction (KIE/QA)	Ingestion through KG integration	Single-text to process model generation	Boundary, retrieval, conversion, extraction, validation, editing
Ontology enforcement	Output format specs (Pydantic/JSON); KB-backed schema retrieval	Post-hoc alignment to formal ontologies via vector DB	JSON schema templates (fixed and open)	Schema alignment agent maps to existing KG types	BPMN format constraints injected via prompts	Schema-bound typed output constraints; methodology expert validation
Provenance and gating	None; case repository tracks history	Source sentence tracking; HITL via feedback agent	Verification step; no HITL	Cross-agent verification; optional manual review escalation	No provenance; no HITL	Source-linked provenance; staged proposals with human accept/reject
Agent decomposition	3 agents: schema, extract, reflect	4 agents: extract, align, judge, feedback	Single agent with tool-calling loop	9 agents by pipeline stage; central controller	3 roles across 4 phases (generate, refine, review, test)	5 agents and orchestrator aligned with methodology stages
Evaluation	2 IE benchmarks; ablation study	3 tasks, 3 models; P/R/F1 and concept alignment	1 domain; schema adherence and exact match; 3 model variants	3 domains, 3 models; 1200 articles; LLM-verified correctness	4 datasets; distance to reference BPMN; ablation study	4300 runs, 34 models; operational metrics, LLM judge, human $κ$ validation

¹ HITL = human-in-the-loop; KIE = key information extraction; QA = question answering; P/R/F1 = precision/recall/F1-score; KG = knowledge graph.

Table 2. Formal ontology for ecosystem modelling. Each construct type is defined by its structural constraints and admissible relationships.

Construct	Definition	Structural Constraints
Actor	An identifiable organisational entity or institution participating in the ecosystem	Must have a unique identifier; classified as either an active actor or a passive object, where objects represent entities such as infrastructure components that participate without exercising independent initiative
Role	A function or position that an actor assumes within the ecosystem	An actor may hold multiple roles; multiple actors may share the same role
Interaction	A structured relationship between two roles, typed as one of five categories, namely monetary value, intangible value, goods, information, or data exchange	Must reference two existing participants; typed and optionally directional; must cite evidential source

Table 3. Summary of evaluation test cases. Each case isolates a distinct agent role within the multi-agent architecture.

Test Case	Agent Tested	Input and Ground Truth	Models/Runs
TC-O: Orchestrator	Orchestrator with fixed sub-agents	User-level task; 6 ground-truth actors in Danish energy domain	25/261
TC-G: Entity generation	Ecosystem editor	Multi-part instruction with 6 reference documents; 18 ground-truth entities	26/390
TC-D: Document extraction	Document analyser	Purpose-built PDF with prose text and embedded diagram; 42 ground-truth entities	23/1187
TC-S: Search relevance	Search agent	Relevant vs. irrelevant web page; binary classification	30/432
TC-M: Methodology expert	Methodology expert	Boundary definition task with 2 reference documents; rubric-based quality assessment	25/394

Table 4. Summary of empirical performance across evaluation test cases. Scores are per-model means and ranges indicate per-model variation; scoring procedures differ by test case (Section 5.4). The 95% confidence intervals are hierarchical-bootstrap percentile intervals, omitted for TC-S, whose binary outcome is a basic operational check rather than a discriminative benchmark.

Test Case	Task Category	Mean Score	95 % CI	Range	Models
TC-O: Orchestrator	Pipeline F1	0.66	$[0.51, 0.79]$	0.00 to 1.00	25
TC-O: Orchestrator	Successful completion rate	65.5%
TC-G: Entity generation	Entity completion	68.0%	$[52.6, 82.0]$	0 to 100%	26
TC-G: Entity generation	Reference integration	4.0/6
TC-D: Document extraction	Total score	0.640	$[0.59, 0.69]$	0.32 to 0.80	23
by entity type	Actor identification	0.786	$[0.72, 0.85]$	0.45 to 0.99
	Role extraction	0.846	$[0.80, 0.89]$	0.62 to 1.00
	Interaction extraction	0.431	$[0.38, 0.48]$	0.16 to 0.58
	Attribution	0.499	$[0.43, 0.56]$	0.06 to 0.72
by modality	Text-sourced	0.947	$[0.92, 0.97]$	0.78 to 1.00
	Image-sourced	0.470	$[0.38, 0.55]$	0.05 to 0.72
TC-S: Search relevance	Classification accuracy	100%, 27 of 30			30
TC-M: Methodology expert	Weighted quality score	0.877	$[0.85, 0.90]$	0.75 to 0.99	25

Table 5. Document extraction performance across entity types and source modalities for TC-D. Twelve models spanning the full performance range are shown; complete results for all 23 models appear in Table A1.

Model	Actor	Role	Inter.	TEXT	IMAGE
Claude Opus 4.5	0.936	0.967	0.567	0.986	0.708
Claude Haiku 4.5	0.991	0.977	0.501	0.988	0.716
GPT-5.2	0.951	1.000	0.479	0.991	0.692
Gemini 2.5 Flash	0.890	0.962	0.569	0.997	0.668
Gemini 2.5 Pro	0.909	0.896	0.571	0.983	0.641
GPT-5-mini	0.941	0.954	0.415	0.984	0.621
Qwen3-VL 235B	0.925	0.933	0.519	0.967	0.662
Ministral 8B	0.899	0.934	0.438	0.923	0.628
Grok 4.1 Fast	0.750	0.819	0.397	0.967	0.393
GPT-5-nano	0.654	0.698	0.245	0.833	0.266
Kimi K2	0.500	0.671	0.299	0.925	0.088
GPT-4o-mini	0.449	0.624	0.164	0.775	0.052
Mean, $n = 23$	0.786	0.846	0.431	0.947	0.470

Cell shading indicates the score band: green, ≥0.90; light green, 0.70–0.89; yellow, 0.50–0.69; orange, 0.30–0.49; red, <0.30.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gärdström, H.F.; Jørgensen, B.N.; Ma, Z.G. Agentic Generative AI for Methodology-Grounded Modelling from Unstructured Documents: Design and Evaluation of a Multi-Agent Ecosystem Mapping Pipeline. Information 2026, 17, 570. https://doi.org/10.3390/info17060570

AMA Style

Gärdström HF, Jørgensen BN, Ma ZG. Agentic Generative AI for Methodology-Grounded Modelling from Unstructured Documents: Design and Evaluation of a Multi-Agent Ecosystem Mapping Pipeline. Information. 2026; 17(6):570. https://doi.org/10.3390/info17060570

Chicago/Turabian Style

Gärdström, Hampus Fink, Bo Nørregaard Jørgensen, and Zheng Grace Ma. 2026. "Agentic Generative AI for Methodology-Grounded Modelling from Unstructured Documents: Design and Evaluation of a Multi-Agent Ecosystem Mapping Pipeline" Information 17, no. 6: 570. https://doi.org/10.3390/info17060570

APA Style

Gärdström, H. F., Jørgensen, B. N., & Ma, Z. G. (2026). Agentic Generative AI for Methodology-Grounded Modelling from Unstructured Documents: Design and Evaluation of a Multi-Agent Ecosystem Mapping Pipeline. Information, 17(6), 570. https://doi.org/10.3390/info17060570

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Agentic Generative AI for Methodology-Grounded Modelling from Unstructured Documents: Design and Evaluation of a Multi-Agent Ecosystem Mapping Pipeline

Abstract

1. Introduction

2. Related Work

2.1. Generative Artificial Intelligence for Structured Extraction and Modelling

2.2. Agentic Architectures and Orchestration Frameworks

2.3. Evaluation of Generative and Semantic Outputs

2.4. Closest Prior Work and Differentiation

2.5. Research Gap

3. Modelling as a Methodology-Grounded Transformation Process

3.1. Modelling as Structured Representation Construction

3.2. Ecosystem Modelling as a Structured Schema

3.3. Transformation Stages in Methodology-Grounded Modelling

3.4. Human Oversight and Representational Accountability

4. Agentic Generative AI Architecture for Modelling

4.1. Design Principles

4.2. Multi-Agent Decomposition Aligned with Modelling Stages

4.3. Controlled Model Editing and Governance Mechanisms

4.4. Separation of Planning and Execution

5. Evaluation Framework for Generative Modelling Pipelines

5.1. Challenges in Evaluating Semantic Modelling Outputs

5.2. Hybrid Evaluation Design

5.3. Experimental Setup

5.4. Scoring Procedures

6. Empirical Results

6.1. Document Extraction Performance

6.2. Performance Across Agent Roles

6.3. Model Selection as the Dominant Performance Factor

6.4. Summary of Empirical Patterns

7. Discussion

7.1. Implications for Methodology-Grounded Generative Modelling

7.2. Orchestration and Entity Generation as Emerging Capabilities

7.3. Reliable Automation and Its Structural Limits

7.4. Reference Integration and Incremental Modelling

7.5. Architectural Implications for Hybrid Modelling Frameworks

7.6. Provenance Preservation as a Structural Limitation

7.7. Operational Workflow Implications

7.8. Governance and Accountability

7.9. Limitations and Future Directions

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Detailed Per-Model Results

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI