Game Knowledge Management System: Schema-Governed LLM Pipeline for Executable Narrative Generation in RPGs

Rahman, Aynigar; Yu, Aihe; Cho, Kyungeun

doi:10.3390/systems14020175

Open AccessArticle

Game Knowledge Management System: Schema-Governed LLM Pipeline for Executable Narrative Generation in RPGs

by

Aynigar Rahman

¹

,

Aihe Yu

²

and

Kyungeun Cho

^3,*

¹

Department of Computer Science and Artificial Intelligence, Dongguk University-Seoul, 30 Pildongro 1-gil, Jung-gu, Seoul 04620, Republic of Korea

²

Department of Autonomous Things Intelligence, Dongguk University-Seoul, 30 Pildongro 1-gil, Jung-gu, Seoul 04620, Republic of Korea

³

Department of Computer Science and Artificial Intelligence, College of Advanced Convergence Engineering, Dongguk University-Seoul, 30 Pildongro 1-gil, Jung-gu, Seoul 04620, Republic of Korea

^*

Author to whom correspondence should be addressed.

Systems 2026, 14(2), 175; https://doi.org/10.3390/systems14020175

Submission received: 3 January 2026 / Revised: 24 January 2026 / Accepted: 3 February 2026 / Published: 5 February 2026

(This article belongs to the Special Issue Redefining Knowledge Management Systems: The Role of Generative AI in Innovation, Learning, and Knowledge Processes)

Download

Browse Figures

Versions Notes

Abstract

Procedural approaches have long been used in game development to reduce authoring costs and increase content diversity; however, traditional rule-based systems struggle to scale narrative complexity, whereas recent large language model (LLM)-based methods often produce outputs that are structurally invalid or incompatible with real-time game engines. This gap reflects a fundamental limitation in current practice: generative models lack systematic mechanisms for managing executable game knowledge rather than merely producing free-form narrative texts. To address this issue, we propose a Game Knowledge Management System (G-KMS) that reformulates LLM-based narrative generation as a structured knowledge management process. The proposed framework integrates knowledge grounding, schema-governed generation, normalization-based repair, engine-aligned knowledge admission, and application within a unified pipeline. The system was evaluated on a compact 2D Unity-based RPG benchmark using automated structural and semantic analyses, engine-level playability probes, and a controlled human player study. The experimental results demonstrated high reliability in knowledge admission, stable procedural structures, controlled expressive diversity, and a strong alignment between system-level metrics and player-perceived narrative quality, indicating that LLMs can function as dependable knowledge-construction components when embedded within a governed management pipeline. Beyond the evaluated RPG setting, this study suggests a practical and reproducible approach that may be extended to other executable systems, such as interactive simulations and training environments.

Keywords:

game knowledge management system; generative AI; large language models; schema-governed systems; engine-level application; game AI

1. Introduction

Procedural content generation (PCG) has long supported scalable and replayable gaming experiences by reducing manual authoring. Classical PCG approaches—such as rule-based, template-driven, and planning-oriented systems—provide strong structural reliability and designer control but struggle to produce semantically rich quests, coherent character behaviors, and context-aware dialogue required by narrative-driven role-playing games (RPGs).

From a systems perspective, traditional game AI and PCG pipelines can be viewed as implicit knowledge management systems that encode game rules, world states, and behaviors through fixed symbolic representations and schemas. Although this paradigm ensures internal consistency and engine compatibility, it relies heavily on manual knowledge modeling and rigid abstractions, limiting adaptability and reuse as narrative scope expands.

Recent advances in large language models (LLMs) offer an alternative approach that enables data-driven generative knowledge construction. Prior work demonstrates that LLMs can synthesize characters, quests, narratives, and dialogue with high linguistic fluency and diversity, supporting applications ranging from conceptual game design to LLM-driven NPCs and open-ended simulations [1,2,3,4]. However, when integrated into real game production pipelines, LLM-based approaches face persistent challenges; free-form outputs often violate structural constraints, produce inconsistent world knowledge or personas, and fail to satisfy engine-level parsing and execution requirements. Consequently, most LLM-for-game systems lack reproducible mechanisms for engine-level validation and deployment and are primarily evaluated through text-level analysis or subjective judgment rather than verified runtime execution [5,6,7].

Recent LLM-driven systems—such as co-creative narrative tools, template-augmented generation, and simulated societies—have expanded expressive capabilities [2,4,8]. Generated content is often demonstrated in sandboxed or illustrative settings rather than systematically verified as executable artifacts, leaving a persistent gap between generative narrative knowledge and deployable in-game content.

To address this gap, we introduce a Game Knowledge Management System (G-KMS) that governs the construction, validation, and application of generative narrative output as executable game knowledge. A G-KMS manages artifacts, such as characters, quests, dialogue logic, and interaction rules, throughout the entire lifecycle from generation to runtime execution, explicitly enforcing executability, verifiability, and engine compatibility.

In this paper, we use the term “knowledge management system” to emphasize a lifecycle- and governance-oriented treatment of executable game knowledge artifacts (e.g., creation, standardization, validation/admission, storage, and reuse) within a runtime-constrained production setting. We do not claim to cover the full organizational management-system scope defined in standards such as ISO 30401 [9] (e.g., leadership, culture, and enterprise-wide KM processes); instead, our focus is on the technical and operational governance layer that enables reliable knowledge admission and deployment in an engine-executable form. Human-centered curation and organizational KM processes are important extensions and are discussed as future work.

While many of the individual techniques employed are not algorithmically novel in isolation, our contribution lies in reframing their integration through an explicit knowledge management lens as a governed, lifecycle-oriented system.

Building on this concept, we propose an LLM-chained G-KMS framework, a structured multistage pipeline that transforms unstructured narrative inputs into engine-executable knowledge artifacts for 2D Unity-based RPGs under explicit governance and validation. Rather than introducing a new generative algorithm, this framework formalizes how existing generative and verification practices can be organized into a coherent executable knowledge management process. Extensive experiments and a controlled human player study demonstrate the execution reliability and gameplay quality of the generated artifacts.

The main contributions of this work are threefold:

1.: We formalize the concept of a Game Knowledge Management System (G-KMS) and articulate a conceptual and operational lifecycle for executable game knowledge in LLM-based generation pipelines.
2.: We propose an LLM-chained schema-governed pipeline that produces structurally valid, semantically grounded, and engine-executable narrative knowledge.
3.: We establish a multidimensional evaluation methodology that combines engine-level execution and human player interaction for reproducible system-level validation.

The remainder of this paper is organized as follows: Section 2 reviews related work on knowledge representation in games, narrative PCG, LLM-based generations, and runtime-integrated systems. Section 3 presents the proposed G-KMS framework. Section 4 presents the experimental results and analysis. Finally, Section 5 concludes the paper and discusses future research directions.

2. Related Work

2.1. Knowledge Representation and Management in Game Systems

Early research on intelligent game systems relied heavily on explicit knowledge representations to encode game rules, world states, and agent behaviors in forms suitable for symbolic reasoning and algorithmic control. Rather than learning representations from data, these systems emphasize handcrafted abstractions, domain-specific ontologies, and symbolic models, providing strong guarantees of correctness, playability, and internal consistency at the cost of substantial authoring effort and limited adaptability as game complexity increases.

A representative body of work focuses on symbolic knowledge representations for agent reasoning and decision-making. In the context of real-time strategy (RTS) games, Ontañón et al. [10] survey AI techniques based on explicit state abstractions, hierarchical representations, and symbolic encodings of strategies, goals, and resources. These approaches function as structured knowledge bases that support long-term reasoning over complex game states but become increasingly difficult to maintain and extend as environments scale in size and diversity.

Plan-based and behavior-oriented architectures further illustrate this paradigm. Classical planning systems encode domain knowledge as reusable plans, action schemas, and goal hierarchies, enabling agents to respond coherently to dynamic environments. The architecture proposed by Young et al. [11] demonstrates how plan representations persistently link perceptions, actions, and goal reasoning. Although effective in constrained domains, such systems typically assume fixed ontologies and predefined action spaces, thereby limiting their applicability to narrative-rich or content-diverse game worlds.

Explicit knowledge representation also plays a central role in the generation of PCG. Search-based PCG approaches, as surveyed by Togelius et al. [12], formulate content generation as an optimization over explicitly defined representations of levels, rules, and constraints, embedding gameplay knowledge directly into the evaluation functions. Although this enables the systematic exploration of large design spaces, the resulting representations are often tightly coupled to specific mechanics or genres, restricting reuse across contexts.

Mixed-initiative design systems further integrate symbolic representations and constraint solving to manage design knowledge. Tanagra [13], for example, encoded gameplay knowledge through beat-based abstractions, hierarchical geometry patterns, and numerical constraints, forming an internal knowledge management layer that ensures playability while allowing real-time designer control. However, such systems remain largely focused on spatial and mechanical domains, and extending them to higher-level narrative logic or evolving world states remains challenging.

2.2. Symbolic and Template-Based Narrative PCG

Early procedural narrative generation systems were predominantly based on symbolic rule-based and planning-oriented paradigms that relied on handcrafted logic, symbolic operators, and deterministic event transitions to assemble quests. By constructing narratives from predefined symbolic components, these approaches provide strong structural guarantees, including syntactic validity, solvability, and predictable behavior. However, their reliance on manually authored structures constrains expressive diversity and limits their ability to adapt to evolving game states or support varied character behaviors.

Representative examples include mechanics-to-template mapping and planning-based query-generation methods. Alexander et al. [14] generated quests by embedding domain actions—such as gathering or escorting—into fixed narrative structures, ensuring reliability at the cost of bounded variation. Similarly, Ammanabrolu et al. [15] employed hierarchical goal-tree planning for text-adventure quests to guarantee logical correctness but required manual specification of all branches, limiting scalability as narrative complexity increases. Broader analyses of symbolic PCG pipelines confirm these trade-offs: rule-based RPG systems exhibit high structural robustness but demand substantial authoring efforts to achieve narrative richness [16], and classical PCG techniques remain more effective for geometric or structural generation than for modeling narrative coherence or emotionally grounded interactions [17].

Template-driven narrative systems have emerged to relax the rigidity of purely rule-based approaches while preserving predictable structures. These systems rely on author-defined templates or scene schemas into which variable elements—such as characters, items, or events—are procedurally inserted. Although this strategy maintains consistency and prevents broken quest logic, its expressive range remains constrained by the available template inventory.

Hybrid and human-in-the-loop (HITL) approaches further extend this paradigm by incorporating generative models and designer oversight. For example, the PANGeA system [5] combines handcrafted scene templates with LLM-based rewriting to generate turn-based narrative sequences, using templates as structural anchors to ensure a coherent event flow. HITL pipelines demonstrate that iterative annotation and manual correction can improve narrative quality [18], whereas co-creative systems enable users and models to collaboratively shape the story progression [19]. Similarly, natural-language–driven procedural systems allow for high-level guidance over generations [20]. However, these approaches depend on predefined structures or sustained human involvement, limiting the autonomy and scalability of large-scale narrative generation.

2.3. LLM-Based Narrative Generation and Dialogue Control

LLMs have significantly expanded the expressive capacity of procedural narrative generation, enabling the synthesis of character backstories, quest concepts, branching dialogues, and interactive story progression with high linguistic fluency. Prior work demonstrates that high-level narrative elements—such as characters, settings, and story summaries—can be generated directly from natural-language prompts [1], whereas interactive frameworks further show that LLMs can improve story beats between characters in real time, supporting an open-ended narrative flow [2].

Open-ended simulation research provides a complementary perspective on LLM generativity. Studies by Park et al. [4] and Li et al. [8] showed that LLM-driven agents can exhibit persistent social behaviors and emergent routines within sandbox environments, interacting through unconstrained natural-language exchanges. These systems highlight the flexibility of LLMs in modeling dynamic social interactions and long-horizon behavior but typically operate without explicit structural representations aligned with game engine execution.

To address the issues of coherence and control, several studies have introduced additional structures into LLM-based generation. The methods proposed by Steven et al. [21] and Li et al. [22] mapped natural-language descriptions onto semantic world models or character personality classes, thereby enabling more consistent entity representations. Dialogue-focused studies further explored techniques such as persona grounding, semantic graph constraints, and iterative rewriting to improve local coherence and conversational stability [3,6,7].

Despite these advances, maintaining long-term persona consistency and narrative control remains a central challenge for LLM-generated dialogue, particularly in RPG-style interactions that require a stable character identity across extended exchanges. Shuster et al. [23] showed that dialogue models may deviate from assigned personas or contradict prior statements in long contexts. To mitigate such issues, post hoc correction frameworks—such as Generate–Delete–Rewrite (GDR) [24]—and training-based strategies, including persona-focused fine-tuning [25,26], have been proposed to improve identity stability. Additional studies have investigated NPC-centric interaction shaping to enhance conversational grounding [27].

2.4. Runtime-Integrated PCG and Engine-Level Systems

Recent advances in LLM-driven agents have shifted PCG from static artifact creation to interactive runtime behavior modeling. Nicholas et al. [28] observed that many behavior-generation systems emphasize high-level reasoning or intent simulation, producing outputs that are conceptually coherent but not expressed in formats directly consumable by commercial game engines. Consequently, such systems demonstrate autonomous decision-making or narrative improvisation while remaining decoupled from engine-native execution structures, including state machines and event-driven logic.

Embodied agent research has further explored runtime generativity. Wang et al. [29] show that LLM-guided agents in Minecraft can perform autonomous exploration, tool construction, and long-horizon goal pursuit. However, these behaviors are mediated through external APIs rather than being integrated into engine-level execution pipelines. Similarly, strategic multi-agent systems—such as those developed by the Meta-Fundamental AI Research Diplomacy Team [30]—exhibit sophisticated coordination and negotiation abilities but operate entirely through symbolic communication channels without producing structured, engine-executable gameplay artifacts.

Additional studies have investigated long-term NPC behavior and continuity. Zheng et al. [31] studied the mechanisms for maintaining persistent NPC memory, enabling narrative reinforcement across interactions. These representations are typically encoded as free-form text or loosely structured graphs rather than as validated engine-level data structures. Complementary research by Sudhakaran et al. [32] demonstrated that LLMs can generate tile-based platformer levels under grammatical constraints, providing effective spatial PCG, but without addressing narrative logic, quest sequencing, or dialogue coordination.

2.5. Summary

The literature reviewed in this section reveals a fragmented landscape of narrative PCG and game AI systems, in which different approaches address isolated aspects of content creation but rarely support the full lifecycle of executable game knowledge. Symbolic and template-based methods emphasize structural reliability and control yet remain constrained by fixed representations and manual authoring. LLM-based narrative and dialogue systems substantially improve expressive capacity but typically operate without formal structures that ensure consistency, state coherence, or compatibility with downstream execution. Runtime-integrated agents and open-ended simulations further demonstrate dynamic behavior and long-term interaction but largely function outside engine-native pipelines and validated data representations.

As summarized in Table 1, existing approaches tend to prioritize one of three dimensions—structural control, narrative expressiveness, or interactive behavior—and seldom integrate these properties within a unified, verifiable workflow. From a knowledge management perspective, most systems can therefore be viewed as partial or implicit solutions; they generate rules, narratives, or behaviors but do not provide an end-to-end mechanism for constructing, validating, and applying game knowledge artifacts in executable form.

This gap motivates the need for a Game Knowledge Management System (G-KMS) that explicitly treats narrative elements—such as characters, quests, dialogue graphs, and interaction rules—as structured knowledge artifacts and manages them across the generation, verification, and runtime application stages. In response, the following section introduces a G-KMS-oriented framework that integrates LLM-based generative flexibility with schema-governed knowledge representation, automated structural repair, and engine-level validation, enabling the reliable transformation of generative outputs into executable game content.

3. Proposed Method

3.1. Architectural Overview

We propose an LLM-chained Game Knowledge Management System (G-KMS) that integrates schema-governed generation, grounded world knowledge, and engine-level validation to produce executable narrative content. This framework treats narrative task generation as a structured pipeline in which narrative elements are constructed, validated, and applied as engine-compatible knowledge artifacts.

From a research methodology perspective, this work follows a design science research (DSR) paradigm, in which a novel artifact is constructed to address a practical problem and systematically evaluated through multiple complementary methods. In our study, the identified problem is the lack of reliable mechanisms for admitting LLM-generated narrative content into real game engines; the proposed G-KMS constitutes the designed artifact; the three-stage pipeline represents its operational realization; and the engine-level integration, automated analyses, and human player study serve as the demonstration and evaluation phases of the DSR lifecycle.

The system is organized as a three-stage pipeline corresponding to the lifecycle of executable game knowledge, ensuring alignment between generation, verification, and execution.

The first stage, Data and World Modeling, establishes a grounded knowledge substrate for generation. Narrative entities are formalized into a world bible, whereas engine assets—including prefabs (predefined reusable game object templates in Unity), scene layouts, and walkable coordinates (precomputed navigable grid positions allowed for character placement and movement in the game map)—are converted into symbolic resources. A unified task schema specifies the formal structure of characters, quests, state transitions, and dialogue graphs, providing a finite and internally consistent representation of the game world.

The second stage, LLM-Constrained Generation, constructs executable knowledge artifacts under explicit structural and semantic constraints. Prompts integrate the schema specifications, enumerated asset identifiers, spatial constraints, and curated exemplars. The LLM produces structured JSON representations that are subsequently normalized and validated to enforce schema compliance, state reachability, and consistency of world references. Valid outputs are retained as executable artifacts, whereas invalid outputs are repaired or discarded.

The third stage, Evaluation and Execution, applies the validated knowledge artifacts within a Unity runtime environment. Characters, quest graphs, and dialogue flows are instantiated according to the prescribed state logic, whereas an evaluation pipeline performs structural checks, semantic consistency analyses, and engine-level playability tests to verify correct behavior during gameplay.

Together, these stages form an end-to-end pipeline for managing the LLM-generated game knowledge from grounded construction to verified runtime execution, as illustrated in Figure 1.

3.2. Data and Asset Standardization

To enable controllable and engine-compatible knowledge construction, G-KMS establishes a standardized representation that aligns narrative semantics with Unity engine assets. This grounding process, illustrated in Figure 2, integrates semantic extraction from narrative text with asset formalization and schema definition to produce the structured knowledge resources consumed by downstream generation modules.

First, domain-relevant entities are extracted from the narrative source material using prompt-guided LLM processing. Characters, locations, factions (an in-game affiliation group representing a character’s role or allegiance), items, and event descriptors are identified and normalized using alias resolution, deduplication, and attribute completion. These entities are organized into a world bible that encodes semantic concepts as structured entries with consistent categories and attributes, constraining the vocabulary and relationships available during subsequent generations.

Unity development assets are formalized into symbolic representations, including spatial layouts, permissible character and item templates, and schema-level interaction components that link semantic concepts to engine-executable constructs.

These semantic and asset-level resources are then consolidated into a unified task schema that specifies the formal structure of all generable JSON artifacts, including character definitions, quest objectives, state transitions, item entities, and branching dialogue nodes. The resulting knowledge base—comprising the world bible, walkable position maps, prefab whitelists, and task schema—provides a consistent representational foundation that aligns narrative semantics with engine constraints and supports constrained LLM-based generation.

An example subset of these standardized representations is shown in Figure 3, which illustrates how semantic attributes, faction information, and prefab identifiers are jointly encoded.

3.3. Schema-Constrained LLM Generation Process

To generate narrative tasks that are semantically coherent and engine-executable, the proposed G-KMS employs a schema-constrained LLM generation process in which language models operate under explicit structural and semantic constraints. As illustrated in Figure 4, the process integrates prompt construction, structured decoding, normalization, and validation into a unified generation pipeline.

Generation begins with the construction of a composite prompt that guides the controlled synthesis of the executable game knowledge. The prompt integrates multiple functional layers, including structural constraints, semantic contexts, few-shot exemplars, and task-level instructions. The full prompt specification is provided in Appendix A.

Conditioned by this prompt, the LLM produces a JSON-only structured output that conforms to the predefined task schema. The generated output is then passed through a normalization module that deterministically resolves common inconsistencies, including missing fields, invalid prefab references, out-of-bound coordinates, and misaligned dialogue branch identifiers.

Following normalization, the outputs undergo validation and admission stages, which act as rigorous verification gates. This stage includes a smoke test to verify Unity loadability and a validation test that enforces schema compliance and engine-aligned constraints. Outputs that fail validation are logged and discarded, while passing artifacts are stored as executable task files. The detailed normalization and validation rules are provided in Appendix B.

Representative executable knowledge artifacts produced by this process are shown in Figure 5, Figure 6 and Figure 7, including standardized character definitions, schema-grounded item behaviors, and branching dialogue structures. A visualized example of the executable Quest JSON structure and its state transition logic is provided in Appendix C.

3.4. Evaluation Pipeline and Unity Implementation

To assess whether the executable knowledge artifacts produced by the proposed G-KMS function correctly in a real game environment, Stage 3 provides an integrated evaluation-and-execution layer coupled with a Unity-based implementation (Figure 8).

In the Unity runtime path, the validated task JSON files are loaded with a C# quest loader that mirrors the task schema, instantiating characters, quest graphs, and dialogue flows within existing 2D RPG systems. Gameplay execution produces runtime logs and diagnostic traces for downstream analysis. The tile-map regions used for execution are shown in Figure 9.

In parallel, the offline evaluation path aggregates batch-level statistics and evaluation summaries from the same validated artifacts, whereas the LLM-based self-evaluation and WebGL-based player studies provide complementary qualitative signals. Detailed definitions of each metric and the corresponding experimental results are presented in Section 4.

4. Experimental Results and Analysis

In this section, we evaluate the proposed system across six complementary dimensions: structural validity, textual diversity, task-graph integrity, engine-level compatibility, model-based narrative assessment, and human player experience. The sampling temperature was treated as a bounded perturbation to probe robustness rather than as the primary variable of interest. Together, these evaluations examined whether the system consistently produces executable, coherent, and engaging narrative tasks suitable for deployment in a Unity-based RPG environment. All narrative tasks were generated using GPT-4o as the backbone generator model, while LLM-based self-evaluation was conducted using a lighter, separate evaluator model (GPT-4o-mini) to reduce direct self-consistency bias.

4.1. Structural Validity

Structural validity evaluates whether the generated tasks can be safely admitted to a Unity runtime. Two checks were performed: a smoke test, which verified runtime loadability, and a validation test, which enforced schema compliance and engine-aligned semantic constraints. These checks target common failure modes in LLM-generated structured outputs, including malformed fields, invalid prefab references, and inconsistent quest transitions [33,34].

Table 2 summarizes the structural validity across system iterations. Early versions achieved partial runtime loadability but failed validation, indicating that prompt-based control alone is insufficient. Introducing an explicit normalization layer (VLST3) enabled full structural admissibility by deterministically repairing missing fields, flagging inconsistencies, and addressing invalid references. Subsequent versions retained normalization as a safety layer while relaxing overly rigid post-processing and strengthening prompt guidance, resulting in stable validation performance at scale.

Across all configurations of the Final Generator, the smoke test pass rate remained at 100%, whereas the validation rates varied within a bounded range. This variation reflects a controlled trade-off between conservative generation and increased expressive variance rather than structural instability [35]. No configuration reintroduced catastrophic failure modes once normalization and schema enforcement were applied.

To further isolate the role of normalization under controlled conditions, we evaluated an ablation in which the final generator (T = 0.7) was executed without normalization while keeping the prompt and schema constraints unchanged. The results show that removing normalization leads to a complete loss of validation pass rate despite full runtime loadability, indicating that prompt- and schema-based control alone is insufficient to guarantee engine-aligned semantic consistency.

The error breakdown presented in Table 3 further supports this trend. Schema errors are eliminated early and remain absent throughout later iterations, whereas semantic- and prefab-related errors decrease progressively as normalization and prompt constraints are refined, reaching near-elimination in the final system. In contrast, disabling normalization causes semantic and prefab-related errors to reappear consistently. These results demonstrate that stable structural validity can be maintained under bounded generative variations when schema enforcement and normalization are combined with informed prompt control.

4.2. Textual Diversity and Redundancy

Textual diversity and redundancy evaluate whether the generated tasks maintain expressive variation without collapsing into repetitions or drifting into incoherent stochasticity. We assessed this property using complementary linguistic indicators, including token-level entropy, semantic similarity, text-length statistics, and embedding-based topic structures, following prior studies on controlled neural text generation [35,36].

Lexical diversity was measured using token entropy. As shown in Table 4, the entropy values remain within a bounded range across all configurations, indicating that the system avoids both lexical collapse and uncontrolled randomness. Lower-temperature settings produce more conservative phrasing, whereas higher-temperature configurations expand the stylistic range. Importantly, entropy increases smoothly rather than abruptly, suggesting that variation is regulated by schema constraints and prompt grounding rather than by unstructured noise.

We emphasize that higher entropy does not directly imply better narrative quality; instead, our objective is to identify a bounded regime that avoids both mode collapse and excessive stochastic noise, which is necessary for maintaining coherent and playable narrative content.

Table 4. Dialogue Diversity (Entropy).

Temp	Token Entropy H1 ↑	Route Entropy ↑	Interpretation
T0.3	1.71	0.49	The dialogue is relatively stable but has the lowest diversity.
T0.7 ¹	2.45	0.75	Diversity has increased significantly, while the structure remains stable.
T1.0	2.65	0.76	It exhibits the highest diversity but begins to approach the upper limit of randomness.

¹ T0.7 achieves the optimal balance between structural stability and expressive diversity; ↑ denotes increasing entropy values.Semantic redundancy was evaluated using pairwise Sentence-BERT similarity analysis [37]. As shown in Table 5, the near-duplicate rates remained low across all configurations, confirming that the system did not generate semantically equivalent tasks repeatedly. Even under higher expressive variance, the generated quests remain aligned with the intended themes and narrative roles.

Table 5. Semantic Redundancy.

Temp	Mean Similarity ↓	Near-Duplicate Rate (>0.9) ↓	Interpretation
T0.3	0.612	4.8%	There are a few very similar tasks.
T0.7 ¹	0.607	2.3%	Reduced semantic repetition, more dispersed expression.
T1.0	0.598	1.1%	Almost no repetition, but some tasks deviated from the theme.

¹ T0.7 significantly reduces redundancy while maintaining a focused task theme; ↓ indicates that lower values correspond to reduced semantic redundancy.

Text-length statistics provided complementary evidence. Table 6 shows moderate increases in the average token count with bounded variance, indicating that increased expressiveness primarily enriches dialogue and contextual detail rather than introducing structural noise or off-topic content.

Beyond local statistics, higher-level semantic structure is examined using Sentence-BERT embeddings combined with clustering and dimensionality reduction techniques, such as k-means clustering and t-SNE visualization [37,38,39]. As illustrated in Figure 10, the generated tasks consistently formed coherent thematic clusters across the configurations. Medium-temperature settings preserved clear cluster boundaries while supporting internal variations, whereas higher-temperature settings produced more dispersed distributions with reduced thematic sharpness. The entropy, semantic similarity, length distribution, and topic dispersion results indicated that the system maintained a stable expressive regime, supporting meaningful narrative variation without degeneration into repetition or semantic drift.

4.3. Task-Graph Metrics

Task-graph metrics evaluate whether the generated quests exhibit structurally coherent and navigable logic, independent of surface-level narrative variation. Each task is modeled as a directed acyclic graph composed of conditional transitions, choice nodes, and terminal states, following standard formulations in procedural quest generation research [14,40,41]. We assessed structural completeness using path length, branching ratio, dead ends, and ending counts.

The results summarized in Table 7 show a highly stable structural profile across all validated tasks. The average path length remained fixed at 3.0, reflecting the DSL-enforced progression from initialization through branching to resolution. This invariance indicates that the global quest structure is preserved across configurations.

Branching ratios indicate meaningful player choices without degenerating into linear chains or uncontrolled graph expansions [42]. Although handcrafted golden samples exhibited higher branching ratios, the generated tasks consistently achieved nontrivial branching within schema-defined limits, balancing structural expressiveness and predictability.

No generated quests contain dead ends or unreachable terminal states. As shown in Table 7, all branches terminate correctly, addressing a known challenge in automated narrative generation [14,41]. The generated tasks also produce a stable number of reachable endings (approximately five), supporting multiple outcomes without compromising solvability. Although this number is lower than that of the handcrafted sample, it reflects a deliberate trade-off favoring structural reliability over unconstrained branching complexity. These metrics indicate that the system consistently generates complete, solvable, and structurally well-formed quest graphs under bounded generative variation.

4.4. Engine-Level Playability Proxies

Engine-level playability proxies evaluate whether the generated tasks remain compatible with the implicit engine semantics prior to full-runtime execution. In addition to formal schema validation, we assessed engine-aligned behavior using indirect indicators of spatial placement stability and prefab usage, which capture common failure modes related to coordinate legality and asset conventions in PCG pipelines [43].

Spatial placement behavior was analyzed using coordinate heat maps aggregated across all validated tasks. As shown in Figure 11, the generated entities consistently occupied valid walkable regions and avoided non-navigable areas or boundary artifacts. Spatial distributions remained coherent and interpretable across configurations, with placements clustering around semantically meaningful regions rather than random dispersion.

The variations in spatial dispersion reflect controlled exploration rather than instability. Lower-variance configurations produced tightly clustered placements, whereas higher-variance settings broaden the spatial coverage within the same navigable regions. Importantly, no configuration introduces out-of-bound coordinates or illegal placements, indicating that spatial constraints are consistently respected over generations. The origin (0,0) corresponds to a fixed player spawning position in the Unity scene.

Prefab usage provides a complementary indicator of engine compatibility. As summarized in Figure 12, prefab references exhibited a stable core distribution corresponding to common quest mechanics alongside a controlled long-tail expansion that increases visual and role diversity. Across all evaluated tasks, no illegal, mistyped, or out-of-schema prefabs appeared, and pathological over-repetition—such as dominance by a single prefab—was avoided [44]. Spatial stability and disciplined prefab usage indicate that the generated tasks remain aligned with engine-level constraints under bounded stochastic variation, thus preserving spatial coherence and asset correctness before runtime execution [45].

4.5. LLM Self-Evaluation

While structural, diversity, and graph-based analyses capture the formal properties of generated tasks, certain narrative qualities—such as character voice, dialogue naturalness, and perceived interestingness—are difficult to assess using rule-based or engine-level metrics alone. To complement these evaluations, we applied an LLM-based self-evaluation protocol as a scalable proxy for player-facing narrative judgments [46], following prior studies on persona stability and narrative believability [23].

We note that LLM-based self-evaluation is inherently biased and cannot be regarded as an objective or independent ground-truth measure; accordingly, we treat these scores as auxiliary indicators rather than primary evidence.

A stratified sample of 60 validated quests was evaluated using a seven-dimensional rubric covering world consistency, narrative logic, solvability, character voice, dialogue quality, interestingness, and overall score. Spatial invalidity was explicitly penalized in accordance with rubric-based evaluation practices from the alignment and RLHF research [47].

All judgments were produced by a separate evaluator model that was different from the generator model to reduce direct self-consistency bias.

In addition to numeric scores, the self-evaluation process produces concise natural-language summaries of each quest’s narrative premise and resolution, providing a compact overview of the story content as an auxiliary output. Appendix D summarizes the evaluation prompt template, scoring dimensions, violation rules, and output schema used for the LLM-based self-evaluation.

The aggregate results are presented in Table 8. Across all evaluated tasks, average scores exceed 4.0 on a five-point scale for most dimensions, indicating strong narrative coherence, well-formed progression logic, and credible character interaction. High scores for world consistency and solvability indicate that increased narrative expressiveness does not compromise logical completeness or grounding. These results should be interpreted as complementary signals that support, but do not replace, human and engine-level evaluations.

Dimension-level trends show that world consistency and narrative logic remain stable across samples, whereas stylistic dimensions such as character voice and dialogue quality exhibit greater variation, as expected in generative narrative systems. Representative examples in Table 9 include both rubric scores and short narrative summaries, illustrating how the generated tasks can be quickly inspected and compared at the story level while preserving structural coherence. The LLM-based self-evaluation indicates that the generated tasks maintain coherent, expressive, and engaging narrative qualities under bounded generative variation, when considered jointly with the automated and human evaluations reported in earlier sections.

4.6. Human Player Study

While automated and model-based evaluations provide scalable indicators of system quality, the human player experience remains a definitive test of whether the generated narrative content functions effectively in gameplay. To assess player-facing quality, we conducted a human-subject study in which 15 participants played randomly selected quests drawn from the pool of final-version validated tasks through a Unity WebGL demo and completed a thirteen-item Likert questionnaire. Each participant was asked to complete at least three randomly assigned quests during the session. The survey covered six dimensions: playability and clarity, narrative logic, world consistency, character voice, dialogue quality, interest, and overall satisfaction.

Participants included both general players and technically informed users, most of whom reported prior experience with RPG or story-driven games, with varying levels of game development background ranging from none to intermediate. This composition allowed the study to capture intuitive player perceptions without restricting the evaluation to expert developers.

The results summarized in Table 10 indicate consistently positive player experiences across all dimensions. Playability and clarity achieved a mean score of 4.17, suggesting that the objectives and progression were easy to understand during gameplay. Narrative logic scored 4.23, indicating that players perceived the quest structure and branching as coherent and free from logical discontinuities.

World consistency received one of the highest ratings (4.40), showing that the characters, events, and settings aligned well with the established game world. The character voice and dialogue quality were also rated highly (mean 4.30), with dialogue naturalness achieving the highest individual score (4.60), reflecting stable persona expression and conversational coherence. Interestingness score was 4.07, indicating sustained engagement, whereas overall satisfaction reached 4.57, suggesting strong player acceptance and a willingness to engage with additional generated content. These pilot results provide preliminary evidence that the generated quests deliver clear, coherent, and engaging player experiences, thus complementing earlier automated and model-based evaluations.

5. Conclusions and Future Work

This study proposes a Game Knowledge Management System (G-KMS) that reframes LLM-based narrative generation as a structured, executable knowledge management process for RPGs. Rather than treating LLMs as free-form content generators, the proposed framework embeds them within a governed pipeline that explicitly manages the lifecycle of game knowledge—from grounding and schema-constrained generation to normalization, validation, and engine-level execution.

The core contribution of this work is not a new generative model, but a system-level methodology that enables LLM outputs to function as reliable, engine-executable knowledge artifacts through schema governance, deterministic normalization, and engine-aligned admission checks. All experimental validations were conducted within a compact 2D Unity-based RPG benchmark, serving as a controlled testbed for assessing system reliability and executability. The results demonstrate stable structural validity, controlled expressive diversity, coherent quest-graph logic, and strong engine compatibility.

These findings suggest that LLMs can serve as dependable knowledge-construction components when embedded within an explicit management and verification framework, enabling systematic validation, reuse, and deployment within real game engines.

Despite these strengths, this framework has several limitations. Task schemas and world knowledge remain largely static, constraining long-term narrative evolution and cross-task dependencies. Future work will explore richer narrative representations, including multiquest dependencies, persistent world states, and adaptive knowledge updates driven by player behavior, as well as extensions to larger or persistent environments such as sandbox games and long-running narrative worlds.

Additional directions include integrating lightweight planning or reasoning modules, incorporating feedback-driven adaptation, and exploring deployment-oriented LLM configurations to improve runtime efficiency. Beyond the validated RPG scenario, extending the framework toward organization- and human-centered KM dimensions (e.g., designer curation, approval workflows, and continuous knowledge maintenance) represents an important avenue for future research.

Author Contributions

Conceptualization, A.R.; methodology, A.R.; software, A.R.; validation, A.R.; writing—original draft preparation, A.R.; writing—review and editing, A.Y. and K.C.; visualization, A.R.; supervision, K.C.; project administration, K.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Institute of Information and Communications Technology Planning and Evaluation (IITP) under an Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2026-RS-2023-00254592) grant funded by the Korea government Ministry of Science and Information, Communications and Technology (MSIT) (30%). This research was also supported by Joycity through research funding in 2024 (Project Name: Freestyle Game Artificial Intelligence Development) (40%). This research was additionally supported by the IITP (Institute of Information & Communications Technology Planning & Evaluation)-ICAN (ICT Challenge and Advanced Network of HRD) grant funded by the Korea government (Ministry of Science and ICT) (IITP-2026-RS-2023-00260248) (30%).

Data Availability Statement

The source code and model checkpoints are not publicly available due to institutional and commercial constraints. Requests for further details may be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
API	Application Programming Interface
GenAI	Generative Artificial Intelligence
G-KMS	Game Knowledge Management System
HITL	Human-in-the-Loop
KMS	Knowledge Management System
PCG	Procedural Content Generation
LLM	Large Language Model
RAG	Retrieval-Augmented Generation
RTS	Real-Time Strategy
RPG	Role-Playing Game
DSR	Design Science Research
JSON	JavaScript Object Notation
DSL	Domain-Specific Language
NPC	Non-Player Character
UI	User Interface
GPT	Generative Pre-trained Transformer
t-SNE	t-distributed Stochastic Neighbor Embedding
UMAP	Uniform Manifold Approximation and Projection
S-BERT	Sentence-BERT
WebGL	Web Graphics Library

Appendix A

Prompt Templates for Stage-2 LLM-Constrained Generation. To ensure reproducibility while improving conciseness, this appendix summarizes the core system-prompt structure used for schema-constrained narrative generation. The prompt is organized into four coordinated layers together with system-level output rules. Only representative constraints and examples are shown for brevity.

Table A1. LLM-Constrained Generation.

Layer	Purpose	Representative Constraints and Elements
Constraint Layer	Enforce structural validity and engine compatibility	Output must be a single JSON object strictly conforming to task.schema.json (no commentary or markdown). All prefab identifiers, portraits, enums, conditions, and actions must match schema definitions. Player/NPC/item positions are initialized at (0.0, 0.0) and reassigned by normalization. Allowed quest states: START, SEARCHING, RETURNING, COMPLETE. Allowed condition types: flagTrue, flagFalse, questStateIn, routeIn, routeUnset, proximity. Allowed actions: setFlag, setObjective, setQuestState, log, lockRoute (required fields only).
Context Layer	Ground narrative generation in world knowledge and style	Incorporate semantic constraints from the World Bible (locations, factions, roles, personalities). Maintain the tone of a magical sports festival (Campsite, Forest, Stadium, Tent). No canon names from existing IP. Dialogue must be immersive, personality-consistent, and written in English with appropriate emotional nuance.
Few-Shot Layer	Provide structural and stylistic priors	Include one or more validated “golden-sample” quest JSONs as exemplars. Demonstrate correct usage of route locking, flag updates, objective progression, and conditional dialogue branching. Reinforce formatting discipline and field ordering expected by the validator. (Actual few-shot examples omitted for brevity.)
Generation Request Layer	Specify task-level narrative and gameplay requirements	Generate exactly one complete quest JSON. Three or more objectives forming a START → SEARCHING → RETURNING → COMPLETE arc. 2–3 NPCs with personality-driven dialogue and conditional branches. 1 quest-relevant items. 25–40 dialogue lines across branches. Four distinct endings gated by routes and flags.
System-Level Output Rules	Enforce admission-level correctness before execution	JSON must contain all required top-level keys (schemaVersion, taskId, title, description, etc.). No additional fields allowed (additionalProperties = false). All outputs must pass normalization, schema validation, semantic validation, and Unity smoke testing before admission.

Appendix B

To guarantee engine-executable outputs, each generated task undergoes a two-stage post-processing pipeline: (1) normalization and (2) semantic–structural validation. Normalization ensures structural cleanliness and field consistency; validation evaluates schema conformity, prefabs and coordinates correctness, and dialogue-graph integrity.

Table A2. Normalization Rules.

Normalization Rule	Description
Illegal key sanitization	Removes fields not permitted by the schema from actions and conditions (e.g., stray ‘id’, ‘text’, ‘to’, ‘states’ keys).
Branch balancing	Ensures each dialogue node contains at least one fallback branch (‘return_to_main’), preventing dead-end states.
NPC spatial correction	If ‘positionZone’ is missing, autofill with “All”; snap NPC coordinates to the nearest walkable cell (Manhattan metric).
Schema-field alignment	Align ‘metadata.title’ to match the top-level ‘title’ if inconsistent.
Fix-log recording	All modifications are stored in a ‘post_normalize.fixlog’ list for transparency and debugging.

Table A3. Semantic–Structural Validation Rules.

Validation Category	Core Checks	Description
Schema Conformity	Draft-7 schema check; Required fields; Enum validity	Ensures JSON structure matches the formal task schema. In compatibility mode, schema violations are logged but not blocking.
Prefab and Portrait Validity	Prefab whitelist enforcement; Portrait naming rule; (‘<Faction>_<m/f>_face.png’)	Verifies that all character and item prefabs exist in the official whitelist and portrait filenames match faction/gender patterns.
Position and Zone Consistency	Walkable-cell lookup; Zone normalization and aliasing; Near-miss tolerance with EPS	Confirms that player/NPC/item positions correspond to valid walkable points within tolerance. Zone names are normalized to avoid mismatch. Near misses are downgraded to warnings.
Dialogue-Graph Integrity	‘option.next’ branch mapping; ‘endingId’ validity; Node structure correctness	Ensures all dialogue options reference existing branches, all ending IDs are defined, and no branches create dead references.
Title and Metadata Alignment *	Check ‘title == metadata.title’	Prevents inconsistencies between top-level and metadata titles during multi-step generation.

* Auxiliary cross-field consistency check to prevent title drift between the top-level field and metadata during multi-step generation.

Appendix C

To ensure reproducibility and transparency, Figure A1 presents a visualized structure of the executable quest JSON used in this study. The diagram illustrates the full progression logic of the Recover the Torn Map task, including state transitions, dialogue branching, flag updates, item-triggered events, and ending conditions. This visualization reflects the underlying schema-constrained narrative flow that is loaded and executed directly in the Unity runtime. This figure shows the quest logic, dialogue routes, conditional flags, objective updates, and endings for the Recover the Torn Map task.

Figure A1. Visual Structure of the Executable Task JSON.

Appendix D

Table A4 presents the prompt template used for LLM-based self-evaluation, including the evaluator role, scoring dimensions, violation rules, and structured JSON output format.

Table A4. Prompt Template Used for LLM Self-Evaluation.

Component	Description/Example
Evaluator Role	“You are a narrative reviewer evaluating the coherence and quality of a generated RPG quest”.
Evaluation Scope	Assess narrative consistency, quest logic, character voice, dialogue quality, solvability, and overall interestingness of a single generated task.
Scoring Dimensions	world_consistency, narrative_logic, solvability, character_voice, dialogue_quality, interestingness, and overall_score; each rated on a 1–5 Likert scale.
Violation Flags	Invalid prefab or location references penalize world_consistency; dead-end or unreachable dialogue branches penalize narrative_logic; use of reserved canon names is recorded as a violation flag.
Output Format	Structured JSON conforming to eval.schema.json, containing numeric scores for each dimension and a brief natural-language summary (≤120 characters) of the quest premise and resolution.
Example Output	{ “overall”: 4.17, “world_consistency”: 5, “narrative_logic”: 4, “solvability”: 4, “character_voice”: 4, “dialogue_quality”: 4, “interestingness”: 4, “summary”: “Players resolve a missing wand incident through dialogue at a forest camp”. }

References

Hu, C.; Zhao, Y.; Liu, J. Game Generation via Large Language Models. In Proceedings of the 2024 IEEE Conference on Games (CoG), Milan, Italy, 5–8 August 2024; pp. 1–4. [Google Scholar]
Wu, W.; Wu, H.; Jiang, L.; Liu, X.; Zhao, H.; Zhang, M. From Role-Play to Drama-Interaction: An LLM Solution. In Findings of the Association for Computational Linguistics: ACL 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 3271–3290. [Google Scholar]
Liu, X.; Xie, Z.; Jiang, S. Personalized Non-Player Characters: A Framework for Character-Consistent Dialogue Generation. AI 2025, 6, 93. [Google Scholar] [CrossRef]
Park, J.S.; O’Brien, J.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative Agents: Interactive Simulacra of Human Behavior. In UIST’23: Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, San Francisco, CA, USA, 29 October–1 November 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 1–22. [Google Scholar]
Buongiorno, S.; Klinkert, L.; Zhuang, Z.; Chawla, T.; Clark, C. PANGeA: Procedural Artificial Narrative Using Generative AI for Turn-Based, Role-Playing Video Games. Proc. AAAI Conf. Artif. Intell. Interact. Digit. Entertain. 2024, 20, 156–166. [Google Scholar] [CrossRef]
Shao, Y.; Li, L.; Dai, J.; Qiu, X. Character-LLM: A Trainable Agent for Role-Playing. arXiv 2023, arXiv:2310.10158. [Google Scholar] [CrossRef]
Kang, T.; Lin, M.C. Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts. arXiv 2025, arXiv:2505.16819. [Google Scholar]
Li, J.; Li, Y.; Wadhwa, N.; Pritch, Y.; Jacobs, D.E.; Rubinstein, M.; Bansal, M.; Ruiz, N. Unbounded: A Generative Infinite Game of Character Life Simulation. arXiv 2024, arXiv:2410.18975. [Google Scholar] [CrossRef]
ISO 30401:2018; Knowledge Management Systems—Requirements. International Organization for Standardization: Geneva, Switzerland, 2018.
Ontanón, S.; Synnaeve, G.; Uriarte, A.; Richoux, F.; Churchill, D.; Preuss, M. A survey of real-time strategy game AI research and competition in StarCraft. IEEE Trans. Comput. Intell. AI Games 2013, 5, 293–311. [Google Scholar] [CrossRef]
Young, R.M.; Riedl, M.O. An Architecture for Integrating Plan-Based Behavior Generation with Interactive Game Environments. J. Game Dev. 2004, 1, 1–29. [Google Scholar]
Togelius, J.; Yannakakis, G.N.; Stanley, K.O.; Browne, C. Search-Based Procedural Content Generation: A Taxonomy and Survey. IEEE Trans. Comput. Intell. AI Games 2011, 3, 172–186. [Google Scholar] [CrossRef]
Smith, G.; Whitehead, J.; Mateas, M. Tanagra: Reactive Planning and Constraint Solving for Mixed-Initiative Level Design. IEEE Trans. Comput. Intell. AI Games 2011, 3, 201–215. [Google Scholar] [CrossRef]
Alexander, R.; Martens, C. Deriving Quests from Open World Mechanics. In FDG’17: Proceedings of the 12th International Conference on the Foundations of Digital Games, Hyannis, MA, USA, 14–17 August 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 1–7. [Google Scholar]
Ammanabrolu, P.; Broniec, W.; Mueller, A.; Paul, J.; Riedl, M. Toward Automated Quest Generation in Text-Adventure Games. In Proceedings of the 4th Workshop on Computational Creativity in Language Generation, Tokyo, Japan; Burtenshaw, B., Manjavacas, E., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 1–12. [Google Scholar]
da Rocha Franco, A.d.O.; de Carvalho, W.V.; da Silva, J.W.F.; Maia, J.G.R.; de Castro, M.F. Managing and Controlling Digital Role-Playing Game Elements: A Current State of Affairs. Entertain. Comput. 2024, 51, 100708. [Google Scholar] [CrossRef]
Risi, S.; Togelius, J. Increasing Generality in Machine Learning through Procedural Content Generation. Nat. Mach. Intell. 2020, 2, 428–436. [Google Scholar] [CrossRef]
Koppen, L. Integrating a Human Feedback Loop in PCG for Level Design Using LLMs. Bachelor’s Thesis, University of Twente, Enschede, The Netherlands, 2024. [Google Scholar]
Sun, Y.; Li, Z.; Fang, K.; Lee, C.H.; Asadipour, A. Language as Reality: A Co-Creative Storytelling Game Experience in 1001 Nights Using Generative AI. Proc. AAAI Conf. Artif. Intell. Interact. Digit. Entertain. 2023, 19, 425–434. [Google Scholar] [CrossRef]
Kumaran, V.; Carpenter, D.; Rowe, J.; Mott, B.; Lester, J. End-to-End Procedural Level Generation in Educational Games with Natural Language Instruction. In Proceedings of the 2023 IEEE Conference on Games (CoG), Boston, MA, USA, 21–24 August 2023; pp. 1–8. [Google Scholar]
Nasir, M.U.; James, S.; Togelius, J. Word2World: Generating Stories and Worlds through Large Language Models. arXiv 2024, arXiv:2405.06686. [Google Scholar]
Li, W.; Bai, Y.; Lu, J.; Yi, K. Immersive Text Game and Personality Classification. arXiv 2022, arXiv:2203.10621. [Google Scholar] [CrossRef]
Shuster, K.; Urbanek, J.; Szlam, A.; Weston, J. Am I Me or You? State-of-the-Art Dialogue Models Cannot Maintain an Identity. In Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, USA; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 2367–2387. [Google Scholar]
Song, H.; Wang, Y.; Zhang, W.-N.; Liu, X.; Liu, T. Generate, Delete and Rewrite: A Three-Stage Framework for Improving Persona Consistency of Dialogue Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 5821–5831. [Google Scholar]
Ji, K.; Lian, Y.; Li, L.; Gao, J.; Li, W.; Dai, B. Enhancing Persona Consistency for LLMs’ Role-Playing Using Persona-Aware Contrastive Learning. arXiv 2025, arXiv:2503.17662. [Google Scholar]
Takayama, J.; Ohagi, M.; Mizumoto, T.; Yoshikawa, K. Persona-Consistent Dialogue Generation via Pseudo Preference Tuning. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, UAE; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 5507–5514. [Google Scholar]
Zhou, W.; Peng, X.; Riedl, M. Dialogue Shaping: Empowering Agents through NPC Interaction. arXiv 2023, arXiv:2307.15833. [Google Scholar] [CrossRef]
Jennings, N.; Wang, H.; Li, I.; Smith, J.; Hartmann, B. What’s the Game, Then? Opportunities and Challenges for Runtime Behavior Generation. In UIST’24: Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, Pittsburgh, PA, USA, 13–16 October 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 1–13. [Google Scholar]
Wang, G.; Xie, Y.; Jiang, Y.; Mandlekar, A.; Xiao, C.; Zhu, Y.; Fan, L.; Anandkumar, A. Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv 2023, arXiv:2305.16291. [Google Scholar] [CrossRef]
Meta Fundamental AI Research Diplomacy Team (FAIR); Bakhtin, A.; Brown, N.; Dinan, E.; Farina, G.; Flaherty, C.; Fried, D.; Goff, A.; Gray, J.; Hu, H.; et al. Human-Level Play in the Game of Diplomacy by Combining Language Models with Strategic Reasoning. Science 2022, 378, 1067–1074. [Google Scholar] [CrossRef]
Zheng, S.; He, K.; Yang, L.; Xiong, J. MemoryRepository for AI NPC. IEEE Access 2024, 12, 62581–62596. [Google Scholar] [CrossRef]
Sudhakaran, S.; González-Duque, M.; Freiberger, M.; Glanois, C.; Najarro, E.; Risi, S. MarioGPT: Open-Ended Text2Level Generation through Large Language Models. Adv. Neural Inf. Process. Syst. 2023, 36, 54213–54227. [Google Scholar]
Welleck, S.; Kulikov, I.; Roller, S.; Dinan, E.; Cho, K.; Weston, J. Neural Text Generation with Unlikelihood Training. arXiv 2019, arXiv:1908.04319. [Google Scholar] [CrossRef]
Chang, S.; Wang, J.; Dong, M.; Pan, L.; Zhu, H.; Li, A.H.; Lan, W.; Zhang, S.; Jiang, J.; Lilien, J.; et al. Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness. arXiv 2023, arXiv:2301.08881. [Google Scholar]
Holtzman, A.; Buys, J.; Du, L.; Forbes, M.; Choi, Y. The Curious Case of Neural Text Degeneration. arXiv 2020, arXiv:1904.09751. [Google Scholar] [CrossRef]
Li, J.; Galley, M.; Brockett, C.; Gao, J.; Dolan, B. A Diversity-Promoting Objective Function for Neural Conversation Models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California; Knight, K., Nenkova, A., Rambow, O., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 110–119. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084. [Google Scholar] [CrossRef]
Ahmed, M.; Seraj, R.; Islam, S.M.S. The K-Means Algorithm: A Comprehensive Survey and Performance Evaluation. Electronics 2020, 9, 1295. [Google Scholar] [CrossRef]
van der Maaten, L.; Hinton, G. Visualizing Data Using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
McIntyre, N.; Lapata, M. Plot Induction and Evolutionary Search for Story Generation. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden; Association for Computational Linguistics: Stroudsburg, PA, USA, 2010; pp. 1562–1572. [Google Scholar]
Porteous, J.; Cavazza, M. Controlling Narrative Generation with Planning Trajectories: The Role of Constraints. In Interactive Storytelling. ICIDS 2009; Iurgel, I.A., Zagalo, N., Petta, P., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 234–245. [Google Scholar]
Riedl, M.O.; Young, R.M. Narrative Planning: Balancing Plot and Character. J. Artif. Intell. Res. 2010, 39, 217–268. [Google Scholar] [CrossRef]
Shaker, N.; Togelius, J.; Nelson, M.J. Procedural Content Generation in Games; Springer: Cham, Switzerland, 2016. [Google Scholar]
Summerville, A.; Snodgrass, S.; Guzdial, M.; Holmgård, C.; Hoover, A.K.; Isaksen, A.; Nealen, A.; Togelius, J. Procedural Content Generation via Machine Learning (PCGML). IEEE Trans. Games 2018, 10, 257–270. [Google Scholar] [CrossRef]
Smith, G.; Othenin-Girard, A.; Whitehead, J.; Wardrip-Fruin, N. PCG-Based Game Design: Creating Endless Web. In Proceedings of the International Conference on the Foundations of Digital Games, Raleigh, NC, USA, 29 May–1 June 2012; Association for Computing Machinery: New York, NY, USA, 2012; pp. 188–195. [Google Scholar]
Zheng, L.; Chiang, W.-L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.P.; et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Curran Associates Inc.: Red Hook, NY, USA, 2023; pp. 46595–46623. [Google Scholar]
Bai, Y.; Jones, A.; Ndousse, K.; Askell, A.; Chen, A.; DasSarma, N.; Drain, D.; Fort, S.; Ganguli, D.; Henighan, T.; et al. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv 2022, arXiv:2204.05862. [Google Scholar] [CrossRef]

Figure 1. LLM-Chained Game Knowledge Management System (G-KMS) Architecture. This figure illustrates the three-stage LLM-chained pipeline framed as a Game Knowledge Management System.

Figure 2. Data Grounding and Prompt Preparation Pipeline. This figure shows how narrative source text is transformed into a structured world bible and aligned with Unity assets and schemas to provide grounded knowledge resources for constrained generation.

Figure 3. Example of Standardized Character Data Representations. This figure shows sample player and NPC objects generated using the standardized schema, including grounded faction, personality attributes, prefab identifiers, and valid spatial coordinates.

Figure 4. Prompt-Constrained Generation Pipeline. This figure depicts the schema-constrained prompts guide JSON-only decoding, followed by normalization and validation; valid tasks are admitted as executable artifacts.

Figure 5. Standardized character representation. The figure shows player and NPC objects encoded using the unified schema, including grounded semantic attributes, valid prefab identifiers, and engine-compatible spatial placements.

Figure 6. Schema-grounded item representation. This example illustrates how item interactions are encoded to trigger quest-state transitions, flag updates, and player-facing log messages within the task schema.

Figure 7. Branching dialogue structure. The visualization demonstrates schema-aligned dialogue branching, where player choices and conditional transitions correspond to quest states, representing a subset of the full dialogue graph.

Figure 8. Evaluation Pipeline and Unity Implementation. This figure presents the dual-path evaluation combining Unity runtime execution with parallel automated analysis for assessing structural validity, narrative quality, and playability.

Figure 9. A tile map (a grid-based 2D game map composed of discrete tiles in Unity) in the game world, containing divided sub-regions (campsite, forest, stadium, tent).

Figure 10. t-SNE/UMAP + Clustering Visualization. (a) Tasks generated at low temperatures form several compact, well-defined topic clusters with limited diversity; (b) Medium temperature preserves clear thematic structure while increasing semantic diversity within and across clusters, representing a balance between diversity and consistency; (c) At a temperature of 1.0, task distributions become more dispersed and cluster boundaries blur.

Figure 11. Spatial Placement Heatmaps Across Temperatures (T0.3, T0.7, T1.0). (a) Low-temperature generation produces tightly clustered placements, indicating conservative coordinate sampling with limited spatial variation; (b) Medium temperature expands the distribution while preserving clear structural density, achieving a balance between diversity and spatial coherence; (c) High-temperature sampling yields the most dispersed layouts, but placements remain within walkable regions and retain meaningful spatial structure.

Figure 12. Prefab Usage Distribution Across Temperatures (T0.3, T0.7, T1.0). (a) At low temperature, quest-critical prefabs dominate and the distribution remains compact, reflecting conservative selection; (b) Medium temperature broadens the prefab long tail and introduces greater factional and stylistic variation while preserving stable core categories; (c) High-temperature sampling further extends the long tail and increases rare prefab usage, while remaining schema-constrained and free of invalid assets.

Table 1. Comparative Overview of Narrative PCG Approaches.

Category	Representative Work	Structural Control	Narrative Expressiveness	Dialogue/Persona Consistency	Engine-Level Integration	Distinguishing Features
Rule-Based/Planning-Based PCG	[14,15]	High (template/ logic rules)	Low–Medium	None	Limited/none	Requires heavy authoring; constrained variation; no LLM richness
Template-Driven/Hybrid Systems	[5,18]	High (template anchors)	Medium	Limited	Editor-only; not automated	Template-bound; requires human supervision; not scalable
LLM—Based Narrative Generation	[1,2,4]	Low (free-form generation)	High	Moderate	None	No schema guarantees; inconsistent world models; text-only evaluation
LLM Dialogue/Persona Consistency Models	[3,6,7]	Medium (AMR/fine-tuning)	Medium	High	None	No quest-state control; no quest graph generation; no runtime verification
Open-Ended LLM Simulation/Multi-Agent Systems	[8,29]	Low–Medium	High	Moderate	Sandbox only	No schema validation; not engine-ready; no structural repair
LLM—Chained PCG Framework (our)	-	High (JSON schema + graph repair)	High	High(world bible+ persona coherence)	Full Unity runtime validation	An end-to-end LLM-to-engine pipeline for scalable, execution-ready JSON artifacts

Table 2. Structural Validity Across System Configurations and Ablations.

Version	Tasks	Smoke Pass Rate	Validate Pass Rate	Interpretation
Early Baseline (VLST1)	20	50%	0%	Partially loadable, but severe semantic inconsistencies prevent any task from meeting validation requirements.
Prompt-Optimized (VLST2)	20	100%	0%	Prompt regularization fixes all structural issues but does not resolve deeper semantic constraints such as prefab mismatches and invalid ending references.
Narrative-Normalized (VLST3)	20	100%	100%	Automatic normalization enforces strict structural and semantic constraints, correcting missing fields, flagging inconsistencies, and invalid references.
Prompt-Optimized (VLST4)	100	100%	68%	Built upon the normalization framework introduced in VLST3, this version enhances prompt richness and narrative guidance while relaxing overly rigid post-processing rules.
Final Generator (without Normalization)	100	100%	0%	Removing normalization under identical prompt and schema constraints leads to a complete loss of semantic validity, indicating that prompt- and schema-based control alone is insufficient for ensuring runtime-executable content. This confirms normalization as a necessary governance layer for maintaining engine-aligned semantic consistency, rather than a mere engineering convenience.
Final Generator (T = 0.3)	100	100%	78%	The final generator retains normalization as a safety layer but adopts a softer, selective correction strategy to preserve expressive diversity. Validation rates vary with sampling temperature, reflecting a controlled trade-off between structural reliability and narrative variability. Among these, T = 0.7 achieves the strongest balance between semantic validity and expressive richness.
Final Generator (T = 0.7)	100	100%	85%
Final Generator (T = 1.0)	100	100%	77%

Table 3. Error Breakdown Across Versions.

Stage	Schema Errors	Semantic Errors	Missing Required Fields	Prefab Errors
VLST1	High	High	Frequent	Frequent
VLST2	Eliminated (structure fully stable)	High (deep semantic issues persist)	Eliminated	Partially resolved
VLST3	Eliminated	Fully resolved (after normalization)	Eliminated	Only minor inconsistencies
VLST4	Eliminated	Moderate (improved but not fully fixed)	Eliminated	Rare
Final_w/o Norm	Eliminated	High	Eliminated	Frequent
Final_T0.7	Eliminated	Lowest	Eliminated	Nearly eliminated

Table 6. Token length reflects complexity.

Temp	Avg Token Count ↑	Std ↑	Interpretation
T0.3	82.3	25.0	The shortest, with concise dialogue content.
T0.7 ¹	92.7	32.4	More information, yet maintaining consistency.
T1.0	100.8	42.1	The longest, but the task content varies greatly.

¹ T0.7 performs best in terms of “richness-structural stability”; ↑ indicates increasing token length or variability, reflecting changes in expressive richness rather than guaranteed improvements in coherence or quality.

Table 7. Result Summary.

Version	Num_tasks	Avg_path_len	Branching _ratio	Reachable _endings	Total_endings
T0.3	78	3.0	1.341	5.16	5.16
T0.7	85	3.0	1.361	5.38	5.38
T1.0	77	3.0	1.362	5.15	5.15
Golden_sample	1	3.0	1.455	8	8

Table 8. LLM Self-Evaluation Summary Across Final Generators (T0.3/T0.7/T1.0).

Version	Avg. Overall	World Consistency	Narrative Logic	Solvability	Character Voice	Dialogue Quality	Interestingness	Invalid Pos.	Notes
T0.3	4.20	4.3	4.1	4.2	4.2	4.2	4.0	0	Logic-stable, conservative creativity
T0.7	4.35	4.5	4.3	4.4	4.4	4.3	4.3	0	Best balance between coherence and creativity
T1.0	4.25	4.2	4.1	4.2	4.5	4.5	4.7	1	High creativity, mildly unstable structure

Table 9. Representative Self-Evaluation Sample.

Dimension	Scores	Content
Story Pitch	-	Ilya seeks help at the edge of the campsite. You follow footprints and wind-scattered fragments, recover and repair the map, and return it to ensure safe night travel along the route.
World Consistency	4.5	No canon names; prefabs ok; positions ok
Narrative Logic	4.5	branches closed; zone-event fit
Solvability	4.0	objectives unlock and complete
Character Voice	4.5	Distinct speakers and concise options; personalities indicated.
Dialogue Quality	4.0	No overlong lines
Interestingness	4.0	Search vs. promise routes add variety; fits campsite-to-forest beat.

Table 10. Human Player Study Summary (N = 15).

Dimension	Question Items	Mean	SD	Interpretation
Playability and Clarity	Q1–Q2	4.17	0.22	Players generally felt that the quest instructions were clear and the process was easy to understand.
Narrative Logic	Q3–Q4	4.23	0.33	The narrative structure is coherent, with no jumps in plot or logical inconsistencies.
World Consistency	Q5	4.40	0.50	The characters, events, and worldview are highly consistent.
Character Voice and Dialogue Quality	Q6–Q8	4.30	0.41	The NPCs have distinct personalities, and their dialogue is natural and non-repetitive.
Interestingness	Q9	4.07	0.46	The overall task is interesting and the process is engaging.
Overall Experience	Q10–Q11	4.57	0.47	The experience was positive, and most players were willing to try more AI-generated tasks.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Rahman, A.; Yu, A.; Cho, K. Game Knowledge Management System: Schema-Governed LLM Pipeline for Executable Narrative Generation in RPGs. Systems 2026, 14, 175. https://doi.org/10.3390/systems14020175

AMA Style

Rahman A, Yu A, Cho K. Game Knowledge Management System: Schema-Governed LLM Pipeline for Executable Narrative Generation in RPGs. Systems. 2026; 14(2):175. https://doi.org/10.3390/systems14020175

Chicago/Turabian Style

Rahman, Aynigar, Aihe Yu, and Kyungeun Cho. 2026. "Game Knowledge Management System: Schema-Governed LLM Pipeline for Executable Narrative Generation in RPGs" Systems 14, no. 2: 175. https://doi.org/10.3390/systems14020175

APA Style

Rahman, A., Yu, A., & Cho, K. (2026). Game Knowledge Management System: Schema-Governed LLM Pipeline for Executable Narrative Generation in RPGs. Systems, 14(2), 175. https://doi.org/10.3390/systems14020175

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Game Knowledge Management System: Schema-Governed LLM Pipeline for Executable Narrative Generation in RPGs

Abstract

1. Introduction

2. Related Work

2.1. Knowledge Representation and Management in Game Systems

2.2. Symbolic and Template-Based Narrative PCG

2.3. LLM-Based Narrative Generation and Dialogue Control

2.4. Runtime-Integrated PCG and Engine-Level Systems

2.5. Summary

3. Proposed Method

3.1. Architectural Overview

3.2. Data and Asset Standardization

3.3. Schema-Constrained LLM Generation Process

3.4. Evaluation Pipeline and Unity Implementation

4. Experimental Results and Analysis

4.1. Structural Validity

4.2. Textual Diversity and Redundancy

4.3. Task-Graph Metrics

4.4. Engine-Level Playability Proxies

4.5. LLM Self-Evaluation

4.6. Human Player Study

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix B

Appendix C

Appendix D

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI