Automated Generation and Evaluation of Interactive-Fiction Serious Games with Open-Weight LLMs
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsReview Report
Automated Generation and Evaluation of Interactive-Fiction Serious Games with Open-Weight LLMs
Summary
This study presents SINE, an automated pipeline using open-weight LLMs to generate text-based interactive fiction serious games from structured JSON seeds. The system integrates grammar-guided decoding, deterministic validation, and a "fixer" agent for iterative repair. Evaluated across 240 seeds with models like Qwen 3 and Gemma 3, the pipeline achieves 68–86% success rates for compilation, playability, and learning-goal fidelity. Results highlight that repair iterations are critical for robustness, while grammar masking offers inconsistent benefits over reasoning prompts.
Strengths
-
Practical Impact: Addresses the technical barrier for educators by automating game creation from pedagogical inputs.
-
Methodological Rigor: Features a robust multi-stage evaluation, transparent metrics, and open-source release of artifacts/code.
-
System Design: Effectively combines constrained decoding with a feedback-loop repair agent, demonstrating that open models can produce runnable educational tools locally.
-
Nuanced Findings: Provides valuable insights into the trade-offs between prompting strategies and formal constraints.
Weaknesses & Recommendations
-
Lack of User Validation: The study relies on automated metrics without assessing narrative quality or learning outcomes with actual users.
-
Recommendation: Include pilot feedback from educators/students or explicitly frame user studies as essential future work.
-
Rigid Fidelity Metric: Learning-goal fidelity uses exact text matching, potentially penalizing semantically correct paraphrases.
-
Recommendation: Discuss limitations of strict equality and propose semantic similarity metrics for future iterations.
-
Confounded Complexity: Seed complexity couples station and task counts, limiting causal analysis of specific difficulty drivers.
-
Recommendation: Clarify this limitation and suggest factorial designs for future research.
-
Missing Efficiency Data: Quantitative data on inference latency and memory usage is sparse.
-
Recommendation: Add a brief report on runtime and resource consumption to aid deployment planning.
Overall Assessment
Accept with Minor Revisions
Author Response
Comments 1: Lack of User Validation: The study relies on automated metrics without assessing narrative quality or learning outcomes with actual users.
Recommendation: Include pilot feedback from educators/students or explicitly frame user studies as essential future work.
Response 1: We appreciate this comment and agree that the present version focused too narrowly on automated structural evaluation. We have therefore revised the discussion to make the user-facing follow-up work more explicit. In the revised manuscript, we now describe a planned follow-up user study with teachers and students that will assess content-level comprehensibility, perceived narrative coherence between story and learning tasks, and engagement. We also separate this from a later step on actual learning outcomes in deployment settings and from a separate later strand on co-creation and human-in-the-loop workflows. This change can be found on page 15, final paragraph of Section 7.1.
Comments 2: Rigid Fidelity Metric: Learning-goal fidelity uses exact text matching, potentially penalizing semantically correct paraphrases.
Recommendation: Discuss limitations of strict equality and propose semantic similarity metrics for future iterations.
Response 2: This is an important point, and we agree with it. The current metric is intentionally strict, but that same strictness can indeed penalize semantically adequate reformulations. We therefore revised the limitations section to explain the trade-off more plainly. The text now states that the equality-based formulation was chosen as a robust automated baseline to reduce silent content distortion, while also making clear that semantically equivalent reformulations and partly correct task transfers currently end up in the same error category. We also added a concrete future-work direction proposing semantic checks for partial matches and meaning-preserving reformulations, for example via embedding-based similarity and task-level verifier judgments. These changes can be found on page 15-16, in the “Metric limitations” paragraph of Section 7.2.
Comments 3: Confounded Complexity: Seed complexity couples station and task counts, limiting causal analysis of specific difficulty drivers.
Recommendation: Clarify this limitation and suggest factorial designs for future research.
Response 3: We agree and have clarified this issue in two places. First, in the methodology, we now explain more explicitly how seeds were sampled and why a full factorial design was not used. We note there that this was done to keep the benchmark practically manageable and to avoid extreme station-task combinations that would be atypical for station-based play and didactically implausible. Second, in the discussion, we return to this point and explain that the coupling is not only a statistical limitation, but also affects the balance between narrative framing and assessed learning content. We also explicitly mention factorial seed designs as a direction for future work. These changes can be found on page 6, Section 5.1, and on page 15, middle paragraph of Section 7.1.
Comments 4: Missing Efficiency Data: Quantitative data on inference latency and memory usage is sparse.
Recommendation: Add a brief report on runtime and resource consumption to aid deployment planning.
Response 4: We agree that this information should be easier to find in the manuscript. For that reason, we expanded the efficiency reporting in a concise deployment-oriented way. The revised manuscript now reports the comparable local Apple Silicon host classes used for the experiments, explains that complete runs were distributed across comparable local hosts with identical settings, clarifies that the released pipeline also runs unchanged on a single comparable host, and adds aggregated runtime statistics for generation, checking, and fixing. We also briefly explain why memory utilization was not logged as a comparative metric. These changes can be found on page 11, in the “Hardware footprint” paragraph of Section 5.4.
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper investigates the use of open-weight large language models for the automated generation of serious games and proposes an automated pipeline named SINE. The framework integrates structured seeds, LLM-based generation, grammar-guided decoding, automated validation, and a repair agent to generate interactive-fiction (IF) serious games in a text-based format. The study evaluates different models and prompting strategies through multiple experimental rounds and reports quantitative results based on compilation success, playability, and learning-goal fidelity. Overall, the study presents a relatively complete generation–validation–repair pipeline and conducts experiments on 240 seeds to assess system performance. The results suggest that, under certain conditions, it is feasible to automatically generate runnable games that preserve predefined learning tasks. The topic is of interest, and the paper makes a systematic effort to design an automated evaluation framework for serious-game generation.
However, several issues remain in the current version of the manuscript. First, while the methodology and experimental pipeline are described in detail, the discussion of the generated game content itself remains limited, especially regarding narrative quality and educational effectiveness. Second, the evaluation metrics mainly focus on structural validity and syntactic correctness, which do not fully capture the pedagogical value or gameplay experience of the generated games. Third, although the paper compares several models and prompting strategies, the interpretation of the differences between them remains relatively brief and could be further elaborated. In addition, certain aspects of the experimental design would benefit from clearer explanation, particularly the construction of seeds and the relationship between seed complexity and generation success. Overall, the work has potential value, but improvements in experimental interpretation and practical implications are needed. Therefore, I recommend major revision before the manuscript can be considered for publication.
- The paper provides a clear overview of the research problem and contributions, but the introduction could further emphasize the broader significance of automatically generating serious games in the context of educational technology. At present, the focus is primarily on the technical pipeline, while the educational implications are discussed only briefly.
- The methodology section describes the SINE pipeline in detail, but the rationale behind several design choices could be further clarified. In particular, the motivation for the four generation strategies (S1–S4) and how they represent different prompting paradigms could be explained more explicitly.
- The experimental design employs 240 seeds across multiple rounds, yet the construction and distribution of these seeds are only briefly described. Providing more information about how the seeds were generated and how their complexity is controlled would improve reproducibility and interpretability of the results.
- The evaluation metrics focus mainly on compilation rate, playability, and learning-goal fidelity. While these structural metrics are useful, they do not directly assess narrative quality or gameplay experience. Given that the study concerns serious games, it would be helpful to discuss how these metrics relate to actual learning outcomes or player engagement.
- The results section reports differences across models and prompting strategies, but the explanations for these differences remain somewhat limited. A deeper analysis linking model characteristics, prompting design, and task complexity to the observed outcomes would strengthen the discussion.
- The manuscript provides very limited visualization or concrete examples of the generated games. Including one or two representative examples or structural diagrams in the main text would help readers better understand the form and quality of the generated outputs.
Author Response
Comments 1: The paper provides a clear overview of the research problem and contributions, but the introduction could further emphasize the broader significance of automatically generating serious games in the context of educational technology.
At present, the focus is primarily on the technical pipeline, while the educational implications are discussed only briefly.
Response 1: We agree that the educational significance should be more visible already in the introduction. We therefore expanded this part to frame the authoring bottleneck not only as a technical issue but also as an educational-access issue. The revised text now states more explicitly that game-based learning formats may remain unavailable when authoring capacity is limited, and that automated generation can help reduce this barrier by turning existing question pools into editable game drafts, while pedagogical impact still requires later user-based validation. This change can be found on page 1, first paragraph of Section 1.
Comments 2: The methodology section describes the SINE pipeline in detail, but the rationale behind several design choices could be further clarified.
In particular, the motivation for the four generation strategies (S1-S4) and how they represent different prompting paradigms could be explained more explicitly.
Response 2: We agree and revised the methodology to explain the rationale of the four prompting strategies more explicitly. In the revised version, S1 is described as a one-shot baseline, S3 as the condition with an explicitly required planning phase, S2 as an intermediate condition that allows models to use planning only when it appears useful, and S4 as a grammar-constrained extension applied to S3 as the strongest prompt-only setup for structurally demanding outputs. We also tightened the literature framing so that it stays closer to what prior work actually supports for grammar-guided generation and validation. This change can be found on page 8, first paragraph after the S1-S4 list in Section 5.2.
Comments 3: The experimental design employs 240 seeds across multiple rounds, yet the construction and distribution of these seeds are only briefly described.
Providing more information about how the seeds were generated and how their complexity is controlled would improve reproducibility and interpretability of the results.
Response 3: Thank you for this comment. We have expanded the seed description to improve reproducibility while keeping the presentation abstract rather than implementation-specific. The revised manuscript now explains that seeds are sampled from topic-specific station and multiple-choice task pools, that stable IDs and fixed task order are preserved for downstream validation, and that the round protocol preserves benchmark reproducibility after sampling. We also clarify directly in the methodology why a full factorial design was not used, namely to keep the benchmark manageable and to avoid station-task extremes that would be didactically implausible. We then pick up the implications of this design choice again later in the discussion. These changes can be found on page 6, in Section 5.1, and on page 15, middle paragraph of Section 7.1.
Comments 4: The evaluation metrics focus mainly on compilation rate, playability, and learning-goal fidelity.
While these structural metrics are useful, they do not directly assess narrative quality or gameplay experience.
Given that the study concerns serious games, it would be helpful to discuss how these metrics relate to actual learning outcomes or player engagement.
Response 4: We agree with this comment. We revised the discussion and limitations to position the structural metrics more clearly as prerequisite checks rather than direct evidence of learning outcomes or player experience. The revised text now links the metric choice more explicitly to serious-game design literature that treats the relation between learning objectives and gameplay mechanics as a central concern. We also added a more cautious note on possible proxy use, pointing out that some interactive-narrative work reports correlations between log-based indicators and questionnaire-based experience measures, but that such indicators should here be read primarily as prerequisite checks and, at most, as proxy indicators once combined with user data. In addition, the manuscript now states more concretely that a follow-up user study is planned and that actual learning outcomes are a later step. These changes can be found on page 15, final paragraph of Section 7.1, and on page 15-16, “Metric limitations” in Section 7.2.
Comments 5: The results section reports differences across models and prompting strategies, but the explanations for these differences remain somewhat limited.
A deeper analysis linking model characteristics, prompting design, and task complexity to the observed outcomes would strengthen the discussion.
Response 5: We agree that this part of the discussion was still too brief. We therefore expanded the discussion in two ways. First, we added a qualitative spot-check of archived finalist outputs to relate prompting design more directly to the visible structure of the generated scripts. In the revised manuscript, optional reasoning is described as more often yielding direct task-centered progression, whereas required reasoning more often separates room changes, transition scenes, and quiz nodes more explicitly. We also state more clearly that these observations support only a tentative link between prompt design and output form and do not justify strong claims about stable model tendencies. Second, we extended the complexity discussion to make clearer that the decline for larger seeds is not only a syntactic issue, but is closely tied to learning-goal fidelity and to the partly coupled station/task structure of the benchmark. These changes can be found on page 14-15, final and following paragraphs of Section 7.1.
Comments 6: The manuscript provides very limited visualization or concrete examples of the generated games.
Including one or two representative examples or structural diagrams in the main text would help readers better understand the form and quality of the generated outputs.
Response 6: We agree. We therefore strengthened the example presentation by explicitly linking the appendix example output to a reduced beat graph that visualizes the progression structure of the same generated game. The revised manuscript now introduces this graph in the results text and briefly explains that correct answers advance the main path while wrong answers are routed into local retry loops that preserve overall playability. The corresponding graph is included as Figure A4. These additions can be found on page 12 and on page 20 in Figure A4.
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThanks. My concerns are resolved, and I recommend accepting this paper.
Author Response
Thank you for your time and thoughtful review; we appreciate your recommendation to accept the paper.

