Integrating Cognitive, Symbolic, and Neural Approaches to Story Generation: A Review on the METATRON Framework

Calvo, Hiram; Herrera-González, Brian; Laureano, Mayte H.

doi:10.3390/math13233885

Open AccessFeature PaperReview

Integrating Cognitive, Symbolic, and Neural Approaches to Story Generation: A Review on the METATRON Framework

by

Hiram Calvo

^*

,

Brian Herrera-González

and

Mayte H. Laureano

Center for Computing Research (CIC), Instituto Politécnico Nacional (IPN), Av. Juan de Dios Bátiz, Esq. Miguel Othón de Mendizábal, Col. Nueva Industrial Vallejo, Mexico City 07738, Mexico

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(23), 3885; https://doi.org/10.3390/math13233885

Submission received: 10 November 2025 / Revised: 26 November 2025 / Accepted: 2 December 2025 / Published: 4 December 2025

(This article belongs to the Special Issue Mathematical Foundations in NLP: Applications and Challenges)

Download

Browse Figures

Versions Notes

Abstract

The human ability to imagine alternative realities has long supported reasoning, communication, and creativity through storytelling. By constructing hypothetical scenarios, people can anticipate outcomes, solve problems, and generate new knowledge. This link between imagination and reasoning has made storytelling an enduring topic in artificial intelligence, leading to the field of automatic story generation. Over the decades, different paradigms—symbolic, neural, and hybrid—have been proposed to address this task. This paper reviews key developments in story generation and identifies elements that can be integrated into a unified framework. Building on this analysis, we introduce the METATRON framework for neuro-symbolic generation of fiction stories. The framework combines a classical taxonomy of dramatic situations, used for symbolic narrative planning, with fine-tuned language models for text generation and coherence filtering. It also incorporates cognitive mechanisms such as episodic memory, emotional modeling, and narrative controllability, and explores multimodal extensions for text–image–audio storytelling. Finally, the paper discusses cognitively grounded evaluation methods, including theory-of-mind and creativity assessments, and outlines directions for future research.

Keywords:

storytelling; automatic story generation; narrative controllability; coherence filtering; neurosymbolic approach; multimodal narrative; computational creativity

MSC:

68T50; 68T01; 68T30; 68T07

1. Introduction

Stories are a theme that has captivated people for centuries. Ancient evidence suggests that storytelling was practiced in very early times of humanity, primarily through oral transmission across generations. Oral narratives were, for many centuries, the predominant way of disseminating stories and preserving cultural knowledge [1]. The advent of written language marked a great advance for storytelling, allowing tales to be recorded and shared across distances and generations. The development of the printing press further proliferated written stories. In the last century, the rise of computers and the Internet has enabled stories of any origin to reach almost every corner of the earth. However, all these stories have something fundamental in common: they have been the result of human imagination. This underscores the central role of creativity and imagination in the storytelling process. The importance of stories lies in the fact that virtually everything we know can be conveyed as a narrative—the story of someone or something [2]. Through narrative, complex ideas and experiences are structured in a form that is easier to comprehend and remember. Given the significance of storytelling in human cognition and communication, it is natural that artificial intelligence (AI) researchers have long been intrigued by the question: Can a computer be creative? One way to explore this question is by attempting tasks that seem to require creativity, such as the generation of fictional stories. If an AI system can produce compelling, novel stories, it could shed light on the capabilities and limits of machine creativity.

Automatic story generation has been an active area of research for several decades. Broadly, two paradigms have been pursued: the symbolic (or knowledge-driven) approach and the connectionist (data-driven neural network) approach. Each has its strengths and weaknesses, and to date there is no single solution or agreed-upon methodology for the story generation task. The challenges of automatic storytelling are multifaceted, involving natural language generation, narrative planning, world modeling, and even elements of cognitive psychology. A major challenge in story generation research is the lack of reliable automatic metrics to evaluate the quality of generated stories. While tasks like machine translation or summarization have well-defined evaluation metrics (e.g., BLEU, ROUGE), storytelling is inherently more subjective. It is difficult to automatically measure qualities like coherence, plot interestingness, or character believability, which often require human judgment. As a result, many works rely on human evaluations to assess story quality, although efforts have been made to propose proxy metrics. In addition, the terminology used to discuss story quality can vary between works, which can cause confusion. For consistency, in this article we will use the following definitions (aligned with recent literature):

Coherence: The logical connectivity and relevance between events in the story. A coherent story is one in which each event follows meaningfully from previous events.
Cohesion: The local fluidity and fluency of the text, including grammatical correctness and appropriate use of referential links (e.g., pronouns, conjunctions). This is often referred to simply as fluency.
Consistency: The adherence to established facts or rules within the story world and the preservation of character attributes or story facts over time. A consistent story does not violate its own established logic (e.g., a character’s behavior remains believable given what we know about them, or a previously dead character does not suddenly reappear without explanation).
Novelty: The creativity or originality of the story, often measured in terms of avoiding redundant or cliched content. In computational terms, novelty is sometimes approximated by the diversity of generated text (e.g., using metrics like the proportion of unique n-grams).
Interestingness: The level of engagement or emotional impact the story can evoke in a reader. This is a highly subjective quality, associated with the story’s ability to surprise, entertain, or provoke thought. Few works attempt to explicitly optimize or measure interestingness, as it is difficult to quantify.

Recent advances in large pre-trained language models (PLMs) have greatly improved the cohesion (fluency) of generated stories. Models such as GPT-2, GPT-3, and others can produce grammatically correct and locally fluent text, which was a major challenge in earlier neural generation approaches. Indeed, it is now relatively easy for an AI to produce story-like text that reads like natural language; however, these models still struggle with coherence and consistency on a larger scale: they may introduce events that are irrelevant or contradictory to prior events, or drift off-topic for longer stories.

Maintaining a long-range narrative thread is non-trivial for sequence-predictive models that lack an explicit representation of plot or world state. Notably, while some works have considered coreference coherence (i.e., keeping track of which pronouns refer to which entities) as part of coherence, in this work we focus on the broader narrative coherence and treat coreference resolution as outside our current scope. To date, there is no widely accepted automatic metric that directly measures narrative-level coherence or consistency. Some proxy metrics have been used: for example, perplexity (a language model likelihood measure) correlates with fluency/cohesion and is sometimes taken as an indirect indicator of quality [3]. For diversity/novelty, metrics like Distinct-n [4] (the proportion of unique n-grams in generated text) and Repetition-n (the fraction of texts that contain a repeated n-gram) are used to ensure the model is not simply regurgitating the same phrases. However, high diversity does not automatically equate to a good story—it must be balanced with coherence and consistency.

Measuring interestingness remains particularly elusive. Some research has attempted to incorporate affective or thematic control to indirectly produce more engaging narratives (e.g., ensuring an emotional arc in the story), but there is no standard metric for reader engagement. Ultimately, human evaluation is required to judge qualities like interestingness or overall narrative quality. Common human evaluation criteria include rating a story’s coherence, consistency, and enjoyment on Likert scales, or ranking stories from different systems.

Our hypothesis is that combining symbolic structuring with neural language generation can leverage the strengths of both approaches: the symbolic part can handle high-level planning and consistency (ensuring a coherent sequence of events and respect for narrative constraints), while the neural part can handle low-level language realization (producing fluent and novel text). This hybrid, or neurosymbolic, strategy aims to produce stories that are not only readable but also coherent, consistent, and engaging.

This survey brings together the diverse threads of research on symbolic, neural, and cognitive approaches to story generation, integrating them into a unified perspective that culminates in the proposed framework. The name METATRON encapsulates this integrative ambition on several levels. At a technical level, it derives from the Greek prefix

μ ε τ \overset{´}{α}

(“beyond” or “after”) and the suffix

- τ ρ o ν

(“instrument” or “device”), suggesting a meta-system that operates above other narrative generators—coordinating symbolic control and neural language modeling. At a narrative level, the term alludes to Metatron, the celestial scribe in mystical and apocryphal traditions, a mediator between divine abstraction and human language. This metaphor mirrors the framework’s aim of translating structured knowledge and affective cues into coherent stories. Finally, at a cognitive level, the name emphasizes reflective storytelling: aiming towards a system that not only generates narratives but also models, at a meta-level, the reasoning processes underlying creative narration.

In the following sections, we survey the state of the art in story generation (Section 2), describe our proposed framework in detail (Section 3), and then examine three specific dimensions where our approach introduces novel considerations beyond previous works: narrative controllability (Section 4), memory and emotional modeling (Section 5), and cognitive and creative dimensions of AI-generated narratives (Section 6). We discuss architectural challenges and practical considerations (Section 7), and finally we conclude with a discussion of future work (Section 8).

2. State of the Art

Approaches to automatic story generation can be broadly categorized into symbolic (knowledge-based) methods and connectionist (neural network-based) methods. Early work in the field was predominantly symbolic, drawing on techniques from planning, knowledge representation, and cognitive modeling. In the last decade, neural approaches, especially those using deep learning and large language models, have become dominant. Each paradigm addresses the challenges of story generation differently.

2.1. Symbolic Approaches to Story Generation

Symbolic story generation methods typically involve explicit representations of narrative knowledge and often use planning algorithms or heuristics to construct a story. One classic example is the TALE-SPIN system by Meehan [5], which attempted to tell simple fables by simulating characters’ problem-solving behavior. Another early system, UNIVERSE by Lebowitz [6], used predefined plot fragments that could be assembled to create a story in a particular genre (soap operas, in that case). These systems operated in very limited domains with manually crafted knowledge bases. A particularly influential symbolic approach was the use of case-based reasoning and story analogies. Turner [7] developed the MINSTREL system, which generated new King Arthur stories by adapting old ones. MINSTREL introduced the idea of transformational creativity in storytelling: the system had rules for creatively modifying existing story structures (so-called “story skeletons”) to meet certain authorial goals or constraints. For example, if the system was tasked with creating a story that illustrated a moral or a particular emotional effect, it could retrieve a known story with a similar theme and transform elements (characters, plot points) while preserving the underlying structure. MINSTREL’s successes demonstrated that, given a well-structured knowledge base of narrative archetypes, a computer could recombine pieces in novel ways that still yield a coherent story structure [8]. Another landmark system was MEXICA by Pérez-y-Pérez and Sharples [9]. MEXICA generates short stories about the Mexica (Aztec) civilization, using a cognitive model called Engagement-Reflection. In the engagement phase, the system iteratively adds events to the story, guided by emotional and tension trajectories. In the reflection phase, it evaluates the story so far against narrative knowledge and revises it if necessary to ensure coherence and interest. MEXICA requires a set of predefined story actions and a set of seed stories from which it builds a knowledge base of plot patterns and story-world contexts. A story-world context (SWC) encodes the state of the narrative in terms of relationships between characters (e.g., friendship, enmity) and unresolved goals or tensions. Using these SWCs, MEXICA can determine what story action might naturally follow the current state (for example, if a character’s friend was harmed, a next action might be to seek revenge, which is coherent with a high tension state). MEXICA’s stories, while limited in theme, are remarkably coherent and have a clear dramatic arc, owing to the built-in model of narrative tension. Both MINSTREL and MEXICA rely on handcrafted knowledge and are restricted to specific domains (Arthurian legend and Aztec stories, respectively); however, they often outperform purely neural approaches in certain narrative qualities: their stories rarely contain logical contradictions, and the high-level plot is clearly structured with a beginning, middle, and end. In terms of the criteria we listed earlier, these symbolic systems excel in coherence and consistency; novelty is limited by the fact that they recombine known elements (though creative transformation rules can introduce some novelty), and interestingness is subjective but arguably improved by their attention to dramatic structure.

Another line of symbolic approaches involves formal planning. Riedl and Young [10] and others introduced narrative planning techniques where story generation is treated as a planning problem: the goal is to find a sequence of actions (events) that transition from an initial state to a goal state (for example, a happy ending) while respecting certain constraints (character believability, causal feasibility). The output of a planner is a plot outline that can then be rendered into natural language. The advantage of planning is that it inherently ensures causal coherence: each action is chosen because it helps achieve some character goal or story goal, making the sequence logically motivated. For instance, the story planner IPOCL (Intent-Driven Partial Order Causal Link planner) [10] explicitly ensures that every character action is motivated by that character’s intent (so characters act believably) and that there are no dangling unresolved plot points. A limitation of planning-based stories is that they can sometimes feel overly logical or mechanical, and the language output (if not carefully constructed) may lack stylistic flair. However, hybridizing planners with learned models (for surface realization) is a promising direction.

Symbolic approaches provide explicit control over narrative structure and are strong at maintaining global coherence and consistency. They struggle with language fluency and tend to be domain-specific, requiring labor-intensive knowledge engineering. These limitations opened the door for data-driven neural approaches.

2.2. Neural Approaches to Story Generation

Neural (connectionist) approaches treat story generation as a sequence prediction problem, often training on large datasets of stories to learn to predict the next sentence or next word given the preceding text. Early neural models, based on recurrent neural networks (RNNs) or LSTMs, could generate local sequences of text but often lost coherence after a few sentences. The introduction of the Transformer architecture [11] and the availability of massive pre-training corpora led to a major advance in the quality of generated text. Models like GPT-2 [12] and GPT-3 [13] demonstrated the ability to generate paragraphs of fluent, contextually relevant text, simply by predicting one word at a time. However, as researchers soon discovered, fluency does not guarantee narrative coherence or consistency. A large language model (LLM) like GPT-3, with billions of parameters, has effectively absorbed a tremendous amount of world knowledge and can imitate various writing styles. Yet, without additional guidance, it has no guarantee of sticking to a logical plot. For example, GPT-3 may introduce a new character halfway through the story and then forget about them, or it may change a character’s name or attributes across paragraphs. These failures are due to the model’s probabilistic nature: it produces the most likely continuation at each step, which locally looks plausible, but there is no mechanism to enforce global constraints or remind the model of long-term dependencies beyond its immediate context window. A number of recent works have tried to address these shortcomings of neural models:

Commonsense and Knowledge Augmentation. One line of work augments neural story generators with external knowledge to make them more coherent and logical. Guan et al. [3] introduced a system that pre-trains GPT-2 on sentences derived from commonsense knowledge graphs (ConceptNet and ATOMIC) before fine-tuning on a story dataset. The idea is to inject commonsense relationships (e.g., cause and effect, preconditions for events) into the language model. Their model, Knowledge-Enhanced GPT-2, produced stories that were judged to be more coherent than those from a baseline GPT-2 fine-tuned only on stories. This suggests that one reason for incoherence in neural stories is a lack of real-world knowledge or commonsense; by addressing that, the model is less likely to produce nonsensical event sequences. Other works have used knowledge graphs [14] or inferred script knowledge (sequences of events that commonly happen) to guide story generation.

Two-Stage Generation (Plan then Write). Inspired by the success of symbolic planners, some neural approaches explicitly break the task into two stages—first generating a high-level plan or outline, then generating the full story. Yao et al. [15] proposed a “plan and write” model that generates a sequence of keywords as a plan and then conditions on those keywords to generate the story. Similarly, Fan et al. [16] introduced hierarchical story generation: first produce a short synopsis of the story (a few sentences), and then generate the story in detail conditioned on that synopsis. These methods enforce a form of top-down coherence: by conditioning on an outline or synopsis, the model is guided to stick to the main events or theme. Empirically, stories generated with a plan tend to be more focused and logically connected than those generated in a single pass from an initial prompt. Recent advancements in prompt engineering for LLMs also echo this approach—in practice, one can prompt an LLM like GPT-4 to first output an outline for a story and then elaborate it, which often yields better structure.

Discriminators and Reranking. Another approach incorporates a second model (often a BERT-like classifier) to evaluate or rerank candidate continuations. For example, the INTERPOL model by Wang et al. [17] generated multiple continuations for a given story context using GPT-2 and then used a RoBERTa-based coherence classifier to choose the best continuation (the one that maximizes narrative coherence with the context). This is analogous to a generate-and-test paradigm: the generator proposes and a critic judges. It was found to improve the overall coherence of stories, since blatantly incoherent continuations could be filtered out. Reranking based on learned metrics (e.g., a classifier fine-tuned to predict whether a story sequence is in the right order) is a way to inject some logical constraint without hard-coding rules.

Entity and Memory Tracking. Ensuring consistency, especially with regard to characters and entities, has been tackled by giving the model an explicit memory of what has happened. For instance, Clark et al. [18] used dynamic entity representations that update as the story progresses, and conditioned the next-sentence generation on these representations. If the state of an entity is known (e.g., Character A is in location X and is hungry), the next sentence is more likely to be consistent (Character A might look for food, rather than being described as full). Memory networks and recurrent state tracking have been employed to help the model “remember” previously mentioned facts and conditions [19,20]. More recently, approaches that explicitly manage or compress the evolving narrative state have been explored to mitigate the limited context window of neural language models. In dialogue systems, for instance, history compression and summary-based context management have been used to maintain coherence across long interactions [21]. Likewise, narrative-generation systems with explicit plot-state tracking—such as PlotMachines—model the evolving structure of a story through an external state representation, allowing the system to recover earlier events even when they fall outside the immediate input context [22]. These help mitigate the issue often termed “lost in the middle,” where a model forgets events that occurred more than a few thousand tokens ago in a long narrative [23].

Controlling Style and Content. Researchers have also looked at making generation controllable with respect to certain dimensions. For example, Peng et al. [24] sought controllable story generation by allowing the user to specify certain events or the ending of the story, and the model would fill the rest. Tambwekar et al. [25] used reinforcement learning to shape stories toward achieving a given goal or moral, by defining a reward that encourages sequences of events leading to that goal. Others have incorporated sentiment or emotion trajectories that the story should follow [26,27], as a way to make stories more engaging or fall into a desired genre (e.g., tragedy vs. comedy).

In terms of evaluation, neural story generation has mostly relied on automatic metrics like BLEU (comparing to a reference story, which is quite limited for this creative task), perplexity (lower perplexity indicates the story is more like the training data in terms of fluency), and diversity metrics like Distinct-n. However, as discussed, these do not capture coherence or interestingness. Hence, human evaluation is common: typically, evaluators read a set of stories (presented in random order without knowing which system produced which) and rate them on coherence, fluency, and overall quality, or choose which of two stories is better. This has shown that despite high fluency, pure neural models often lag behind symbolic or hybrid ones in coherence [3,28].

A comparative overview of the main symbolic, neural, and hybrid approaches discussed in this section is presented in Table 1. The state of the art in story generation circa early 2020s had evolved toward hybridizing neural models with various forms of control or knowledge injection [3,15,16,17,22,25,29,30,31].

More recent multimodal and controllable systems (2023–2025) further advance automatic narrative generation by integrating vision–language models, long-context transformers, and structured control mechanisms. Frameworks such as SEED-Story [38] demonstrate long-form multimodal storytelling with cross-modal grounding; StoryLLaVA [39] enables visually grounded narrative generation through LMMs with stepwise decomposition; and StoryBox [40] provides a modular multimodal sandbox for constrained narrative exploration. Additional progress in large-scale controllable generation, including Megatron-LM infrastructure for guided decoding [41], has facilitated stronger integration between symbolic planning and LLM-based generation. These developments illustrate the field’s movement toward architectures where symbolic structure, multimodal grounding, and neural generation operate jointly—an evolution that directly motivates the neuro-symbolic design underlying our METATRON framework.

As summarized in Table 2, symbolic, neural, and neuro-symbolic approaches differ in crucial aspects such as coherence, consistency, creativity, and scalability—differences that directly motivate the hybrid design of METATRON. Yet, even within this landscape, certain dimensions remain underexplored. For example, most systems presume that coherence and fluency suffice to ensure reader engagement, overlooking qualities like surprise, suspense, or emotional resonance. These facets of interestingness are not easily captured by structural alignment alone. Likewise, the cognitive plausibility of characters—whether their actions reflect their beliefs, goals, and knowledge—remains largely unaddressed by current models. Human authors naturally incorporate theory of mind into narratives, whereas AI systems typically lack such representations. Evaluating whether a story “makes sense” thus often requires deeper modeling of character intentionality and knowledge states, a challenge we examine in Section 6.

3. The METATRON Framework

This section synthesizes prior work into a neuro-symbolic reference architecture for automatic short-story generation and uses METATRON as a framework that instantiates the main components. It is composed of several modules that together handle different aspects of the task: high-level plot structuring, low-level sentence generation, coherence maintenance, and even multimodal enrichment (through images and audio). In this section, we describe each module and then how they interact in the overall pipeline (see Figure 1). The key idea is to use symbolic knowledge to generate a story skeleton—a sequence of major events or situations—and then use neural models to flesh out those events into natural language text, with iterative refinement to ensure that the narrative flows coherently from one event to the next. We also incorporate a cognitive-oriented emotional model to give the story an affective arc, laying the groundwork for adding imagery and sound to the narrative experience.

3.1. General Architecture of the METATRON Framework

Figure 1 summarizes the general architecture of METATRON as a numbered computational pipeline. The process starts at step (0), which initializes the pipeline and passes control to the Outline generation module. In step (1), this module receives a dramatic situation encoded as BCO symbols from the top trapezoid, while step (2) retrieves structured knowledge from the knowledge base (bottom cylinder). Using these two inputs, the outline module produces two complementary symbolic representations: in step (3), an AMV-level outline (cyan trapezoid), and in step (4), a phrase-level outline (red trapezoid). These trapezoidal nodes denote intermediate symbolic structures that are not yet full narrative units.

In the next stage, both outlines are expanded into full narrative layers. Step (5) turns the AMV outline into a Full story AMV (cyan rectangle), while step (6) turns the phrase-level outline into a Full story Id-text (red rectangle). Rectangles thus represent complete textual or semi-textual story layers that are ready for downstream processing.

The Narrative interpolation module (dark red rectangle) receives these two full-story representations via steps (7) and (8), which provide the AMV stream and the text stream, respectively. Inside this module, step (9) interpolates AMVs and step (10) interpolates phrases, producing interpolated AMVs and interpolated phrases, again shown as cyan and red trapezoids to emphasize that they remain structural units rather than final surface text.

Colors encode the two parallel processing tracks and the role of each module: cyan corresponds to the AMV-based symbolic track (attribute–value matrices), red to the phrase-level or Id-text track, purple to the central outline generation module that initiates both tracks, green to the pipeline start, and dark red to the semantic–structural interpolation stage.

Finally, dashed arrows implement iterative feedback loops. In step (11), interpolated AMVs can feed back into the Full story AMV layer, while in step (12) interpolated phrases can refine the Full story Id-text layer. These loops allow the system to correct, regenerate, and realign earlier stages, enforcing structural and semantic coherence across the evolving narrative.

3.2. Module’s Description

The first module is the Attribute-Value Matrix (AVM) and Beginning-Climax-Outcome (BCO) Generator. This module is responsible for creating an initial story sketch in terms of key situations and their narrative roles (beginning, climax, outcome). We leverage the classical taxonomy of dramatic situations proposed by Georges Polti [42] as a foundation for plot generation. Polti’s book, The Thirty-Six Dramatic Situations, catalogues 36 basic situations that he claimed encompass the majority of dramatic stories. Each situation is defined by certain roles and dynamics (for example, “Supplication” involves a Persecutor, a Suppliant, and a Power in authority whose decision is doubtful). Polti also provided sub-classifications and examples for each situation, often drawn from literature and theater. Mike Figgis [43] revisited Polti’s list from a modern perspective, making updates such as gender-neutral roles and adaptation to film narrative structures, but the essence remains similar.

Polti’s dramatic situations can serve as templates for generating story outlines. A random draw from these 36 situations (and their subtypes) provides a basic dramatic scenario, which can then be instantiated with specific characters, settings, and conflicts. A systematic way to achieve this is through Typed Feature Structures [44,45,46], which represent narrative situations in a structured format.

In this representation, an Attribute–Value Matrix (AVM) encodes attributes such as:

SituationType: one of Polti’s 36 dramatic situations (and possible subtype).
Characters: roles involved (e.g., Protagonist, Antagonist, Victim).
Setting: the environment or location where the situation unfolds.
Object: a key element or MacGuffin central to the situation.
Action: the core event or interaction (e.g., Revenge, Rescue, Seduction).
Why: a causal or motivational link to other situations or backstory elements.
When: temporal ordering or dependency (e.g., occurring after another situation).
Emotion: the dominant affective tone (e.g., jealousy, betrayal, sacrifice).
Outcome: a label indicating whether the AVM represents a beginning, climax, or resolution within the overall story arc.

Figure 2 illustrates a sample AVM instantiated for Polti’s situation 5, “Pursuit” (subtype A: Fugitives from Justice). The structure includes a Pursuer (e.g., a detective) and a Fugitive (e.g., a thief), a Setting such as “a bustling medieval marketplace”, an Action like “chasing through the crowds”, and an Emotion of high-intensity fear and desperation. The Outcome field remains unspecified, as this instance could serve within the story’s opening or middle sections.

The population of these Attribute–Value Matrices (AVMs) can be achieved through a combination of knowledge-based resources and controlled randomness to ensure diversity across generated narratives. The underlying knowledge base typically includes lists of character archetypes (e.g., hero, villain, authority figure), settings organized by genre (e.g., futuristic city, rural village, pirate ship), and other contextual elements. After selecting one of Polti’s thirty-six dramatic situations—either randomly or based on a user-specified theme or genre—the corresponding template information determines which roles are required. For example, Polti’s situation 5 (Pursuit) generally involves two main roles: a Punisher and a Fugitive. These roles can then be instantiated with specific character types drawn from the knowledge base (e.g., Punisher = “royal guard”, Fugitive = “rebellious wizard”, in a fantasy context). Each instantiated AVM can further include an approximate emotional tone and causal links (Why) connecting it to previous or subsequent situations—such as linking the pursuit to a prior crime corresponding to situation 3 (Crime Pursued by Vengeance).

Once an AVM is completed, a situational script can transform it into three concise sentences representing the Beginning, Climax, and Outcome (BCO) of that situation. This BCO structure is inspired by classical narrative theory, where the beginning establishes context and characters, the climax introduces the highest point of tension, and the outcome resolves the conflict. Although Polti’s situations describe central conflicts, each one can be viewed as a self-contained micro-story with its own narrative arc.

Template sentence structures can be pre-defined for each dramatic situation and, where relevant, its subtypes, emphasizing these three pivotal moments. For instance, for situation 5 (Pursuit, subtype A: Fugitives from Justice), the corresponding templates could be:

Beginning: “A <Fugitive> is on the run after committing <Crime>, and the <Authority> has dispatched a relentless <Pursuer> to hunt them down.”
Climax: “The chase leads them to <Climactic Setting>, where the <Pursuer> corners the <Fugitive> amid rising tension and a crowd of onlookers.”
Outcome: “In the end, <Outcome of Pursuit>—the <Fugitive> is brought to justice, and the long pursuit reaches its conclusion.”

These template sentences include placeholders (e.g., <Fugitive>) that can be filled automatically using values from the AVM, such as “the rebellious wizard” or “the Captain of the Guard.” Templates of this type have been devised for each of Polti’s situations, drawing from his original descriptions and subsequent modern reinterpretations [42,43]. Each template aims to encapsulate the essential dramatic tension of its respective situation in a single, self-contained sentence.

Figure 3 illustrates the transformation from a structured AVM into its corresponding BCO representation. The AVM encodes the abstract narrative schema, and the situational script maps this information into natural language. The resulting output functions as a story outline—three sentences that define the onset, turning point, and resolution of a potential narrative. At this stage, the output constitutes a skeletal story aligned with a known dramatic pattern. While some micro-narratives or fables could stand on their own in this form, such outlines are generally too sparse and abstract to be fully engaging. Subsequent modules therefore expand and enrich this preliminary outline into a coherent and expressive short story.

The next stage in the narrative pipeline might involve a Fictional and Semi-Coherent Generator, a neural language model based on the Transformer architecture that produces short segments of fictional text with stylistic and local contextual coherence. The term semi-coherent denotes a level of generation where continuations are contextually appropriate and stylistically consistent within a short window of preceding text, but where long-range narrative coherence is not guaranteed. Broader structural and causal consistency across the entire story is typically ensured through additional control mechanisms described in later modules.

Training such generators generally follows a two-phase strategy (illustrated in Figure 4). In the first phase, the model is pre-trained or fine-tuned on a large corpus of narrative fiction, such as novels, short stories, or other narrative forms spanning multiple genres. This extensive exposure enables the model to internalize the general stylistic and structural conventions of storytelling, including narration, dialogue, description, and pacing. Publicly available corpora such as Project Gutenberg or the BookCorpus have often been employed for this purpose, as they provide sufficient coverage of diverse literary forms and registers to instill a recognizable “storytelling voice.”

In the second phase, the model may undergo fine-tuning on a more focused Causal Dataset designed to emphasize temporal and causal coherence between adjacent sentences. This specialized dataset can be either extracted or synthesized from existing narrative resources. One widely used foundation is the ROCStories corpus [47], a collection of five-sentence everyday stories characterized by clear causal and temporal progression. To strengthen the model’s sensitivity to causal relations, training instances may be augmented with explicit connectives such as “because of that,” “as a result,” or “after,” producing examples like: “Event A happens. Therefore, Event B follows.” or “After X, Y.” Additional samples may include discourse markers such as however, meanwhile, or subsequently, which guide the model in learning appropriate discourse transitions.

Through this two-phase learning process, the model is enabled to develop a stylistic and structural grounding in fictional prose while acquiring a local sense of logical continuity. The resulting system tends to generate continuations that are both fluent and narratively plausible at the sentence-to-sentence level. Nonetheless, its coherence remains semi-local, limited by the model’s contextual window and memory constraints, meaning that details or causal links extending beyond that window may not be consistently maintained. This controlled limitation is addressed in subsequent components of the architecture, which provide mechanisms for global coherence and narrative-level consistency.

The following module is a Coherence Filter. Its role is to ensure inter-sentence coherence and to filter out any candidate sentences that do not fit well in the narrative context. We implement this as a masked language model (MLM) trained as a discriminator or scoring function. Unlike the autoregressive generator, which only looks at left context when generating the next token, a masked model (like BERT [48] or RoBERTa [49]) can consider both left and right context of a gap. This makes it suitable for judging how well a sentence fits between a preceding and following sentence.

To train the coherence filter, two complementary datasets can be constructed:

Positive Dataset: A collection of coherent story triples (previous sentence, candidate sentence, next sentence) drawn from real narratives. For this purpose, corpora such as ROCStories (short commonsense stories) and WritingPrompts (longer, creative stories from Reddit) are often used. Sequences of three consecutive sentences are extracted from these stories, and the middle sentence is treated as a “good” candidate given its neighboring context.
Negative Dataset: A set of incoherent triples generated by introducing controlled perturbations into authentic stories, following the approach of Wang et al. [28]. These perturbations include:
–
Shuffling: Selecting three consecutive sentences from a story but replacing or swapping the middle one with a sentence from a different location, thus breaking logical continuity.
–
Irrelevant Insertion: Inserting a random sentence—taken from another story or context—between two adjacent sentences, typically introducing unrelated entities or topics.
–
Repetition: Reusing the preceding sentence or part of it as the middle sentence, modeling incoherence caused by redundancy (a phenomenon often flagged by human evaluators as poor narrative flow).
–
Contradiction/Opposition: Introducing a middle sentence that contradicts its context (e.g., if the first sentence states “John was alive,” a contradictory continuation might read “Everyone was mourning John’s death”). Such oppositions produce clearly inconsistent narrative segments.

Using these datasets, a BERT-like model can be fine-tuned to assign a coherence score to a triple of sentences. One practical method concatenates the sequence [previous] [candidate] [next] with separator tokens and trains the model to output a binary label distinguishing coherent from incoherent samples. Alternatively, the model can be trained in a masked language modeling setup, predicting the middle sentence or using its likelihood as a proxy for coherence. A ranking-based approach is often preferred: the model learns to assign higher coherence scores to true middle sentences than to their corrupted counterparts, given identical surrounding context. During story generation, this coherence filter evaluates candidate intermediary sentences between a preceding and following context, determining which candidate provides the most plausible connective link. This mechanism is particularly effective for narrative interpolation, as described in the subsequent section.

Figure 5 illustrates the training regime for the masked coherence model. Story contexts are fed with masked spans covering the transition between sentences, and the model is trained to predict the correct connective or to rank the true sentence order above alternatives. Through this process, the model would be able to capture linguistic and narrative cues that signal smooth continuity—such as tense agreement, consistent referential expressions, and coherent causal or temporal connectives.

With the above modules in place (outline generator, neural sentence generator, coherence filter), it is possible to apply narrative interpolation in the Fictional and Coherent Interpolation Module. This is inspired by the work of Wang et al. [17] on narrative interpolation adapted to our framework. The goal of narrative interpolation is to fill in gaps between key plot points (in our case, between the BCO outline sentences) with additional narrative that connects them smoothly. Given that our outline generator produced something like [B1, C1, O1] (one sentence each for beginning, climax, outcome of the story), we want to expand this to a full story: [B1, s2, s3, C1, s5, s6, s7, O1] for example (where the lowercase s2, s3, etc., are the interpolated sentences). We chose to insert three sentences between B and C, and three between C and O in our design (making a total of 1 + 3 + 1 + 3 + 1 = 9 sentences story, if we keep B, C, O themselves). In practice, one could decide on different lengths or even a variable number of interpolations.

The interpolation process works iteratively. Suppose we have two sentences that we want to bridge: an “Alpha” (prior context) and an “Omega” (later context). Initially, Alpha could be the Beginning sentence and Omega the Climax sentence. We want to generate a Beta sequence (one or more sentences) that logically goes between Alpha and Omega. For now, consider generating a single sentence Beta that goes between them. We feed the Alpha sentence into the fictional generator which will propose several candidate next sentences.

Specifically, we can sample or beam-search to get, say,

N = 10

candidate continuations. Now, these continuations are unconstrained beyond the context of Alpha, so many of them may not lead toward Omega in a sensible way. This is where the coherence filter comes in: we use the filter to score each candidate Beta in the context of “Alpha … Omega.” Essentially, we are asking: if Beta were the sentence before Omega, how coherent is the sequence Alpha → Beta → Omega? The coherence filter provides a score (or probability of being coherent) for each Alpha–Beta–Omega triple. We then pick the Beta with the highest score. This Beta is considered the “best” way to connect Alpha to Omega among the sampled options, according to our learned coherence model. We accept this Beta as part of the story. Now we have one more sentence in our story: Alpha, Beta, Omega (where Omega was originally the Climax). But since we intend to insert three sentences, we will actually repeat this process to gradually fill in multiple sentences. One approach is shown in Algorithm 1, which performs greedy narrative interpolation. However, this procedure can propagate local errors—if the first inserted sentence is slightly off, subsequent interpolations may drift semantically. To mitigate this, Algorithm 2 introduces a beam-search variant that jointly optimizes local and global coherence scores.

Algorithm 1: Greedy Narrative Interpolation

An alternative interpolation strategy consists of binary splitting: identifying an intermediate point that lies approximately halfway, in narrative terms, between an initial sentence (Alpha) and a target sentence (Omega), and then recursively filling the intervening segments. However, for simplicity, many implementations adopt a greedy left-to-right interpolation procedure similar to that described by Wang et al. [17]. During each interpolation step, the system updates its attribute–value matrices and any associated state-tracking structures to maintain narrative consistency. For instance, when a candidate intermediate sentence (Beta) is proposed, entity references can be examined: does Beta introduce a new character absent from Alpha or Omega? If so, the coherence filter may implicitly penalize this inconsistency, as introducing unmotivated entities typically reduces the coherence score. More explicitly, an entity-tracking penalty could be integrated into the scoring function to discourage abrupt character introductions or discontinuities across scenes.

Algorithm 2: Narrative Interpolation with Beam Search and Re-Scoring

In addition to referential consistency, affective coherence can also be monitored through an Emotion field linked to an external emotion regulation mechanism inspired by the taxonomy of Kandel et al. [50]. In this setting, a target emotional trajectory may be predefined (for example, starting from neutrality, rising to fear at the climax, and concluding with relief). Each generated sentence can be tagged with an estimated emotion using a sentiment or emotion classifier, and the progression can be compared against the desired emotional arc. If deviations occur, candidate sentences may be resampled or regenerated under conditional prompting, incorporating explicit emotion labels as soft constraints. Simpler variants of this approach generate freely and later apply post hoc filtering based on emotional consistency metrics.

Figure 6 illustrates a single iteration of the interpolation process. Given two boundary sentences—Alpha and Omega—the generator produces multiple candidate Beta sentences. The coherence filter, implemented via a masked language model, evaluates how well each candidate connects Alpha to Omega, selecting the highest-scoring option. The resulting Beta is inserted between Alpha and Omega, and its content updates the story’s symbolic state (for example, noting temporal progression if Beta implies that time has passed, or registering the discovery of a clue if Beta introduces one). This updated state informs subsequent interpolation steps, preserving continuity across the growing narrative.

After several iterations—such as three interpolations between the beginning (B) and the climax (C)—the sequence evolves as B,

s_{2}

,

s_{3}

,

s_{4}

, C. Subsequent interpolation between the climax and the outcome (O) yields C,

s_{5}

,

s_{6}

,

s_{7}

, O. The resulting narrative therefore follows the progression Beginning

\to s_{2} \to s_{3} \to s_{4} \to

Climax

\to s_{5} \to s_{6} \to s_{7} \to

Outcome. This iterative procedure highlights the neurosymbolic collaboration at the core of the architecture: the symbolic layer provides the structural scaffold (the BCO outline and constraints from the AVM, including the participating characters and emotional targets), while the neural component generates surface text under the guidance of the coherence filter and, when relevant, entity- or emotion-tracking mechanisms.

3.3. Full Story Generation and Multimodal Extension

At the conclusion of the interpolation process, we expect a complete story in textual form. Figure 7 illustrates how the initial single-sentence components representing the Beginning, Climax, and Outcome (B, C, and O) can be expanded into a multi-sentence narrative. In the example depicted, three interpolated sentences are inserted between the beginning and the climax, and another three between the climax and the outcome. These values are not fixed but serve as indicative parameters for narrative density. The underlying framework allows flexible control over the degree of elaboration: longer stories can be produced by increasing the number of interpolations or by subdividing major narrative transitions—such as B to C or C to O—into smaller substeps, potentially introducing multiple climactic or transitional moments. This modular structure supports a spectrum of narrative granularities, from concise summaries to extended plots with finer emotional and structural variation.

At this point, given that we have symbolic storylines, it is possible to consider adding multimodal elements to the story—specifically images and audio—to enhance the storytelling. Although the main focus of this work is textual story generation, we consider the use of state-of-the-art generative models for images (such as latent diffusion models) and audio (such as generative audio models) to create illustrations and soundtracks or sound effects that accompany the text. For images: Since our story is grounded in specific events and characters via the AVM, we could use that metadata to query a text-to-image model for key scenes. For example, for the Beginning, which we know involves certain characters in a certain setting, we can generate an image illustrating that scene. Modern text-to-image models like Stable Diffusion [51] or DALL-E could be used by feeding them a prompt like “A drawing of [character] chasing [character] through [setting]” for our Pursuit scenario. However, a challenge in multi-image storytelling is consistency: ensuring that the character looks similar across images, and the style is consistent. Recent research (e.g., [38]) has started to address image sequence coherence by using specialized multi-modal models that generate sequences of images with persistent characters. Our system could leverage such models or use a prompt engineering approach to carry over descriptions (“the same knight from before now in a marketplace”, etc.). For audio: We consider two types of audio—speech (narration or dialogue spoken out loud) and non-speech audio (background music or sound effects). If the goal is to create an audiobook-like output, we could simply feed the text to a text-to-speech system with appropriate voices for narration and characters. But going beyond that, we might want to generate a soundtrack that matches the mood of each scene. Bae et al. [52] introduced the Sound of Story (SoS) dataset which pairs story scenes with background audio. Using such a resource, one could train a model to select or generate audio given a scene description. Generative models like AudioLM, MusicLM [53], or AudioGen [54] allow creation of sound effects or music from text prompts. For example, for a tense chase scene, an audio model could generate fast-paced, suspenseful music; for a peaceful resolution, a calm melody. Our system design includes two modules (as placeholders in Figure 1) for image and audio generation. They take as input the evolving story (or the final story) and produce visual and auditory content. In practice, these modules would rely on existing foundation models (pre-trained on image-text or audio-text pairs). We might not train these from scratch but rather use them in a zero-shot or fine-tuned manner. For instance, a CLIP-based model could be used to ensure the selected images align with story semantics. The integration of these modalities must be done carefully to maintain narrative coherence. We wouldn’t want the images or sounds to contradict the text (e.g., an image showing a character with different hair color than described, or a sound of rain in a scene the text says is indoors). To enforce alignment, we use the story’s symbolic representation: the AVM and subsequent state tracking can be converted into prompts or constraints for the multimodal models. The Characters attribute ensures we mention the right attributes in the image prompt (e.g., “the knight with a red cape” if the text mentioned that), and the Setting attribute helps ground the visuals and audio (e.g., “marketplace noise” vs. “forest ambience”). Finally, we would arrive at a full multimedia story: a sequence of text paragraphs, each possibly accompanied by an image and/or background audio that complements it. While the scope of this paper is primarily the methodology, we include this multimodal aspect as part of our vision for METATRON, reflecting the trend towards more immersive AI-generated narratives.

Figure 8 depicts the entire framework of METATRON including this last aspect. We start from a symbolic seed (Polti’s situation), generate a BCO outline, expand it via neural models under symbolic guidance, and optionally add images and sounds. The result is a rich, structured story that benefits from both the knowledge of classic narrative structures and the creativity of modern AI generative models.

3.4. A Neurosymbolic Approach

Our approach is neurosymbolic in the sense that it tightly intertwines symbolic AI techniques with neural networks. The symbolic part (e.g., the AVM representation of situations, the use of Polti’s taxonomy, the logic-based coherence constraints) provides guidance and structure that neural generative models typically lack. The neural part (the transformer-based language models, image and audio models) provides flexibility and learning capacity that symbolic systems struggle with, especially when dealing with the richness of natural language and sensory data.

This neurosymbolic philosophy addresses a known issue: pure symbolic systems can ensure a consistent and logical story but often produce stiff or formulaic text, whereas pure neural systems produce fluent text that can easily go off the rails logically. By combining them, we aim to mitigate each side’s weaknesses. In our framework:

The AVM and BCO generator ensure that at a high level, the story follows a meaningful dramatic arc (so we won’t get a nonsensical sequence of events; there’s an underlying human-vetted narrative pattern).
The neural generator ensures that the prose of the story is natural and potentially creative or unexpected (thus improving novelty and fluency).
The coherence filter acts as a symbolic constraint engine (though learned) to enforce logical transitions, effectively acting like a narrative rule- checker albeit implemented as a neural network. It embodies symbolic logic in its training data design.
The emotion model injects a cognitive dimension: it treats the story as not just a sequence of events but as an affective journey. This draws on psychological theories of emotion and can be seen as a symbolic layer overlaying the text (each event is tagged with emotional significance).
The use of Polti’s situations is essentially encoding human literary knowledge (a symbolic knowledge base) into the generative process, which is something a purely neural model pre-trained on text might not explicitly have, or might not reliably use.

Another perspective on why this combination is powerful lies in the concept of controllability and interpretability. In a symbolic system, every decision (why a certain event happens) can be traced to a rule or a template. In a neural system, the decisions are buried in billions of weights. By structuring the problem via symbolic representations (like planning outlines or templates), we make the overall process more interpretable and easier to steer. For instance, if the story outcome is unsatisfactory, one can adjust the symbolic outline and regenerate, rather than hoping a black-box model will produce a better ending by luck. This is crucial for applications where a user or author might want to co-create with the AI, maintaining some control over the narrative direction (see Section 4 on narrative controllability).

Our use of an emotion taxonomy and extrinsic emotion regulation in the story is a nod to cognitive science: human writers often subconsciously ensure that a story has an emotional rhythm (e.g., rising tension, moments of relief, a final catharsis). By modeling emotion explicitly (even if roughly, with just a label for each sentence or scene), we can supervise the neural generator to follow an intended emotional arc. This again is something that pure language modeling would not guarantee, as it has no explicit concept of “current emotional valence” unless we prompt it carefully.

Finally, the integration of multimodal generation can also be seen through a neurosymbolic lens. Image and audio generation models operate in continuous high-dimensional latent spaces (very much a neural network domain), but the decision of what to illustrate or what sound to make is guided by symbolic reasoning (e.g., “what is the important thing happening now? It’s a confrontation at night, so an image of a moonlit duel is appropriate; sounds of crickets and distant thunder might accentuate the mood”).

In these ways, METATRON is not just stacking neural and symbolic components, but making them work in tandem. The symbolic part sets objectives and checks for the neural part, and the neural part, in turn, can provide feedback (like via learned evaluations) to refine symbolic choices (e.g., if a certain dramatic situation consistently leads to dull stories, the system could learn to tweak the template usage). This synergy is where we see a lot of potential for advancing story generation beyond what either approach could do alone.

3.5. Novelty and Distinguishing Features of the Approach

Several aspects distinguish this architecture from previous research in automatic story generation.

Integration of classical dramatic theory. Rather than relying solely on loosely defined schemas or crowd-sourced plot fragments, the design explicitly incorporates comprehensive dramaturgical taxonomies such as Polti’s thirty-six situations [42,43], in dialog with structuralist accounts like Propp’s morphology [55]. Leveraging these taxonomies injects long-standing theoretical insight into computational storytelling, offering structural richness and systematic coverage of narrative space (36 situations × 3 phases = 108 templates) while promoting diversity beyond generic adventure templates [8,9,32,56].
Cognitive-oriented emotion regulation. Emotion is represented and controlled as an explicit signal (valence, intensity, and basic categories), enabling targeted arcs and pacing at the sentence or scene level; this extends prior work on protagonist emotions and narrative affect [27,57] and is conceptually grounded in cognitive and neural perspectives [50], connecting to broader creativity discussions in AI [58].
Neurosymbolic coherence filtering. Local coherence is assessed iteratively during generation using masked-LM style discriminators, advancing beyond pairwise sentence checks and narrative interpolation setups [17,28]. Triple-based scoring captures bidirectional constraints (previous–candidate–next) and integrates tightly with symbolic state tracking and entity representations [18,22], aligning with planning-centric traditions in narrative control [6,10,33].
Multimodal narrative generation. The architecture is compatible with visual and audio synthesis, building on visual storytelling [59] and recent diffusion-based imagery [51], and extending toward sound and music generation for scene-setting [52,53,54,60]. This broadens storytelling from purely verbal narration toward scene-direction capabilities.
Hybrid use of large language models (LLMs). LLMs are positioned as guided components within a neurosymbolic scaffold rather than free-running narrators, addressing long-context limitations [11,13,23,34,61] and echoing recent perspectives that use LLMs as planners or search guides inside narrative systems [35,36,62]. This stance complements earlier plan-and-write and controllability lines [4,15,21,24,63].
Evaluation perspectives. Beyond readability or surface coherence, the evaluation program emphasizes cognitively meaningful tests, including theory-of-mind style probes and consistency under commonsense constraints [47,64,65], as well as creativity-oriented assessments [66]. These criteria situate the approach within established surveys and state-of-the-art mappings [37,67,68,69].

Overall, distinctiveness arises from compositional design rather than optimization of a single component: symbolic planning and dramaturgical structure [7,8,42,55,70] are combined with neural realization and coherence control [3,20,71,72], within a pipeline that admits retrieval augmentation and controllable conditioning [21,62,73]. Future empirical studies could include ablations—e.g., comparing Polti-guided planning to keyword-driven baselines [16] or measuring the impact of explicit emotion regulation [27,57]—and longitudinal tests of global consistency under long-context stressors [23]. Such investigations would refine how structured knowledge and cognitive modeling contribute to high-quality, engaging narrative generation while remaining aligned with classic and contemporary lines of research [5,6,10,37,56].

3.6. Evaluation Methodology

Evaluating story generation typically combines automatic metrics with human judgments, given the richness and open-endedness of the task. The following protocol aligns with common practice while incorporating aspects that target deeper narrative properties.

Automatic metrics (initial stage). Current automatic metrics remain limited: perplexity approximates fluency under a reference language model; lexical diversity is commonly measured with Distinct-n for $n \in {2, 3, 4}$ [4]; and descriptive indicators such as average story length or the positional distribution of emotionally valenced terms can help diagnose intended affective arcs. Coherence can be approximated by segmenting stories into triples (previous–candidate–next) and scoring them with a trained coherence filter, although no single metric reliably captures global structure. In the early stages of system development, these proxy metrics serve primarily as diagnostics rather than substitutes for human evaluation.
Human evaluation (ground truth). Because reliable automatic measures of coherence, creativity, emotional resonance, and theory-of-mind reasoning are still lacking, human judgment remains the ground truth for validating narrative quality. Reader studies typically collect Likert ratings along several dimensions: cohesion/fluency (sentence-level readability), coherence (global sense-making), consistency (continuity of characters and world facts), engagement (subjective interest), and creativity (originality and avoidance of clichés). Pairwise preference tests can compare METATRON outputs with baseline systems (e.g., prompt-only LLMs or symbolic planners). A theory-of-mind consistency test can further check whether characters’ actions align with their knowledge and beliefs across the narrative.
Toward reduced human intervention (LLM-as-judge). Over time, human evaluations can be used to calibrate large language models as approximate judges of narrative quality, following approaches in recent work on LLM-based evaluators (e.g., Literary Language Mashup). Once sufficient correlation with human judgments is established, LLM-based evaluation can partially replace human raters for routine benchmarking. In this sense, human judgment is required at the initial validation stage, but the long-term aim is to derive stable, automatically computable proxies that progressively reduce the need for human intervention in large-scale evaluation.

Beyond these components, the Lovelace 2.0 test [66] provides a conceptual guide for creativity-oriented challenges: prompts can specify thematic requirements and constraints (e.g., “a betrayal that ends in an unexpected act of kindness”), and human judges assess both compliance and inventiveness. A comprehensive evaluation plan thus combines objective indicators (fluency, diversity) with subjective assessments (coherence, engagement), while probing deeper aspects such as consistency and creativity. Comparative analyses against prior systems and ablation studies within the same framework help isolate contributions of specific modules; for instance, differences in interestingness scores between settings with and without emotion regulation would speak to the value of affect-aware control.

4. Controllable Narrative Generation and Symbolic Guidance in LLMs

One of the key challenges in story generation is controlling the narrative to follow a desired trajectory or satisfy certain constraints (be it plot structure, genre conventions, or user- provided prompts). Large language models (LLMs) such as GPT-3 and GPT-4 are powerful but inherently prone to drift: they follow the statistical patterns of language, which means if a user says “tell me a story about a detective”, the model will start off well, but the longer it continues, the greater the chance it introduces elements that steer the story away from the user’s initial concept (perhaps it brings in an unrelated subplot, or concludes too quickly, etc.). Ensuring narrative controllability is essential, especially if these models are to be useful tools for writers or autonomous creative agents. Our approach addresses controllability by embedding symbolic guidance at multiple levels of the generation process. The use of Polti’s dramatic situations is one form of symbolic control: it fixes the type of conflict or drama that will occur. This is a high-level control on the theme and structure of the story. For instance, if the chosen situation is “Fatal Imprudence” (Polti #17), no matter how the neural generation proceeds, the story should revolve around a character’s fatal mistake and its consequences. This prevents the model from suddenly shifting into a different kind of story (like a random romance subplot) which would undermine coherence. It is a symbolic prior on the space of possible stories. Previous research has explored various methods of controlling narrative generation:

Schema and Plot Graphs: Earlier systems like GESTER [70] or STORY GRAMMARS tried to use schemas that define what sequences of plot functions can occur. Vladimir Propp’s morphology of the folktale [55] is a famous example of a story schema (with functions like “villainy”, “donor test”, “hero’s victory”, etc.). These are symbolic structures that, if enforced, guarantee a certain kind of story (e.g., a fairy tale structure). Modern analogues include plot graphs or story domains, where authors define possible event sequences in a planning domain language [37]. Our use of Polti is akin to a high-level schema selection. We could in the future refine it to more detailed schemas (like sequences of Polti situations or sub-situations).
Conditional Generation with Outlines: As discussed in state- of-the-art, models like Fan et al. [16] accept an outline or prompt that lists key events or constraints. The large model then fills in the details. This is a simple form of control: you steer via the input. In our pipeline, we essentially produce such an outline automatically (BCO sentences) and then ensure the generated story adheres to them (through the coherence filter that anchors on the outline points). This means we can achieve similar controllability as providing a human-written outline, but without human effort, because the outline is generated by our Polti-based module.
Vector-based Control and Prompt Engineering: With models like GPT-3, one popular approach has been to craft the prompt cleverly to encode constraints. For example, one can instruct the model step by step: “First, list the main points of the story that involve a detective solving a crime. Then narrate the story following those points.” This often yields a more structured output. This insight aligns with our method: by explicitly separating planning (points) from narration, we control the outcome better. Some works have used prefix tuning or PPLM where an attribute model steers the generation by nudging the hidden states. For instance, Dathathri et al. [73] could steer the topic or sentiment of generated text by adding gradients from a classifier. In principle, one could train a “narrative arc” classifier and use PPLM to enforce, say, “make sure this stays a mystery story, don’t turn into comedy”. However, these methods, while elegant, can be brittle and are less interpretable than symbolic control.
Hybrid Planner-LLM Systems: Recently, Xiang et al. [35] demonstrated an approach where a classical symbolic planner (like one that generates a sequence of high-level actions using a domain definition and goals) is interleaved with a neural LLM (GPT-3) that turns each plan step into a paragraph of story. They applied this to the old TALE-SPIN domain. The planner ensures logical progression (characters only do things allowed in the domain and needed to achieve goals), while GPT-3 adds richness. This is very analogous to our architecture, except our “planner” is simpler (choose a Polti situation and BCO outline rather than planning each step of a character’s plan). The success of Xiang et al.’s system [35] is promising: they found that the combined system wrote more coherent stories than GPT-3 alone, and often more interesting ones than the planner alone (which was very dry). This validates the neurosymbolic approach. It also underscores that large LMs can be used in a controlled pipeline rather than just free generation. Another hybrid example is Farrell and Ware [36], who used an LLM to guide a search in narrative planning: essentially, the LLM suggests which branches of the story tree are promising (like a heuristic) and the planner ensures validity. This improved the efficiency and creativity of the planner’s output without sacrificing coherence. These works show that symbolic control at macro-level combined with neural fluency at micro-level is a fruitful direction.

Narrative Controllability and Symbolic Guidance

Controllability in story generation encompasses both system-level and user-level mechanisms that shape the resulting narrative. Within this framework, user influence can manifest at several points. A user may specify a desired dramatic situation (for instance, Polti’s Ambition, #30), define particular characters or settings within the Attribute–Value Matrix (AVM), or modify the high-level Beginning–Climax–Outcome (BCO) outline before expansion. Even finer-grained intervention is possible during interpolation—rejecting or revising generated continuations—if the interface exposes such functionality. Symbolic representations facilitate this interaction: modifying structured summaries or attributes is more intuitive than adjusting latent model parameters.

Maintaining a coherent dramatic arc is another dimension of control. Classical frameworks such as Freytag’s pyramid describe an asymmetric five-part progression—exposition, rising action, climax, falling action, and resolution—whereas the current BCO outline provides a simplified three-phase structure. During interpolation, local coherence is preserved through filtering, but maintaining the global emotional and dramatic contour may require additional guidance. Emotional regulation modules and the intrinsic structure of Polti’s situations already impose a baseline arc, since most situations encode conflict and resolution. However, the system could incorporate explicit constraints, such as ensuring that the main conflict emerges by the midpoint of the story or by adjusting sentence allocation (e.g., more pre-climactic buildup than post-climactic resolution) to emulate classical asymmetry.

Large language models (LLMs) such as GPT-4 face persistent challenges with long-range coherence, partly due to limited context windows and the absence of explicit planning or memory mechanisms. Hierarchical generation mitigates these limitations: an outline or plan constrains generation, enabling consistent narratives even with shorter effective contexts [71]. Within this design, interpolation and symbolic scaffolding function as a form of implicit planning that reduces drift and ensures adherence to narrative goals.

Ensuring that generated text respects symbolic constraints remains an open challenge. If a Polti situation stipulates “Fatal Imprudence” (a mistake leading to death), the generated story should explicitly contain that causal event. Iterative verification could be implemented through classifiers trained to identify which Polti situation a story exemplifies, enabling automatic validation or regeneration when discrepancies arise. Analogously, reinforcement learning could be used for reward shaping [25], rewarding outputs that satisfy symbolic goals; however, reinforcement-based tuning often compromises fluency, and filtering-based control remains a more practical compromise.

Symbolic guidance also enhances interpretability in controllability. For example, specifying “Mistaken Jealousy” (#32) followed by “Remorse” (#35) yields a predictable yet expressive narrative trajectory. Prompting an LLM directly with “a story about jealousy and remorse” may not ensure the inclusion of those plot points, whereas symbolic scaffolding enforces their presence, producing a more structured and purposeful narrative. Such control is particularly valuable in applications requiring narrative fidelity, such as educational or game-based storytelling.

From a broader research perspective, the integration of literary theory with neural text generation suggests a productive convergence between AI and the humanities. Previous work in narrative planning languages [37,74] formalized story control through explicit logic and state tracking; the present approach advances that paradigm by using natural language outlines derived from formal narrative knowledge, thus preserving interpretability while enhancing generative flexibility.

Finally, narrative controllability extends to stylistic modulation. While the current framework focuses on plot, stylistic attributes can also be symbolically parameterized. A stylistic module might adjust vocabulary, rhythm, or syntax to emulate a specific genre (e.g., Victorian prose or gothic narrative). Though LLMs can approximate style from examples, an explicit symbolic layer could reinforce consistency. In a modular architecture, this enables swapping or fine-tuning generators for different stylistic domains (children’s literature, horror, or science fiction) while retaining the same symbolic backbone.

Overall, symbolic guidance provides a structured means of exerting narrative control, improving consistency, and aligning generated stories with desired arcs, styles, and constraints. Beyond increasing reliability, this approach strengthens the theoretical bridge between computational creativity and classical narratology, underscoring the enduring relevance of literary formalisms in contemporary AI storytelling.

5. Episodic Memory and Emotional Modeling in Story Generation

Human storytelling extends far beyond the mere sequencing of events; it involves constructing continuity, emotional depth, and cognitive coherence across time. Characters are not blank agents—they remember, anticipate, and feel. Accordingly, two interdependent challenges in automatic story generation are the management of episodic memory, which maintains consistency and causal recall, and the modeling of emotion, which shapes narrative tension and engagement. Recent research increasingly recognizes that these dimensions are fundamental to human-like narrative reasoning [23,27,57,75].

5.1. Memory as Narrative Structure and Constraint

A recurring issue in long-form neural text generation is the so-called lost in the middle phenomenon, where models lose track of mid-range context over thousands of tokens [23]. In narrative terms, this results in incoherence, forgotten characters, or contradictions. Symbolic systems traditionally handled memory explicitly through state variables or knowledge representations, ensuring that facts introduced early in the story could be referenced later. The Attribute–Value Matrix (AVM) used in the present framework operates in this spirit, acting as a symbolic state tracker that records entities (Who), locations (Where), temporal shifts (When), and causal or motivational links (Why). Each generated sentence updates this evolving representation, effectively constituting a lightweight episodic memory of the story world. This is conceptually akin to the cognitive Event-Indexing Model, which posits that readers maintain mental indices of protagonist, temporality, spatiality, causality, and intentionality [75]. By mirroring this structure, computational models approximate how human readers monitor coherence.

Early narrative systems already anticipated this perspective. For example, the MINSTREL system [8] maintained transformational rules for adapting prior stories—implicitly a memory of creative cases—while MEXICA [9] explicitly encoded states of character relationships and unresolved goals. These symbolic memories guaranteed logical continuity but required intensive manual design. Neural architectures, by contrast, offer emergent memory through large context windows, but their retention is probabilistic rather than structural. Hybrid approaches seek to reconcile both: maintaining an explicit memory representation while using neural components for surface realization.

Mechanisms for Memory Retention

Several strategies have been proposed to address memory limitations in neural story generation. One approach is to introduce recurrent memory or summarization, periodically condensing prior content and feeding summaries back into the model. Transformer-XL [34] exemplifies this idea, maintaining a recurrence mechanism that allows context reuse beyond the nominal sequence length. In practice, summarization or “rolling context” techniques are used in narrative generation to refresh the model’s awareness of prior events. Another line of work introduces episodic memory modules as external stores of narrative facts, retrievable during generation. Retrieval-augmented architectures such as Lewis et al. [62] demonstrate that a model can query nonparametric memory for relevant information. Applied to story generation, such modules could store narrative triples (“Alice saved Bob,” “Bob’s loyalty = uncertain”) and reinstate them when relevant. A related idea is the knowledge graph-based memory proposed by Wang et al. [17], in which evolving states (e.g., “Alice knows location of key = false”) are dynamically updated and validated. These mechanisms, whether symbolic or neural, share the goal of maintaining temporal and causal consistency across long narratives.

Recent evaluations of large language models reveal their limitations in this regard. Huet et al. [76] found that even advanced models like GPT-4 struggle with multi-event recall and temporal reasoning: after a sequence of related events, models often fail to answer factual questions about earlier occurrences. Such results emphasize the need for architectural memory supports in creative text generation. The interpolation-based approach described in this work mitigates some of these issues by dividing long stories into short bridging steps (Alpha → Beta → Omega), effectively chunking narrative progression into cognitively manageable segments. This mimics human writing strategies, where authors plan in units of scenes or transitions rather than entire plots at once.

5.2. Emotional Modeling and Affective Trajectories

Emotion shapes narrative engagement by providing tension, motivation, and empathy. The challenge for computational systems is to capture emotional dynamics—both at the level of characters and of reader response—without reducing them to static sentiment scores. Drawing on affective neuroscience and psychology, several models represent emotion as a multidimensional space of valence and arousal [50]. Within this framework, each event or situation in the story can be tagged with an emotion label corresponding to its prototypical affective state: anger for “Revolt,” fear for “Pursuit,” sorrow for “Self-Sacrifice.” These annotations provide a scaffold for emotional coherence and intensity regulation.

Different strategies exist to operationalize emotional control. One is conditioning, in which the generator is prompted or fine-tuned on emotion labels (e.g., “[Emotion: Fear] The night was dark…”). Another is post-generation filtering, where an emotion classifier evaluates the produced text, and sentences whose emotion deviates from the target are revised or replaced. A third involves plot-level adjustment, mapping emotions to structural arcs such as Freytag’s pyramid (rising tension, climax, resolution). Empirical analyses of literary corpora have revealed canonical emotional trajectories—“fall-rise,” “rise-fall-rise,” etc.—that recur across genres and cultures [57]. Such trajectories can guide the placement of emotional peaks and recoveries in generated stories. The approach of Brahman and Chaturvedi [27], for instance, explicitly enforced emotion arcs (joy → anger → sadness) through reinforcement learning, though at the cost of some fluency. A symbolic framework can achieve similar control more efficiently by linking emotional profiles to Polti’s 36 situations, which naturally encode affective polarity (e.g., “Deliverance” versus “Fatal Imprudence”).

Interplay of Emotion and Memory

Episodic memory and emotional modeling are deeply intertwined: emotional salience often determines which events are remembered, while memory recall reactivates emotion. In narrative generation, maintaining this link yields greater depth and realism. Consider a story where Alice rescues Bob early on (“Deliverance”) and Bob later betrays her (“Betrayal among kindred”). The emotional impact of betrayal depends on the model’s ability to recall the earlier act of rescue—without that memory, the scene loses moral weight. Symbolic memory structures (e.g., the AVM) can explicitly store such causal and affective dependencies (“Alice saved Bob → Bob’s betrayal intensifies sorrow”). During generation, coherence filters that rank candidate continuations based on contextual fit can implicitly favor sentences that recall earlier emotional events, thereby reinforcing dramatic continuity. This principle parallels cognitive accounts of emotional memory, where events charged with affect are more likely to be recalled and integrated into future interpretation.

Recent developments in large language models show limited but promising sensitivity to affective context. When supplied with a “story-so-far” recap, models like GPT-4 can maintain emotional continuity over long sequences. This aligns with practices in interactive fiction, where developers maintain a running “world state summary” including character moods and relationships. Retrieval-based memory architectures [62] or long-context models (e.g., LongLLaMA) may further strengthen this coupling by allowing selective recall of emotionally salient passages. In future systems, one could imagine a dual-channel control: a symbolic tracker for event and belief states, and a neural estimator for emotional tone, each influencing the other.

Finally, maintaining belief and world-state consistency—what cognitive science frames as a rudimentary form of theory of mind—remains an open challenge for story generation systems. The joint modeling of memory, emotion, and belief thus advances narrative generation toward greater cognitive plausibility, where consistency, recall, and affect intertwine to shape believable, human-like narratives. These intertwined aspects not only strengthen internal coherence but also open the way for evaluating story generation through cognitive and creative criteria—a perspective further developed in Section 6.

6. Cognitive and Creative Dimensions of AI-Generated Narratives

The broader motivation behind the METATRON framework is to investigate the boundary between algorithmic generation and genuine creativity. Story generation occupies a liminal space between structured reasoning and imaginative expression, making it an ideal domain for studying cognitive aspects of artificial intelligence. Evaluating creativity in AI systems is notoriously complex, as it involves philosophical, psychological, and aesthetic dimensions. Nevertheless, several operational frameworks exist, such as the Lovelace tests for creative autonomy [66] and metrics grounded in the classical triad of novelty, surprise, and value introduced by Boden [58]. This section examines how cognitive and creative dimensions—particularly theory of mind (ToM) reasoning, imaginative constraint satisfaction, and human-like reasoning—may be evaluated in AI-generated narratives.

6.1. Theory of Mind in AI Narratives

Theory of Mind refers to the capacity to attribute beliefs, intentions, and emotions to oneself and to others, recognizing that others possess perspectives distinct from one’s own. In storytelling, ToM underlies suspense (the reader knows more than the protagonist), irony (the audience perceives the misunderstanding), and emotional depth (characters act on partial or mistaken knowledge). A system capable of modeling ToM could, in principle, design such asymmetries deliberately. Current large language models, however, lack explicit representations of individual character beliefs or epistemic states; they operate from a single narrative viewpoint—typically omniscient or third-person limited—unless constrained otherwise. Nonetheless, extensive exposure to human-written fiction allows LLMs to approximate ToM patterns implicitly. Kosinski [64] reported that GPT-4 solves several false-belief tasks (e.g., Sally–Anne scenarios) via prompting, suggesting emergent ToM-like reasoning, though subsequent analyses contend these effects reflect statistical pattern imitation rather than genuine mental-state modeling [65].

Addressing ToM in computational storytelling requires explicit tracking of characters’ knowledge and beliefs. Andreas [77] proposed symbolic belief graphs to represent and update what each agent knows or believes at a given moment. Similarly, Riedl and Harrison [78] explored plan-based story generation where belief reasoning shapes character decisions. Integrating such symbolic structures with neural text generators could enable narrative systems to represent asymmetric knowledge states (“Bob knows X; Alice does not”), thus allowing phenomena like dramatic irony or deception. To maintain coherence at the world level, hybrid architectures may also update factual states dynamically using commonsense resources such as ConceptNet [79] or large-scale commonsense knowledge bases reviewed by Ilievski et al. [80]. Promising advances in this direction include systems that simulate commonsense reasoning during story generation [30] or encode event-level causal graphs for maintaining internal consistency [81].

For evaluation, one can embed controlled ToM scenarios within generated stories: for instance, Character A hides an object unbeknownst to Character B, and the model must later generate B’s behavior. If B acts with ignorance consistent with the story context, ToM coherence is preserved; if B “knows” the hidden object without cause, the system fails the ToM test. Similarly, multi-character dialogues can test whether distinct knowledge boundaries are respected. Post-generation, automated comprehension tests—asking “Does Character B know that X happened?”—can probe implicit ToM consistency, using either the same model in analytic mode or an auxiliary question-answering model. Human raters can complement this by judging whether characters’ actions remain believable given what they know or feel. Improved ToM coherence relative to baseline LLMs would indicate that the structured narrative framework (e.g., symbolic memory, controlled perspectives) enhances cognitive plausibility in storytelling.

Beyond qualitative illustrations, Theory of Mind in computational narratives can also be approached through more concrete and quantifiable evaluation setups. Recent benchmarks for ToM-like reasoning in large language models include structured false-belief tests, second-order belief challenges, mistaken-belief cascades, and multi-agent perspective-taking tasks [64,82,83]. These allow researchers to score models on standardized metrics such as belief-attribution accuracy, consistency under re-prompting, and robustness to minimal narrative perturbations. Complementary datasets targeting commonsense-based ToM—for example, social scenarios requiring reasoning about intentions, deception, or emotional inference—provide quantitative baselines for evaluating how reliably an LLM maintains separate epistemic states for different characters [84,85].

In creative or narrative contexts, ToM can also be operationalized through controlled story probes. For instance, recent creativity and narrative-coherence evaluations use constrained micro-stories in which a model must maintain identities, motivations, and hidden information over multi-paragraph arcs [86]. Metrics such as narrative-state consistency, cross-turn belief preservation, and contradiction rates offer measurable indicators of ToM coherence. These quantitative evaluations complement the qualitative assessments proposed above, grounding ToM in reproducible tests while acknowledging that current LLMs may exhibit pattern-based approximations of mental-state reasoning rather than genuine modeling [65].

6.2. Operationalizing Theory of Mind Within the METATRON Framework

To complement the qualitative discussion of ToM and narrative asymmetries, the METATRON framework can incorporate controlled, quantitative probes directly into its generation pipeline. These probes evaluate whether the system maintains distinct epistemic states for different characters and whether narrative decisions respect those knowledge boundaries.

Example 1.

First-order false-belief micro-stories.

Inspired by benchmarks such as Kosinski [64] and consistency stress-tests from Ullman [83], we embed short diagnostic scenarios into the generated story. For instance:

Alpha: “Lara hides the key in the red box while Marco is outside.” Omega: “Marco returns and searches for the key.”

During interpolation, METATRON must generate an intermediate sentence

β

that respects epistemic asymmetry. A valid solution might be:

“Unaware of its new location, Marco checks the drawer first.”

A ToM error would be:

“Marco heads straight to the red box, knowing the key is inside.”

Accuracy is computed as the proportion of interpolations in which

β

conforms to the characters’ knowledge states. This yields a simple quantitative ToM score.

Example 2.

Second-order belief cascades.

More challenging ToM probes, adapted from Rabinowitz et al. [82], introduce nested beliefs:

“Anna believes that Ben thinks the diary is locked in the desk. In fact, only Anna knows it was moved to the attic.”

ToM consistency is measured by whether METATRON preserves nested belief relations after several interpolation steps. Multiple-choice comprehension questions (“Does Ben know the diary is in the attic?”) can be posed to an auxiliary QA model, yielding objective scoring.

Example 3.

Social commonsense ToM.

Using templates inspired by SocialIQA [84] or MindGames [85], METATRON can be tasked with inferring social intentions:

Alpha: “Ruth avoided speaking about the accident at dinner.” Omega: “She later apologized to Eli.”

Here, METATRON should interpolate an inference such as: “She realized Eli was still shaken by the memory.’’

Scoring uses forced-choice accuracy: the system must select the correct social inference among distractors.

Example 4.

Narrative-state ToM metrics.

We quantify narrative ToM coherence within full stories:

Belief-tag accuracy (BTA): For each character, METATRON maintains a belief-state graph. After story generation, an LLM-based evaluator answers questions of the form “Does character X know Y?”. Ground truth is determined by the symbolic state tracker. BTA is the percentage of questions answered correctly.
Epistemic contradiction rate (ECR): The proportion of sentences in which a character acts on information they do not possess. Lower is better.
Perspective drift score (PDS): Measures unintended switches in narrative viewpoint between adjacent sentences (0 = no drift).
Coherence under perturbation (CUP): Following Ullman [83], minimal changes are introduced (e.g., swapping object locations); the system is scored on whether character beliefs update consistently.

Example 5.

Creativity-and-ToM joint assessment.

Narrative creativity benchmarks such as LOT [86] can be adapted so that each prompt defines both a structural constraint (e.g., a dramatic situation) and a ToM constraint (e.g., one character hides critical information). Judges (human or LLM-based) score:

creativity of the twist or conflict resolution,
maintenance of ToM consistency across long-distance dependencies,
degree of novel yet plausible emotional inference.

This integrates ToM reasoning into broader criteria of narrative inventiveness and structural coherence.

Interpretation Within METATRON

Because METATRON maintains symbolic AVMs, BCO outlines and a dynamic belief-state tracker, the framework can compute ToM metrics without modifying the neural generator. The symbolic layer defines ground truth for what each character knows, while the neural components attempt to generate text consistent with these constraints. The discrepancy between symbolic truth and neural realization produces numerical ToM scores that quantify the neurosymbolic alignment.

This hybrid method supports ablation studies: by disabling the belief-state tracker or coherence filter, we can measure their contribution to ToM coherence, enabling empirical claims about the value of neurosymbolic integration.

6.3. The Lovelace Test and Computational Creativity

The Lovelace 2.0 Test [66] provides a structured approach to assessing machine creativity. It challenges an AI system to produce an artifact that satisfies human-defined constraints yet is not trivially derivable from them. For instance, a human might request: “Tell a story involving a happy dog and a sunken ship, without using the words ‘dog,’ ‘ship,’ or ‘water’.” The degree of inventiveness with which the system meets this challenge reflects creative reasoning: indirect description (“a cheerful creature wagged beside the wreck”) or metaphorical framing would indicate genuine novelty. Symbolic systems are naturally suited to such tests, as constraints can be enforced through filtering or token penalties, while neural generators supply expressive variability. Within METATRON, such constraints can be injected at the AVM level (e.g., forcing certain attributes or emotional arcs) or within situational templates, selecting Polti situations that metaphorically accommodate disparate concepts. Creativity thus emerges as constraint satisfaction under symbolic structure, contrasting with the looser generative freedom of large pretrained LLMs.

Beyond constraint satisfaction, creativity also entails originality and diversity. Neural models risk reproducing patterns seen in training data, leading to “mode collapse” into familiar tropes. By contrast, our use of Polti’s thirty-six dramatic situations ensures systematic variety at the structural level, while stochastic character instantiation and emotional regulation inject local novelty. Creativity metrics may include lexical diversity (Distinct-n), inter-story event variety, and emotional trajectory variance. In particular, emotion regulation tends to enhance interestingness: stories exhibiting fluctuations in affect (e.g., tension and release) are empirically more engaging than those with flat trajectories [57,87].

6.4. Lovelace-Style Probes Tailored to METATRON

Concretely, we can adapt Lovelace 2.0 to METATRON by defining families of constrained prompts and scoring how well the system satisfies them. Example categories include:

Lexical avoidance constraints. The system must realize a given dramatic situation (e.g., Polti’s Pursuit) while avoiding a list of taboo words (e.g., names of roles or key objects). Constraint satisfaction is measured as the fraction of outputs with zero taboo-token violations. To penalize trivial circumlocutions (e.g., simply omitting core concepts), judges rate whether the constraints were satisfied in a non-degenerate way (e.g., the forbidden concept is still clearly present at the semantic level).
Conceptual fusion constraints. Prompts specify two or more disparate concepts (e.g., “a betrayal that ends in an unexpected act of kindness” or “a courtroom drama with fairies and quantum computers”), and require mapping them into a single Polti situation. We measure fusion success as the percentage of stories where human raters agree that all requested elements are meaningfully integrated rather than mentioned in isolation. This follows the spirit of Riedl [66] in emphasizing non-trivial satisfaction of combined constraints.
Affective-structural constraints. Here the prompt enforces a target emotional arc (e.g., neutral → fear → relief) in combination with a particular dramatic situation. METATRON’s explicit emotion tags allow computation of an arc adherence score: the correlation between the intended valence trajectory and the predicted trajectory from an emotion classifier applied to the generated text [57]. Creative success requires both structural compliance and an engaging, non-formulaic realization.

6.4.1. Quantifiable Creativity Metrics

To move beyond anecdotal examples, we can define a battery of metrics:

Constraint satisfaction rate (CSR). Proportion of generated stories that satisfy all hard constraints (lexical, structural, or affective). This directly captures the Lovelace 2.0 requirement that outputs respect user-imposed conditions.
Non-triviality score (NTS). Following Riedl [66], human or LLM-based judges rate on a Likert scale whether a story’s solution to the constraints is “obvious” (e.g., trivial insertion of keywords) or shows a degree of indirectness, metaphor, or inventiveness. Higher scores indicate more genuinely creative responses.
Originality and diversity indices. Across a batch of stories generated under the same constraints, we compute Distinct-n, type-token ratios for content words, and clustering of AVM instantiations (e.g., diversity of character roles and settings) [87]. A system that repeatedly falls back to the same combinations of roles or scenes would exhibit low structural diversity even if surface wording varies.
Surprise under a baseline model. Given a reference language model or n-gram proxy trained on a background corpus, we approximate how unexpected METATRON’s stories are by computing normalized inverse likelihood or information content. Moderately higher surprise (without degeneracy) is taken as a signature of originality, distinguishing creative constraint-satisfying outputs from rote reproductions.

6.4.2. Integrating Lovelace 2.0 with Neurosymbolic Ablations

Because METATRON factors generation into symbolic planning and neural realization, Lovelace-style tests naturally support ablation studies. For example, one can compare CSR, NTS, and diversity metrics between:

A baseline LLM prompted directly with constraints, and
The full neurosymbolic pipeline where constraints are encoded at the AVM/BCO levels.

If the neurosymbolic variant achieves higher constraint satisfaction and non-triviality without sacrificing fluency, this would empirically support the claim that symbolic scaffolding enhances computational creativity. Similarly, disabling emotion regulation or Polti-based situation selection should decrease affective-arc adherence and structural variety, respectively. In this way, the Lovelace 2.0 paradigm becomes not only a conceptual test, but a practical, quantifiable protocol for assessing and comparing creative capabilities in METATRON and related systems.

6.5. Human-like Cognitive Abilities in Narratives

Another dimension of cognitive evaluation concerns whether generated stories exhibit human-like reasoning. This includes commonsense causality, temporal consistency, and moral coherence. Early symbolic systems such as TALE-SPIN famously produced illogical outcomes (e.g., a character starves because it cannot plan to obtain food) due to limited world modeling. Contemporary LLMs encode large amounts of commonsense knowledge implicitly, yet they may still yield implausible actions or physical inconsistencies. The coherence filter in our framework mitigates this by penalizing locally illogical continuations, promoting causal continuity. Evaluators can test for commonsense coherence by checking whether the story violates basic physics or social expectations (“it was raining, yet moments later the ground was perfectly dry”), either through human annotation or automated contradiction detection using knowledge bases such as ConceptNet [79] and recent commonsense reasoning frameworks [80].

Moral and social reasoning constitute further layers of human-like cognition. Many stories implicitly convey moral lessons or emotional catharsis. While METATRON does not explicitly encode ethical reasoning, Polti’s situations often embed moral conflicts (“Sacrifice of loved ones,” “Mistaken jealousy,” “Remorse”), providing indirect scaffolds for ethical evaluation. Future expansions could integrate moral appraisal modules or reinforcement signals tied to social acceptability. Similarly, stylistic creativity—the originality of metaphors, imagery, or linguistic rhythm—remains a hallmark of human narrative art. Neural components can generate figurative language spontaneously, but symbolic oversight could ensure stylistic coherence across sentences. Computational creativity studies increasingly employ detectors for metaphor density, analogy frequency, or lexical novelty to approximate expressive originality; such analyses could be applied to evaluate stylistic richness in generated fiction.

At a deeper level, creativity also involves character transformation: the capacity for agents within the narrative to change beliefs or emotions across time. While our system does not explicitly model character growth, some Polti-derived situations naturally imply transformation (e.g., “Remorse” presupposes regret and self-revision). Human readers may nonetheless perceive such arcs if emotional and causal consistency are well maintained. Distinguishing AI-generated from human-authored stories may thus depend less on surface coherence and more on subtleties of internal development, subtext, and moral nuance—areas where current generative models still lag.

6.6. Creativity, Control, and Cognitive Realism

The relationship between creativity and control remains a central tension in computational storytelling. Excessive symbolic constraint risks formulaic results, whereas unconstrained neural generation often leads to incoherence or redundancy. METATRON seeks a balance: symbolic planning provides narrative scaffolding, while neural generation ensures stylistic and lexical fluidity. Evaluating creativity in this context therefore involves assessing both the breadth of variation across stories and the depth of coherence within each one. Human studies could complement automatic metrics by asking readers whether a story “feels new” or derivative, and whether it evokes a sense of imagination comparable to human writing. Clustering analyses comparing AI and human stories in embedding space could further quantify novelty and diversity.

From a cognitive-scientific perspective, hybrid narrative systems serve a dual function: they produce stories and simultaneously model aspects of human creativity. Earlier systems like MINSTREL already positioned themselves as cognitive simulations of creative reasoning via case-based adaptation [8]. In the same spirit, within the METATRON framework, it is possible to operationalize a plausible model of human storytelling as an interaction between structured knowledge (schemas, dramatic archetypes) and generative flexibility (language improvisation). The degree to which such systems reproduce human-like creative patterns—balancing structure and surprise—may inform not only AI evaluation but also cognitive theories of narrative construction.

6.7. Toward Integrative Evaluation of Cognitive and Creative Dimensions

The evaluation of AI-generated stories must extend beyond fluency or grammar to encompass higher-order cognitive attributes: understanding of mental states (ToM), respect for causal and moral coherence, and the capacity for imaginative novelty. The methodologies discussed—constraint-based creativity tests, ToM consistency checks, commonsense reasoning evaluation, and stylistic originality analysis—collectively provide a multifaceted framework for assessing artificial creativity. If hybrid systems such as METATRON achieve human-comparable ratings in coherence while maintaining competitive novelty and engagement, this would mark a significant step toward cognitive realism in artificial storytelling. Conversely, any remaining deficits (e.g., limited subtext or shallow emotional development) can guide future research into richer narrative planning, multi-character belief tracking, and adaptive creativity through feedback. Ultimately, the convergence of symbolic control and neural expressivity may offer not just better stories, but deeper insight into the cognitive architectures underlying the human capacity to imagine.

7. Architectural Challenges and Practical Considerations

The proposed METATRON architecture relies on a series of specialized yet interconnected components: the AVM/BCO generator responsible for abstract story scaffolding, the fictional generator that realizes scenes in natural language, a coherence filter to ensure logical consistency, a narrative interpolation engine for gap-filling, and optional multimodal generators for enhancing audiovisual engagement. This modular design brings clear advantages for interpretability, controllability, and symbolic transparency—core goals of the architecture.

Such complexity introduces architectural and computational trade-offs. This section critically examines three core areas: the potential computational cost of the pipeline, the propagation of errors through its modular stages, and the scalability of the system to longer, multi-scene narratives. To support a concrete assessment of computational feasibility, we propose a simple cost estimation procedure that formalizes how module runtime compounds throughout the generation process. This is detailed in Algorithm 3.

Algorithm 3: Estimate Computational Overhead

7.1. Computational Efficiency

One concern is the computational overhead introduced by the iterative “narrative interpolation” process. Filling each gap in the story requires generating multiple candidate continuations and then evaluating each with a separate model (the coherence filter) to select the best fit. This generate-and-rank approach, inspired by the method of Wang et al. [17], can greatly improve local coherence but is computationally intensive. If m candidate sentences are produced for each of n narrative gaps, the generation model and coherence model could be invoked on the order of

m \times n

times in total. Consequently, a story with many scenes (and thus many gaps to fill) will see roughly linear growth in runtime as its length increases, potentially making the approach slow for long narratives or real-time use. Adding multimodal components (e.g., generating an image or audio snippet for each scene) further compounds the computational cost, since vision or audio generation models must run alongside text generation.

A first optimization strategy concerns the coherence filter. In our prototype we assume a BERT-class masked language model as discriminator, which can be overkill for simple coherence decisions. Standard knowledge distillation techniques [88,89] make it feasible to train a lighter-weight student model (for example, a 3–6 layer Transformer) on the same triple-based coherence supervision. At deployment time, the student replaces the teacher as the primary filter, preserving most of the ranking ability at a fraction of the FLOPs. For setups where very high precision is required, a two-stage scheme is possible: the distilled model prunes the candidate set aggressively, and the full-sized teacher is only applied to the top k candidates. This substantially reduces the effective

t_{rank}

term in Algorithm 3 without abandoning the generate-and-rank philosophy.

The second lever is memory and context management. As stories grow longer, repeatedly passing the entire history to the generator or the coherence filter becomes prohibitively expensive and unnecessary. Instead, we can adopt dynamic memory mechanisms similar in spirit to segment-level recurrence and memory caching in architectures such as Transformer-XL [34]. In METATRON, this translates into maintaining a symbolic and neural memory of the narrative state that stores compressed summaries of previous scenes (e.g., entities, unresolved goals, affective status) while only feeding a sliding window of recent text plus a short summary into the generator and coherence filter. This reduces input length, stabilizes latency for long stories, and creates a natural interface between the AVM/BCO layer and the neural models. From the perspective of Algorithm 3, this primarily affects the constant factors inside

t_{gen}

and

t_{rank}

, since sequence length is a dominant driver of cost.

Third, the candidate generation process itself can be made more efficient. Rather than sampling a fixed number m of candidates for every gap, we can employ adaptive candidate sampling. For example, the interpolation module may start with a small batch of candidates, score them, and stop early if one surpasses a coherence threshold, only escalating to larger batches when the initial candidates are weak. Bandit-style or successive-halving strategies can further reduce cost by discarding low-promise candidates after a partial evaluation, rather than fully scoring all of them. On the generative side, carefully tuned sampling schemes such as top-p (nucleus) sampling with small p can reduce both computational and human evaluation cost by avoiding extremely low-probability continuations that are unlikely to be selected anyway [90]. In practice, this allows lowering m without a proportional loss in narrative quality.

Finally, multimodal components are clear computational bottlenecks: generating high-resolution images with diffusion models or long audio segments with generative audio models can dominate the total runtime, as reflected in the

k \cdot t_{multi}

term of Algorithm 3. Several pathways exist for cost reduction. One is to decouple text and multimodal generation temporally: the full story is first generated and validated in text-only form, and only selected key scenes (for example, one per dramatic phase or per user-specified highlight) are passed to image or audio generators. Another is to use lightweight or quantized variants of diffusion and audio models for interactive preview, reserving heavier models or higher resolutions for an optional “final render” stage. Caching and reusing multimodal outputs across similar prompts, and sharing latent representations when multiple views of the same scene are needed, can also amortize costs in multi-story or batch-generation settings.

While a naïve implementation of METATRON may spend minutes of compute for moderately long stories, practical deployments can leverage model distillation, dynamic memory, adaptive candidate sampling, and selective multimodal generation to keep resource usage within realistic bounds. Explicitly modeling these trade-offs clarifies where future engineering efforts should focus when scaling the architecture to richer, longer, or more interactive storytelling scenarios.

7.2. Error Propagation and Robustness

The modular, sequential design also raises the issue of error propagation—mistakes in one module can cascade into others. Each component in the pipeline relies on receiving sensible input from the previous stage; if that input is flawed, the flaw can be amplified downstream. For example, a poorly constructed BCO outline (e.g., one that contains logical contradictions or an uninteresting plot trajectory) provides a weak foundation for the fictional generator, which may then produce text that is coherent in a sentence-level sense but fails to form a satisfying or logical story. Likewise, if the fictional generator introduces an inconsistency or irrelevant event that the coherence filter does not catch (perhaps because the filter only evaluates local coherence between adjacent segments), that inconsistency will persist into the final narrative. Even when the coherence filter does flag low-quality outputs, the system is limited by the alternatives it has: if all candidate continuations for a gap are sub-optimal, the chosen “least bad” option may still lead the story astray. Without mechanisms to correct such issues, small errors can accumulate over multiple interpolation steps, gradually undermining the overall story consistency or believability. To improve robustness, the system needs strategies to handle module failures or sub-optimal outputs gracefully. For instance, the interpolation module could be allowed to attempt a regeneration of new candidates if none of the initial ones meet a minimum coherence score, or the coherence filtering step could be expanded to consider a broader context of the story (to avoid approving locally-coherent but globally-inconsistent passages). The manuscript should discuss how the architecture currently handles these failure modes (e.g., does it retry generation, revert to a fallback outline, or simply proceed with the best available option?) and what safeguards could be implemented to prevent a single module’s error from derailing the entire narrative. By acknowledging the risk of “garbage in, garbage out” inherent in pipeline architectures, the paper can highlight the importance of each module’s reliability and perhaps propose methods to monitor or mitigate errors as they propagate through the system.

7.3. Scalability to Longer Narratives

A final consideration is how well the pipeline scales to longer, multi-scene narratives. The current implementation has been demonstrated on short stories or single-scene interpolations, but extending it to a full-length story with many scenes or chapters would introduce new challenges. Apart from the linear increase in computation noted above, maintaining coherence across a long narrative requires handling longer-range dependencies that a pairwise interpolation strategy might not capture. Characters, events, and thematic elements need consistency throughout the narrative, which is difficult to ensure with only local (adjacent-sentence) coherence checks. Scaling up may therefore require a more hierarchical generation strategy or global planning. For example, Yang et al. [91] recently showed that it is possible to generate stories on the order of 2000+ words by breaking the task into a planning stage, a drafting stage, and a revision stage, effectively guiding the narrative with a high-level plan and then refining it. Their approach (called Re3) demonstrated that long-form generation is feasible, but it demanded substantial computational resources and careful coordination between stages. In follow-up work, Yang et al. [92] introduced a technique to produce detailed, hierarchical outlines for stories and used those outlines to control the generation process. This resulted in significantly improved plot coherence in multi-thousand-word narratives, albeit with increased complexity in the planning process. These examples suggest that our pipeline could be scaled up to handle longer stories, but doing so would likely require analogous innovations—such as generating more detailed multi-level outlines (beyond the simple BCO for a single scene) and incorporating mechanisms for maintaining global coherence (for instance, a memory of past events or constraints on characters’ states across scenes). The manuscript would benefit from a discussion of how the proposed architecture might perform if tasked with a much longer narrative, and what modifications would be needed to make that feasible. This discussion could touch on potential enhancements like hierarchical interpolation (filling in story gaps at multiple levels of granularity), dividing a long narrative into smaller sections that are generated sequentially with consistency checks at the boundaries, or even adding an editing/refinement module that revisits earlier parts of the story once later parts are generated (to adjust foreshadowing or resolve inconsistencies). Additionally, practical considerations such as parallelizing certain module operations or using more efficient model architectures for some steps could be mentioned as ways to curb the runtime explosion when scaling up. By addressing these architectural challenges and practical considerations, the paper can present a more nuanced perspective on the system’s capabilities and limitations, guiding future work on making such complex storytelling systems both effective and scalable.

7.4. Illustrative Example: Overhead Estimation

To concretize the cost implications of this pipeline, we simulate a typical scenario. Suppose a story is scaffolded into 6 scenes (

k = 6

), and the narrative interpolation step identifies 10 gaps (

n = 10

) to be filled. For each gap, the system generates

m = 5

candidate continuations, each evaluated by a coherence scoring model.

We assume the following average module latencies (on a GPU-enabled environment):

$t_{bco} = 3.0$ s for generating the full BCO outline with a symbolic prompt to GPT-4.
$t_{gen} = 2.5$ s per candidate continuation (again using a GPT-class model).
$t_{rank} = 0.5$ s per reranking pass using a BERT-based coherence model.
$t_{interp} = 0.3$ s per interpolation formatting and integration.
$t_{multi} = 4.0$ s per image or audio generation using a model like SDXL or AudioGen.
$t_{interp} = 0.3$ s per interpolation formatting and integration.

Using Algorithm 3, the total runtime is approximated as:

\begin{matrix} T_{total} & = t_{bco} + n \cdot (m \cdot (t_{gen} + t_{rank}) + t_{interp}) + k \cdot t_{multi} \\ = 3.0 + 10 \cdot (5 \cdot (2.5 + 0.5) + 0.3) + 6 \cdot 4.0 \\ = 3.0 + 10 \cdot (5 \cdot 3.0 + 0.3) + 24.0 \\ = 3.0 + 10 \cdot (15.0 + 0.3) + 24.0 \\ = 3.0 + 10 \cdot 15.3 + 24.0 \\ = 3.0 + 153.0 + 24.0 = 180.0 s \end{matrix}

Thus, under reasonable assumptions, generating a moderately complex story with 10 interpolations and 6 multimodal scenes would require roughly three minutes of compute time, even in a hardware-accelerated environment. This estimate assumes sequential execution; partial parallelization of candidate generation or scoring could reduce this time.

These figures are conservative: scaling to longer narratives with 20+ gaps or deeper multimodal conditioning could push generation time to 10–15 min. This reinforces the importance of controlling interpolation frequency, optimizing candidate sampling strategies, or batching calls to large models to keep overhead manageable in practical settings.

8. Conclusions and Future Directions

This survey examined the evolving landscape of AI-driven story generation through the dual lenses of symbolic narrative theory and neural text modeling. By reviewing several decades of research—from early logic-based planners and cognitive narrative models to large-scale transformer architectures—we have highlighted the persistent challenges of coherence, consistency, emotional depth, and creative control in machine-generated fiction. The analysis revealed a growing convergence between computational creativity, cognitive science, and literary theory, suggesting that the most promising advances arise where these domains intersect.

From this comprehensive review, we derived METATRON, a conceptual and methodological synthesis that embodies the principles identified across the literature. Rather than a particular system proposal, METATRON represents a framework integrating symbolic planning, coherence filtering, and emotion-aware neural generation. Its architecture reflects the broader insight emerging from the survey: that meaningful narrative generation requires both explicit structure and adaptive fluency—an alliance of symbolic interpretability with neural expressivity.

In this sense, METATRON functions as a theoretical bridge linking classical narrative paradigms—such as Polti’s dramatic situations, Freytag’s arc, and cognitive models of episodic memory—with contemporary large language models. The proposed modules (story outlining, interpolation, multimodal rendering, and cognitive evaluation) illustrate how long-standing narrative theories can be recontextualized within neural frameworks to address issues of coherence and controllability that purely statistical systems continue to face.

Limitations. While this synthesis advances the discussion, it also inherits the field’s unresolved challenges. Polti’s taxonomy, though comprehensive, still constrains creative variability within established archetypes. Similarly, coherence filtering ensures local continuity but does not yet capture global narrative logic or subtextual meaning. The cognitive and multimodal extensions outlined here remain speculative, and empirical validation through user studies and large-scale benchmarks is still required. Moreover, evaluating creativity, emotional resonance, and theory-of-mind reasoning continues to depend on human judgment, as automatic proxies remain unreliable.

In addition to these conceptual limitations, the present manuscript does not address the scalability or real-world cost implications of implementing METATRON at production scale. Deploying the full neuro-symbolic pipeline—particularly long-context LLMs, multimodal generators, and iterative coherence filters—requires significant computational resources. The cost of inference and fine-tuning depends on model size, hardware availability, and usage frequency, which may limit accessibility for smaller research groups or real-time applications. Future work should examine strategies for lowering computational demands, such as model distillation, caching of symbolic structures, or lightweight variants of the coherence filter.

Future Directions. Several avenues emerge from this analysis. Advances in memory-augmented transformers, retrieval-augmented generation, and world-modeling architectures could yield more stable long-term coherence and richer causal reasoning. Integrating multiple narrative arcs and subplots would move story generation closer to the structural density of human fiction. The intersection with interactive storytelling and game design offers fertile ground for testing symbolic–neural cooperation in dynamic environments. On the evaluative side, cognitive metrics—capturing belief tracking, emotional trajectories, or reader engagement—could complement traditional linguistic measures. Ethical and cultural dimensions also deserve attention, particularly regarding bias, authorship, and the boundaries of machine creativity.

This survey aimed not only to summarize the state of story generation but to articulate a direction for its next phase: a cognitively grounded, multimodal, and symbolically interpretable paradigm. The proposed METATRON framework thus stands as both a product and a projection of this synthesis—a model that exemplifies how narrative theory, cognitive modeling, and neural text generation may converge toward a more human-like understanding of storytelling. In this convergence lies the promise of future systems capable not merely of producing coherent text, but of narrating with intention, emotion, and imagination.

Author Contributions

Conceptualization, H.C. and B.H.-G.; methodology, H.C. and B.H.-G.; validation, M.H.L. and H.C.; formal analysis, H.C. and M.H.L.; investigation, H.C., B.H.-G. and M.H.L.; resources, H.C.; writing—original draft preparation, B.H.-G.; writing—review and editing, H.C. and M.H.L.; supervision, M.H.L. and H.C.; project administration, H.C.; funding acquisition, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Instituto Politécnico Nacional (COFAA, SIP-IPN, Grant SIP 20250015) and the Mexican Government (SECIHTI, SNII).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

We are grateful to our colleagues in the NLP laboratory and Pablo Gervás for their insightful discussions and feedback on early versions of this work.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
AVM	Attribute–Value Matrix
BCO	Beginning, Climax, and Outcome structure
BERT	Bidirectional Encoder Representations from Transformers
DOME	Dynamic Outline Model for Evaluation
GPT	Generative Pre-trained Transformer
LLM	Large Language Model
NLP	Natural Language Processing
RL	Reinforcement Learning
SWC	Story–World Context
ToM	Theory of Mind

References

Harari, Y.N. Sapiens: A Brief History of Humankind; Harper: New York, NY, USA, 2015. [Google Scholar]
Rukeyser, M. The Collected Poems of Muriel Rukeyser; University of Pittsburgh Press: Pittsburgh, PA, USA, 2005. [Google Scholar]
Guan, J.; Wang, Y.; Huang, S.; Zhao, Z.; Huang, M. A knowledge-enhanced pretraining model for commonsense story generation. Trans. Assoc. Comput. Linguist. 2020, 8, 93–108. [Google Scholar] [CrossRef]
Li, J.; Galley, M.; Brockett, C.; Gao, J.; Dolan, B. A diversity-promoting objective function for neural conversation models. In Proceedings of the NAACL-HLT 2016, San Diego, CA, USA, 12–17 June 2016; pp. 110–119. [Google Scholar]
Meehan, J.R. TALE-SPIN, an interactive program that writes stories. In Proceedings of the 5th International Joint Conference on Artificial Intelligence, Cambridge, MA, USA, 22–25 August 1977; pp. 91–98. [Google Scholar]
Lebowitz, M. Planning stories. In Proceedings of the 9th Annual Conference of the Cognitive Science Society, Seattle, WA, USA, 16–18 July 1987; pp. 234–242. [Google Scholar]
Turner, S.R. MINSTREL: A Computer Model of Creativity in Storytelling. Ph.D. Thesis, University of California, Los Angeles, CA, USA, 1993. [Google Scholar]
Turner, S.R. The Creative Process: A Computer Model of Storytelling and Creativity; Lawrence Erlbaum Associates: Hillsdale, NJ, USA, 1994. [Google Scholar]
Pérez-y-Pérez, R.; Sharples, M. MEXICA: A computer model of a cognitive account of creative writing. J. Exp. Theor. Artif. Intell. 2001, 13, 119–139. [Google Scholar] [CrossRef]
Riedl, M.O.; Young, R.M. Narrative planning: Balancing plot and character. J. Artif. Intell. Res. 2010, 39, 217–268. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 5998–6008. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Better Language Models and Their Implications. OpenAI Technical Report. 2019. Available online: https://openai.com/blog/better-language-models/ (accessed on 9 November 2025).
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Ammanabrolu, P.; Riedl, M.O. Playing Text-Adventure Games with Graph-Based Deep Reinforcement Learning. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 3–5 June 2019; pp. 3557–3565. [Google Scholar] [CrossRef]
Yao, L.; Peng, N.; Weischedel, R.; Knight, K.; Zhao, D.; Yan, R. Plan-and-write: Towards better automatic storytelling. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 7378–7385. [Google Scholar]
Fan, A.; Lewis, M.; Dauphin, Y. Strategies for structuring story generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 2650–2660. [Google Scholar] [CrossRef]
Wang, S.; Durrett, G.; Erk, K. Narrative interpolation for generating and understanding stories. arXiv 2020, arXiv:2008.07466. [Google Scholar] [CrossRef]
Clark, E.; Ji, Y.; Smith, N.A. Neural text generation in stories using entity representations as context. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; pp. 2250–2260. [Google Scholar]
Jain, P.; Agrawal, P.; Mishra, A. Story generation from sequence of independent short descriptions. In Proceedings of the SIGKDD Workshop on Machine Learning for Creativity (ML4Creativity), Halifax, NS, Canada, 14 August 2017. [Google Scholar]
Ammanabrolu, P.; Tien, E.; Cheung, W.; Luo, Z.; Ma, W.; Martin, L.J.; Riedl, M.O. Story realization: Expanding plot events into sentences. arXiv 2019, arXiv:1909.03480. [Google Scholar] [CrossRef]
Xu, P.; Patwary, M.; Shoeybi, M.; Puri, R.; Fung, P.; Anandkumar, A.; Catanzaro, B. MEGATRON-CNTRL: Controllable story generation with external knowledge using large-scale language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 2831–2845. [Google Scholar]
Rashkin, H.; Celikyilmaz, A.; Choi, Y.; Gao, J. PlotMachines: Outline-Conditioned Generation with Dynamic Plot State Tracking. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 16–20 November 2020; pp. 4274–4295. [Google Scholar] [CrossRef]
Liu, N.F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Jia, R.; Liang, P.; Manning, C.D. Lost in the middle: How language models use long contexts. arXiv 2023, arXiv:2307.03172. [Google Scholar] [CrossRef]
Peng, N.; Ghazvininejad, M.; May, J.; Knight, K. Towards controllable story generation. In Proceedings of the 1st Workshop on Storytelling, New Orleans, LA, USA, 5 June 2018; pp. 43–49. [Google Scholar]
Tambwekar, P.; Dhuliawala, M.; Martin, L.; Mehta, A.; Harrison, B.; Riedl, M.O. Controllable neural story plot generation via reward shaping. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 5982–5988. [Google Scholar] [CrossRef]
Luo, F.; Xu, Z.; Liu, T.; Chang, B.; Sui, Z. Learning to control the fine-grained sentiment for story ending generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6020–6026. [Google Scholar]
Brahman, F.; Chaturvedi, S. Modeling protagonist emotions for emotion-aware storytelling. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Online, 16–20 November 2020; pp. 5277–5294. [Google Scholar]
Wang, W.; Li, P.; Zheng, H. Consistency and coherency enhanced story generation. In Proceedings of the European Conference on Information Retrieval, Virtual Event, 28 March–1 April 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 85–99. [Google Scholar]
Ammanabrolu, P.; Riedl, M.O. Learning Knowledge Graph-based World Models of Textual Environments. In Proceedings of the Advances in Neural Information Processing Systems, NeurIPS 2021, Online, 6–14 December 2021. [Google Scholar]
Bosselut, A.; Rashkin, H.; Sap, M.; Malaviya, C.; Celikyilmaz, A.; Choi, Y. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), Florence, Italy, 28 July–2 August 2019; pp. 4762–4779. [Google Scholar] [CrossRef]
Ammanabrolu, P.; Riedl, M.O. Modeling worlds in text. arXiv 2021, arXiv:2106.09578. [Google Scholar] [CrossRef]
y Pérez, R.P.; Sharples, M. Three computer-based models of storytelling: BRUTUS, MINSTREL and MEXICA. Knowl. Based Syst. 2004, 17, 15–29. [Google Scholar] [CrossRef]
Porteous, J.; Cavazza, M. Controlling narrative generation with planning trajectories: The role of constraints. In Proceedings of the ICIDS 2009: Interactive Storytelling, Berlin, Germany, 9–11 December 2009; pp. 234–245. [Google Scholar]
Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.G.; Le, Q.; Salakhutdinov, R. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 2978–2988. [Google Scholar] [CrossRef]
Xiang, J.; Zhao, Z.; Zhou, M.; McKenzie, M.; Kilayko, A.; Macbeth, J.C.; Carter, S.; Sieck, K.; Klenk, M. Interleaving a symbolic story generator with a neural network-based large language model. In Proceedings of the Advances in Cognitive Systems Conference, Arlington, VA, USA, 19–22 November 2022. [Google Scholar]
Farrell, R.; Ware, S.G. Large Language Models as Narrative Planning Search Guides. IEEE Trans. Games 2025, 17, 419–428. [Google Scholar] [CrossRef]
Ware, S.G.; Young, R.M. Glaive: A State-Space Narrative Planner Supporting Intentionality and Conflict. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, Québec, QC, Canada, 27–31 July 2014; AAAI Press: Washington, DC, USA, 2014; pp. 957–964. [Google Scholar]
Yang, S.; Ge, Y.; Li, Y.; Chen, Y.; Ge, Y.; Shan, Y.; Chen, Y. SEED-Story: Multimodal Long Story Generation with Large Language Model. arXiv 2024, arXiv:2407.08683. [Google Scholar] [CrossRef]
Yang, L.; Xiao, Z.; Huang, W.; Zhong, X. StoryLLaVA: Enhancing Visual Storytelling with Multi-Modal Large Language Models. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, UAE, 19–24 January 2025; Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Di Eugenio, B., Schockaert, S., Eds.; International Committee on Computational Linguistics (ICCL). 2025; pp. 3936–3951. [Google Scholar]
Chen, Z.; Pan, R.; Li, H. StoryBox: Collaborative Multi-Agent Simulation for Hybrid Bottom-Up Long-Form Story Generation Using Large Language Models. arXiv 2025, arXiv:2510.11618. [Google Scholar] [CrossRef]
Narayanan, D.; Shoeybi, M.; Casper, J.; LeGresley, P.; Patwary, M.; Korthikanti, V.A.; Vainbrand, D.; Kashinkunti, P.; Bernauer, J.; Catanzaro, B.; et al. Efficient large-scale language model training on GPU clusters using Megatron-LM. arXiv 2021, arXiv:2104.04473. [Google Scholar] [CrossRef]
Polti, G. Les 36 Situations Dramatiques; Mercure de France: Paris, France, 1895. [Google Scholar]
Figgis, M. The Thirty-Six Dramatic Situations; Faber and Faber: London, UK, 2017. [Google Scholar]
Minsky, M. A Framework for Representing Knowledge. In The Psychology of Computer Vision; Winston, P., Ed.; McGraw-Hill: New York, NY, USA, 1975; pp. 211–277. [Google Scholar]
Calvo, H.; Gelbukh, A. Recognizing Situation Patterns from Self-Contained Stories. In Proceedings of the Advances in Natural Language Understanding and Intelligent Access to Textual Information: NLUIATI-2005 Workshop in conjunction with MICAI-2005, Monterrey, Mexico, 14–18 November 2005; Gelbukh, A., Gómez, M.M., Eds.; Research in Computing Science. Center for Computing Research, IPN: Mexico City, Mexico, 2006; pp. 1–10. [Google Scholar]
Gelbukh, A.; Calvo, H. Second Approach: Constituent Grammars. In Automatic Syntactic Analysis Based on Selectional Preferences; Springer: Berlin/Heidelberg, Germany, 2018; pp. 29–44. [Google Scholar]
Mostafazadeh, N.; Chambers, N.; He, X.; Parikh, D.; Batra, D.; Vanderwende, L.; Kohli, P.; Allen, J. A corpus and evaluation framework for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 6 January 2016; pp. 839–849. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 3–5 June 2019; pp. 4171–4186. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A robustly optimized BERT pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar] [CrossRef]
Kandel, E.R.; Koester, J.D.; Mack, S.H.; Siegelbaum, S.A. Principles of Neural Science, 6th ed.; McGraw-Hill: New York, NY, USA, 2021. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar] [CrossRef]
Bae, J.; Jeong, S.; Kang, S.; Han, N.; Lee, J.Y.; Kim, H.; Kim, T. Sound of story: Multi-modal storytelling with audio. In Proceedings of the Findings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 13465–13479. [Google Scholar]
Agostinelli, A.; Borsos, Z.; Engel, J.; Verzetti, M.; Le, Q.V.; Adi, Y.; Acher, M.; Saharia, C.; Chan, W.; Tagliasacchi, M. MusicLM: Generating music from text. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar] [CrossRef]
Kreuk, F.; Polyak, A.; Ridnik, T.; Sharir, E.; Paiss, R.; Lang, O.; Mosseri, I.; Ayalon, A.; Dorman, G.; Freedman, D. AudioGen: Textually guided audio generation. arXiv 2022, arXiv:2209.15352. [Google Scholar] [CrossRef]
Propp, V. Morphology of the Folktale, 2nd ed.; University of Texas Press: Austin, TX, USA, 1968. [Google Scholar]
Gervás, P. Computational approaches to storytelling and creativity. AI Mag. 2014, 30, 49–62. [Google Scholar] [CrossRef]
Reagan, A.J.; Danforth, C.M.; Tivnan, B.; Williams, J.R.; Dodds, P.S. The emotional arcs of stories are dominated by six basic shapes. EPJ Data Sci. 2016, 5, 31. [Google Scholar] [CrossRef]
Boden, M.A. Creativity and artificial intelligence. Artif. Intell. 1998, 103, 347–356. [Google Scholar] [CrossRef]
Huang, T.H.K.; Ferraro, F.; Mostafazadeh, N.; Misra, I.; Agrawal, A.; Devlin, J.; Girshick, R.; He, X.; Kohli, P.; Batra, D.; et al. Visual storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 6 June 2016; pp. 1233–1239. [Google Scholar] [CrossRef]
Copet, J.L.; Kreuk, F.; Gat, I.; Remez, T.; Kant, D.; Synnaeve, G.; Adi, Y.; Défossez, A. Simple and controllable music generation. arXiv 2023, arXiv:2306.05284. [Google Scholar] [CrossRef]
Shoeybi, M.; Patwary, M.; Puri, R.; LeGresley, P.; Casper, J.; Catanzaro, B. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv 2019, arXiv:1909.08053. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Volume 33, pp. 9459–9474. [Google Scholar]
Hashimoto, T.B.; Guu, K.; Oren, Y.; Liang, P. A retrieve-and-edit framework for predicting structured outputs. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; Volume 31, pp. 10073–10083. [Google Scholar]
Kosinski, M. Theory of mind may have spontaneously emerged in large language models. arXiv 2023, arXiv:2302.02083. [Google Scholar] [CrossRef]
Kosinski, M. Evaluating large language models in theory-of-mind tasks. Natl. Acad. Sci. 2023, 120, e2405460121. [Google Scholar] [CrossRef] [PubMed]
Riedl, M.O. The Lovelace 2.0 Test of Artificial Creativity and Intelligence. arXiv 2014, arXiv:1410.6142. [Google Scholar] [CrossRef]
AlHussain, A.I.; Azmi, A.M. Automatic story generation: A survey of approaches. ACM Comput. Surv. 2021, 54, 103:1–103:38. [Google Scholar] [CrossRef]
Alabdulkarim, A.; Li, S.; Peng, X. Automatic story generation: Challenges and attempts. In Proceedings of the 3rd Workshop on Narrative Understanding (NUSE), Virtual, 11 June 2021; pp. 72–83. [Google Scholar]
Kachare, A.H.; Kalla, M.; Gupta, A. A review: Automatic short story generation. Seybold Rep. 2022, 17, 1818–1829. [Google Scholar]
Pemberton, L. A Modular Approach to Story Generation. In Proceedings of the Fourth Conference of the European Chapter of the Association for Computational Linguistics, Manchester, UK, 10–12 April 1989. [Google Scholar]
Fan, A.; Lewis, M.; Dauphin, Y. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; pp. 889–898. [Google Scholar] [CrossRef]
Li, J.; Bing, L.; Qiu, L.; Chen, D.; Zhao, D.; Yan, R. Learning to write stories with thematic consistency and wording novelty. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 1715–1722. [Google Scholar] [CrossRef]
Dathathri, S.; Madotto, A.; Lan, J.; Hung, J.; Frank, E.; Molino, P.; Yosinski, J.; Liu, R. Plug and play language models: A simple approach to controlled text generation. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar] [CrossRef]
Young, R.M.; Ware, S.G.; Cassell, B.A.; Robertson, J. Plans and planning in narrative generation: A review of plan-based approaches to the generation of story, discourse and interactivity in narratives. Sprache Datenverarb. Spec. Issue Form. Comput. Model. Narrat. 2013, 37, 41–64. [Google Scholar]
Cardona-Rivera, R.E.; Cassell, B.A.; Ware, S.G.; Young, R.M. Indexter: A computational model of the event-indexing situation model for characterizing narratives. In Proceedings of the 3rd Workshop on Computational Models of Narrative, Istanbul, Turkey, 26–27 May 2012; pp. 34–43. [Google Scholar]
Huet, A.; Houidi, Z.B.; Rossi, D. Episodic memories generation and evaluation benchmark for large language models. arXiv 2025, arXiv:2501.13121. [Google Scholar] [CrossRef]
Andreas, J. Language Models as Agent Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 5769–5779. [Google Scholar] [CrossRef]
Riedl, M.O.; Harrison, B. Using stories to teach human values to artificial agents. In Proceedings of the 2nd AAAI Conference on Artificial Intelligence, Ethics, and Society, Madrid, Spain, 20–22 October 2016. [Google Scholar]
Speer, R.; Chin, J.; Havasi, C. ConceptNet 5.5: An open multilingual graph of general knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4444–4451. [Google Scholar]
Ilievski, F.; Oltramari, A.; Ma, K.; Zhang, B.; McGuinness, D.L.; Szekely, P. Dimensions of commonsense knowledge. Knowl. Based Syst. 2021, 229, 107347. [Google Scholar] [CrossRef]
Ammanabrolu, P.; Cheung, W.; Broniec, W.; Riedl, M.O. Automated storytelling via causal, commonsense plot ordering. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; Volume 35, pp. 5859–5867. [Google Scholar]
Rabinowitz, N.C.; Perbet, F.; Song, H.F.; Zhang, C.; Eslami, S.M.A.; Botvinick, M. Machine Theory of Mind. In Proceedings of the 35th International Conference on Machine Learning, ICML, Stockholm, Sweden, 10–15 July 2018; pp. 4218–4227. [Google Scholar]
Ullman, T. Large language models fail on trivial alterations to theory-of-mind tasks. arXiv 2023, arXiv:2302.08399. [Google Scholar]
Sap, M.; Rashkin, H.; Chen, D.; LeBras, R.; Choi, Y. SocialIQA: Commonsense reasoning about social interactions. arXiv 2019, arXiv:1904.09728. [Google Scholar] [CrossRef]
Sileo, D.; Lernould, A. Mindgames: Targeting theory of mind in large language models with dynamic epistemic modal logic. arXiv 2023, arXiv:2305.03353. [Google Scholar] [CrossRef]
Guan, J.; Feng, Z.; Chen, Y.; He, R.; Mao, X.; Fan, C.; Huang, M. LOT: A Story-Centric Benchmark for Evaluating Chinese Long Text Understanding and Generation. Trans. Assoc. Comput. Linguist. 2022, 10, 434–451. [Google Scholar] [CrossRef]
Jordanous, A. A Standardised Procedure for Evaluating Creative Systems: Computational Creativity Evaluation Based on What It Is to Be Creative. Cogn. Comput. 2012, 4, 246–279. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. In Proceedings of the NIPS Deep Learning and Representation Learning Workshop, Montréal, QC, Canada, 11 December 2015. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. In Proceedings of the NeurIPS Workshop on Energy Efficient Machine Learning and Cognitive Computing, Vancouver, BC, Canada, 3 December 2019. [Google Scholar]
Holtzman, A.; Buys, J.; Du, L.; Forbes, M.; Choi, Y. The Curious Case of Neural Text Degeneration. In Proceedings of the ICLR, Virtual, 26 April–1 May 2020. [Google Scholar]
Yang, K.; Tian, Y.; Peng, N.; Klein, D. Re3: Generating longer stories with recursive reprompting and revision. arXiv 2022, arXiv:2210.06774. [Google Scholar] [CrossRef]
Yang, K.; Klein, D.; Peng, N.; Tian, Y. DOC: Improving long story coherence with detailed outline control. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; pp. 3378–3465. [Google Scholar]

Figure 1. General architecture of the METATRON framework. Trapezoids represent intermediate symbolic structures (AMV outlines in cyan and phrase-level outlines in red), rectangles correspond to full narrative layers, and the cylinder denotes the knowledge base. Colors distinguish the AMV track (cyan), the phrase-level track (red), the generative outline module (purple), and the interpolation module (dark red). Dashed arrows indicate iterative feedback loops that refine and realign earlier stages in the pipeline.

Figure 2. Example of an Attribute-Value Matrix (AVM) [45] representing a narrative situation. In this illustration, the situation is Polti’s Pursuit. Attributes like Characters, Setting, and Why (causal link) are filled with example values.

Figure 3. Process of generating Beginning, Climax, and Outcome (BCO) sentences from an AVM. The AVM provides the structured representation of a situation (here, an example of Polti’s “Pursuit” situation), and the situational script converts it into three natural-language sentences outlining the narrative. Blue boxes mark the AVM-related transformation steps, the green node represents the knowledge database used for basic filling, and the red boxes indicate the BCO-specific extraction stages. The hexagon corresponds to the few-shot prompting stage with GPT-4.

Figure 4. Training of the auto-regressive language model for fictional text generation. The model is first trained on a large fiction corpus to learn storytelling style, then fine-tuned on a causal/coherence-focused dataset so that its continuations are logically and temporally connected to the prompt. Green hexagons denote the model and dataset components, the blue box marks the fine-tuning stage, and the red hexagon represents the resulting fictional generator after adaptation.

Figure 5. Training of the masked language model for coherence filtering. The model is trained on both naturally coherent story sequences and perturbed (incoherent) sequences, learning to distinguish which candidate sentence leads to a more coherent overall passage. Green circles and hexagons denote the original datasets and base model, blue boxes correspond to the perturbation and fine-tuning stages, and the red hexagon marks the resulting coherence ranker that evaluates candidate continuations.

Figure 6. Narrative Narrative interpolation process (one iteration). Given a preceding sentence (Alpha) and a succeeding sentence (Omega), the system generates multiple candidate Beta sentences. The coherence filter (incorporating the masked language model) evaluates which candidate best connects Alpha to Omega. The highest-scoring Beta is then inserted between Alpha and Omega. This process can repeat to insert multiple sentences. Green labels denote the fixed boundary sentences (Alpha and Omega), blue shapes correspond to the generative and ranking modules involved in producing the Beta candidates, and red boxes indicate the update steps that modify the AVMs and the evolving story.

Figure 7. Full story generation: the system starts with a Beginning (B), Climax (C), and Outcome (O) outline. It then interpolates intermediate sentences (shown as grey boxes) between B and C, and between C and O, resulting in a fully fleshed-out story.

Figure 8. Overall architecture of the METATRON story generation framework. Symbolic components (green) generate the high-level narrative structure, neural components (blue) produce and refine text, and potentially multimodal components (red) add images and audio to the story.

Table 1. Comparative overview of representative approaches to automatic story generation.

Paradigm/System	Key References	Core Mechanism	Strengths	Limitations/ Challenges
TALE-SPIN	Meehan [5]	Rule-based simulation of characters’ goals and problem-solving	Ensures causal logic; early model of narrative reasoning	Limited domain, rigid grammar, minimal stylistic variety
Author	Lebowitz [6]	Plot fragments and rule-based assembly	Genre control; domain-specific coherence	Highly handcrafted; low generalization
MINSTREL	Turner [7,8]	Case-based reasoning and transformational creativity	Adaptation of existing stories; explicit author goals	Requires detailed knowledge base; limited language output
MEXICA	Pérez-y-Pérez and Sharples [9], y Pérez and Sharples [32]	Engagement–reflection cognitive cycle with emotional tension modeling	Strong internal coherence and affective arcs	Domain-limited; manually designed action set
Narrative Planners (IPOCL, etc.)	Riedl and Young [10], Porteous and Cavazza [33]	Partial-order causal link planning with character intentions	Causal consistency and believable actions	Text realization often mechanical; computationally expensive
Plan-and-Write/Hierarchical Neural	Yao et al. [15], Fan et al. [16]	Two-stage neural pipeline: outline generation then realization	Better global focus than flat LMs; scalable with LLMs	Still prone to drift within long contexts
Knowledge-Augmented LMs	Guan et al. [3], Ammanabrolu and Riedl [14]	Integration of commonsense or script knowledge graphs	Improved logical coherence and causality	Limited by coverage and noise in external knowledge
INTERPOL	Wang et al. [17]	Generator–critic setup with coherence reranker (RoBERTa)	Removes incoherent continuations effectively	Requires multiple candidates; increases inference cost
Entity-Aware Generators	Clark et al. [18], Ammanabrolu et al. [20], Rashkin et al. [22]	Explicit entity or state representations during decoding	Consistent characters and references; mitigates “lost in the middle”	Additional complexity; entity drift not fully solved
Emotion- and Goal-Conditioned Models	Brahman and Chaturvedi [27], Luo et al. [26], Tambwekar et al. [25]	Conditioning via emotion trajectories or reinforcement learning rewards	Genre or mood control; dynamic affective pacing	Limited emotional taxonomies; unstable optimization
Large Language Models (GPT-2/3, etc.)	Vaswani et al. [11], Radford et al. [12], Brown et al. [13], Dai et al. [34]	Autoregressive Transformer trained on large-scale corpora	Fluent, stylistically rich text generation	Weak long-range coherence; memory and consistency issues
Hybrid/ Neurosymbolic	Xiang et al. [35], Farrell and Ware [36], Ware and Young [37]	Combination of symbolic planning and neural generation guided by coherence filters	Balances structure and fluency; explainable; extensible	Integration cost; evaluation frameworks still developing

Table 2. Comparative characteristics of story generation paradigms.

Criterion	Symbolic	Neural	Neuro-Symbolic
Coherence	High: follows explicit plot logic	Medium/Low: lacks structure, drifts	High: guided plots keep global structure
Consistency	High: no contradictions; explicit states	Low: frequent contradictions or drift	High: constraints avoid contradictions
Creativity	Low: rule-bound, formulaic	High: diverse, data-driven content	Medium: structured yet diverse
Prior Knowledge	Extensive: handcrafted rules/templates	Minimal: learned from data	Moderate: needs schemas plus pretraining
Scalability	Poor: domain-limited, brittle	High: general but context-limited	Moderate: scalable neural core, adaptable rules

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Calvo, H.; Herrera-González, B.; Laureano, M.H. Integrating Cognitive, Symbolic, and Neural Approaches to Story Generation: A Review on the METATRON Framework. Mathematics 2025, 13, 3885. https://doi.org/10.3390/math13233885

AMA Style

Calvo H, Herrera-González B, Laureano MH. Integrating Cognitive, Symbolic, and Neural Approaches to Story Generation: A Review on the METATRON Framework. Mathematics. 2025; 13(23):3885. https://doi.org/10.3390/math13233885

Chicago/Turabian Style

Calvo, Hiram, Brian Herrera-González, and Mayte H. Laureano. 2025. "Integrating Cognitive, Symbolic, and Neural Approaches to Story Generation: A Review on the METATRON Framework" Mathematics 13, no. 23: 3885. https://doi.org/10.3390/math13233885

APA Style

Calvo, H., Herrera-González, B., & Laureano, M. H. (2025). Integrating Cognitive, Symbolic, and Neural Approaches to Story Generation: A Review on the METATRON Framework. Mathematics, 13(23), 3885. https://doi.org/10.3390/math13233885

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrating Cognitive, Symbolic, and Neural Approaches to Story Generation: A Review on the METATRON Framework

Abstract

1. Introduction

2. State of the Art

2.1. Symbolic Approaches to Story Generation

2.2. Neural Approaches to Story Generation

3. The METATRON Framework

3.1. General Architecture of the METATRON Framework

3.2. Module’s Description

3.3. Full Story Generation and Multimodal Extension

3.4. A Neurosymbolic Approach

3.5. Novelty and Distinguishing Features of the Approach

3.6. Evaluation Methodology

4. Controllable Narrative Generation and Symbolic Guidance in LLMs

Narrative Controllability and Symbolic Guidance

5. Episodic Memory and Emotional Modeling in Story Generation

5.1. Memory as Narrative Structure and Constraint

Mechanisms for Memory Retention

5.2. Emotional Modeling and Affective Trajectories

Interplay of Emotion and Memory

6. Cognitive and Creative Dimensions of AI-Generated Narratives

6.1. Theory of Mind in AI Narratives

6.2. Operationalizing Theory of Mind Within the METATRON Framework

Interpretation Within METATRON

6.3. The Lovelace Test and Computational Creativity

6.4. Lovelace-Style Probes Tailored to METATRON

6.4.1. Quantifiable Creativity Metrics

6.4.2. Integrating Lovelace 2.0 with Neurosymbolic Ablations

6.5. Human-like Cognitive Abilities in Narratives

6.6. Creativity, Control, and Cognitive Realism

6.7. Toward Integrative Evaluation of Cognitive and Creative Dimensions

7. Architectural Challenges and Practical Considerations

7.1. Computational Efficiency

7.2. Error Propagation and Robustness

7.3. Scalability to Longer Narratives

7.4. Illustrative Example: Overhead Estimation

8. Conclusions and Future Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI