TimeWeaver: Orchestrating Narrative Order via Temporal Mixture-of-Experts Integrated Event–Order Bidirectional Pretraining and Multi-Granular Reward Reinforcement Learning

Lu, Zhicong; Jia, Wei; Tian, Changyuan; Jin, Li; Bai, Yang; Xu, Guangluan

doi:10.3390/electronics14193880

Open AccessArticle

TimeWeaver: Orchestrating Narrative Order via Temporal Mixture-of-Experts Integrated Event–Order Bidirectional Pretraining and Multi-Granular Reward Reinforcement Learning

by

Zhicong Lu

^1,2,3,4

,

Wei Jia

^1,2,3,4

,

Changyuan Tian

^1,2,3,4

,

Li Jin

^1,4,*

,

Yang Bai

^1,4

and

Guangluan Xu

^1,4

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China

²

University of Chinese Academy of Sciences, Beijing 100190, China

³

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China

⁴

Key Laboratory of Target Cognition and Application Technology (TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(19), 3880; https://doi.org/10.3390/electronics14193880

Submission received: 18 August 2025 / Revised: 21 September 2025 / Accepted: 24 September 2025 / Published: 29 September 2025

(This article belongs to the Special Issue Advances in Generative AI and Computational Linguistics)

Download

Browse Figures

Versions Notes

Abstract

Human storytellers often orchestrate diverse narrative orders (chronological, flashback) for crafting compelling stories. To equip artificial intelligence systems with such capability, existing methods rely on implicitly learning narrative sequential knowledge, or explicitly modeling narrative order through pairwise event temporal order (e.g., take medicine <after> get ill). However, both suffer from imbalanced narrative order distribution bias and inadequate event temporal understanding, hindering generating high-quality events in the story that balance the logic and narrative order. In this paper, we propose a narrative-order-aware framework, TimeWeaver, which presents an event–order bidirectional pretrained model integrated with temporal mixture-of-experts to orchestrate diverse narrative orders. Specifically, to mitigate imbalanced distribution bias, the temporal mixture-of-experts is devised to route events with various narrative orders to corresponding experts, grasping distinct orders of narrative generation. Then, to enhance event temporal understanding, an event sequence narrative-order-aware model is pretrained with bidirectional reasoning between event and order, encoding the event temporal orders and event correlations. At the fine-tuning stage, reinforcement learning with multi-granular optimal transport reward is designed to boost the quality of generated events. Extensive experimental results on automatic and manual evaluations demonstrate the superiority of our framework in orchestrating diverse narrative orders during story generation.

Keywords:

story generation; narrative order; pretraining; mixture-of-experts; reinforcement learning; optimal transport

1. Introduction

Stories, serving as an essential element of numerous literary genres, exhibit a significant role across various domains, such as literature, education, and entertainment. To increase the readability and interest of the story, a proficient storyteller possesses the skillful ability to orchestrate diverse narrative orders (i.e., chronological and flashback) in the storyline. For example, in Figure 1, following a series of chronological narratives of current situations, the literary work ingeniously inserts the past events “Bad luck to dream of mountains…” as a flashback to provide context for current narratives, which explains the protagonist’s tragic plight and enhances the aesthetics and appeal of the story. To develop a more comprehensive artificial intelligence-generated content (AIGC) system, it is crucial to equip artificial intelligence systems with such capability of orchestrating diverse narrative orders of the story.

To achieve this goal, existing work has conducted preliminary explorations in two mainstreams to guide the model in manipulating the narrative order: (1) implicit learning of narrative sequential knowledge from the text corpus; (2) explicit modeling of narrative order through learning the pairwise event temporal order. Concretely, the former [1,2,3] generally adopts the representative next token prediction paradigm in the pretraining or fine-tuning stage, which expects the model to automatically grasp the narrative sequential knowledge entailed in the training corpus, thereby arranging the narrative order reasonably. However, the imbalanced narrative order distribution bias stemmed from the natural corpus inclines the model towards orchestrating a monotonic chronological order, underestimating the creative flashback order. As shown in Figure 2, influenced by the dominant chronological order in the corpus, existing methods (i.e., vanilla) would exacerbate this bias by increasing the chronological order (e.g., from 0.68 to 0.79) and shrinking the flashback order (e.g., from 0.22 to 0.06), thus hindering creative generation. To partially alleviate this bias, MEGATRON [4] introduces external knowledge to lead the event content generation. However, it requires multiple retrievals from the knowledge base, increasing the complexity and limiting the freedom of story generation.

For the latter, existing methods model the narrative order through learning the pairwise event temporal order. As depicted in Figure 3, the chronological narrative order between pairwise events is equivalent to the “before” temporal order (i.e., event

[e 1]

described earlier happens before event

[e 2]

described later), while the flashback narrative order corresponds to the “after” temporal order. Therefore, narrative order can be represented by event temporal order, so an adequate understanding of event temporal order is a prerequisite for orchestrating the narrative order reasonably. To achieve this goal, diverse pretraining schemas oriented on event temporal order [7,8,9,10,11] are adopted to mine the event temporal knowledge. To further enhance the controllability of story generation and actively stimulate the model to orchestrate rich narrative orders, the temporal prompts [5] (i.e., <before>, <after>) are engaged to endow the model with the ability to manipulate the narrative order in place. For instance, in Figure 3, <after> implies the pairwise event temporal order, which induces the model to insert the past event for orchestrating the flashback order (i.e., “felt confused and in pain” described earlier occurs after “had a car accident” described later).

However, these methods are plagued by the difficulty in generating high-quality events in the story that balance the logic and narrative order, leading to deteriorating event logic and inefficient pairwise event temporal order manipulation. We attribute this phenomenon to the following reasons: (i) the setting of orchestrating diverse narrative orders is inherently more challenging when required to generate high-quality events in the story, as it requires considering both event logic and temporal order rather than mere logic in regular story generation; (ii) the imbalanced narrative order distribution from the natural corpus exacerbates the learning difficulty; (iii) the implemented training strategy invariably motivates the model to perform the unidirectional reasoning from temporal order to event, neglecting the inverse reasoning order from pairwise events, leading to the insufficient understanding of event temporal order.

In this paper, we propose a novel narrative-order-aware framework for story generation, which presents an event–order bidirectional pretrained model integrated with temporal mixture-of-experts to enhance the understanding of event temporal order and mitigate the imbalanced narrative order distribution bias. A reinforcement learning algorithm with the innovative multi-granular optimal transport (OT) reward is also designed to further improve the quality of generated events. Specifically, the temporal mixture-of-experts is devised to route events with various narrative orders to corresponding experts so that each type of expert is able to focus on learning certain narrative order orchestration (chronological or flashback) and facilitates the model grasping distinct orders of narrative generation. This hinders overfitting to the dominant narrative order (e.g., chronological) in the corpus by previous methods using a single feed-forward neural network as a whole, thereby mitigating the imbalanced narrative order distribution bias. Then, to adequately understand event temporal order, an event sequence narrative order model is pretrained with bidirectional reasoning between event and order to encode event temporal orders and correlations, which is realized by the joint objectives of temporal event ordering, pairwise order prediction, and event blank infilling. At the fine-tuning stage, the mappings between the generated events and the sentences in the story are captured by OT. On the basis of mappings, we construct an innovative multi-granular reward to effectively measure the generated event quality, further boosting the quality of generated event under the optimization of the reinforcement learning algorithm. Overall, our contributions can be summarized as follows:

We present a novel narrative-order-aware framework, TimeWeaver, for story generation, which proposes an event–order bidirectional pretrained model integrated with temporal mixture-of-experts to enhance the understanding of event temporal order and mitigate the imbalanced narrative order distribution bias.
In TimeWeaver, an event sequence narrative-order-aware model is pretrained with bidirectional reasoning between event and order to encode event temporal orders and event correlations, where the temporal mixture-of-experts routes events with various narrative orders to corresponding experts, facilitating each type of expert grasping distinct orders of narrative generation.
We design a reinforcement learning algorithm with the innovative multi-granular OT reward to further improve the generated event quality at the fine-tuning stage. The multi-granular reward effectively measures the quality of generated events based on the mappings captured by OT.
Extensive automatic and manual experiments validate the effectiveness of our framework in orchestrating diverse narrative orders during story generation.

2. Related Work

2.1. Story Generation

Story generation was first presented by logical and symbolic reasoning, which focuses on generating a chain of events and arranging them into a coherent narrative. Compared to other sequence-to-sequence natural language processing (NLP) tasks such as machine translation [12], it receives relatively limited information (e.g., input sentence, story title), but requires outputting much longer sequences, which is inherently a more challenging task. To achieve better story generation, existing works with deep neural networks primarily focus on improving two aspects of the story: coherence and controllability. For the coherence, to imitate the human writing characteristics, the pioneer works [1,2,3] introduced a two-stage planning-based method, namely planning storyline prior to writing stories. Benefiting from the additional supervised signal from the process of the storyline generation, the coherence of generated stories can be significantly improved. Later, numerous works were devoted to refining the storyline because the quality of the generated storyline determines the final generated story. Fan et al. [13,14] and Chen et al. [15] explored the impact of different storyline compositions (e.g., a sentence, semantic role) on the generated stories, while Tan et al. [16] and Xu et al. [4] concentrated on leveraging the external knowledge bases to enrich the generated storyline. For the controllability, researchers [17,18] inserted special prompts into the input or the generated storyline, guiding the model in generating stories related to the desired attribute, such as emotion, protagonist’s personality, and writing styles, while Feng et al. [19] and Yuan et al. [20] controlled the story generation through multiple rounds of fine-grain with the Large Language Models (LLMs) [21].

Despite the progress, previous works pay little attention to orchestrating diverse narrative orders (i.e., chronological and flashback) in storylines, which is a common writing technique for human storytellers to craft compelling stories. Although a few researchers attempted to equip artificial intelligence system with such capability through implicit learning of narrative sequential knowledge from text corpus [1,2] or explicit modeling of narrative order through learning pairwise event temporal order [5,8], they suffer from imbalanced narrative order distribution bias and insufficient understanding of event temporal order, making it challenging to create high-quality events in the story that balance logic and narrative order. To overcome these issues, we devise the temporal mixture-of-experts to route events with various narrative orders to corresponding experts, hindering overfitting to a certain narrative order in the corpus (i.e., chronological) and mitigating the bias. Then, we present an event sequence narrative-order-aware model that is pretrained with bidirectional reasoning between event and order, enhancing the understanding of event temporal order. Moreover, we design a reinforcement learning algorithm with the novel multi-granular OT reward to further improve the generated event quality.

2.2. Optimal Transport

The original goal of OT is to determine the mappings between two data distributions while minimizing the associated cost [22]. The cost is generally computed by the mappings and the cost function. Formally, considering two discrete probability distributions

u (x)

and

v (y)

on two complete metric spaces X and Y, respectively, where

\sum_{i = 1}^{n} u (x_{i}) = \sum_{j = 1}^{m} v (y_{j}) = 1

, the OT mappings

T^{*}

between

u (x)

and

v (y)

can be derived by solving the following optimization problem as (1)

\begin{matrix} T^{*} = \underset{T}{\arg \min} \sum_{i, j = 1} t_{i j} \cdot f (x_{i}, y_{j}), \\ s . t . & \sum_{j = 1} t_{i j} = u (x_{i}) \forall i \in {1, \dots, n}, \\ \sum_{i = 1} t_{i j} = v (y_{j}) \forall j \in {1, \dots, m}, \\ T \in R_{+}^{n \times m}, \end{matrix}

(1)

where f denotes the cost function. To efficiently solve the above optimization problem, previous works have presented some approximation algorithms, such as Sinkhorn [23] and IPOT [24]. Recently, OT has been widely used in the field of natural language processing. Lee et al. [25] adopted OT to construct a distance metric, achieving the interpretable textual semantic alignment, while Guerreiro et al. [26] employed OT to build a hallucination detector, improving the fidelity of machine translation. In this paper, IPOT [24] is adopted to capture the mappings between the generated events and the sentences in the story. With the mappings, an innovative multi-granular reward is constructed to further improve the quality of generated events under the optimization of the reinforcement learning algorithm.

2.3. Event-Centric Pretraining

Although pretrained language models (PLMs) equipped with a large-scale corpus have made great progress in the field of natural language processing [21,27], their performance on event-centric tasks is far from satisfactory. It is probably because the paradigm of next token prediction emphasizes more token-level concurrence, weakening the understanding of the complete semantic unit “event” [9]. To address this limitation, recent works pay more attention to designing event-centric pretraining to enhance the understanding of events. For instance, Zhou et al. [9,28] separately adopted discriminative event mask reconstruction and contrastive learning to learn event commonsense. Lin et al. [8] and Han et al. [7] explored event trigger and temporal indicators prediction to mine the event temporal knowledge entailed in the training corpus, which shares a similar motivation to our work. Nevertheless, they suffer from the imbalanced narrative order distribution bias, which causes the pretrained model to grasp the dominant chronological narrative order much better than the scarce flashback narrative order. Moreover, they invariably motivate the model to perform the unidirectional reasoning from temporal order to event, neglecting the inverse reasoning order from pairwise events, leading to the insufficient understanding of event temporal order. These hinder the model from orchestrating diverse narrative orders while crafting stories.

To tackle the above problems, in this paper, the temporal mixture-of-experts is devised to route events with various narrative orders to corresponding experts so that each type of expert is able to focus on learning certain narrative order orchestration (chronological or flashback) and facilitate the model grasping distinct orders of narrative generation. This hinders overfitting to the dominant narrative order (e.g., chronological) in the corpus by previous methods using a single feed-forward neural network as a whole, thereby mitigating the imbalanced narrative order distribution bias. Then, to improve the understanding of event temporal order, an event sequence narrative-order-aware model is pretrained with the bidirectional reasoning between event and order to encode event temporal orders and event correlations.

3. Method

In this section, we elaborate on our proposed narrative-order-aware framework, TimeWeaver, for story generation. We first describe the overview and the storyline designed in TimeWeaver, as illustrated at the top of Figure 4. Then, we introduce the three key components in detail: (1) a temporal mixture-of-experts to mitigate the imbalanced narrative order distribution bias by facilitating each type of experts grasping distinct orders of narrative generation, as illustrated at the bottom of Figure 4; (2) an event–order bidirectional pretrained event sequence model to encode the event temporal orders and event correlations, as illustrated in Figure 5; (3) a reinforcement learning algorithm with the innovative multi-granular OT reward to improve the quality of generated events at the fine-tuning stage, as illustrated in Figure 6.

3.1. Overview and Storyline Design

The bottom of Figure 4 shows the overview of our proposed TimeWeaver framework. Following the line of previous works [1,3], we employ the planning-based method to implement TimeWeaver, namely planning storylines prior to writing stories. Specifically, given the input template

x^{t}

, we employ a pretrained event sequence narrative-order-aware storyline model integrated with temporal mixture-of-experts to orchestrate diverse narrative orders, generating the structured storyline s, and subsequently require the story model to unfold the storyline c into the final story y. The apparent advantage of the planning-based method is that it can control and improve the coherence of the generated stories through designing the template of storyline [1,2,3]. Drawing inspiration from this, the storyline consisted of the interweaving of event e and temporal prompt t is designed to achieve orchestrating diverse narrative orders in the generated story. The specific details are as follows.

Formally, let

{\hat{y}}_{i}

=

{s_{i, 1}, s_{i, 2}, \dots, s_{i, k - 1}, s_{i, k}}

denote the i-th golden story with k sentences; we first leverage semantic role labeling (SRL) [29] tools to extract the corresponding golden event

\hat{e} = (\hat{v}, {\hat{a}}^{0}, {\hat{a}}^{1})

for each sentence in

{\hat{y}}_{i}

, where

{\hat{a}}^{0}

,

\hat{v}

,

{\hat{a}}^{1}

denote the golden trigger and its relevant arguments. It should be noted that there may be several results of SRL for a sentence; we only choose the one whose event trigger is the ROOT node in the syntactic analysis as the final event for simplicity. Then, we adopt ECONET [7] to identify the pairwise event order between sentences in

{\hat{y}}_{i}

to obtain the temporal prompts t, where

t \in

{<before>, <after>, <vague>}, and <vague> denote that the event temporal order between pairwise events is not completely certain, which could be an arbitrary order. With

\hat{e}

and

\hat{t}

, the i-th golden storyline with k events can be expressed as

{\hat{c}}_{i}

=

{{\hat{e}}_{i, 1}, {\hat{t}}_{i, 1}, {\hat{e}}_{i, 2}, {\hat{t}}_{i, 2}, \dots, {\hat{e}}_{i, k}, < eoe >}

, where <eoe> refers to event ending. In the fine-tuning stage, all events in

{\hat{c}}_{i}

are masked except for the starting event

{\hat{e}}_{i, 0}

to acquire the input template

x_{i}^{t}

. It should be noted that we do not mask temporal prompts

\hat{t}

because we argue that they facilitate understanding event temporal order and further explicitly modeling narrative order, thereby orchestrating diverse narrative orders. Concretely, the temporal prompts of <before> and <after> denote “before” and “after” temporal orders, which are equivalent to chronological and flashback narrative order. The specific example of the input and storyline with five events is shown at the bottom of Figure 4. Formally, let

α

and

β

represent the parameters of the storyline and story models, respectively, which are served by BART [30]; then, the two models can be fine-tuned as below:

\begin{matrix} L_{α} = - l o g p ({\hat{c}}_{i} | x_{i}^{t}) = - \sum_{k = 1}^{| {\hat{c}}_{i} |} l o g p ({\hat{c}}_{i, k} | x_{i}^{t}, {\hat{c}}_{i, < k}, α), \end{matrix}

(2)

\begin{matrix} L_{β} = - l o g p ({\hat{y}}_{i} | c_{i}) = - \sum_{k = 1}^{| {\hat{y}}_{i} |} l o g p ({\hat{y}}_{i, k} | c_{i}, {\hat{y}}_{i, < k}, β) . \end{matrix}

(3)

Note that in the fine-tuning stage, we adopt the generated storyline

c_{i}

rather than the golden storyline

{\hat{c}}_{i}

to reduce the discrepancy between inference and training.

3.2. Temporal Mixture-of-Experts

With the designed structured storyline incorporated with the temporal prompts, the storyline model is expected to comprehend event temporal order through the explicit hints of temporal prompts and orchestrate the corresponding narrative order in the generated storyline (e.g.,

{\hat{e}}_{i, 1}

<after>

{\hat{e}}_{i, 2}

hints that event

{\hat{e}}_{i, 1}

described earlier happens “after” event

{\hat{e}}_{i, 2}

described later, which is equivalent to the flashback narrative order). However, the imbalanced narrative order distribution bias shown in Figure 2 inclines the model to emphasize the learning of “before” event temporal order, while underestimating the learning of “after” order. This further leads to inefficient <after> prompt and low-quality events. Borrowing the idea from mixture-of-experts (MoE) in multi-task learning [31,32,33], we regard the orchestration of diverse narrative orders as two subtasks: chronological order and flashback order narrative generation. Then, the Temporal Mixture-of-Experts (TMoE) is accordingly devised to route events with various narrative orders to corresponding experts. The details of TMoE are depicted at the top of Figure 4.

Following the line of typical MoE [34,35], we replace the feed-forward neural (FFN) layers in the storyline model (i.e., BART [30]) with TMoE layers in place. Herein, we show a certain layer of TMoE for clarity. Let X =

{h_{e 1}, h_{t_{1}}, \dots, h_{t_{k - 1}}, h_{e_{k}}, h_{e o e}}

\in R^{l \times d}

denote the hidden states of inputs tokens for TMoE, where

e_{k}

,

t_{k}

,

e o e

denote k-th event, k-th temporal prompt and event ending, l and d denote the number of tokens and dimension, respectively. To avoid low expert activation mentioned in [35], we first project and split each token x in X to multiple feature subspaces through multi-head layer projection

W_{h e a d}

as (4), achieving subtokens

X_{m h}

=

{h_{e 1}^{0}, h_{e 1}^{1}, \dots, h_{e 1}^{m - 1}, h_{t_{1}}^{0}, h_{t_{1}}^{1}, h_{t_{1}}^{m - 1}, \dots}

\in R^{m l \times \frac{d}{m}}

, where m denotes the number of heads.

X_{m h} = X \cdot W_{h e a d}^{T} .

(4)

Then, each subtoken

x_{m h}

in

X_{m h}

will be assigned to the corresponding areas according to the temporal prompt. We set two types of areas (i.e., chronological, flashback). Concretely, taking a part of the specific storyline

{\hat{c}}_{i}

= {…, <after>,

{\hat{e}}_{i, 2}

,<before>,

{\hat{e}}_{i, 3}

,…} as an example, all subtokens belong to

{\hat{e}}_{i, 2}

and

{\hat{e}}_{i, 3}

will be assigned to the flashback and chronological areas, respectively. Each area contains the corresponding temporal experts, who are responsible for grasping distinct orders of narrative generation. For example, the area of chronological is composed of “before” experts, while flashback includes “after” experts. Given that the narrative order maybe not completely certain, which could be an arbitrary order, we add shared “vague” experts to handle it. The shared “vague” experts primarily focus on learning general event knowledge, since it does not require grasping the specific order of narrative generation. Hence, they also participate in the routing of “before” and “after” experts to ensure the sharing of knowledge. Later, all subtokens

X_{m h}

are routed to the corresponding experts in their respective areas through the gating function

g (\cdot)

. The gating value of routing a certain subtoken

x_{m h}

into q-th expert can be computed as (5)

g (x_{m h}^{q}) = \frac{e x p (x_{m h} \cdot e_{q})}{\sum_{j = 0}^{M} e x p (x_{m h} \cdot e_{j})},

(5)

where

e_{q}

denote the learnable embedding of q-th expert. We subsequently adopt a top-k routing strategy, where only the experts

σ = t o p_{k} (g (x_{m h}))

with the largest top-k gating value can be activated.

σ

denotes the set of activated experts. Last, the sub-token is handled by the activated experts, followed by merging and projecting to the entire token through

W_{m e r g e} \in R^{d \times d}

as below (6) and (7):

\begin{matrix} x_{m h} = x_{m h} + \sum_{q \in σ} g (x_{m h}^{q} \cdot f_{q}^{F F N} (x_{m h})), \end{matrix}

(6)

\begin{matrix} x = ([x_{m h}^{0}; x_{m h}^{1},; \dots; x_{m h}^{m - 1}]) \cdot W_{m e r g e}^{T}, \end{matrix}

(7)

where

f_{q}^{F F N}

denotes the trainable parameter for q-th experts realized by feed-forward neural layer, and [ ; ] denotes the concatenation operation. To facilitate the expert load balancing, the load balance loss [36] is introduced as below:

\begin{matrix} f_{i} & = \frac{1}{| B |} \sum_{x \in B} 1 {a r g m a x g (x) = i}, \end{matrix}

(8)

\begin{matrix} P_{i} & = \frac{1}{| B |} \sum_{x \in B} g_{i} (x), \end{matrix}

(9)

\begin{matrix} L_{b} & = α \cdot N \cdot \sum_{i = 1}^{N} f_{i} \cdot P_{i}, \end{matrix}

(10)

where

| B |, N

denote the number of subtokens in a batch and experts,

f_{i}

and

P_{i}

denote the number of subtokens allocated to i-th expert and the sum of the gating value for i-th expert. Compared to the general method which adopts a single FFN to model all types of narrative orders, TMoE hinders overfitting to the dominant chronological order through routing events with various narrative orders to corresponding events. It facilitates each type of expert grasping distinct orders of narrative generation, thereby mitigating the imbalanced narrative order distribution bias.

3.3. Event–Order Bidirectional Pretraining

With the temporal mixture-of-experts, we concentrate on achieving an effective storyline model to orchestrate diverse narrative order in a high-quality storyline, as the storyline substantially influences the final generated story. Considering the more challenging setting that requires taking both event logic and temporal order into account, and the inadequate understanding of event temporal order caused by unidirectional reasoning from temporal order to event, an event sequence narrative-order-aware storyline model is presented. It is pretrained with the bidirectional pretraining between event and order, which is realized by the joint objectives of temporal event ordering, pairwise order prediction, and event blank infilling to encode event temporal orders and event correlations. Herein, we take a sample to detail the event–order bidirectional pretraining, which can be divided into three parts, as shown in Figure 5.

Event blank infilling requires the storyline model to infer the blank event from the given pairwise event temporal order hinted at by the temporal prompt. Concretely, given the golden storyline

{\hat{c}}_{i}

, we mask each event with a probability of

0.25

to obtain

c_{i}^{e b i}

, and all temporal prompts

\hat{t}

are remained observable. Note that if no events are chosen for masking, we randomly select one to mask. In this way, the storyline model is forced to generate events that are not only reasonable in event correlations but also conform to a specific event temporal order, which facilitates forward reasoning from temporal order to event.

Pairwise order prediction demands the storyline model to reason the temporal order from the given pairwise events within the generated paradigm. Concretely, given the golden storyline

{\hat{c}}_{i}

, all temporal prompts

\hat{t}

are masked, along with <eoe> to facilitate understanding the ending of story. To avoid the deviation from the storyline generation, an event is additionally masked to obtain

c_{i}^{p o p}

. Note that the corresponding temporal prompt will be restored if an event is chosen to mask. Since we only mask one event, the storyline model reasons event temporal order in most cases, which facilitates inverse reasoning from event to temporal order.

Temporal event ordering requires the storyline model to reason the original orders of shuffled temporal events. Concretely, given the golden storyline

{\hat{c}}_{i}

, we fix the starting event and ending of the story, then scramble the pair composed of adjacent event and temporal prompt (e.g., (

{\hat{t}}_{4}, {\hat{e}}_{5}

), (

{\hat{t}}_{2}, {\hat{e}}_{3}

)) to obtain

c_{i}^{t e o}

. In this way, when the storyline model attempts to reorder

c_{i}^{t e o}

, it is obliged to carefully determine whether the event temporal order of pairwise events (

{\hat{e}}_{1}

,

{\hat{e}}_{5}

) conform to the temporal prompt

{\hat{t}}_{4}

, and if the latter event

{\hat{e}}_{2}

can be inferred from the former event

{\hat{e}}_{3}

given the specific event temporal order

{\hat{t}}_{1}

. This facilitates the bidirectional reasoning between event and temporal order.

To realize the above three event–order pretraining strategies and jointly encode the event temporal orders and event correlations, for each golden storyline, we first obtain

c_{i}^{e b i}

,

c_{i}^{p o p}

, and

c_{i}^{t e o}

through masking and shuffling operations. Then, we execute different objectives to recover the golden storyline

c_{i}^{e b i}

in parallel and their losses are combined in a varying ratio to jointly guide the optimization of the storyline model, which can be formulated as below (11):

\begin{matrix} L p = - \sum_{k = 1}^{| {\hat{c}}_{i} |} & γ_{1} l o g p ({\hat{c}}_{i, k} | c_{i}^{e b i}, {\hat{c}}_{i, < k}) + γ_{2} l o g p ({\hat{c}}_{i, k} | c_{i}^{p o p}, {\hat{c}}_{i, < k}) + γ_{2} l o g p ({\hat{c}}_{i, k} | c_{i}^{t e o}, {\hat{c}}_{i, < k}), \end{matrix}

(11)

where

γ_{1}

and

γ_{2}

are determined by (12).

s t e p

and

t o t a l s t e p

denote the current and total update steps of the pretraining process. With the setting of Equation (12), the influence of pairwise order prediction and temporal event ordering will gradually decrease while the influence of event blank infilling will gradually increase in the pretraining stage. This is because we expect the storyline model to focus more on event temporal order in the early stage, then leveraging the understanding of event order to facilitate generating higher-quality events in event blank infilling.

η_{1} = \frac{s t e p}{t o t a l s t e p}, η_{2} = 1 - \frac{s t e p}{t o t a l s t e p} .

(12)

3.4. Reinforcement Learning Algorithm with Multi-Granular Optimal Transport Reward

With the event–order bidirectional pretrained event sequence storyline model, we focus on adopting the planning-based method to implement TimeWeaver, generating stories with diverse narrative orders. However, there exists a plight that the storyline model cannot adjust with the story model because the token selection in generating storylines is non-differentiable, preventing the gradient backpropagation from the story to the storyline model. This causes the storyline model not knowing how the generated storyline affects the final story and being unable to make corresponding adjustments, limiting the generated event quality in the storyline. To overcome this barrier, a reinforcement learning algorithm with an innovative multi-granular OT reward is designed to realize the joint optimization of the two models in the fine-tuning stage, improving the quality of generated events.

We first briefly describe how to formulate the reinforcement learning in this work. Particularly, the policy gradient algorithm [37] is employed to optimize the storyline model through maximizing the expected reward

E_{α} [R_{i}]

in (13). The corresponding gradient of the storyline model can be computed by the policy gradient theorem and approximated with the sampling technique, which can be expressed as (13).

\begin{matrix} E_{α} [R_{i}] = E [R_{i} \cdot l o g (p ({\hat{c}}_{i} | x_{i}^{t}, α)], \end{matrix}

(13)

\begin{matrix} \nabla J (α) = E [R_{i} \cdot \nabla l o g p ({\hat{c}}_{i} | x_{i}^{t}, α)] . \end{matrix}

(14)

Hence, the main challenge is to construct a valid reward $R_{i}$ based on the feedback of the story model so that the storyline model can adjust itself according to the reward as in Equation (14). Accordingly, the innovative multi-granular OT reward is designed to conquer this challenge. The specific details are illustrated in Figure 6, which can be roughly divided into two parts: event-granular and storyline-granular reward construction.

Given a batch

{(x_{1}^{t}, {\hat{c}}_{1}, {\hat{y}}_{1}), . ., (x_{N}^{t}, {\hat{c}}_{N}, {\hat{y}}_{N})}

with N samples in the fine-tuning stage, we perform the general forward propagation to compute the token loss through the negative log-likelihood estimation in Equations (2) and (3). For each sample with five events and sentences, the representations of events

H_{e}

=

{h_{e}^{1}, \dots, h_{e}^{5}}

and sentences

H_{s}

=

{h_{s}^{1}, \dots, h_{s}^{5}}

are extracted by averaging the hidden states of the included tokens. The hidden states are taken from the last layer of the model. The sentence loss for each story

L_{s}

=

{l_{s}^{1}, \dots, l_{s}^{5}}

is also computed by averaging the loss of the included tokens. Then, the event-granular reward

R^{e g}

is constructed to guide the storyline model optimization based on the event quality.

Specifically, for a specific event in the storyline (e.g.,

e_{i}

), an intuitive approach is to regard the negative sentence loss

- l_{s}^{i}

as the reward of the corresponding event

e_{i}

. Because if the event

e_{i}

in the storyline is low-quality, it is difficult to reconstruct the original sentence

s_{i}

by the story model, resulting in smaller event reward and higher sentence loss

l_{s}^{i}

. Nevertheless, the mappings between each event and sentence modeled in this way are one-to-one because the reward of event

e_{i}

is only determined by

s_{i}

. However, one event may contribute several sentences, so the mappings should be one-to-many. To address this issue, we regard the process of unfolding the storyline into story as the semantic distribution moving from event to sentence, and then employ the OT (i.e., IPOT [24]) to capture the OT mappings between them. Particularly, the cost function

f (\cdot)

is first defined to compute the cost of transferring the semantics of event

e_{i}

to sentence

s_{j}

according to the similarity between

h_{e}^{i}

and

h_{s}^{j}

as (15).

f (e_{i}, s_{j}) = 1 - e x p (- | | h_{e}^{i} - h_{s}^{j} {| |}^{2}) .

(15)

It means the closer the semantics of

h_{e}^{i}

and

h_{s}^{j}

, the easier the transportation between them. Then, we leverage the IPOT algorithm to obtain the OT mappings

T^{*} \in R^{5 \times 5}

and OT loss

L_{o t}

based on

f (\cdot), H_{e}, H_{s}

as (16). Intuitively, we can regard

T_{i j}

as the semantic transportation or contribution from event

e_{i}

to sentence

s_{j}

. With

T^{*}

,

L_{s}

, the event-granular OT reward

R^{e g}

= {

r_{1}^{e g}, \dots, r_{5}^{e g}

} can be constructed by normalizing the weighted summing of different negative sentence losses

l_{s}

, and the weighted coefficients are determined by the value of semantic transportation as

\begin{matrix} T^{*}, L_{o t} & = I P O T (H_{e}, H_{s}, f (\cdot)), \end{matrix}

(16)

\begin{matrix} R^{e g} & = - T^{*} L_{s}, \end{matrix}

(17)

\begin{matrix} r_{i}^{e g} & = \frac{r_{i}^{e g}}{\sum_{j}^{5} r_{j}^{e g}}, \end{matrix}

(18)

where

r_{i}^{e g} \in R^{e g}

denotes the reward for the event

e_{i}

. Note that

r_{i}^{e g}

is decided not only by the loss of sentence

s_{i}

, but also by other sentence losses. The reward intensity of sentence

s_{j}

to event

e_{i}

relies on the semantic contribution

T_{i j}

from

e_{i}

to

s_{j}

captured by OT, akin to the idea of backpropagation. Consequently, the mappings between each event and sentences are one-to-many and

r_{i}^{e g}

comprehensively considers the feedback of all sentences in the story. Furthermore, the storyline model can compare the rewards of different events in a storyline and perceive the generated event quality, thereby adjusting itself accordingly and improving the quality of generated events.

In addition to guiding optimization from the event granularity, we further design storyline-granular reward

R^{s g}

to facilitate the storyline model adjusting itself based on the quality of the storyline. Particularly, for a specific storyline (e.g.,

c_{i}

) in a batch with N samples, the i-th story loss

L_{s}^{i}

is computed by averaging the loss of tokens included in the story. Later, we regard the negative normalized story loss as the storyline-granular reward

r_{i}^{s g}

,

R^{e g} = {r_{1}^{s g}, \dots, r_{N}^{s g}}

:

r_{i}^{s g} = - \frac{L_{s}^{i}}{\sum_{j}^{N} L_{s}^{j}},

(19)

where

r_{i}^{s g} \in R^{s g}

denotes the reward for storyline

c_{i}

. In this way, if the storyline

c_{i}

is low-quality, it will be difficult for the story model to reconstruct the original story

{\hat{y}}_{i}

, leading to higher story loss

L_{s}^{i}

and smaller storyline reward. Moreover, the storyline model can perceive the generated storyline quality through comparing the rewards of different storylines in a batch, thereby adjusting itself accordingly. With

R^{e g}

and

R^{s g}

, the final multi-granular OT reward for i-th event in j sample in a batch can be obtained from multiplying each event-granular OT reward

r_{i}^{e g}

by the corresponding storyline-granular reward

r_{j}^{s g}

as below:

r_{i j} = r_{i}^{e g} \times r_{j}^{s g}, i \in [1, 5], j \in [1, N],

(20)

where i and j denote the number of event in the storyline and sample in a batch. In this way, $r_{i j}$ effectively measures the generated event quality comparing different events in a storyline and different storylines in a batch, realizing the joint optimization of the storyline and story model under the reinforcement learning algorithm in Equations (10) and (11), and improving the generated event quality.

To elaborate on our approach more clearly, we visualize the comparison between the multi-granular OT reward and previous methods in Figure 7. Specifically, the initial method is essentially a storyline-granular reward. It compares storylines in the batch so that different storylines receive different rewards, but all events in the same storyline receive the same reward. However, it cannot measure the quality of events in the same storyline, because there is no difference between their rewards. The advanced method is inherently an event-granular reward. The events in the storyline receive different rewards. The difference between the naive and OT versions is the way they model the mappings between the events in the storyline and sentences in the story. The former is one-to-one, while the latter is one-to-many and more accurate. However, both the naive and OT versions only compare different events in the storyline, without comparing different storylines in the batch. On the contrary, our method solves the issues of the previous methods in a multi-granular way. It compares different storylines in the batch to obtain storyline-granular reward, and also compares the different events in the storyline to obtain event-granular reward.

Overall, to more explicitly illustrate the whole algorithm flow, the detailed pseudo-code for training and optimization is shown in Algorithm 1. Furthermore, we organize the code and open source to clarify the specific implementation https://github.com/lzc-nazarite/Timeweaver (accessed on 19 September 2025).

Algorithm 1: Pseudo-code For training and optimization

4. Experiments

4.1. Datasets

To demonstrate the validity of our method in orchestrating diverse narrative orders in generating both short and long stories, we choose the ROCStories [6] and WritingPrompts [13] datasets. For the event–order bidirectional pretraining data, we adopt the event sequences open-sourced by [8,38] and further process them. The specific details of each dataset are shown as follows:

ROCStroies [6] is a large-scale handwriting story generation dataset, consisting of 98,162 five-sentence short stories. The average length is about 42 words. We adhere to the official split [5], assigning 88,344 for training, 4908 for evaluation, and 4909 for testing. For each sample, the first sentence is utilized to construct the input template

x^{t}

and the rest as reference generation.

WritingPrompts [13] is also a handwriting story generation datasets, containing 30,335 pairs of prompts and stories. The average length of the story is over 700 words. Following the previous works [5], we select stories that are no longer that 500 words, resulting in a total number of 96,488 training and 5784 validation prompt–story pairs. For the test set, the 1000 prompt–story pairs provided by the compared baseline [39] are adopted. For each sample, the prompt is considered as the input template

x^{t}

and the rest as reference generation. The temporal prompts between pairwise events on both datasets are provided by [5].

Event Sequences Corpus [38] originates from the EventNarratives corpus, which contains more than 200,000 narrative-structured documents identified from different sources such as news articles, novel books, and blogs. We directly adopt the filtered version provided by [8] as the pretraining corpus. It contains 10K event sequences, with nearly 70% consisting of three or more events. To obtain the temporal prompts between pairwise events, we further leverage the ECONET [7] with three sets of pretrained weights to infer the event temporal order. We only adopt the results when the event orders judged by the three models are the same. Otherwise, we label the event order as “vague”. It denotes that the temporal order is not completely certain, which could be an arbitrary order.

4.2. Baseline Models

We compare with the following strong baselines.

BART [30]: It receives the input template and directly generates the story without the intermediate storyline generation and temporal prompt.
MEGATRON [4]: It leverages the external knowledge base to fine-tune GPT2 [40,41], facilitating generating more diverse stories on the ROCStories dataset. It also has no intermediate storyline generation and temporal prompt.
BART-PLANNING-A [30]: It adopts the planning-based method, which is similar to the overview shown at the bottom of Figure 4. However, it regards the event arguments as the mask unit in the structured storyline design rather than the whole event and does not include temporal prompts.
BART-PLANNING-E [30]: It regards the event as a mask unit in the structured storyline design. The rest remains unchanged with the baseline BART-PLANNING-A.
FLASHBACK-VANILLA [5]: Based on the baseline BART-PLANNING-A, it introduces the temporal prompt in the structured storyline design.
BPOT-VANILLA [42]: Based on the baseline BART-PLANNING-E, it introduces the temporal prompt in the structured storyline design.
TEMPORALBART [8]: It adopts the temporal event pretraining to mine the event temporal knowledge. We utilize the pretrained weights to initialize the storyline model of baseline BPOT-VANILLA.
CONTENT-PLANNING [39]: Based on the baseline BART-PLANNING-A, it additionally trains some classifiers to refine the storyline on the WritingPrompts.
FLASHBACK [5]: Based on the baseline BPOT-VANILLA, it adopts the autoregressive mask strategy [40] to pretrain the storyline model with the large-scale Book corpus [43], which includes 74 million sentences of 10k books. It also adopts the reinforcement learning algorithm to realize the joint optimization of the storyline and story model.
ONE-SHOT-LLMs [44]: LLMs have shown powerful capability in natural language generation. We select the mainstream open-source LLMs (i.e., Qwen2.5-7b [44], Llama3-8b [45]) to generate the story under the one-shot settings. The specific prompt design for LLMs is shown in Table 1. The generated stories are utilized in manual evaluation and case studies.

4.3. Evaluation Metrics

Consistent with the prior studies [5,42], we employ the following automatic metrics to evaluate the stories from a general perspective. Perplexity (PPL) indicates the model’s perplexity of the generated stories. BLEU-4 (B-3) [46] assesses the fluency and accuracy of generated stories by measuring 3-gram overlaps with the reference stories. ROUGE-L (R-L) [47] focuses on recall, which measures the longest common subsequence between generated and reference text. Repeat-n (R-n) [48] quantifies redundancy as the proportion of stories that contain at least one repeated n-gram. Distinct (Dis) measures diversity as the fraction of unique 4-gram types among all 4-grams. Token Length (Tks) indicates the average length of the generated stories.

Considering that the automatic evaluation metrics have difficulty evaluating open-ended text generation oriented on orchestrating diverse narrative orders, we design narrative order-aware metrics Narrative Order Diversity (NOD) and Narrative Order Accuracy (NOA) in manual evaluation, which focus on assessing whether the model can generate events in the story that conform to the given narrative order. NOD is presented to measure the diversity of narrative order by computing the entropy of narrative order distribution in the generated stories. NOA is presented to measure the accuracy of narrative order by calculating the ratio of the event temporal orders in the generated stories that are consistent with the given narrative orders hinted by the temporal prompts. In addition to NOD and NOA centered on narrative order, we also design the metrics Coherence and Overall in manual evaluation to assess the generated stories from a comprehensive perspective. Coherence is proposed to measure the inter-sentence logic and whether the generated stories deviate from the story input. Overall is proposed to evaluate the overall quality of the generated stories, which is presented as a ranking. Note that in manual evaluation, Coherence is strictly determined by the inter-sentence logic, while Overall is dependent on the preference of the annotator, such as the interest of the story.

Implementation Details

To ensure a fair comparison with the previous works [5,8], we adopt the BART-base [30] as the storyline and story model unless otherwise mentioned. We then add the tokens of temporal prompts (<before>, <after>, <vague>) into the vocabulary of BART, making it understand event temporal order to orchestrate narrative order. The input template is constructed by the first sentence and prompt, as shown in Section 3.1. For the Temporal Mixture-of-Experts (TMoE) to replace the feed-forward neural (FFN) layers in the transformer [49] block every three layers, the number of head and the ratio of the load balance loss are typically set to 4 and

1 \times 10^{- 2}

, which is the same as the previous works [35,36]. The number of <before> experts in chronological area, <after> experts in flashback area, and shared <vague> experts are set to 8, 6, 2, which are roughly consistent with the ratio of narrative order distribution, as shown in Figure 2. The specific details are as follows: (i) the rough ratio of the ROCStories corpus is 7(before):1(vague):2(after). However, we adopt the top-2 sampling technique in the router, so the number of experts in each type should be at least 2. Hence, we multiply the ratio to get 14(before):2(vague):4(after). (ii) It should be noted that the vague experts are shared, which means that the true number of experts in <before> and <after> should be equal to the required number of experts (14(before), 4(after)) minus the number of shared experts (2(vague)). Then, we update the ratio to 12(before):2(vague):2(after). (iii) Since shared experts (vague) can participate in both chronological and flashback narrative order generation, they can access the entire training corpus and learn more basic event knowledge. Therefore, we intuitively expand the number of shared experts to ensure sufficient capacity to store the knowledge. Last, we update the ratio to 8(before):6(vague):2(after).

During the pretraining stage, the storyline model is optimized by Equations (10) and (11). The hyperparameters are learning rate

1 \times 10^{- 5}

; batch size 32. We pretrain for 10 epochs and choose the model with the best evaluation perplexity. The training time for per epoch during the pretraining stage is 2.5 h. During the fine-tuning stage, the storyline model is optimized by Equations (10), (13) and (16). The hyperparameters are learning rate

1.2 \times 10^{- 4}

; batch size 32. The story model is optimized by Equation (3). During the inference stage, greedy decoding with a 4-beam search size is adopted to decode the token step by step. Both training and inference are implemented on 4 A100-40G GPUs. The training time on ROCStories and WritingPrompts per epoch is 1.5 and 12 h, respectively.

4.4. Experimental Results

(1) We adopt the automatic metrics to compare our method with the baselines on the ROCStories and WritingPrompts datasets. The results are reported in Table 2. We can observe that (i) TimeWeaver outperforms other baselines in almost all metrics, indicating its powerful ability to craft more fluent and diverse stories that are more in line with the reference stories. (ii) In terms of the Distinct (Dis) metric on the ROCStories dataset, our method performs worse than the baseline

{MEGATRON}^{*}

. It could be that

{MEGATRON}^{*}

serves GPT2 [27] as the backbone and adopts a top-k sampling scheme with k = 40 and a softmax temperature of 0.7 to decode the token during the inference stage.

{MEGATRON}^{*}

also leverages the external knowledge to insert varied contexts. Both significantly improve the diversity of the generation. Nevertheless, our method still surpasses

{MEGATRON}^{*}

on other metrics by a large margin. (iii) For the Repeat-n (R-n) metric on WritingPrompts, TimeWeaver lags slightly behind the baseline BART, probably due to the much longer generated stories. When compared to the baselines with a similar generated length (e.g., BPOT-VANILLA), our method exhibits superior performance. (iv) For the baseline CONTENT-PLANNING, which adopts BART-LARGE as the backbone, they show better performance on WritingPrompts due to the more filtered training data. Concretely, they filter the sample without prompt, and their training data is about

\frac{3}{2}

of ours. Moreover, they design classifiers to refine the storyline.

(v) On both ROCStories and WritingPrompts datasets, the baselines realized by the planning-based method outperform the one-stage generation method (i.e., BART), which is consistent with the insight that the planning-based method can improve the generated story quality. The planning-based baselines whose storyline contains the temporal prompts (e.g., BPOT-VANILLA) perform better than those without (e.g., BART-PLANNING-A/E). It is probably because the temporal prompt hints at the pairwise event temporal order. Moreover, the planning-based baselines whose storyline serves the whole event as the mask unit (e.g., BPOT-VANILLA) outperform the storyline with the event argument as the mask unit (e.g., FLASHBACK-VANILLA), especially for the Token Length (Tks) metric on WritingPrompts. It could be serving the whole event because the mask unit makes it easier to understand event correlations, thereby generating longer stories. All of them demonstrate the effectiveness of our storyline design. (vi) Regarding the strongest baseline FLASHBACK, it pretrains on a larger corpus (74 million sentences of 10k books) and shares a similar idea of adopting reinforcement learning to realize the joint optimization. Meanwhile, our method, pretained on 10k event sequences, surpasses them.

(2) In addition to the automatic evaluation, we conduct the manual evaluation on ROCSories to evaluate the generated stories from the perspective of narrative order and comprehensive quality. To compare with the mainstream open-source LLMs, we adopt the designed prompt shown in Table 1 to orchestrate diverse narrative orders in story generation. Concretely, we define the role of LLMs, detail the content of the task, followed by a specific demonstration. Note that we require the LLMs to generate stories no more than 48 words, which is slightly more than the average length of the ROCStories (42 words) dataset. This is because if the length is not constrained, the stories generated by LLMs are much longer than other baselines, which is adverse for a fair comparison.

Then, drawing on the human evaluation setting from previous research, we randomly sample 100 test samples from the ROCStories dataset. Note that we do not choose the WritingPrompts dataset as its stories are too long and comprise dialogues, modal particles, and short phrases without events at all. This greatly increases the difficulty of human annotations. For each test sample, we extract its first sentence as the input and get seven versions which are generated by various models, including LLMs (e.g., Qwen2.5-7b, Llama3-8b) and specialized systems (e.g., TEMPORALBART). It should be noted that the stories generated by different models only contain five sentences and have no significant length gap for a fair comparison.

We later select five postgraduates majoring in natural language processing from the University of Chinese Academy of Sciences to perform human annotations, as they have enough basic knowledge of event commonsense (i.e., event logic and temporal order) and natural language processing. Five electronic questionnaires are constructed based on the test samples, where each contains the same 100 evaluation samples. It is worth noting that the order of the evaluation sample in each questionnaire is randomly shuffled, and the options in each evaluation sample are also randomly shuffled. For each evaluation sample in the questionnaire, we require the annotator to complete the annotations of three tasks, and the results are finally stored in a spreadsheet. It should be noted that the annotators complete the questionnaire independently and are blind in the preparation of the questionnaire, subsequent result storage, and statistics. The specific details of the three tasks are as follows:

(1) Narrative Order Judgment: Given a five-sentence story, the annotator is required to determine the event temporal order of adjacent pairwise events, then choose a correct label from the sets {before, after, vague}. The element in the sets corresponds to the chronological, flashback, and the narrative order is not completely certain, respectively. Thus, for each story with five sentences, four labels are obtained from an annotator. Then, if the annotated results from various annotators are not completely contradictory (e.g., {before and after appear simultaneously), we adopt the majority voting to determine the final result. Otherwise, the result is directly set to vague.

(2) Coherence Score: Given a five-sentence story, the annotator is required to assess its coherence and choose a reasonable value from 1 to 5. The coherence of the story is reflected on the logical correlation of inter-sentence and whether the generated story deviates from the input template. Thus, for each story with five sentences, one score is obtained from an annotator. Then, we average the scores from various annotators to obtain the final coherence score.

(3) Overall Rank: Given a set of five-sentence stories generated by various baselines, the annotator is required to rank stories based on their overall quality. The overall quality is determined by the preference of the annotator after comprehensively considering the fluency, coherence, and interest of the story. Similar to the coherence score, we average the rankings from various annotators to obtain the final ranking.

After completing the above three tasks, we obtain the Narrative Order Diversity (NOD), Narrative Order Accuracy (NOA), Coherence, and Overall metrics. Note that we do not encourage the model to generate stories with the uncertain narrative order. Hence, we merge the number of <vague> and <after> during computing NOD. When computing NOA, <vague> is regarded as arbitrary narrative orders, which can be both <before> and <after>. The evaluation results and detailed statistics on the annotated results of narrative order are presented in Table 3 and Figure 8, respectively.

It can be discerned that (i) chronological (“before”) is the dominant narrative order of generated stories, which is consistent with the findings in [50] and human writing habits. (ii) However, TimeWeaver outperforms other baselines in NOD, indicating its powerful ability to orchestrate the creative flashback narrative order in the generated stories (highest “after” ratio 25.25%). (iii) In terms of the NOA metric, our method surpasses other baselines, reflecting its exceptional ability to follow the temporal prompt, weaving the corresponding narrative orders in the stories (highest “after” accuracy 93.33%). (iv) For the Coherence and Overall metrics, although LLMs perform slightly better because of their huge number of parameters and the large-scale training corpus, it is difficult for them to break the dominant chronological narrative order and follow the temporal prompt <after> to orchestrate the flashback narrative order in the stories, which is reflected in the poor performance of NOD (0.549, 0.588, 0.627), lower “after” ratio (13.00%, 14.25%, 15.75%), and lower “after” accuracy (25.39%, 30.95%, 32.25%), respectively. When compared to the baselines with a similar parameter scale (e.g., FLASHBACK), our method exhibits superior performance in all metrics.

In a word, compared to previous works, the superior performance of the above automatic and manual evaluations demonstrates the effectiveness of our method. This can be attributed to three factors: (1) temporal mixture-of-experts enhances the learning of distinct orders of narrative generation; (2) bidirectional pretraining between event and order facilitates encoding event correlations and event temporal order; (3) multi-granular OT reward effectively measures the quality of generated events, improving the joint optimization.

4.5. Ablation Study

To further validate the effectiveness of our proposed framework, TimeWeaver, we introduce the following variants to conduct ablation studies on the ROCStories dataset. (1) Vanilla: This is the naked version of TimeWeaver, which adopts the storyline model to plan the structured storyline with the temporal prompts and then unfold the storyline into the final story through the story model, which is equivalent to the baseline BPOT-VANILLA. (2) +MRRL: Based on (1), it utilizes the reinforcement learning algorithm with the innovative multi-granular OT reward to realize the joint optimization of the storyline and story model, further boosting the generated event quality. (3) +TMoE: Based on (2), it replaces the feed-forward neural (FFN) layers of the storyline model with temporal mixture-of-experts every three layers, facilitating each type of expert grasping distinct orders of narrative generation. (4) +EOBP: Based on (3), it adopts the event–order bidirectional pretraining on the 10K-event-sequence corpus, encoding the event correlations and event temporal order. The results of the ablation studies on automatic and manual evaluations are detailed in Table 4 and Table 5.

Specifically, we can draw the following inferences based on the observation of the results. (i) +MRRL significantly improves the metric of coherence. This is probably because it enables the storyline model to adjust itself with the multi-granular reward constructed by the feedback of the story model, which realizes the joint optimization and improves the generated event quality. (ii) +TMoE greatly boosts the metric of NOD and NOA. It could be that each type of expert grasps distinct orders of narrative generation and hinders overfitting to the dominant chronological order, mitigating the imbalanced narrative order distribution bias. (iii) +EOBP performs bidirectional pretraining between event and order, encoding the event correlations and event temporal order, further improving the performance in all metrics. Overall, these inferences validate the effectiveness of our method.

To explicitly illustrate the effectiveness of the core design in TimeWeaver, we conduct two comparative experiments on the ROCStories dataset.

(1) Effectiveness of event–order bidirectional pretraining. To improve the understanding of event temporal order, we design the bidirectional pretraining between event and order to encode event temporal orders and event correlations. Here, we design three variants to validate its effectiveness. (i) Autoregressive: It adopts the standard next token prediction paradigm in [27] to pretrain the storyline model. (2) Event Blank Infilling: It requires the storyline model to perform forward reasoning from temporal order to event, which reasons the blank event from the given pairwise event temporal order. (3) Pairwise Order Prediction: It requires the storyline model to perform inverse reasoning from event to temporal order, which reasons the temporal order from the given pairwise events. (4) Temporal Event Ordering: It requires the storyline model to perdorm bidirectional reasoning between event and temporal order, which reasons the original orders of shuffled temporal events. The experimental results are shown in Figure 9. We can observe that (i) serving event as mask unit (i.e., event blank infilling) outperforms the standard autoregressive in this task. This is because it is more in line with the pretraining strategy of BART [30]. (ii) The variant of temporal event ordering performs worst, because it deviates from the target of storyline generation. (iii) Event–order bidirectional pretraining achieves the best, proving that bidirectional pretraining is superior to unidirectional pretraining and autoregressive.

(2) Effectiveness of Multi-granular OT reward. To make the storyline model adjust itself with the reasonable feedback from the story model, we design the event- and storyline-granular reward to effectively measure the quality of the generated events. Here, we design two variants to verify its effectiveness. (1) Event-Granular Naive Reward (EGNR): It directly regards the negative sentence loss as the reward of the corresponding event, which models the mappings between each event and sentences in a one-to-one way. (2) Event-Granular OT Reward (EGOTR): It adopts the OT to capture the one-to-many mappings between each event and sentences, followed by constructing the event-granular reward by the OT mappings. (3) Storyline-Granular Reward (SGR): It regards the negative normalized storyline loss as the reward. The experimental results are shown in Table 6. We can discern that (i) EGOTR performs better than EGNR. It is probably because OT reward comprehensively considers the feedback of all sentences, which effectively measures the quality of generated events. (ii) Multi-granular OT reward performs best, demonstrating the effectiveness of our design. We believe it is because multi-granularity facilitates comparison of storylines within a batch and events within a sample, thereby more comprehensively measuring the generated event quality, making the storyline model more accurately adjust itself.

4.6. Case Study

To explicitly demonstrate the difference between the baselines and our method in orchestrating diverse narrative orders during story generation, we conduct the case studies on the ROCStories dataset. Concretely, given the input template composed of the first sentence and temporal prompts, we require different models to generate stories and judge whether they can follow the temporal prompts to weave the corresponding narrative order in the generated stories (i.e., the temporal prompt of <before>, <after>, <vague> hint at the chronological, flashback, arbitrary narrative order, respectively). The results are presented in Table 7.

We can find that LLMs (e.g., Qwen2.5-7b, GPT4o) can almost create stories with reasonable event logic. However, it is difficult for them to follow the <after> and orchestrate the narrative order in place. Regarding the strongest baseline FLASHBACK, it is also fine-tuned on the relevant corpus, and our method exhibits superior performance. Particularly, for the simple cases with only one <after>, FLASHBACK sometimes generates correct flashback narrative order, but it faces the problem of deteriorating event logic. For instance, in the first case, “terry rushed to the hospital” follows “it was a car accident”, but the accident is not related to Terry, so there is no need for him to rush to the hospital. Furthermore, in the third and fourth cases, FLASHBACK wrongly associates “had a burning sensation in her bladder” with “had flushed the toilet” and “didn’t have much money” with “spent a lot of money”. Instead, our method maintains the reasonable event logic and orchestrates the correct flashback order in place. Besides, for the complex cases with multiple <after> as shown in the second case, FLASHBACK fails to orchestrate the flashback narrative order with the second <after> and the conflicting relationship with “that man” has shifted from “my friend” to “me”. In contrast, TimeWeaver is capable of dealing with such complex situations, orchestrating two correct flashback narrative orders (“they had been dating him for two years” occurs in the past and complement the context, “she told me that he cheated on her with another woman” occurs in the past and explains “tom was depressed”). Both these cases validate the effectiveness of our method in generating high-quality events in the story that balance the logic and narrative order when orchestrating diverse narrative orders for crafting compelling stories.

5. Conclusions and Future Works

In this paper, we present a novel narrative-order-aware framework for story generation, which proposes an event–order bidirectional pretrained model integrated with temporal mixture-of-experts to enhance the understanding of event temporal order and mitigate the imbalanced narrative order distribution bias. A reinforcement learning algorithm is also designed with the innovative multi-granular OT reward to further improve the generated event quality. Specifically, the temporal mixture-of-experts is designed to route events with various narrative orders to corresponding experts, facilitating each type of expert grasping distinct orders of narrative generation. This hinders overfitting to the dominant narrative order in the corpus, thereby mitigating the imbalanced narrative order distribution bias. Then, we present an event sequence narrative-order-aware model to adequately understand event temporal order. The model is pretrained with bidirectional reasoning between event and order to encode event temporal orders and event correlations. At the fine-tuning stage, reinforcement learning with multi-granular OT reward is proposed to further boost the quality of generated events.

We conduct extensive experiments on two publicly available benchmark datasets. The newly achieved state-of-the-art performance on automatic and manual evaluations and further case studies demonstrate the superiority of our method. In the future, we will explore how to control the narrative order of long texts (paragraphs) or other narrative modalities (video).

6. Limitations

Despite the impressive results of our work in orchestrating diverse narrative orders (chronological, flashback) for crafting compelling stories, we have to admit that our work has the following limitations: (1) To make a fair comparison with the previous baselines, we adopt the semantic role labeling (SRL) tools to extract corresponding golden event arguments, and serve the BART-base as our main backbone. However, this prevents the exploration of performance limits as SRL may struggle to deal with complex sentences, and the capacity of BART-base is insufficient. In the future, we will adopt more powerful event extraction methods and larger models. (2) To address the insufficient capacity of BART-base and improve the generated quality and controllability, we adopt the planning-based method. However, it may restrict the creativity of story generation. In the future, we will adopt large models to support end-to-end generation and improve the creativity of generation. (3) The proposed Temporal Mixture-of-Experts and multi-granular reward reinforcement learning indeed increase the computational costs, but they are not the main costs. In our specific implementation, they added nearly 7 min of training time per epoch in ROCStories (from 83 min to 90 min). This is because the operations in Temporal Mixture-of-Experts have already supported parallel computing by GPUs, and the most time-consuming part of the calculation of multi-granular rewards has been greatly optimized by the IPOT algorithm. Moreover, our method would not increase the computational costs in the evaluation stage. Therefore, the computational costs caused by our method itself are not high, but the performance gain is considerable.

(4) We evaluate our method on two representative datasets, ROCStories (average 42 words) and WritingPrompts (average 700 words). The effectiveness of scaling narrative length (42 to 700 words) is underexplored. Moreover, in this work, we orchestrate narrative orders at the sentence-level, and in the future, we can develop into paragraph-level; (5) We mainly carry out our work on text modality. However, in real application, the technique of orchestrating diverse narrative orders has been widely applied in other modalities, such as visual storytelling and creative video generation. Hence, we will develop current methods into more modalities.

Author Contributions

Writing—original draft preparation, Z.L.; writing—review and editing, Z.L., W.J. and C.T.; methodology, Z.L. and W.J.; Conceptualization, Z.L. and W.J.; software, Z.L.; formal analysis, Z.L.; validation, W.J., C.T., L.J. and Y.B.; data curation, Z.L.; supervision, L.J., Y.B. and G.X. All the authors have made meaningful and valuable contributions in revising and proofreading the resulting manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Natural Science Foundation of China (No. 62206267).

Data Availability Statement

The data supporting the findings of this research are publicly available. For the fine-tuning training data and testing data during the experiment, this paper uses the publicly available dataset ROCStories https://doi.org/10.18653/v1/N16-1098 (accessed on 1 June 2016) and WritingPrompts https://doi.org/10.18653/v1/P18-1082 (accessed on 1 July 2018) for training and testing during experiments. For the pretraining data, this paper adopts the event sequences corpus provided by https://doi.org/10.18653/v1/P18-1050 (accessed on 1 July 2018).

Acknowledgments

During the preparation of this manuscript, the author(s) used GPT4o for the purposes of serving as the powerful baseline in orchestrating diverse narrative orders during crafting stories. The specific details are shown in Section 4.2.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

AIGC	Artificial Intelligence-Generated Content
NLP	Natural Language Processing
OT	Optimal Transport
PLMs	Pretrained Language Models
SRL	Semantic Role Labeling
MoE	Mixture-of-Experts
FFN	Feed-Forward Neural
NOD	Narrative Order Diversity
NOA	Narrative Order Accuracy
EGNR	Event-Granular Naive Reward
EGOTR	Event-Granular Optimal Transport Reward
SGR	Storyline-Granular Reward

References

Martin, L.J.; Ammanabrolu, P.; Wang, X.; Hancock, W.; Singh, S.; Harrison, B.; Riedl, M.O. Event Representations for Automated Story Generation with Deep Neural Nets; Association for the Advancement of Artificial Intelligence (AAAI): Menlo Park, CA, USA, 2018; Volume 32. [Google Scholar]
Xu, J.; Ren, X.; Zhang, Y.; Zeng, Q.; Cai, X.; Sun, X. A Skeleton-Based Model for Promoting Coherence Among Sentences in Narrative Story Generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4306–4315. [Google Scholar] [CrossRef]
Yao, L.; Peng, N.; Weischedel, R.; Knight, K.; Zhao, D.; Yan, R. Plan-and-Write: Towards Better Automatic Storytelling. Proc. AAAI Conf. Artif. Intell. 2019, 33, 7378–7385. [Google Scholar] [CrossRef]
Xu, P.; Patwary, M.; Shoeybi, M.; Puri, R.; Fung, P.; Anandkumar, A.; Catanzaro, B. MEGATRON-CNTRL: Controllable Story Generation with External Knowledge Using Large-Scale Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 2831–2845. [Google Scholar] [CrossRef]
Han, R.; Chen, H.; Tian, Y.; Peng, N. Go Back in Time: Generating Flashbacks in Stories with Event Temporal Prompts. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 1450–1470. [Google Scholar] [CrossRef]
Mostafazadeh, N.; Chambers, N.; He, X.; Parikh, D.; Batra, D.; Vanderwende, L.; Kohli, P.; Allen, J. A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 839–849. [Google Scholar] [CrossRef]
Han, R.; Ren, X.; Peng, N. ECONET: Effective Continual Pretraining of Language Models for Event Temporal Reasoning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; pp. 5367–5380. [Google Scholar] [CrossRef]
Lin, S.T.; Chambers, N.; Durrett, G. Conditional Generation of Temporally-ordered Event Sequences. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers); Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 7142–7157. [Google Scholar] [CrossRef]
Zhou, B.; Chen, Y.; Liu, K.; Zhao, J.; Xu, J.; Jiang, X.; Li, Q. Generating Temporally-ordered Event Sequences via Event Optimal Transport. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 1875–1884. [Google Scholar]
Yuan, C.; Xie, Q.; Ananiadou, S. Temporal relation extraction with contrastive prototypical sampling. Knowl.-Based Syst. 2024, 286, 111410. [Google Scholar] [CrossRef]
Han, P.; Zhou, S.; Yu, J.; Xu, Z.; Chen, L.; Shang, S. Personalized Re-ranking for Recommendation with Mask Pretraining. Data Sci. Eng. 2023, 8, 357–367. [Google Scholar] [CrossRef]
Babhulgaonkar, A.R.; Bharad, S.V. Statistical machine translation. In Proceedings of the 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), Aurangabad, India, 5–6 October 2017; pp. 62–67. [Google Scholar] [CrossRef]
Fan, A.; Lewis, M.; Dauphin, Y. Hierarchical Neural Story Generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Gurevych, I., Miyao, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 889–898. [Google Scholar] [CrossRef]
Fan, A.; Lewis, M.; Dauphin, Y. Strategies for Structuring Story Generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 2650–2660. [Google Scholar] [CrossRef]
Chen, G.; Liu, Y.; Luan, H.; Zhang, M.; Liu, Q.; Sun, M. Learning to Generate Explainable Plots for Neural Story Generation. IEEE/ACM Trans. Audio Speech Lang. Proc. 2020, 29, 585–593. [Google Scholar] [CrossRef]
Tan, B.; Yang, Z.; Al-Shedivat, M.; Xing, E.; Hu, Z. Progressive Generation of Long Text with Pretrained Language Models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 4313–4324. [Google Scholar] [CrossRef]
Kong, X.; Huang, J.; Tung, Z.; Guan, J.; Huang, M. Stylized Story Generation with Style-Guided Planning. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; pp. 2430–2436. [Google Scholar] [CrossRef]
Xie, Y.; Hu, Y.; Li, Y.; Bi, G.; Xing, L.; Peng, W. Psychology-guided Controllable Story Generation. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 6480–6492. [Google Scholar]
Feng, Y.; Song, M.; Wang, J.; Chen, Z.; Bi, G.; Huang, M.; Jing, L.; Yu, J. SS-GEN: A Social Story Generation Framework with Large Language Models. Proc. AAAI Conf. Artif. Intell. 2025, 39, 1300–1308. [Google Scholar] [CrossRef]
Yuan, A.; Coenen, A.; Reif, E.; Ippolito, D. Wordcraft: Story Writing with Large Language Models. In Proceedings of the 27th International Conference on Intelligent User Interfaces, New York, NY, USA, 22–25 March 2022; pp. 841–852. [Google Scholar] [CrossRef]
Wang, X.; Chen, Z.; Wang, H.; U, L.H.; Li, Z.; Guo, W. Large Language Model Enhanced Knowledge Representation Learning: A Survey. arXiv 2024, arXiv:2407.00936. [Google Scholar] [CrossRef]
Kantorovich, L. On a problem of Monge. J. Math. Sci. 2004, 133, 15–16. [Google Scholar] [CrossRef]
Cuturi, M. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In Proceedings of the Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013, Lake Tahoe, NV, USA, 5–8 December 2013; pp. 2292–2300. [Google Scholar]
Xie, Y.; Wang, X.; Wang, R.; Zha, H. A Fast Proximal Point Method for Computing Exact Wasserstein Distance. In Proceedings of the 35th Uncertainty in Artificial Intelligence Conference, Tel Aviv, Israel, 22–25 July 2019; Volume 115, pp. 433–453. [Google Scholar]
Lee, S.; Lee, D.; Jang, S.; Yu, H. Toward Interpretable Semantic Textual Similarity via Optimal Transport-based Contrastive Sentence Learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 5969–5979. [Google Scholar] [CrossRef]
Guerreiro, N.M.; Colombo, P.; Piantanida, P.; Martins, A. Optimal Transport for Unsupervised Hallucination Detection in Neural Machine Translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 13766–13784. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K. Improving Language Understanding by Generative Pre-Training. OpenAI Blog 2018. Available online: https://api.semanticscholar.org/CorpusID:49313245 (accessed on 19 September 2025).
Zhou, Y.; Geng, X.; Shen, T.; Long, G.; Jiang, D. EventBERT: A Pre-Trained Model for Event Correlation Reasoning. In Proceedings of the ACM Web Conference 2022, New York, NY, USA, 25–29 April 2022; pp. 850–859. [Google Scholar] [CrossRef]
Gardner, M.; Grus, J.; Neumann, M.; Tafjord, O.; Dasigi, P.; Liu, N.F.; Peters, M.; Schmitz, M.; Zettlemoyer, L. AllenNLP: A Deep Semantic Natural Language Processing Platform. In Proceedings of the Workshop for NLP Open Source Software (NLP-OSS), Melbourne, Australia, 20 July 2018; pp. 1–6. [Google Scholar] [CrossRef]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar] [CrossRef]
Ye, H.; Xu, D. TaskExpert: Dynamically Assembling Multi-Task Representations with Memorial Mixture-of-Experts. In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, 1–6 October 2023; pp. 21771–21780. [Google Scholar] [CrossRef]
Chen, T.; Chen, X.; Du, X.; Rashwan, A.; Yang, F.; Chen, H.; Wang, Z.; Li, Y. AdaMV-MoE: Adaptive Multi-Task Vision Mixture-of-Experts. In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, 1–6 October 2023; pp. 17300–17311. [Google Scholar] [CrossRef]
Ma, J.; Zhao, Z.; Yi, X.; Chen, J.; Hong, L.; Chi, E.H. Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, 19–23 August 2018; pp. 1930–1939. [Google Scholar] [CrossRef]
Chen, C.; Cai, F.; Chen, W.; Zheng, J.; Zhang, X.; Luo, A. BP-MoE: Behavior Pattern-aware Mixture-of-Experts for Temporal Graph Representation Learning. Knowl. Based Syst. 2024, 299, 112056. [Google Scholar] [CrossRef]
Wu, X.; Huang, S.; Wang, W.; Ma, S.; Dong, L.; Wei, F. Multi-Head Mixture-of-Experts. In Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Lepikhin, D.; Lee, H.; Xu, Y.; Chen, D.; Firat, O.; Huang, Y.; Krikun, M.; Shazeer, N.; Chen, Z. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv 2020, arXiv:2006.16668. [Google Scholar] [CrossRef]
Sutton, R.S.; McAllester, D.A.; Singh, S.; Mansour, Y. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Proceedings of the Advances in Neural Information Processing Systems 12, Denver, CO, USA, 29 November–4 December 1999; pp. 1057–1063. [Google Scholar]
Yao, W.; Huang, R. Temporal Event Knowledge Acquisition via Identifying Narratives. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Gurevych, I., Miyao, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 537–547. [Google Scholar] [CrossRef]
Goldfarb-Tarrant, S.; Chakrabarty, T.; Weischedel, R.; Peng, N. Content Planning for Neural Story Generation with Aristotelian Rescoring. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 4319–4338. [Google Scholar] [CrossRef]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Zhou, X.; Sun, Z.; Li, G. DB-GPT: Large Language Model Meets Database. Data Sci. Eng. 2024, 9, 102–111. [Google Scholar] [CrossRef]
Lu, Z.; Jin, L.; Xu, G.; Hu, L.; Liu, N.; Li, X.; Sun, X.; Zhang, Z.; Wei, K. Narrative Order Aware Story Generation via Bidirectional Pretraining Model with Optimal Transport Reward. In Findings of the Association for Computational Linguistics: EMNLP 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 6274–6287. [Google Scholar] [CrossRef]
Zhu, Y.; Kiros, R.; Zemel, R.; Salakhutdinov, R.; Urtasun, R.; Torralba, A.; Fidler, S. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 19–27. [Google Scholar] [CrossRef]
Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. Qwen2.5 Technical Report. arXiv 2024, arXiv:2412.15115. [Google Scholar] [CrossRef]
Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef]
Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Shao, Z.; Huang, M.; Wen, J.; Xu, W.; Zhu, X. Long and Diverse Text Generation with Planning-based Hierarchical Variational Model. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Hong Kong, China, 2019; pp. 3257–3268. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Ning, Q.; Wu, H.; Peng, H.; Roth, D. Improving Temporal Relation Extraction with a Globally Acquired Statistical Resource. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; Walker, M.A., Ji, H., Stent, A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 841–851. [Google Scholar] [CrossRef]

Figure 1. A real example of orchestrating the narrative order in literary work. The black and brown texts represent the chronological and flashback (describe past events to increase the readability and interest) narratives.

Figure 2. The statistical distribution [5] of narrative order in the ROCStories [6] corpus and the stories generated by different models (i.e., BART-PLANNING-A [5], MEGATRON [4]). <vague> represents that the narrative order is not completely certain.

Figure 3. Two examples of leveraging the various event temporal orders to model diverse narrative orders and the comparison of unidirectional and bidirectional reasoning between event and order. <before> and <after> represent the temporal prompts, which hint at the pairwise event temporal order.

[e_{i}]

represent the i-th event, where “

[e 1]

happens before event

[e 2]

” expresses the chronological order, and “

[e 2]

happens after

[e 3]

” expresses the flashback order. The black and underlined brown texts represent the chronological and flashback events, respectively.

Figure 3. Two examples of leveraging the various event temporal orders to model diverse narrative orders and the comparison of unidirectional and bidirectional reasoning between event and order. <before> and <after> represent the temporal prompts, which hint at the pairwise event temporal order.

[e_{i}]

represent the i-th event, where “

[e 1]

happens before event

[e 2]

” expresses the chronological order, and “

[e 2]

happens after

[e 3]

” expresses the flashback order. The black and underlined brown texts represent the chronological and flashback events, respectively.

Figure 4. An illustration of our framework TimeWeaver. TimeWeaver comprises an event–order bidirectional pretrained storyline model and a story model. We illustrate with a five-event-and-sentence story. Given an input template, the bottom shows the pipeline generation, namely planning a storyline c prior to generating a story y. The top shows how temporal mixture-of-experts handles each token with different narrative orders, where the events with various narrative orders are routed (before, after) in different areas (chronological, flashback). The vague expert serves as the shared expert across both areas. At the fine-tuning stage, two models of TimeWeaver are jointly optimized by reinforcement learning algorithm with a multi-granular OT reward.

Figure 5. An illustration of the event–order bidirectional pretraining, which is realized by the joint learning objectives of event blank infilling, pairwise order prediction, and temporal event ordering. The left is the overview of the whole method, while the right shows a specific example of three events.

Figure 6. An illustration of multi-granular OT reward construction. We take a batch of samples as example, where each sample contains five events and sentences. The bottom shows the details of applying the sentence loss

L_{s}

to construct the storyline-granular reward

R^{s g}

, while the top shows the details of employing the event semantics

H_{e}

and sentence semantics

H_{s}

to construct the event-granular reward

R^{e g}

. With

R^{s g}

and

R^{e g}

, we construct the multi-granular OT reward.

Figure 6. An illustration of multi-granular OT reward construction. We take a batch of samples as example, where each sample contains five events and sentences. The bottom shows the details of applying the sentence loss

L_{s}

to construct the storyline-granular reward

R^{s g}

, while the top shows the details of employing the event semantics

H_{e}

and sentence semantics

H_{s}

to construct the event-granular reward

R^{e g}

. With

R^{s g}

and

R^{e g}

, we construct the multi-granular OT reward.

Figure 7. A visualization of the comparison between the multi-granular reward and previous works (storyline- and event-granular reward). We take a set of samples in a batch as an example.

e_{i}

and

s_{i}

represent the event in the storyline and the sentence in the story, respectively.

Figure 7. A visualization of the comparison between the multi-granular reward and previous works (storyline- and event-granular reward). We take a set of samples in a batch as an example.

e_{i}

and

s_{i}

represent the event in the storyline and the sentence in the story, respectively.

Figure 8. Detailed statistics on the annotated results of narrative order. For each set, the first line shows the ratio of different narrative orders, and the second and third line show the NOA on <after> and <before>, respectively.

Figure 9. The perplexity (PPL) of different variants in the pretraining stage.

Table 1. Designed prompt for LLMs to orchestrate diverse narrative orders in story generation.

Instruction	You are a proficient storyteller and able to orchestrate diverse narrative orders (chronological and flashback) for crafting compelling stories.
Task definition	The different narrative orders of story are hinted at by pairwise event temporal orders. The types of event temporal order include <before>, <after>, and <vague>. Concretely, “sentence1 <before> sentence2” indicates the event in sentence1 occurs earlier than the event in sentences, which expresses the chronological narrative order. Likewise, “sentence1 <after> sentence2” indicates the event in sentence1 occurs later than the event in sentence2, which expresses the flashback narrative order. Last,“sentence1 <vague> sentence2” indicates the event temporal order between the event in sentence1 and the event in sentence2 is arbitrary, which can be both <before> and <after>. Taking sentence1 “the police were trying to catch a neighborhood thief” and sentence2 “the thief had stolen a car” as examples, “sentence1 <after> sentence2” and “sentence2 <before> sentence1” are both reasonable, because the event “trying to catch a neighborhood” in sentence1 occurs earlier than the event “had stolen a car” in sentence2. Next, we will provide the start sentence and the specific event temporal order between pairwise sentences to hint at the narrative order, your task is to come up with a five-sentence (including the start sentence) short story no more than 48 words, which is logically coherent and conforms to the narrative order.
Input/Output	Input: start sentence = “nina needed blood tests done”, specific event temporal order of the story = “start sentence <before> sentence1 <after> sentence2 <before> sentence3 <before> sentence4” Output: nina needed blood tests done. she had been feeling unwell for weeks. the docter ordered it to determine the cause. when it was over, relief washed over her. the blood results came back perfectly! Input: start sentence = “a friend of mine just broke up”, specific event temporal order of the story = “start sentence <after> sentence1 <before> sentence2 <after> sentence3 <before> sentence4” Output: …

Table 2. Comparison results measured by automatic metrics on the ROCStories and WritingPrompts datasets. ∗ denotes the method leverage the external knowledge base. The best is in bold and the second-best is underlined. The ↑ and ↓ denote the indicators of the larger or smaller the better.

Models	ROCStories					WritingPrompts
Models	PPL↓	B-3↑	R-L↑	R-2↓	Dis↑	PPL↓	B-3↑	R-L↑	R-2↓	Dis↑	Tks↑
BART [30]	20.24	4.98	19.11	47.78	62.44	31.15	0.57	9.28	20.92	59.37	148.6
MEGATRON * [4]	-	2.57	19.29	60.75	85.42	-	-	-	-	-	-
BART-PLANNING-A [30]	26.93	5.02	19.14	49.98	60.65	31.04	0.67	9.43	23.40	60.27	160.2
BART-PLANNING-E [30]	27.30	5.13	19.29	49.77	61.80	30.65	1.76	9.41	23.70	57.31	218.7
FLASHBACK-VANILLA [5]	22.85	5.07	19.39	46.11	63.42	30.77	1.44	10.95	23.70	59.83	208.6
BPOT-VANILLA [42]	25.51	5.08	19.40	47.68	62.55	30.71	1.97	11.34	25.20	57.89	248.9
TemporalBART [8]	24.65	5.01	19.12	50.62	62.13	30.69	1.38	10.89	25.60	60.13	210.7
FLASHBACK [5]	15.45	5.20	19.49	50.05	64.76	30.73	1.64	11.03	24.20	58.86	222.4
TimeWeaver (Ours)	13.42	5.42	19.82	46.73	66.95	28.37	2.19	11.57	23.30	62.74	256.3
CONTENT-PLANNING [39]	-	-	-	-	-	-	3.46	14.40	-	78.16	252.3
TimeWeaver-Large (Ours)	9.72	5.61	19.97	45.26	67.73	26.81	3.41	13.51	-	64.81	326.3

Table 3. Manual evaluation results on the ROCStories dataset. The best is in bold, the second-best with an underline. The ↑ and ↓ denote the indicators of the larger or smaller the better.

Method	NOD↑	NOA↑	Coherence↑	Overall↓
BPOT-VANILLA	0.654	0.903	2.408	6.35
TEMPORALBART	0.714	0.905	2.536	5.99
FLASHBACK	0.709	0.915	2.922	4.65
Qwen2.5-7b	0.549	0.788	4.182	3.05
Llama3-8b	0.588	0.748	4.386	2.73
GPT4o	0.627	0.803	4.512	2.30
TimeWeaver	0.819	0.970	4.242	2.91

Table 4. Ablation study results of the ROCStories dataset on automatic evaluation. The ↑ and ↓ denote the indicators of the larger or smaller the better. Best in bold.

Method	PPL↓	B-3↑	R-L↑	R-2↓	Dis↑
Vanilla	25.51	5.08	19.40	47.68	62.55
+MRRL	14.52	5.31	19.67	48.06	65.21
+TMoE	13.86	5.36	19.75	47.23	66.18
+EOBP	13.42	5.42	19.82	46.73	66.95

Table 5. Ablation study results of the ROCStories dataset on manual evaluation. The ↑ and ↓ denote the indicators of the larger or smaller the better. Best in bold.

Method	NOD↑	NOA↑	Coherence↑
Vanilla	0.654	0.903	2.408
+MRRL	0.707	0.938	3.528
+TMoE	0.783	0.950	3.962
+EOBP	0.819	0.970	4.242

Table 6. Comparative experimental results of the ROCStories dataset on automatic evaluation. The ↑ and ↓ denote the indicators of the larger or smaller the better. Best in bold.

Method	PPL↓	B-3↑	R-L↑	R-2↓	Dis↑
EGNR	14.26	5.24	19.56	47.38	65.28
EGOTR	13.78	5.38	19.72	46.92	66.56
SGR	13.98	5.32	19.68	46.86	66.34
Ours	13.42	5.42	19.82	46.73	66.95

Table 7. Case studies on the ROCStories dataset. The black and brown texts denote the chronological and flashback narrative order, respectively. The yellow background denotes the failed flashback or logical incoherence.

Input	[s1] terry was in bed sleeping.
prompt	<before><before><after><before>
Flashback	[s2] suddenly, he heard a loud crash. [s3] he jumped out of bed to see what it was. [s4] it was a car accident. [s5] terry rushed to the hospital.
Qwen2.5-7b	[s2] his alarm clock had just stopped ringing. [s3] he finally managed to fall asleep. [s4] when he woke up, he realized he had overslept. [s5] the day ahead seemed long and tiring.
GPT4o	[s2] the moonlight cast shadows on the wall. [s3] suddenly, a loud crash woke him up. [s4] he rushed downstairs to find the window shattered. [s5] realizing it was just the wind, he returned to bed.
TimeWeaver	[s2] suddenly, he heard a loud noise. [s3] he jumped out of bed to see what it was. [s4] it was a burglar trying to break into his house. [s5] terry called the police and reported the burglar.
Input	[s1] a friend of mine just broke up.
prompt	<after><before><after><before>
Flashback	[s2] she had a boyfriend who was cheating on her. [s3] i tried to talk to him, but he wouldn’t talk to me. [s4] i asked him why he cheated on me. [s5] he told me that he didn’t want to be friends with me.
Qwen2.5-7b	[s2] she had been noticing the distance for some time. [s3] one day, she finally told him it was over. [s4] later, I comforted her. [s5] her friends planned a night out to cheer her up.
GPT4o	[s2] they had been dating him for two years. [s3] They met at a concert, sparks flying instantly. [s4] Now, she finds solace in music, healing slowly. [s5] Her heart will mend.
Timewaver	[s2] they had been dating for a year. [s3] i asked her why. [s4] she told me that he cheated on her with another woman. [s5] i felt bad for her.
Input	[s1] kim had a burning sensation in her bladder.
prompt	<before><after><vague><before>
Flashback	[s2] she went to the bathroom to find out what it was. [s3] she realized she had flushed the toilet. [s4] she called the plumber. [s5] he came and fixed the toilet for her.
Qwen2.5-7b	[s2] she decided to see a doctor right away. [s3] the nurse took her medical history first. [s4] then, she was given a urine test to confirm the issue. [s5] relief followed when the test came back negative.
GPT4o	[s2] she remembered drinking too much coffee the day before. [s3] the doctor diagnosed her with a mild infection. [s4] after taking prescribed medication, she felt better. [s5] grateful for the quick recovery, she vowed to drink more water.
Timewaver	[s2] she went to the bathroom to see what was wrong. [s3] she realized that she had a clogged bladder. [s4] kim had to go to the emergency room. [s5] the doctor treated her and gave her some medication.
Input	[s1] tom was depressed.
prompt	<after><before><vague><before>
Flashback	[s2] he didn’t have much money. [s3] he decided to buy a lottery ticket. [s4] he spent a lot of money. [s5] tom won the jackpot.
Qwen2.5-7b	[s2] he decided to talk to a therapist. [s3] during their session, they discussed his feelings. [s4] afterwards, he felt a bit lighter. [s5] finally, he started to see small improvements in his mood.
GPT4o	[s2] he remembered happier times with friends. [s3] they had drifted apart over the years. [s4] seeking solace, he visited the old park bench where they once laughed. [s5] as he sat there, a sense of peace slowly returned.
Timewaver	[s2] he didn’t have anything to cheer him up. [s3] he decided to go to a bar. [s4] he drank a lot of beer. [s5] tom felt much better.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, Z.; Jia, W.; Tian, C.; Jin, L.; Bai, Y.; Xu, G. TimeWeaver: Orchestrating Narrative Order via Temporal Mixture-of-Experts Integrated Event–Order Bidirectional Pretraining and Multi-Granular Reward Reinforcement Learning. Electronics 2025, 14, 3880. https://doi.org/10.3390/electronics14193880

AMA Style

Lu Z, Jia W, Tian C, Jin L, Bai Y, Xu G. TimeWeaver: Orchestrating Narrative Order via Temporal Mixture-of-Experts Integrated Event–Order Bidirectional Pretraining and Multi-Granular Reward Reinforcement Learning. Electronics. 2025; 14(19):3880. https://doi.org/10.3390/electronics14193880

Chicago/Turabian Style

Lu, Zhicong, Wei Jia, Changyuan Tian, Li Jin, Yang Bai, and Guangluan Xu. 2025. "TimeWeaver: Orchestrating Narrative Order via Temporal Mixture-of-Experts Integrated Event–Order Bidirectional Pretraining and Multi-Granular Reward Reinforcement Learning" Electronics 14, no. 19: 3880. https://doi.org/10.3390/electronics14193880

APA Style

Lu, Z., Jia, W., Tian, C., Jin, L., Bai, Y., & Xu, G. (2025). TimeWeaver: Orchestrating Narrative Order via Temporal Mixture-of-Experts Integrated Event–Order Bidirectional Pretraining and Multi-Granular Reward Reinforcement Learning. Electronics, 14(19), 3880. https://doi.org/10.3390/electronics14193880

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TimeWeaver: Orchestrating Narrative Order via Temporal Mixture-of-Experts Integrated Event–Order Bidirectional Pretraining and Multi-Granular Reward Reinforcement Learning

Abstract

1. Introduction

2. Related Work

2.1. Story Generation

2.2. Optimal Transport

2.3. Event-Centric Pretraining

3. Method

3.1. Overview and Storyline Design

3.2. Temporal Mixture-of-Experts

3.3. Event–Order Bidirectional Pretraining

3.4. Reinforcement Learning Algorithm with Multi-Granular Optimal Transport Reward

4. Experiments

4.1. Datasets

4.2. Baseline Models

4.3. Evaluation Metrics

Implementation Details

4.4. Experimental Results

4.5. Ablation Study

4.6. Case Study

5. Conclusions and Future Works

6. Limitations

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI