CLICK: Integrating Causal Inference and Commonsense Knowledge Incorporation for Counterfactual Story Generation

: Counterfactual reasoning explores what could have happened if the circumstances were different from what actually occurred. As a crucial subtask, counterfactual story generation integrates counterfactual reasoning into the generative narrative chain, which requires the model to preserve minimal edits and ensure narrative consistency. Previous work prioritizes conﬂict detection as a ﬁrst step, and then replaces conﬂicting content with appropriate words. However, these methods mainly face two challenging issues: (a) the causal relationship between story event sequences is not fully utilized in the conﬂict detection stage, leading to inaccurate conﬂict detection, and (b) the absence of proper planning in the content rewriting stage results in a lack of narrative consistency in the generated story ending. In this paper, we propose a novel counterfactual generation framework called CLICK based on causal inference in event sequences and commonsense knowledge incorporation. To address the ﬁrst issue, we utilize the correlation between adjacent events in the story ending to iteratively calculate the contents from the original ending affected by the condition. The content with the original condition is then effectively prevented from carrying over into the new story ending, thereby avoiding causal conﬂict with the counterfactual conditions. Considering the second issue, we incorporate structural commonsense knowledge about counterfactual conditions, equipping the framework with comprehensive background information on the potential occurrence of counterfactual conditional events. Through leveraging a rich hierarchical data structure, CLICK gains the ability to establish a more coherent and plausible narrative trajectory for subsequent storytelling. Experimental results show that our model outperforms previous unsupervised state-of-the-art methods and achieves gains of 2.65 in BLEU, 4.42 in ENTScore, and 3.84 in HMean on the TIMETRAVEL dataset.


Introduction
Counterfactual reasoning has attracted significant attention in the field of natural language processing due to its wide range of applications in improving model robustness [1,2], interpretability [3][4][5], and data augmentation [6][7][8].One significant aspect where counterfactual reasoning holds pivotal importance is in text generation, leading to advancements in various applications such as dialogue systems [9], answer feedback generation [10], and creative content generation [11].Building upon the foundation of counterfactual reasoning in text generation, researchers have recently explored a novel task known as counterfactual story generation [12].By leveraging counterfactual thinking, the objective is to generate alternative narratives that explore different outcomes or events based on hypothetical changes to the initial context or plot.This area of research aligns with the current trend in large-scale language models [13][14][15], which is dedicated to pushing the boundaries of creative text generation and enabling more human-like interactions with AI systems.The significance of this task extends far beyond the theoretical realm, offering promising applications in diverse real-world scenarios.In education, it offers the potential to revolutionize learning materials.By rewriting stories under different counterfactual conditions, students can gain a deeper understanding of historical events, scientific phenomena, and complex cause-and-effect relationships.In the legal and ethical realms, the task can aid lawyers, judges, and policymakers.It allows for the exploration of alternative scenarios, thereby facilitating informed decisions and assessments of potential outcomes in legal cases and ethical dilemmas.From enhancing educational materials to guiding legal decisions, simulating business strategies, and aiding medical treatment planning, counterfactual story generation presents an invaluable tool for understanding causality and exploring multifaceted narratives in a multitude of domains.
As shown in Figure 1, in this generation task, given an entire original story (consisting of a one-sentence premise, a one sentence condition, and a three-sentence ending) and an intervening counterfactual condition, some phrases in the original ending may conflict with the new condition.For example, an old man built a zoo where many of the animals were set free due to the occurrence of a hurricane.When the story condition is changed to a hotel, the subsequent story is adapted to portray the destruction of the rooms, ensuring narrative consistency with the counterfactual condition.Furthermore, the hurricane event in scenario s 4 remains unaffected by two conditions, making it optimal to retain the event entirely to ensure minimal editing.Traditional models [16][17][18] are based on understanding the given context to generate fluent and logically sound text.Therefore, leveraging a pre-trained language model enables the generation of fluent endings under counterfactual conditions.However, challenges arise when attempting to achieve accurate reasoning while making only minimal modifications to the ending while ensuring it remains natural.Recent relevant research [19,20] adopts a two-stage framework with the objective of attaining accurate reasoning in order to achieve minimal editing.In the first stage, each token in the story context is examined individually to determine if it requires modification.This approach enables accurate detection of each word, ensuring minimal editing.In the second stage, the words identified in the previous stage are modified to align with the story logic under the counterfactual condition, thereby ensuring narrative consistency.For instance, Hao et al. [19] trained a binary classifier using supervised learning to classify whether each token in the story ending represents causal content, while Chen et al. [20] conducted causal risk ratio calculations to detect causal conflicts.However, despite achieving promising improvements, there are still two primary challenges associated with these approaches, elaborated as follows: Conflict Detection: In previous methods, the causal invariance of a word in the story ending is determined through an assessment of its relevance to both the original condition and the counterfactual condition.However, it is challenging and confusing to compare the correlation of the phrase round them all up in s 5 with the original condition and the correlation of the phrase with the counterfactual condition.Because them refers to the content in s 4 , its relevance to the condition gradually weakens as the event progresses.In such scenario, relying solely on conditions is insufficient to precisely determine the location of causal invariance content.The narrative of events in a story evolves gradually, leading to a diminishing impact on preceding information, whereas the latest plot developments bear a more immediate correlation with future events.Consequently, during the process of story rewriting, it is crucial to consider not only the influence of initial conditions but also to incorporate a more comprehensive account of the entire story.
Causal Continuity: Previous approaches rely exclusively on the modeling capabilities of the language model to predict the output by leveraging the provided contextual information.However, although current generative models are capable of producing coherent text, they are also prone to expose defects such as self-contradiction and topic drifting [21][22][23].In the context of a story condition, subsequent events can unfold in numerous directions.Allowing the language model to select words for the next position solely based on statistical probabilities and the aforementioned information is not an effective mode of control to ensure the generated content maintains narrative consistency with counterfactual conditions.
In terms of conflict detection issues, we leverage the causal relationship within the event sequences instead of relying solely on the story conditions.For example, the story in Figure 1 exhibits the following causal relationships: zoo in s 2 is related to zoo, animal in s 3 ; zoo, animal in s 3 is related to animals, set free in s 4 ; and animals, set free in s 4 is related to round them all up in s 5 .Therefore, from an intuitive standpoint, a more effective approach for assessing the causal correlation between outcomes and conditions is to leverage the correlation of cause words among consecutive events.To explicitly explain the causal relationship, we formulate the story ending generation process as a causal graph.Considering causal continuity issue, we integrate commonsense event knowledge into the rewriting process.Specifically, we introduce COMET [24], a powerful tool capable of generating diverse and structured commonsense knowledge specifically tailored for counterfactual conditions.We fine-tune the GPT style [16] model on a large corpus of story paired with corresponding commonsense knowledge pertaining to counterfactual conditions.Leveraging the vast world of knowledge encoded within the pre-trained language model and integrating structured commonsense knowledge allows for deducing plausible event sequences that have not been previously observed and seamlessly incorporating novel words and knowledge into the generated content.
In this paper, we propose CLICK, a counterfactual generation framework based on CausaL Inference in event sequences and Commonsense Knowledge incorporation.In the first stage, we propose a skeleton extractor, which leverages the causal relationship among event sequences to detect the contents in the story ending that are affected by the original story condition.Subsequently, the elements that results in a basic skeleton are removed, mitigating any interference with the new counterfactual outcome.Furthermore, commonsense generators are conducted to formulate structural knowledge associated with the counterfactual condition, which enhances the causal coherence between the story ending and the counterfactual condition.In the second stage, the skeleton and the knowledge are provided as contexts to a generator to produce proper words to fill in the skeleton in a sequence-to-sequence way.We conduct experiments on the TIMETRAVEL dataset, and the experimental results illustrate that our model achieves state-of-the-art performance compared to other strong baselines.Additionally, our model exhibits superior capabilities in terms of minimal editing to the original ending and ensuring causal coherence between the counterfactual ending and the corresponding counterfactual condition.The contributions to this work can be summarized as follows:

Knowledge-Enhanced Text Generation
Text generation is a task which takes text as input, processes the input text into semantic representations, and generates desired output text.However, the inherent limitation of input text in providing sufficient knowledge poses challenges for neural generation models to achieve the desired output quality [25][26][27].Many research efforts have been made to enhance the control of generation with various desired properties, such as topic [28], emotion [29], keywords [30], dialogue intent [31], etc.In particular, narrative generation requires models to produce fluent and logically coherent stories based on predefined conditions [32][33][34].Nevertheless, current generative models have not yet attained the same level of storytelling proficiency exhibited by human narrators.To bridge this gap, many studies seek to inject structured knowledge into the generation process.According to the method of integration, these works are divided into two categories: knowledge enhanced by encoding or text.
Knowledge Enhanced by Encoding.One line of researchers [35][36][37][38][39][40] encodes structured knowledge into low-dimensional vectors and then uses them to influence word probability distribution during the generation process.Wang et al. [35] encoded entities retrieved from ConceptNet and then fed them into decoders to generate story.Chen et al. [36] leveraged implicit relationships among keywords in stories by calculating the cosine similarity of word embeddings.However, this type of process happens as a black box process, presenting challenges in terms of interpretability.To tackle this challenge, Liu et al. [37] propose calculating a knowledge gain to define a reward at each step during the decoding process.Taking inspiration from such methods, we use vectors trained based on knowledge graphs in the detection of causal invariance and calculate the similarity scores between them to assess the correlation.The resulting scores directly indicate the basis of our correlation detection and contribute to the interpretability of our method.
Knowledge Enhanced by Text.Another line of researchers views structured knowledge as material that can be learned in the same way as stories.They feed knowledge contexts into generative models to capture explicit information.Guan et al. [41] integrated commonsense knowledge graphs into GPT-2 [16] by post-training the model on the knowledge examples.Xu et al. [42] transformed triples in ConceptNet [43] into natural language sentences and utilized them to generate story.These studies lead generative models to being inherently knowledge-enhanced.Differently, [44,45] proposed a two-step generation pipeline with an independent knowledge reasoner instead of finetuning PLM to directly generating discourse-level stories.They first generated successive events, and subsequently expanded these events into coherent discourse sentences.In our method, we incorporate both fine-tuning and multi-step generation techniques.Different from these works, we employ commonsense knowledge as a guiding mechanism to assist the model in Fill-in-the-Blanks tasks, rather than fine-tuning a model to generate sequences in a direct left-to-right way.

Causal Inference and NLP
Causal inference [46] aims to explore the cause-and-effect relationships between different variables.With the emergence of an interdisciplinary research field at the intersection of causal inference and NLP, there is a growing interest among researchers in exploring methods for estimating causal effects from textual data and leveraging causal mechanisms to enhance the current understanding and generation of natural language.Within NLP, distinguishing between causation and correlation remains a considerable challenge, leading to potential misconceptions in the results.Moreover, the prediction process is often treated as a black box, lacking interpretability and transparency in its outputs.To tackle these challenges, incorporating a causal mechanism can be employed to model the data generation process and enhance the comprehension of the causal relationships between events and the underlying constructs within the predictor [47,48].To enhance causal reasoning in narratives, we primarily employ two methods: causal graph analysis and counterfactual reasoning.These approaches offer valuable insights and tools for effectively capturing causal relationships within the context.
Causal Graph Analysis.One line of research focuses on leveraging causal graph analysis in the data generation process, enabling the derivation of valid causal conclusions and ultimately enhancing the performance of NLP systems.The authors of [49,50] employ causal graph analysis to qualitatively analyze the impact of item popularity as a confounder, effectively boosting recommendation system performance.Tian et al. [2] employ a structural causal model to formulate biases in natural language understanding tasks, effectively alleviating the annotation biases of the datasets.Moreover, causal graph analysis is also widely used in various fields, including text classification [51], named entity recognition [52], pretrained language models [53], fake news detection [54], and even performance bottleneck detection in programming languages [55][56][57][58].In this work, we utilize causal graph modeling to analyze the generation process of story event sequences, enabling the identification of elements in subsequent events that are impacted by changing conditions.
Counterfactual Reasoning.Another line of research focuses on enhancing current text generation mechanisms by incorporating counterfactual reasoning capabilities.Counterfactual reasoning refers to reasoning about what could have happened if the past had been different or if certain conditions or events had been altered.It deals with hypothetical or counterfactual scenarios and explores the causal relationships between variables.These efforts generate counterfactual samples that are used to improve model robustness [59], interpretability [60], and data augmentation [7].However, the primary idea behind these works is to utilize language models and diverse sampling strategies to generate counterfactuals, without involving more complex narrative counterfactual reasoning.In 2019, Qin et al. [12] introduced the task of counterfactual story generation.They employed a seq2seq model to reconstruct stories; however, the resulting story ending diverged significantly from the original ending.To address this issue, Hao et al. [19] and Chen et al. [20] proposed a two-step approach.Firstly, they determined the editing position, followed by modifying the content.Compared with the original method, they made improvements in terms of minimal editing and maintaining consistency with counterfactual conditions.However, the content generated by these methods still exhibits flaws in terms of logical rationality and consistency with counterfactual conditions.In our work, we address these limitations by incorporating causal relationships between story event sequences to more accurately assess the causal invariance of content.Additionally, we introduce structural commonsense knowledge to offer diverse and previously unseen planning guidance to the model, aiming to improve the overall quality and coherence of the generated output.

Causal Graph
A causal graph is a probabilistic graphical model used to describe how variables interact with each other, expressed by a directed acyclic graph (DAG) G = {V, E}, where V denotes the set of variables and E represents the causal correlations among those variables.A DAG is a collection of nodes (variables) and edges (associations) that define the assumed causal relationships in a data-generating process.In Figure 2, we show an example of causal graph with three variables: Treatment, Outcome, and Confounder.In the context of causal models, the Treatment plays a direct causal role in determining the value of the Outcome, as is indicated by the directed edge that links the Treatment to the Outcome.The Confounder influences both the Treatment and the Outcome, creating an association between them.However, it is important to note that the association between the Treatment and Outcome resulting from their shared cause is not part of the specific causal association being analyzed.In other words, a portion of the association between the Treatment and Outcome can be attributed to the biasing path that runs from the Confounder through the Treatment to the Outcome.To accurately compute the causal effect, this biasing path must be blocked by adjusting for the influence of the Confounder.Appendix A supplements additional information regarding practical applications of causal graph.

Causal Intervention
Causal intervention is employed to determine the true causal effect of one variable on another in the presence of confounders.In a causal graph, performing an intervening operation on a variable eliminates all edges directed towards it, thereby breaking causal relationships from its parent nodes.The backdoor adjustment [46] using do-calculus provides a method for computing the intervened distribution when there are no additional confounders.For the example in Figure 2, the adjustment formula can be derived according to Bayes' theorem as follows, where Z denotes the value of Confounder Z: P(Y|do(X)) = ∑ z P(Y|X, z)P(z) (1)

Methodology 4.1. Task Formulation
The input of the counterfactual story rewriting task is a five-sentence story S = {s 1 , s 2 , s 3 , . . ., s m }, where m is the number of sentences and s i = {w i 1 , w i 2 , • • • , w i n } contains n words in the i-th sentence, and a counterfactual condition s 2 , which is counterfactual to the initial condition s 2 .In this representation, s 1 is equivalently denoted as the premise p, s 2 is denoted as the original condition c, s 2 is denoted as c , and s 3:m is denoted as the ending e. The goal of this task is to revise the ending e into an edited ending e which minimally modifies the original one and regains narrative coherency to the counterfactual condition.

Causal Graph and Causal Path Analysis
To reveal the causal relationship between the event sequence in the story ending, we construct a causal graph that represents the generation process of each individual event sequence in Figure 3. From the perspective of outcome event generation, an investigation into the sources and affected factors allows us to locate the specific modification points when past events undergo changes.This analysis enables us to pinpoint the corresponding positions in need of adjustments in order to maintain coherence and consistency within the overall narrative.Therefore, the causal invariance of words in the outcome can be detected based on the causal graph modeling.Each causal graph for an event sequence comprises three variables: Treatment, Outcome, and Confounder.In the context of this task, the event sequence in the story ending corresponds to the Outcome variable, while the preceding event adjacent to it represents the Treatment variable.The Confounder variables consist of the earlier contextual events that occurred before the Outcome event.The interventionist account characterizes a causal relationship between two variables C and E in the following way: C is a cause of E if there is at least one ideal intervention on C that changes the value of E. Referring to this definition, we provide the definition of causal invariance in the context of the story rewriting task: under counterfactual interventions, any changes observed in subsequent events indicate a causal relationship with the conditions.The immune portions represent the causal invariance.Based on the causal graph presented in Figure 3, we improve the calculation scope of causal invariance to consider only the relationship between outcome and treatment, that is, the causal effects between adjacent events.
For the calculation of causal invariance in s 3 : In Figure 3a, event s 2 influences event s 3 through the core path s 2 → s 3 .Specifically, the goal of the third sentence's rewriting is to determine the specific components of s 3 that are impacted by the intervention content in s 2 .This entails evaluating the causal invariance between the cause word in s 2 and s 3 .
For the calculation of causal invariance in s 4 : In Figure 3b, event s 3 influences event s 4 through the core path s 3 → s 4 .The story condition now plays the role of Confounder in the causal graph, creating a spurious correlation by influencing both s 3 and s 4 events.To mitigate the influence of Confounder, only the effect of Treatment on Outcome is considered.The influence on s 3 in the previous step serves as the cause of what affects the subsequent event in the current step.Consequently, the calculation of the affected content in s 4 is performed based on the content in s 3 .This entails evaluating the causal invariance between the cause word in s 3 and s 4 .Likewise, the calculation of causal invariance in s 5 is converted to the computation between the causal content in s 4 and s 5 , as illustrated in Figure 3c.
A formal description of a causal graph is shown in Figure 3d.In this graph, each sentence s i in the story is modeled as the outcome in the causal relationship.The preceding event s i−1 , which is adjacent to the sentence s i , serves as the treatment that directly influ-ences it.Additionally, there are other events that transpired prior to sentence s i−1 acting as Confounders that influence both the Treatment and Outcome.To effectively rewrite the sentence s i , it is crucial to identify the underlying cause that impacts its generation process.As indicated by the causal graph, the sentence s i is primarily influenced by the preceding event through the path s i−1 → s i .Hence, during the causal invariance detection stage, we calculate the impact on event s i by using the causal result in the previous event s i−1 .

Model Overview
The framework of CLICK is shown in Figure 4.It consists of three components: (1) a skeleton extractor with narrative chain guidance, which removes words that are causally associated with the original condition, which leads to the formation of a skeleton that exclusively consists of words unrelated to the original condition; (2) a knowledgealignment commonsense generator, which employs COMET, a transformer-based tool to generate structured commonsense knowledge about counterfactual conditions, and whose outcome can provide extensive and diverse information and serve as a valuable resource for subsequent rewriting tasks; and (3) a commonsense-constrained generative model, leveraging the previously acquired skeleton and commonsense knowledge as prompts to rewrite the story ending.

Skeleton Extractor with Narrative Chain Guidance
The counterfactual story generation task investigates how subsequent events are altered when conditions are modified.Based on the causal graph modeling and causal path analysis of the event sequence in the ending depicted in Figure 3, we can formally summarize the causal influence between events as follows: factor X within an event leads to factor Y in the subsequent event, and factor Y further causes factor Z in the subsequent event.Factor X in the first process is referred to as the cause word, while factor Y factor is termed the effect word.In the second process, factor Y becomes the new cause word, and factor Z is the effect word affected by it.In the counterfactual story generation task, the counterfactual condition can be viewed as a causal intervention in the story event chain.For example, in Figure 1, the zoo scene in the original condition is intervened, so a natural idea is to remove the events or content associated with the zoo scene from the original ending.In view of the causal relationship among sequences of events, we employ a progressive approach to identify the influence of adjacent events.This involves determining the effect word in the current sentence based on the cause word from the preceding sentence.Subsequently, the resultant effect word is employed as the cause word to compute the influenced effect word in the subsequent sentence.
The main content of this module is to find the words in story ending that are highly related to the original condition, eliminate them from the ending, and obtain a skeleton consisting solely of words that are irrelevant to the intervention factor in the condition.Summarized below, the main module is divided into the following three steps: (1) conditionguided intervention selection; (2) sequence-aware correlation calculation; (3) skeleton acquisition.

Condition-Guided Intervention Selection
A counterfactual condition involves partial modifications to the original condition.Specifically, when investigating the influence of a specific element on subsequent events, we can modify that element and observe the corresponding changes in the subsequent events.By comparing the disparities between the original and counterfactual conditions, we can determine the intervened variables within the original conditions.This approach enables us to explore the causal effects of the modified element and gain insights into its impact on the subsequent events.
Given the original story condition denoted as c = {w 1 , . . ., w j , . . ., w n } and the counterfactual condition denoted as c = {w 1 , . . ., w j , . . ., w m }, the transformer from original condition to counterfactual condition can be divided into two situations: (1) Word substitution: By selectively modifying only a subset of words in the original conditions, new counterfactual conditions can be obtained.Thus, by comparing c and c , the modified content in c is the intervention.(2) Word deletion and addition: By selectively deleting or adding words to the original conditions, the new counterfactual conditions can be obtained.In this case, all words in c are considered the intervention.

Sequence-Aware Correlation Calculation
Building upon previous causal pathway analysis, we utilize the relationships between adjacent sequences of events to calculate the elements in the story ending that are affected by interventions.To assess the correlation between tokens, we employ numberbatch word embedding [43] to calculate the similarity between them.Tokens with high similarity indicate a significant impact, whereas tokens with lower similarity suggest a lesser influence from the intervened variables.The numberbatch word embedding is trained on diverse datasets, including ConceptNet [43], Word2Vec [61], GloVe [62], and OpenSubtitles [63].By leveraging both the textual information and the structured knowledge graph of Con-ceptNet, these vectors capture semantic representations that surpass what can be directly learned from general language corpora.The numberbatch achieves good performance on tasks related to commonsense knowledge [64].For instance, when calculating the cosine similarity between the word zoo and the sentence The zoo had unusual animals that nobody had ever seen before in numberbatch, it can be observed that the tokens zoo and animals in this sentence exhibit the highest similarity in terms of their cosine similarity scores.This observation aligns with human intuition based on common knowledge.
In the previous step, we identify the location of the perturbation applied to the original conditions, which we refer to as the intervention.Then, we record the original story ending, comprised of the third to fifth sentences, as e = {s 3 , s 4 , s 5 }.To calculate the words in the story ending that are influenced by the intervention, we utilize cosine similarity in the numberbatch word embedding.Algorithm 1 outlines the procedure employed for this purpose.Firstly, we take the intervention as the initial cause word and calculate its correlation with all the words in s 3 .We identify the words with cosine similarity surpassing a predefined threshold and refer to them as the cause words in s 3 .Subsequently, by utilizing the cause words in s 3 , we can compute its correlation with all the words in s 4 .This process is repeated iteratively to obtain the complete set of relevant words across the three sentences.Regarding the parameter threshold setting, we determined it through an extensive series of experiments and comparisons, as discussed in detail in the subsequent ablation experiments section.Through the preceding steps, we have identified the causal words in the story ending that are influenced by the initial condition.To create a counterfactual scenario where the ending is unaffected by the original condition, we replace these causal words with blank spaces and subsequently merge any consecutive spaces.This process yields a fundamental skeleton of the ending that remains independent of the original story condition, ensuring that the ending under the counterfactual condition remains unaltered by the initial condition.

Knowledge-Alignment Commonsense Generator
The primary objective of this module is to generate relevant commonsense knowledge based on counterfactual conditions and incorporate this knowledge into the model through prompts.This guidance encourages the model to take into account the impact of commonsense knowledge when generating story endings.When humans create a story, they often employ commonsense reasoning based on the preceding text to develop a comprehension of the narrative being presented [65].However, machines, constrained by their trained data, lack a universal understanding of commonsense knowledge.To enhance their ability in this regard, it is necessary to incorporate relevant and accurate commonsense knowledge.For this purpose, we utilize COMET [24], a generative knowledge transformer that facilitates commonsense reasoning.By providing a counterfactual condition as input, the model can generate natural language descriptions encompassing nine dimensions of commonsense knowledge, and some examples are illustrated in Table 1.In the classification of ATOMIC [66], these nine structured commonsense descriptions can be divided into three categories:

•
If-Event-Then-Mental-State: Defines three relations relating to the mental pre-and post-conditions of an event, including XIntent (why does X cause the event), XReact (how does X feel after the event), and OReact (how do others feel after the event).Our focus lies on the knowledge of events related to explicitly mentioned participants, specifically within the categories of XIntent and XReact.

•
If-Event-Then-Event: Defines five relations relating to events that constitute probable pre-and post-conditions of a given event, including XNeed (what does X need to do before the event), XEffect (what effects does the event have on X), XWant (what would X likely want to do after the event), OWant (what would others likely want to do after the event), and OEffect (what effects does the event have on others).Our focus is on the knowledge of events that are related to explicitly mentioned participants, encompassing the categories of XNeed, XEffect, and XWant.

•
If-Event-Then-Persona: Defines a stative relation that describes how the subject of an event is described or perceived, including XAttr (how would X be described).

Commonsense-Constrained Generative Model
Given the original story and a counterfactual condition, we first obtain a basic skeleton using the skeleton extractor module.Next, we utilize the commonsense generator module to extend the counterfactual condition with relevant structured commonsense knowledge.With these two components, we train the model to fill in the skeleton with the guidance of commonsense knowledge.This training enables the model to generate a story ending that aligns with the specified counterfactual condition.In our approach, we utilize GPT-2 [16] as the underlying language model.The input sequence for the model consists of four main components: the premise (p), the counterfactual condition (c ), the basic skeleton (s), and the commonsense knowledge (k).These components are combined and represented as {

where [PRE], [CON], [SKE], [KNOW], and [END]
denote special tokens.The primary objective is to generate the counterfactual ending (e ) of the story.This input format allows the model to incorporate relevant information and guide the generation process based on both the given context and commonsense knowledge.
During the training phase, we utilize unsupervised training data that include only the original story and counterfactual conditions, without the corresponding rewriting counterfactual endings.To construct training instances, we assemble the premise (p), condition (c), and basic skeleton (s) extracted from the original story ending and the extended commonsense knowledge specific to the original condition (k).These components are concatenated into the following sequence, which serves as the input for the GPT-2 model: { The original ending is used as the target output.In this approach, the GPT-2 model learns to preserve certain words from the skeleton while generating the final ending.It employs the provided commonsense knowledge to guide the generation process and fill in the blanks.The [END] token serves as the starting symbol for the decoding process, and GPT-2 generates the ending word by word.The probability distribution of the output words is as follows: where y t (e t in the training phase and e t in the inference phase) is the t-th token after the [END] token, y <t represents the words between the token [END] and the t-th token, x represents the words preceding the token [END], and GPT2(z) is the function of getting the current step output distribution of the GPT-2 fed z as its input.
In the generation phase, we train the model using the following loss: where e t is the t-th word in the original ending, e <t represents the words before the t-th word, and m is the length of the original ending.
In the inference phase, given the original story and counterfactual condition, the skeleton extractor module is employed to obtain the basic skeleton (s) representing the structure of the original ending.Next, the commonsense generator module is utilized to generate extended commonsense knowledge (k ) specific to the counterfactual condition.The ending generator leverages the information provided by the basic skeleton (s) and the extended commonsense knowledge (k ), along with the premise (p) and the counterfactual condition (c ), to generate the counterfactual ending.The input sequence for the model is constructed as follows: { The role of the ending generator is to retain the essential words from the given skeleton and generate new words to fill in the blank spaces based on the available input information.By incorporating the provided context, counterfactual condition, and commonsense knowledge, the generator produces a coherent and contextually appropriate counterfactual ending for the story.Specifically, the GPT-2 editor predicts the counterfactual ending token by token: where sample represents the top-k [67] sampling method and V is the vocabulary.When a sentence terminator is predicted, the decoding process is carried out, producing a generated counterfactual ending { ê 1 , ê 2 , . . ., ê n } of length n.

Dataset
We run experiments with CLICK on a standard counterfactual story rewriting dataset TIMETRAVEL [12], which is built on the ROCStories [68] corpus.In TIMETRAVEL, the initial condition was rewritten by humans into a counterfactual condition, followed with edited endings.And only part of the training set is annotated with the edited endings.We train CLICK in an unsupervised manner, i.e., without access to manually edited endings.The unsupervised dataset contains 96,867 training original and counterfactual five-sentence story pairs.The development and test sets both have 1871 original stories, and each of the original stories have one counterfactual condition and three rewritten counterfactual endings.

Evaluation Metrics
In prior research conducted by Qin et al. [12], the model performance is evaluated using metrics such as BLEU [69] and BERTScore [70].The BLEU metric calculates the number of overlapping n-grams between the generated and reference endings, and the BERTScore computes their cosine similarity using BERT encodings.However, it was found that while BLEU effectively measures the minimal edits property, its correlation with human judgments is relatively weak.The BERTScore metric faces the same problem.In more recent investigations, Chen et al. [20] introduced two novel metrics, ENTScore and HMean, which demonstrate greater alignment with human evaluation judgments.These metrics provide improved consistency when evaluating the performance of models.Specifically, the ENTScore evaluates the probability of whether an ending is entailed by the counterfactual context, and HMean calculates the harmonic mean of the ENTScore and BLEU, providing a balanced assessment of coherence and minimal edits.Therefore, we focus on the performance of the model on the HMean metric in the experiment.

Implementation Details
All experiments are implemented on an NVIDIA Tesla V100 GPU with a Pytorch https://pytorch.org,(accessed on 5 October 2023) framework.In the skeleton extractor module, we use ConceptNet Numberbatch word embedding https://github.com/commonsense/conceptnet-numberbatch, (accessed on 5 October 2023) to calculate the similarity between tokens.In the commonsense generator module, we use the COMET model https://github.com/atcbosselut/comet-commonsense,(accessed on 5 October 2023) to expand the commonsense knowledge for counterfactual conditions.In the ending generator module, we use the medium version of GPT-2 from HuggingFace's Transformers library https://huggingface.co/gpt2-medium, (accessed on 5 October 2023) as the base decoder.We use Adam optimization for both models with the initial learning rates set as 5 × 10 −5 and 1.5 × 10 −4 separately.The warm-up strategy is applied with the number of warm-up steps set to 2000.The batch size for the training phrase is set to 8. We train the CLICK model for 15 epochs and select the best models on the validation set.During the inference stage, we use top-k sampling with the temperature set to 0.7 and k set to 40.

Compared Approaches
We compare CLICK with the following baselines: • GPT2-M: [12] utilizes a pre-trained model GPT-2 for story ending rewriting.The method receives the story premise and counterfactual condition as input, without undergoing any training on the dataset.• GPT2-M + FT: [12] fine-tunes the pre-trained model GPT-2 to maximize the loglikelihood of the stories in the ROCStories corpus.The premise and the counterfactual condition are provided as input.• DELOREAN: [71] is an unsupervised decoding algorithm that can flexibly incorporate both the past and future contexts using only off-the-shelf, left-to-right language models and no supervision.The method receives the story premise and counterfactual condition as input, without undergoing any training on the dataset.• EDUCAT: [20] is an editing-based unsupervised approach for counterfactual story rewriting, which includes a target position detection strategy and a modification action.

•
Human: One of the three ground-truth counterfactual endings edited by humans.The results are from [20].• CLICK-α-w/o-kno: A version of CLICK that does not use the commonsense knowledge in the experiment.The variant method receives the story premise, counterfactual condition, and skeleton as input.The correlation threshold α in the skeleton extractor module is set to 0.2.• CLICK-w/o-ske: A version of CLICK that does not use the skeleton in the experiment.The variant method receives the story premise, counterfactual condition, and commonsense knowledge as input.

• CLICK:
The full version of our method.

Main Results
To verify the effectiveness and superiority of our proposed method, we conducted a comprehensive comparison with the state-of-the-art baseline models, and the experimental results are shown in Table 2. (1) Compared to GPT-2, GPT2+FT, and DELOREAN, which are methods with zero shot or which are fine-tuned on ROCstories dataset, our CLICK method demonstrates better performance on the comprehensive evaluation metric HMean.While these methods exhibit high scores on ENTScore, this can be attributed to the fluency and consistency observed in unconstrained free generation, where the original story ending exerts minimal control over the generated content.Furthermore, all of them exhibited lower BLEU scores.This suggests that pre-trained generation models cannot naturally adapt to counterfactual generation tasks, and fine-tuning on similar story datasets fails to instruct the model on counterfactual rewriting and minimum editing constraints.In contrast, our approach achieves a better balance between minimal editing and consistency, making it more appropriate for counterfactual generation tasks.
(2) In comparison to the editing-based approach EDUCAT, our method demonstrates improvements of 2.65 and 4.42 in terms of BLEU and ENTScore, respectively, as well as a 3.84 increase in the comprehensive metric HMean.The improvement in the BLEU metric signifies that our method achieves a higher degree of preservation for unnecessarily modified words in the original ending, thus demonstrating enhanced accuracy in detecting causal conflicts between the original ending and counterfactual condition.The improvement in the ENTScore demonstrates that the endings generated by our method adhere more effectively to the guidance provided by the counterfactual condition.This demonstrates the ability of CLICK to enhance the alignment between the counterfactual condition and the generated ending.
(3) To comprehensively assess the model's capabilities in terms of minimal editing and counterfactual consistency, we concentrate on evaluating its performance using the comprehensive evaluation metric HMean.While previous methods excel in individual aspects of the metric, this does not guarantee their ability to effectively fulfill both requirements of the task.To illustrate this, we introduce two variations of the CLICK method.CLICK-α-w/o-kno, which maximally preserves words from the original ending to the counterfactual ending, achieves the highest scores in BLEU and BERT metrics but falls short of human-level performance in ENTScore.Conversely, CLICK-w/o-ske solely relies on commonsense knowledge to guide the generation of counterfactual endings, resulting in the highest score in ENTScore metric.However, the endings generated by CLICK-w/o-ske fail to achieve minimal editing as they do not make adequate use of the original ending as a constraint and exhibit comparable performance to the GPT2 and GPT2-FT methods in the BLEU index.While a variant of our method may achieve superior performance in a single metric, it is insufficient to evaluate the model's overall performance on the task, necessitating a comparison using the comprehensive evaluation metric.

Ablation Study
We conduct ablation studies to assess the effectiveness of each component, and introduce following variant models for comparison: The experimental results of the ablation studies are presented in Table 3.It is evident that the model exhibits inferior performance in the comprehensive evaluation metric upon removing each component.From these findings, we can draw the following conclusions: (1) After removing the skeleton extractor module, the model experiences a significant 36.3% drop in HMean metric and 44.2% drop in BLEU metric, indicating that the skeleton module has a positive impact on the minimal editing of the original ending and the preservation of words in the generated ending that are consistent with the counterfactual condition.Moreover, this observation further demonstrates the validity of utilizing causal relationships between event sequences to test causal invariance.
(2) After removing the commonsense generator module, the performance of the model exhibits a decrease of 1.03% in HMean and 1.24% in ENTScore, suggesting that this module is beneficial for promoting the consistency between the generated ending and the counterfactual condition.Moreover, this observation further demonstrates the effectiveness of incorporating commonsense knowledge to enhance the guidance of counterfactual conditionals.
(3) The removal of both the skeleton and knowledge modules leads to a 38.4% decrease in HMean metric, highlighting the insufficiency of relying solely on a generic generative model for the task.This can be primarily attributed to the model's inadequate comprehension of counterfactual invariance within the causal narrative chain and its constrained ability to perform precise minimal editing based on the original ending.In the correlation detection stage, we employ a similarity threshold α to determine the relationship between the similarity score and correlation.When the cosine similarity between two word vectors surpasses the threshold, it indicates a correlation between the corresponding words.For this experimental section, we utilized the NumberBatch word vector and tested various thresholds, with the results presented in Figure 5.The observations reveal that higher thresholds lead to higher BLEU and BERT scores but lower ENTscore.This outcome can be attributed to the fact that a higher threshold excludes highly relevant words from the generated endings.Consequently, a larger number of words are retained in the skeleton and subsequently in the generated counterfactual ending, which leads to a higher BLEU score.However, this also results in the retention of more erroneous words, potentially interfering with the model as part of the input, leading to a lower ENTscore.As a result, we select the NumberBatch vector for our method and consider the HMean composite index.After careful analysis, we determine that the threshold of 0.1 serves as the optimal parameter.
The skeleton extractor module evaluates the correlation between two tokens by computing the cosine similarity between their word embeddings.We compare two types of word embeddings for this task: BERT [72], which is based on a pretrained language model, and NumberBatch, derived from a knowledge graph.The results in Figure 6 indicate that the method performs better with NumberBatch word vectors compared to BERT, achieving optimal performance with varying cosine similarity parameters.The model obtains a relatively high BLEU score but a low ENTScore.This outcome suggests that the method prioritizes preserving words from the original ending to the counterfactual ending, leading to a higher BLEU score.However, this approach inadvertently retains certain words that contradict the counterfactual condition, resulting in a lower ENTScore.These observations highlight the limitations of using BERT vectors for effectively detecting causal invariance.Word vectors from pre-trained language models like BERT offer extensive semantic representation capabilities.However, word vectors trained on corpora like knowledge graphs leverage structured knowledge, making them more effective at capturing entity relationships and aligning better with our objectives.Therefore, we ultimately choose the NumberBatch vector for our method.

Effect of Commonsense Knowledge
In this section, we conduct experiments to investigate the influence of different types of commonsense knowledge on the performance of the model.The experimental results are summarized in Table 4.In order to ensure a fair comparison, we augment the counterfactual conditions with various types of commonsense knowledge as guidance for the model, while using the same basic skeleton (NumberBatch word embedding, threshold = 0.1).Sap et al. (2019) [66] propose nine if-then relation types to differentiate causes from effects, agents from themes, voluntary events from involuntary events, and actions from mental states.Given the counterfactual conditions, we utilize the COMET model to generate nine natural language expressions, and examine the influence of various combinations of relationship types on the performance of the model.It should be emphasized that we focus on the mental states, events, and character attributes related to explicitly mentioned participants.Specifically, we concentrate on the following categories: (XEffect + XWant + XNeed) in the If-Event-Then-Event category, (XIntent + XReact) in the If-Event-Then-Mental-State category, and XAttr in the If-Event-Then-Persona category.Firstly, as shown in Table 4, it is evident that commonsense knowledge can serve as a valuable aid to the model.Introducing counterfactual conditions such as XIntent, XNeed, XAttr, and OEffect commonsense knowledge leads to improvements in the model compared to solely relying on the skeleton for the HMean metric.This improvement can be attributed to the provision of more detailed counterfactual conditions, such as XIntent indicating the intention of subject X to perform a specific event.Such reasoning information assists the model in predicting potential subsequent events.It is worth noting that certain types of knowledge may not have a positive impact on the model, but instead introduce interference that leads to a slight decrease in performance.For instance, XReact reflects the emotional response of subject X after the event.Given the numerous possibilities generated by COMET, we only select one possibility to provide as input to the model for testing.Consequently, the endings generated by the model may exhibit emotional states that deviate from the narrative direction observed in the artificially rewritten endings of the test set.This discrepancy highlights the inherent conflict between fostering varied text generation and maintaining consistency with the original data distribution.
Secondly, the integration of multiple types of knowledge has a greater impact on enhancing the model than relying solely on a single type of information.In terms of the overall HMean metric, the combinations of knowledge in the If-Event-Then-Event category and the eIf-Event-Then-Mental-State category outperform their corresponding single knowledge types.The evidence demonstrates that comprehensive knowledge has the ability to amplify the enhancement provided by single knowledge.By increasing the richness of input knowledge, the model can develop a deeper understanding of past events and improve its ability to plan for the future.
Furthermore, we also conduct experiments to examine the impact of incorporating commonsense knowledge about other participants involved in the event on the model.Specifically, we explore the influence of If-Event-Then-Others in the table.Surprisingly, this particular knowledge type also yields a slight improvement in story rewriting.The improvement in overall metric primarily manifests in an improved BLEU metric, indicating that the relevant knowledge about other participants assists the model in generating counterfactual endings that closely resemble the original ones.We believe that this is possible due to the following reason: The presence of auxiliary characters and their interactions with the protagonist can indeed drive the story's development.While changes in the story conditions often result in alterations in the protagonist's behavior, auxiliary characters tend to maintain their original personality and behavior throughout both the original and rewritten narratives.Thus, incorporating commonsense information about these auxiliary characters helps the model approach the original ending more closely when rewriting the counterfactual ending.

Case Study
Table 5 shows two examples of generating counterfactual endings using different methods.The Sketch&Customize method is a supervised approach proposed by Hao et al. [19].In the first example, the EDUCAT method yields logically implausible content and exhibits incoherent story progression.Conversely, the Sketch&Customize method generates logically coherent content but suffers from excessive editing.In contrast, the CLICK method shows the advantages of minimal editing and counterfactual consistency.It seamlessly integrates the rainy scene, incorporating new elements like a rainbow and had lunch by the window, while maximizing the retention of the original ending's wording.
In the second example, the EDUCAT method fails to implement essential targeted alterations to the original ending in conjunction with counterfactual condition.Water-related events from the original scenario persist in the new ending, and no specific adjustments are made to the ingredients mentioned in the counterfactual condition.The content generated by the Sketch&Customize method suffers from logical incoherence and semantic expression issues.While the method attempts to preserve the original ending content as much as possible, the replacement and modification of words are often misplaced, leading to semantic confusion in corresponding sections.The content generated by our CLICK method successfully maintains a significant portion of the original ending while incorporating new vocabularies including mold and pasta in relation to the mentioned ingredients.
Table 5.Two cases of counterfactual endings generated by CLICK and baselines.Red words represent logical incoherence and low counterfactual consistency.Green words represent the preservation of the original ending.The gray boxes represent problems with the generated ending.

Case 1 Premise
The day was sunny and warm, a perfect day for a picnic.

Orig Condition
Mom, James, and Renee went to the park.

CF condition
Rain started to fall.

Orig Ending
First they went for a walk.Then they had a picnic by the river.They all had a good time.

CF Ending
They found a covered seating area.Then they had a picnic there.They all had a good time.

Generated Counterfactual Ending EDUCAT
So I went for a walk.Yes, I had a picnic by the river.I had a great time.Logical Incoherence Sketch&Customize Rain was then followed by a thunderstorm.All the picnic food was soaked.Then it was a cold day.Overediting CLICK First they found a rainbow.Then they had a great lunch by the window.They all had a good time.

Case 2 Premise
Tom was making some pasta.

Orig Condition
He boiled some water.

CF condition
He took all of the ingredients out of the pantry.

Orig Ending
He left the kitchen to answer an important phone call.When he came back there was water all over the ground.He turned off the stove and cleaned up the kitchen.

CF Ending
He left the kitchen to answer an important phone call.When he came back the dog had knocked everything over.He picked up the food and cleaned up the kitchen.

EDUCAT
He left the kitchen to take an urgent phone call.When he got home, there was water all over the ground.He turned off the water and left the kitchen.Low Counterfactual Consistency Sketch&Customize He sat down to answer an important phone call.When he came back he was the ground.Tom turned off the oven to start it up.

Logical Incoherence CLICK
He left the kitchen to answer an important phone call.When he came back he found there was mold all over the pasta.He cleaned off the mess and cleaned up the kitchen.

Conclusions
In this paper, we propose a counterfactual generation framework based on causal inference in event sequences and commonsense knowledge incorporation.The primary objective is to maintain the minimal editing constraint while also being able to incorporate new vocabularies and generate plausible story endings.To eliminate content that may conflict with the counterfactual condition in the story ending, we employ causal graph analysis and utilize the correlation between adjacent events in the story ending to iteratively calculate the contents from the original ending affected by the condition.To enhance the causal consistency between the story ending and the counterfactual condition, we integrate diverse and structural commonsense knowledge, facilitating the construction of coherent causal relationships and the modification of conflicting words to ensure a cohesive narrative.In the future, we intend to explore the domain of counterfactual rewriting in longer texts.This expanded exploration will provide a more rigorous evaluation of the model's capability to maintain a balance between counterfactual consistency and minimal editing of the original text.Furthermore, analyzing causal narrative chains within longer texts presents a heightened challenge, and our forthcoming work aims to tackle this complexity.

Limitations
Our approach primarily relies on causal relationships between adjacent events for counterfactual invariance detection.We have experimentally validated the effectiveness of this approach on short-text datasets commonly used in the current research domain.However, when dealing with the complexities and diversities of real-world texts, particularly in the context of longer documents, we acknowledge the presence of more intricate narrative structures, intricate causal relationships, and longer temporal dependencies.These intricacies necessitate methodological adjustments and adaptations to ensure applicability in longer and more complex text settings.Therefore, in the future, we intend to extend our approach to tackle the challenges posed by complex long-text scenarios.
signifying whether an individual is administered a specific medication, while Outcome represents a binary variable indicating the occurrence of a particular side effect.If, in this context, men are more likely than women to receive the medication and also exhibit a higher propensity to experience the side effect, gender acts as the common cause (Confounder) that links Treatment and Outcome in the depicted Figure A1.After constructing a causal graph for the three variables, the subsequent task is to quantitatively assess the causal impact of the treatment on the outcome.Conceptually, confounding introduces an association between treatment and outcome because it acts as a causal factor for both.However, the observed partial association between Treatment and Outcome can be attributed to the biased pathway originating from Treatment, passing through Confounder, and eventually reaching Outcome.To precisely compute the causal effect, it is imperative to alleviate this biasing pathway through adjustments involving the Confounder.

Figure 1 .
Figure 1.An example of an original story and counterfactual story pair.Given an original story and a counterfactual condition, the task requires changing the original story ending to a counterfactual ending.The words highlighted in red in the original condition represent the elements that are intervened with by the counterfactual condition.The words highlighted in red from the original ending indicate the content requiring modification due to conflicts with the counterfactual condition.The blue words in the counterfactual story represent the modified content.The dotted lines connecting the red contents indicate a causal connection.

Figure 2 .
Figure 2.An example of a causal graph.X: Treatment plays a direct causal role in determining the value of Outcome.Y: Outcome is the effect of the causal path.Z: Confounder is a common cause of both Treatment and Outcome.

Figure 3 .
Figure 3. Causal graphs to describe the generation process of each individual event sequence.s i denotes the i-th sentence in the story, and s 3:5 denotes the original ending, which is the target for rewriting in this task.

Figure 4 .
Figure 4. Overview of our proposed framework, which consists of three main components: skeleton extractor, commonsense generator, and generative model.The skeleton extractor takes the entire input context as input and aims to eliminate words that have strong associations with the original condition by detecting causal invariance among the words in the original ending.The internal structure of the skeleton extractor is depicted on the left side of the diagram, and it allows for a step-by-step acquisition of words highly correlated with the original conditions through the calculation of word vector similarity.These correlated words are then removed to obtain the fundamental skeleton.The commonsense generator takes the counterfactual condition as its input and primarily utilizes COMET to provide extensive and diverse structured commonsense information for subsequent rewriting.The outputs of the skeleton extractor and the commonsense generator, along with the premise and counterfactual condition, are combined and fed into the generative model.This generative model works collectively to produce the counterfactual story ending.

• w/o skeleton
means removing the skeleton extractor module; • w/o knowledge means removing the commonsense generator module; • w/o ske w/o kno means removing both the skeleton extractor module and the commonsense generator module.

4 ) 5 . 6 . 2 .
Effect of SkeletonIn this section, we perform experiments to investigate the effect of similarity threshold and word embedding selection on the process of skeleton extraction.The experimental results are summarized in Figures5 and 6and focus solely on the skeleton extractor module without considering the incorporation of commonsense knowledge into the counterfactual conditions.Specifically, we utilize the skeleton extractor module to obtain the fundamental skeleton and construct the input sequence {[PRE]p[CON]c[SKE]s[END]} for both training and inference stages.

Figure 5 .
Figure 5. Analysis experiment of similarity threshold selection in skeleton extractor module.

Figure 6 .
Figure 6.Analysis experiment of word embedding selection in skeleton extractor module.

Figure A1 .
Figure A1.An example of a causal graph.

Algorithm 1
Sequence-aware correlation computation Input: intervention: Initial intervention words; e = {s n , s n+1,• • • , s m }: Three sentences in ending; α: Threshold used to determine correlation Output: output = {causal n , causal n+1 , • • • , causal m }: Causal word set in the i-th sentence 1: causal_words ← intervention Initialize the set of causal words used to calculate the first sentence 2: for j ← n to m and s j ∈ e do Compute average cosine similarity between word and all words in set causal_words.causal j ← causal j ∪ {word} Add word to causal word set in j-th sentence ← causal j Update the set of causal words used to calculate the next sentence 11: end for 12: return {causal n , causal n+1 , • • • , causal m } similarity ← avg cosine_similarity(word, causal_words)6: if similarity > α then 7:causal_words

Table 1 .
Examples of If-Event-Then-X commonsense knowledge generated by COMET.For inference dimensions, "x" and "o" pertain to PersonX and others, respectively.

Table 2 .
Automatic evaluation results in the test set of TIMETRAVEL.Bold numbers denote the best results.

Table 3 .
Experimental results of ablation study on CLICK.Bold numbers denote the best results.The numerical values following the downward arrows indicate the extent to which the model's performance declines after the removal of that module.

Table 4 .
Effect of various types of commonsense knowledge.Bold numbers denote the best results.