A Multi-Modal Story Generation Framework with AI-Driven Storyline Guidance

: An automatic story generation system continuously generates stories with a natural plot. The major challenge of automatic story generation is to maintain coherence between consecutive generated stories without the need for human intervention. To address this, we propose a novel multi-modal story generation framework that includes automated storyline decision-making capabilities. Our framework consists of three independent models: a transformer encoder-based storyline guidance model, which predicts a storyline using a multiple-choice question-answering problem; a transformer decoder-based story generation model that creates a story that describes the storyline determined by the guidance model; and a diffusion-based story visualization model that generates a representative image visually describing a scene to help readers better understand the story ﬂow. Our proposed framework was extensively evaluated through both automatic and human evaluations, which demonstrate that our model outperforms the previous approach, suggesting the effectiveness of our storyline guidance model in making proper plans


Introduction
Story generation is one of the most creative tasks which guide people from being readers to becoming writers. With the emergence of models such as GPT-2 [1], BART [2], and more advanced natural language generation models [3,4], techniques for completing short stories have made enormous progress. However, generating stories is still challenging to maintain coherence, which is a key challenge of the automatic story generation. The best way to maintain coherence is to conduct the planning before writing each paragraph, just as a human would do when writing a novel. In reality, storytellers determine the elements that make up the story or plan the plot with characters, motifs, and background, to maintain the coherence of the story. Since a story involves several scenes, which are composed of a number of paragraphs, paragraph-level coherence is the most important aspect of the automatic story generation. Just as humans plan a storyline before writing a text, the system needs to accomplish two goals: (1) planning a storyline and (2) generating a paragraph conditioned on the storyline.
When the system tries to generate the next paragraph directly from the current paragraph without any planning, it may struggle to produce a story with a natural flow. Several studies introduce planning approaches to maintain the coherence of a story. They utilize diverse approaches, such as using character personalities [5,6] and matching scene-level contexts using event [7][8][9][10]. Other attempts leverage commonsense knowledge [11,12], and extract keywords by global planning [13,14]. However, these methods do not provide a continuous supervision of the system. Instead, they focus on planning all paragraphs at the same time non-sequentially. In order for the system to generate a story that accurately reflects flows well, the system requires a controller that can provide appropriate directions whenever the system generates a paragraph. This task has traditionally been accomplished by human experts. Previous research [15] proposed a machine-in-the-loop system that allows users to generate paragraphs by directly feeding a detailed storyline made up of various entity combinations into the model. Although this approach achieves a considerable improvement of performance on the automatic story generation, it involves human intervention, which hinders the automatic story generation itself. We find motivation from the process of people writing novels, and we introduce a method in which the system plans and generates on its own. As shown in Figure 1, the system automatically and continuously predicts the storyline before generating stories. Without any human intervention, coherence can be achieved by continuously guiding the generation process based on the predicted storyline in real-time. By doing so, the system can generate full stories from an initial source with both the coherence and engagement. We propose an integrated framework that enables planning and generation. First, we propose a model that can predict the storyline (multiple entities) while maintaining a paragraph-level coherence. We introduce a storyline guidance model that predicts three types of entities (characters, events, and places) by leveraging the multiple-choice question answering (MCQA) approach. Our storyline guidance model predicts the source to compose the next paragraph. As shown in Figure 2, the predicted entities can be newly ordered while maintaining coherence, even if a story is based on the same character, event, and place. Second, we propose the GPT-2-based story generation model which generates a paragraph based on the predicted storyline. Both two models automatically predict and generate iteratively what the other needs. Additionally, to further arouse the readers' interest, our system includes images representing each paragraph. We recognize that a story often revolves around a single visual concept, which is why we have introduced the concept of the story visualization [16][17][18][19][20] with a multi-modal setting [21]. The story visualization aims to generate visual representations that correspond to the themes depicted in the story. Our approach involves generating a paragraph-representative image that captures the background information of the story produced by our proposed framework. Previous story visualization models mainly relied on GAN-based models. Moreover, they were trained on limited datasets [18] containing simple captions and corresponding images. While they achieve success in generating images from captions, our goal is to tackle the greater complexity of full-length stories in terms of both their length and content. To address this problem, we leverage a diffusion-based text-to-image generation model that is more advanced compared to GANbased models for our story visualization approach. As shown in Figure 1, our proposed story visualization model explicitly extracts background information from the generated story and provides a visually descriptive image. The resulting image enhances readers' imagination and improves their engagement with the story.
In summary, we present a novel multi-modal story generation framework that incorporates all these models. The framework generates a sequence of stories paragraph by paragraph, where the storyline guidance and story generation models are sequentially employed for each paragraph. Once all paragraphs constituting one scene are generated, the story visualization model creates an image representing the scene. To predict the entities of a storyline in the next paragraph, the BERT-based storyline guidance model predicts the storyline: character, event, and place. For story generation, we introduce a GPT-2-based story generation model. Finally, the story visualization model is a diffusion-based text-toimage generation model that creates an image to aid readers in understanding the story. We evaluate each model independently using appropriate evaluation metrics and report on the results. Our experiments demonstrate that the generated stories are coherent both logically and visually.
The remainder of this paper is as follows. Section 2 provides a brief review of the relevant literature. Section 3 defines the task and presents our framework, as well as introduces our models separately. Notably, Section 3.2 provides an in-depth discussion of our storyline guidance model. Section 4 outlines the details of our experiments and provides a discussion of our findings. Lastly, Section 5 presents our conclusions and future work.

Related Work
This section is divided into three parts. First, we provide a brief introduction to neural story generation using language generation models. Second, we describe our approach for the controllable story generation. Finally, we introduce story visualization models, which generate images to accompany the generated stories.

Neural Story Generation
Prior to the development of deep learning, previous language generation models performed sentence-level generation rather than paragraph-level generation. As the Attention method has been widely used since the emergence of Seq2Seq [22], pre-trained end-to-end neural models such as GPT-2 [1] and BART [2] have become established as the main models in language generation [23][24][25][26]. In story generation, several studies of the dependency between the current sentence and the generated next sentence have been conducted by modeling entities [27,28], personas [6] or events [10]. However, as sentences become paragraphs, the models have difficulty maintaining a paragraph-level coherence. To solve this problem, several attempts have been made to decompose story generation into a multi-stage framework [7,[29][30][31][32], as real-world settings must be taken into account [33]. These models use a hierarchical strategy that creates a coherent story based on planning before creating the entire story. However, automated storyline planning is not as perfect as human experts, since all of the planning is limited to one entity.

Controllable Story Generation
The controllable text generation means controlling the Language Generation Model to decode a sentence with a specified semantic meaning. The author of [34] explained the force by introducing two types of control: soft control and hard control. Soft control aims to generate text of a general topic, whereas hard control aims to generate a specific word directly in the decoding stage (such as beam search). For story generation, a story is created in a constrained way using plot-based or planning-based control [23,29,35] or persona-based control [6,36], and these methods are relevant to soft control. Unlike the PersonaChat [36], which is used for dialogue models, the story generation model needs to generate long sentences. The story dataset does not explicitly indicate their storyline, so it is extremely challenging to control the story generation process through the long sentences. To solve this problem, Ref. [37] attempts to control the emotional trajectory of the generation process by applying reinforcement learning. The outline-conditioned generation compensates for this problem in that it provides more flexibility than plan-based, persona-based, or event-based methods. The previous study [15] solves this problem by clarifying each entity with fine-grained annotations in their dataset. Our purpose of controlling the story is that the model can provide storyline guidance to the story generation model. We predict that multiple entities will control our system properly throughout the long sentences.

Story Visualization
A story visualization task with an image generation model was proposed by [18]. While image generation models based on GANs have been actively studied in the computer vision field [38], story visualization models also rely heavily on GANs. Several attempts have been made to generate high-quality images in the field of GAN-based text-to-image synthesis [16,[39][40][41]. StoryGAN [18], which includes a text encoder, text and image discriminator, and image generator, was the first proposed model. DuCo-StoryGAN [17] proposed dual-learning and copy-transform to improve the semantic alignment of generated images, and it introduced a character-based evaluation metric. VLC-StoryGAN [42] focused on the structured input text and guided image generation using a parse tree structure, commonsense knowledge, dense captioning, and foreground-background information. To preserve the global consistency of characters and scenes, a character-preserving coherent story visualization method is introduced by [43]. More recently, StoryDALL-E [19] leveraged DALL-E [44], resulting in a greatly improved performance.

Methods
This section is divided into four parts. In the first part, we provide an overview of our proposed framework and task definition. The following three parts focus on the three models that make up the framework and describe each of their functions.

Task Definition and Our Framework Overview
The goal of our task is to minimize human intervention in generating neural language stories. To accomplish this, we propose a well-developed multi-modal story generation framework that is composed of three stages. Firstly, our storyline guidance model predicts the next storyline entities. Secondly, our story generation model generates a paragraph based on the entities predicted by the guidance model. Finally, a story visualization model generates representative images of the scenes to arouse the readers' interests in the story. Previous studies have demonstrated that language generation models rely on storylines selected by human beings to generate stories. However, in our framework, the storyline guidance model can predict multiple entities automatically. These entities have semantic relationships with each other, which allows us to leverage them to generate a new story. Once the storyline is predicted, our story generation model generates a story paragraph using the predicted storyline. By using an image generation model, we ensure scene-level coherence to lend to the readers' understanding. In our framework, all models progress sequentially, generating a new story with a new sequence. The framework is illustrated in Figure 3.

Storyline Guidance Model
The most crucial elements in crafting a story plot are the characters, events, and places that we anticipate. The protagonist is the central character in the paragraph, while the event refers to what the protagonist does. The place is the backdrop against which the protagonist operates. To predict these entities in our story, we employ BERT [45], a language representation model that excels at extracting deep, pre-trained, bidirectional representations. We fine-tune BERT to predict all three entities by training it using the MCQA problem, where we need to identify the context, a single specified answer, and candidates. We set the first and current paragraph of a story as the context, while the specified characters, events, and places that will feature in the next scene are set as the answer. During training, BERT takes the context, question, and answer separated by |SEP| tokens, which serve as separation tokens that connect different compositions. The model then calculates the categorical cross-entropy loss with the candidates and performs the backpropagation process to learn. Finally, the model generates the highest probability as the answer to the question "What is the entity in the next paragraph?". Our model learns the context of the current story during the training stage and corresponds to the question above. Using our storyline guidance model, we select an appropriate answer from the five candidates we created. Figure 4 illustrates the details of our model. Our model can independently predict the next storyline entities based on the current paragraph, allowing the story generation process to continue seamlessly. With this model, users can control specific entity types and guide the automated story generation process in real-time according to their preferences.

Story Generation Model
The Language Generation Model takes sentences and generates the next tokens using an auto-regressive process. To obtain the best sequence for the subsequent tokens, the model predicts a probability distribution for the next tokens given the previous tokens. The probability of sequence y can be obtained using an iterative process, called the chain rule: In a usage of the GPT2-medium [1], which is a general model consisting of multiple transformer blocks of multi-head self-attention modules, the objective is to minimize the following negative likelihood. Storium-GPT2 [15] can be defined as follows: given an input V = (v 1 , v 2 , ..., v M ) with max length M, and the model generates a coherent story Y = {y 1 , y 2 , ... , y |Y| }.
The final embedding E t at position t is computed by summing the positional embedding p t and the token embedding v t with a set of n segment token embeddings {s 1 , s 2 , ..., s n }, which is the method proposed in [46]. The probability distribution of y is obtained by the gradient descent process with a loss function, where H t is the decoder's hidden state at the t-th position computed from the context (the story), and W and b are trainable parameters. During training, a summed embedding vector consists of information introduced in the Storium dataset [15]. The most critical challenge for generation models used for the story generation [1,2] is that previous stories are too long to be used recurrently as inputs to the model. Therefore, when an embedding is generated, it is necessary to limit the max sequence length of each field so that as much information as possible can be included as the input. Ref. [15] uses the Cassowary Solver [47] to solve this problem and ensures that all tokens of the input have at least a minimum length. We apply this method as it is, but we fine-tune the model to have greater consistency with respect to the last sentences of the previous scene by appropriately trimming unnecessary tokens. It is difficult to contain all contexts in the input of [15] within the max sequence length; thus, a model with a larger size must be used to adapt the existing method as it is. In order for the attention mechanism to work better, we delete the character description of the input and add the last two complete sentence units of the first entry (establishment) as the input, and then pad it to fit the max sequence length. The entire embedding is created by applying the method in [15] as it is, stacking it in 2 stacks, and adding each segment so that it does not exceed the max sequence length of the GPT2-medium model.

Story Visualization Model
Our story visualization model is based on the Latent Diffusion Model (LDM) [48]. Prior to illustrating LDM, we briefly explain one detail of the Diffusion Model (DM) [49]. The diffusion model is one of the image generation models, defined as a fixed Markov chain of the forward diffusion process to gradually add Gaussian noise and remove the noise using a reverse process. The forward process (also called diffusion process) is the q sample from the real data x 0 ∼ q(x 0 ) in T steps. This process approximates the posterior q(x 1:T |x 0 ) at each time-step t, which is formulated as: in which β denotes the variance schedule for each time-step until the last time-step T, and is gradually increased, as (β t−1 < β t ).
The reverse process is represented as joint distribution p θ (x 0:T ), and it is also defined as a Markov chain starting from p(x T ) = N (0, 1). It can be calculated by: The objective of the diffusion model is to approximate the mean µ θ (x t , t) of noise distribution in the reverse process, calculated following: The LDM [48] is introduced to use low computational resources but provides equal or better performance. It contains a pre-trained autoencoder that produces a latent vector z = ε(x) from the input space x ∈ D x . The latent vector z is a new input to the Diffusion Model. LDM creates an embedding for a caption for a given image using a text encoder and augments the DM's UNet backbone. The optimization is defined as: where v * is a specialized token that the user inputs into the model, and c θ (y) is a conditioning vector mapped from a conditioning input y.
LDM outperforms other models in the field of text-to-image synthesis. However, different decoding is performed depending on the context (e.g., the background source), which presents a problem in that it is difficult to maintain the concept the user wants. To solve this problem, fine-tuning a specific concept using the few-shot images has been proposed [50,51]. We utilize Textual Inversion [50] for story visualization, and through this, we create an image that maintains one concept for the scene. The textual Inversion leverages CLIP ImageNet templates [52] to go through a diffusion process with random sampling from several template-driven texts to fine-tune the model for a one-shot image given as input by the user. Our reference dataset [15] does not specify images for each character, so instead of creating a picture depicting a character, we focus on a background that can stimulate the writer's imagination. Through this generated image, the coherence of the entire story is enhanced for newly appearing background information, and the story maintains one concept for successive scene entries. By using Textual Inversion, a guidance scale is determined by the complexity and level of detail in the textual descriptions being processed. If the textual descriptions contain a high level of detail and complexity, the guidance scale may need to be relatively small in order to capture all of the relevant information and generate high-quality images. On the other hand, if the textual descriptions are relatively simple or abstract, the guidance scale may be larger. Therefore, we test several guidance scales from 1 to 10 to set the proper guidance scale.

Experiments
This section is divided into five parts. First, we introduce the dataset used in this work. Second, we describe the experimental settings. In the following three parts, we present the experimental results for each model in our framework, along with the relevant metrics.

Dataset
In this study, we conduct an experiment using the Storium dataset [15]. The dataset consists of 5743 stories, including a large corpus of 25,092 scenes and 448,264 scene entries, which we used to train both our story generation model and storyline guidance model. Compared to the conventional benchmarks such as ROCStories [53], which is composed of short sentences, our dataset contains various information about each story. The original Storium dataset includes information on event entities, character personalities, and places. To facilitate the use of the dataset in a multiple choice question-answering (MCQA) context, we reorganized it. Specifically, we concatenated the initial paragraph and the current paragraph for each event to form a context. We provide an event with a corresponding description as an answer. To create a multiple-choice setting, we randomly construct four entities and form candidates with the correct answers. In addition, the answers and candidates for characters and places are constructed in the same way. We use the Storium dataset to incorporate more real stories as it contains numerous entities and large paragraphs. We also exclude stories with fewer than five characters or events to ensure that the candidates were sufficient for a meaningful answer. The example of our modified dataset is illustrated in Figure 5.

Experimental Settings
All experiments are conducted on a 2 NVIDIA RTX3090 card with Intel i7-6700 CPU (3.40 GHz) and 32 GB RAM.Our story generation model is initialized by the weight of GPT-2 medium-sized (355 M parameters). We fine-tuned GPT-2 for 45,000 training steps with a batch size of 4 for 10 h with all the same configurations of Storium paper except the input texts and the output size. We use a temperature of 0.9, a repetition penalty of 1.2, and apply a learning rate of 1 × 10 −5 and a warm-up step of 5000 to conduct the experiment. Our storyline guidance model is initialized by the weight of the BERT-base (110 M parameters). It takes approximately 8 h for training. We exploit the pre-trained Latent Diffusion Model (1.4B parameters) trained on the LAION-400M dataset [54] and follow the same procedure as the Textual Inversion. We set the model's hyperparameters to an image resolution of (512,512), a batch size of 4, gradient accumulation steps of 4, 2000 training steps, and a learning rate of 1 × 10 −4 . During inference, we set our guidance scale to 5, which we determined empirically. It takes less than 10 min for fine-tuning with the original settings.

Automatic Evaluation
We adopt the following automatic metrics. Recall@k, originally used in the retrieval task, measures the proportion of relevant items that are retrieved among the top-k items.This approach measures the outputs of our storyline guidance model to ensure that it predicts appropriate entities compared to the counterparts of the original dataset. We set k = 1, 2. Perplexity(PPL) measures the uncertainty of generated tokens predicted by the natural language generation model. BLEU-n(B-n) [55], also called the Bilingual Evaluation Understudy, is a commonly used metric to evaluate the quality of the generated text. This score measures the similarity between the generated text and the reference text by comparing the n-grams (contiguous sequences of n words) in the generated text to the n-grams in the reference text. We set n = 2, 3, 4. Lexical Repetition(LR-n) [56] compute the percentage of generated stories which repeat a 4-gram at least n times. We set n = 5 to evaluate our model. These measurements cannot measure the semantic similarity between the generated text and the reference text, since they are evaluated based on the words or tokens that appear directly. Therefore, we introduced the BERTScore-reference(BS-r), which is a pre-trained model-based measurement to evaluate the semantic similarity between the generated text and the reference text. To leverage this, we divide sentences into tokens using the NLTK module and utilized the BERTScore-recall [57] with RoBERTa-large as the backbone model. Moreover, this process does not require fine tuning. We introduce a metric BERTScore-coherence(BS-c) to evaluate the coherence between two consecutive paragraphs. This metric computes the BERTScore-recall between the n-th generated paragraph and the (n+1)-th generated paragraph and is particularly suited to evaluating the effectiveness of our storyline guidance strategy. Our approach emphasizes the importance of maintaining consistency between successive paragraphs to create a cohesive and engaging narrative. BS-c is designed to assess the degree of coherence between two paragraphs and provides a more appropriate metric for evaluating the effectiveness of our approach.

Human Evaluation
we conducted a human evaluation to compare our framework with a baseline using four different metrics. The fluency evaluates the quality of individual sentences from a linguistic perspective, taking into account factors such as grammatical correctness and the accuracy of semantic meaning representation. This evaluation was performed on a sentenceby-sentence basis, with each sentence considered in isolation. Coherence measures the logical relatedness between two consecutive paragraphs. This metric aims to assess how well the generated text flows and whether it makes sense as a cohesive whole. Relevance measures the contextual relevance between stories and the storyline. This metric aims to assess how well the generated text aligns with the given topic or subject matter and whether it is relevant to the overall storyline. Likability measures the degree of positive sentiment or engagement that the generated text elicits from human annotators. This metric aims to assess how appealing and engaging the generated text is to the target audience. To evaluate the annotators' agreement on these properties, we used the Fleiss' κ coefficient, which measures the inter-rater reliability of the annotators' judgments.

Storyline Guidance Model
To verify the effectiveness of our approach, we applied our modified dataset, which includes rich story entities and long paragraphs, to evaluate the performance of our model in terms of Recall@1 and Recall@2. Specifically, we focused on three storyline entities: character, event, and place, that can lead to the next paragraph in the narrative. We recognize that achieving semantic coherence between successive paragraphs in the Storium dataset can be challenging, as character descriptions often involve personal information such as occupation and age, rather than a well-defined persona. Nonetheless, our evaluation results demonstrate that our model is able to generate a coherent sequence of the entire story. We observed that the Recall for events is higher than for characters, indicating that the coherence between the current and next paragraphs has a semantic dependency on the event entity. Overall, our results, which are summarized in Table 1, demonstrate the effectiveness of our approach in generating coherent and engaging narratives, even in the presence of challenging datasets. As each paragraph of our dataset consists of a large amount of sentences, we show an example based on event descriptions in the predicted order, and we compare them with the existing data.

Story Generation Model
We have analyzed the performance in both quantitative and qualitative ways. Table 2 presents the automatic evaluation results of our story generation model compared to the Storium-GPT2 baseline. The results demonstrate that our model outperforms the baseline in perplexity, indicating its superior ability to model the text in the test set. In addition, our model generates more word overlaps with the reference texts, as evidenced by higher BLEU-3 and BLEU-4 scores. Although our model's BLEU-2 score does not surpass that of the baseline, we have diagnosed this issue to stem from the dataset, which contains numerous paragraphs with large sentences. Nonetheless, our model's results on BLEU-3 and BLEU-4 are much more appropriate to our dataset, and they demonstrate our model's ability to generate high-quality stories. Moreover, our model have the ability to reduce the lexical repetition. As shown in Table 3, our evaluation results show that the BERTScore-reference exhibits lower performance than the baseline, likely due to the fact that we reduced the input context to enhance coherence in the generated stories. In contrast, our metric, BERTScore-coherence, outperforms the baseline by a significant margin. This result suggests that our storyline guidance model effectively plans and guides the generation of coherent and engaging narratives. We analyze that this improvement is due to our approach's ability to maintain consistency between successive paragraphs. To further investigate the performance of our framework, we conducted a human evaluation process and compared it with a baseline model. Since many studies do not rely solely on automatic evaluation metrics due to limitations in evaluating the story writing performance, we included human evaluation to ensure the quality of our system's performance. Specifically, we conducted human evaluation on the decoded stories, which are the final output texts generated by our story generation model. We randomly sample 100 stories from the test set of the Storium dataset since the complete story was too voluminous to include in our evaluation. We hire three experts to rate the generated paragraphs using 1-5 scores. Our framework incorporates a storyline guidance model that automatically selects the input storyline, while the baseline model relies on human guidance. We generate a storyline for the randomly sampled stories using the experts' intuition. As shown in Table 4, our model outperforms the baseline in terms of fluency, coherence, relevance, and likability. Our model demonstrates a comparable grammatical generation ability to the baseline in terms of fluency, while the annotators' interest was significantly enhanced in terms of likability. Moreover, our storyline guidance model significantly improves the coherence and relevance, indicating its high reliability in predicting the storyline. Therefore, the generated paragraphs had natural plots, and the predicted storyline is well-aligned with the human annotators' expectations.

Story Visualization Model
Our image generation model relies on a prompt constructed by applying CLIP Im-ageNet style templates [52] to the predicted place identified by our storyline guidance model. Following the original paper's recommendation of using 3-5 image inputs, we empirically selected three images as inputs and applied them to various backgrounds in the story, ensuring that the underlying concept was preserved.Our visualization model generates images that contain appropriate background information and maintain a cohesive visual concept. We train the model using three-shot learning, which tunes it to support the readability of the story. Sample outputs from our story visualization model can be found in Figure 6.

Conclusions and Future Work
In this paper, we propose a novel multi-modal story generation framework that includes automated storyline decision-making capabilities, which can replace the human role, allowing the system itself to maintain story coherence. The proposed framework consists of three independent models. One is a BERT-based story guidance model, which predicts a storyline using a multiple-choice question-answering problem. For each entity of the next storyline, a model predicts the one with the most relevance to the current paragraph among the five candidates. Another is a GPT2-medium-based story generation model that creates a story that describes the storyline determined by the guidance model. Lastly, in order to support the readers' readability, we also propose a diffusion-based story visualization model to visualize a representative image from the current scene place predicted by our storyline guidance model. We evaluated the performance of our framework both quantitatively and qualitatively, as well as their corresponding generated stories using a large-scale dataset. We have analyzed the performance of our storyline guidance with Recall@1,2, which helps the storyline planning before generating a paragraph. We evaluate the quality of generated stories with human evaluation, and it suggests that coherence of the generated story is improved. We also provided the meaningful results of the story visualization based on our generated stories.
Our proposed story guidance model is designed to select one of the five candidate entities. This assumption is different from the real-world scenario that the entity of the storyline proceeds in very diverse ways, more than five. Therefore, as our future work, we will explore more improved methodologies that can overcome this problem. Additionally, we plan to investigate new metrics for evaluating the quality of the multi-modal story generation, specifically in measuring the similarity between the generated stories and images.