Fic2Bot: A Scalable Framework for Persona-Driven Chatbot Generation from Fiction

Kang, Sua; Lee, Chaelim; Jung, Subin; Lee, Minsu

doi:10.3390/electronics14193859

Open AccessArticle

Fic2Bot: A Scalable Framework for Persona-Driven Chatbot Generation from Fiction

School of AI Convergence, Sungshin Women’s University, Seoul 02844, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2025, 14(19), 3859; https://doi.org/10.3390/electronics14193859

Submission received: 23 August 2025 / Revised: 20 September 2025 / Accepted: 24 September 2025 / Published: 29 September 2025

(This article belongs to the Special Issue Feature Papers in Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

This paper presents Fic2Bot, an end-to-end framework that automatically transforms raw novel text into in-character chatbots by combining scene-level retrieval with persona profiling. Unlike conventional RAG-based systems that emphasize factual accuracy but neglect stylistic coherence, Fic2Bot ensures both factual grounding and consistent persona expression without any manual intervention. The framework integrates (1) Major Entity Identification (MEI) for robust coreference resolution, (2) scene-structured retrieval for precise contextual grounding, and (3) stylistic and sentiment profiling to capture linguistic and emotional traits of each character. Experiments conducted on novels from diverse genres show that Fic2Bot achieves robust entity resolution, more relevant retrieval, highly accurate speaker attribution, and stronger persona consistency in multi-turn dialogues. These results highlight Fic2Bot as a scalable and domain-agnostic framework for persona-driven chatbot generation, with potential applications in interactive roleplaying, language and literary studies, and entertainment.

Keywords:

character chatbot; large language model (LLM); Retrieval-Augmented Generation (RAG); prompt engineering; persona

1. Introduction

Rapid progress in large language models (LLMs) has accelerated the adoption of chatbots in various domains, particularly in the entertainment industry, where immersive and interactive experiences are highly valued [1]. Recent studies increasingly focus on developing character-based chatbots that emulate the personas of fictional characters from novels, films, and animations [2,3], with Character.AI emerging as a notable commercial example [4].

Such systems must not only recall the factual details of a fictional narrative but also maintain the distinctive linguistic style, emotional tone, and behavioral patterns of a specific character [5,6,7]. Achieving this dual objective is essential for sustaining user immersion and delivering an authentic narrative experience.

However, maintaining both factual grounding and persona coherence in literary-based chatbots remains a significant challenge. While Retrieval-Augmented Generation (RAG) approaches have proven effective in improving factual accuracy by referencing narrative events [8], the consistent preservation of character personas across multi-turn dialogues remains unresolved [9,10]. As a result, persona consistency is frequently underdeveloped, leading to responses that may be factually correct but stylistically generic. Without integrating factual knowledge and persona fidelity, chatbots risk being perceived as mere information retrievers rather than as living, breathing characters.

The difficulty is compounded by the nature of literary texts. Unlike conversational datasets, novels rarely provide extensive character-specific dialogue, resulting in data sparsity that limits the effectiveness of fine-tuning approaches [2]. Narrative structures also introduce complexity in entity resolution, with characters referred to through multiple aliases, pronouns, or descriptive expressions [11]. Furthermore, most existing systems require manual curation, character-specific tuning, or external knowledge integration, hindering scalability and automation [12,13].

To address these gaps, we propose Fic2Bot (Fiction-to-Bot), an end-to-end framework that automatically extracts and operationalizes a character’s persona directly from the raw novel text. Our approach integrates three key dimensions of persona representation:

(1) Worldview Knowledge—The raw novel text is segmented into scene-level units and structured with metadata, enabling the selective retrieval of key passages according to query and context. This design enhances narrative coherence while improving factual accuracy.

(2) Speech Style—Character utterances are analyzed through TF-IDF-based lexical profiling to extract characteristic vocabulary, while syntactic patterns such as sentence length and the frequency of interrogatives, negations, and exclamations are also examined.

(3) Emotional Tone—Pre-trained sentiment analysis models are applied to characterize the emotional tendencies of character utterances, which are quantified into a five-level distribution across positive and negative polarities.

The main contributions of this work are as follows:

End-to-end automated framework: We propose Fic2Bot, the first framework that automatically constructs in-character chatbots directly from raw novel text, eliminating the need for manual curation or fine-tuning.
Scene-structured RAG retrieval: We introduce a retrieval strategy that segments narratives into scene-level units with metadata, achieving higher recall and ranking precision compared to semantic chunking.
Comprehensive persona profiling: We develop a multidimensional analysis of lexical, syntactic, and sentiment features, enabling consistent stylistic and emotional expression across dialogues.
Extensive evaluation across genres: We validate the framework on three novels spanning children’s literature, romance comedy, and thriller, demonstrating robust performance in entity identification (F1 > 0.94), retrieval (+46.7%p Maximum Recall@3 improvement), and speaker attribution (up to +60.7%p over BookNLP).

This paper introduces the relevant previous research (Section 2), specific methods used in each component of the framework (Section 3), key experiments (Section 4), discussion of the results (Section 5), limitations of our work (Section 6), and conclusions drawn from this research (Section 7) in order.

2. Related Work

2.1. Persona Chatbot

The persona is a key element for delivering consistent interactions by continuously reflecting the unique personality and linguistic characteristics of a specific character and is essential for providing an immersive experience. In particular, character chatbots based on entertainment content aim to go beyond simple question-and-answer exchanges to offer users the experience of interacting with the original character, and active research is being conducted in this area [14,15].

Accordingly, recent efforts toward maintaining persona consistency have primarily focused on fine-tuning pre-trained language models on character-specific dialogue data to internalize character’s stylistic and personality traits within the model [12,14,16]. However, such approaches face limitations in reconstructing existing characters, as they require separate fine-tuning or incorporation of external knowledge for each character, along with access to carefully curated datasets [17].

Consequently, this study proposes Fic2Bot, a framework that automatically extracts characters’ speaking styles, personalities, and world knowledge solely from original fictional texts and integrates them into an LLM-based chatbot.

2.2. Coreference Resolution

Narrative texts such as novels often intermingle character names, aliases, and pronouns, making it difficult to clearly identify the speaker of each utterance [18]. Therefore, coreference resolution—the task of determining whether different expressions refer to the same entity—is an essential preprocessing step.

Representative models in coreference resolution research include AllenNLP [19], and more recently, BERT-based pre-trained models such as SpanBERT, which learn span boundary information and have achieved high performance in this task [20,21]. However, existing studies have mainly focused on structured texts such as news articles and Wikipedia, leading to performance degradation when applied to literary texts.

To address this, BookNLP [22], which is tailored for literary text analysis, integrates speaker identification, entity extraction, and coreference resolution, and is capable of handling both singular and plural references. Nevertheless, as Bamman et al. [23] pointed out, literary texts clearly distinguish between the narrative domains of narrators and characters, with frequent shifts between general and specific references. Moreover, in long-form novels, characters’ personalities and relationships may evolve throughout the story, and because BookNLP relies on surface-level linguistic cues such as pronoun-based references and name–pronoun associations, it faces limitations in maintaining consistent coreference matching for the same character across the entire narrative.

To overcome these limitations, recent studies have proposed prompt-based coreference resolution approaches leveraging large language models (LLMs). In particular, a method called Major Entity Identification (MEI) has been introduced to selectively identify coreferences referring only to key entities [24]. This approach employs a two-stage prompting strategy, in which key terms are first extracted and then expanded into full phrases. This helps reduce unnecessary coreference links in literary texts with many characters and complex contexts, enabling more accurate resolution focused on key entities.

Building upon these prompt-based approaches, in this paper, we apply a modified MEI prompt structure, tailored to the goal of improving coreference resolution performance in literary texts.

2.3. Retrieval-Augmented Generation (RAG)

RAG is a representative approach designed to overcome the knowledge limitations of large language models (LLMs). It follows a two-stage architecture in which external knowledge is first retrieved and then used to generate a response to a given query [25]. This method is particularly useful in scenarios that demand accurate, fact-based responses, and has been highlighted as an effective means to mitigate the hallucination problem commonly observed in generative models [26].

Naive RAG models simply convert a user query into a vector and retrieve semantically similar documents from the entire corpus before generating a response. However, this approach often suffers from a lack of semantic cohesion between the query and the retrieved context, or the injection of irrelevant information, ultimately degrading the quality of the response. These limitations become more pronounced when dealing with narrative texts such as long-form novels, where the relationships between characters and the progression of events are highly complex [27].

To address these challenges, recent research has proposed Advanced RAG techniques, which refine both the retrieval and generation phases to enhance accuracy and coherence. These include methods such as query optimization through expansion or rewriting, metadata-based targeted retrieval, and scene-level document segmentation. Compared to simple vector similarity-based retrieval, these strategies offer greater contextual relevance and improved response quality [28].

Such Advanced RAG strategies are particularly effective in the context of literature-based character chatbot systems. Scene-level retrieval allows for a more precise reflection of intricately intertwined elements such as characters, events, and settings, while metadata filtering enables targeted retrieval of scenes focused on specific characters. In this study, we move beyond naive retrieval approaches and apply an Advanced RAG framework optimized for the Fic2Bot system, leveraging scene-based structuring and character-centric tagging.

2.4. A Chatbot System Based on Retrieval-Augmented Generation (RAG)

RAG-based chatbot systems have recently been actively investigated for their potential applications across various domains, including healthcare, customer support, education, and entertainment [29,30,31,32,33]. The core principle of RAG lies in retrieving external knowledge sources and injecting them into the context of the generative model, allowing systems to incorporate up-to-date and rich information without relying solely on internal parameters. This approach has shown strong effectiveness in fact-intensive tasks such as open-domain question answering. However, in persona-oriented chatbots, which require consistent reproduction of a character’s speaking style, personality, and narrative worldview, this simple retrieval–injection mechanism remains insufficient [14].

To address these shortcomings, ChatHaruhi [17] combines script-based memory retrieval, persona-aware prompt design, and sentence embedding search to reproduce a character’s tone and behavior. While this approach demonstrates success in maintaining character consistency, it remains restricted to predefined characters. Incorporating new ones requires extensive manual work, including script collection, data curation, and prompt engineering, which severely limits both scalability and automation.

In contrast, the proposed Fic2Bot framework requires only the raw novel text provided by the user. Through coreference resolution, scene-level RAG, and stylistic–sentiment profiling, it automatically extracts each character’s speech style, personality, and world knowledge and integrates them into response generation. As a result, no manual dataset construction or character-specific tuning is needed. More importantly, Fic2Bot is not bound to a fixed set of characters or a specific domain: it can generalize to entirely new fictional characters across different novels and genres, thereby overcoming the generalization limitations of prior approaches. Thus, Fic2Bot achieves scalability, generalization, and minimal human intervention simultaneously, complementing the limitations of the ChatHaruhi approach.

3. Methods

This section details the proposed Fic2Bot framework, an automated pipeline that transforms raw novel text into in-character chatbot. Given long-form narrative text as the input, the system generates dialogue that reflects each character’s personality, tone, and style. To ensure both factual accuracy and persona consistency, the framework combines scene-level RAG with persona profiling.

We first outline the overall architecture (Section 3.1) and then describe each component in execution order:

Major Entity Identification (MEI)(Section 3.2) resolves ambiguity in character references across names, nicknames, and pronouns by consolidating them into unified entity representations, improving the accuracy and efficiency of subsequent modules.

Scene-level RAG (Section 3.3) restructures the novel into discrete scene units and retrieves only those relevant to a query, reducing irrelevant context and enhancing response accuracy and contextual appropriateness.

Character Style Extraction (Section 3.4) analyzes character utterances to build quantitative profiles of stylistic features—tone, speech patterns, and emotional tendencies—which model each character’s linguistic and affective traits.

Persona-based Response Generation (Section 3.5) integrates retrieved scene context with style profiles to produce responses that are both factually grounded and consistent with how the character would plausibly speak within the narrative world.

For clarity, the module order in this section matches the evaluation order in Section 4.

3.1. Overall Pipeline Architecture

Figure 1 illustrates the end-to-end Fic2Bot framework. The input is a raw novel text file (e.g., .txt), which undergoes preprocessing to resolve coreference and segment the story into discrete scenes. The preprocessed data then flows into two parallel, complementary paths:

Factual Path: Constructs a scene-level retrieval database for accurate, fact-based responses. This path supports the chatbot in recalling specific events, settings, and in-world facts from the novel.
Persona Path: Builds a character-specific style and personality profile by aggregating and analyzing all utterances attributed to each character.

The two paths converge in the final chatbot generation module, which integrates retrieved factual content with persona-specific stylistic and emotional features to produce character-consistent, contextually grounded responses.

3.2. Preprocessing: Major Entity Identification and Scene Structuring

The preprocessing stage serves two purposes: (1) link all mentions (names, aliases, pronouns) to their corresponding major characters, and (2) restructure the novel into scene-level JSON objects for downstream retrieval.

The Major Entity Identification (MEI) task employs a two-stage prompting strategy [24]. The process begins with the selection of major entities and then proceeds through two stages, as illustrated in Figure 2.

Major Entity Selection: Characters that appear five or more times in the novel are designated as primary entities; their names are then supplied to the prompt.
Stage 1—Word-Level Mention Detection: Identify the head word of each mention for all major entities (characters appearing ≥ 5 times in the text).
Stage 2—Span Expansion and Entity Linking: Expand each mention to its full phrase and link all variants to the corresponding entity.

MEI was proposed to address a limitation of conventional coreference resolution models, which perform well in mention clustering but exhibit lower accuracy in mention detection. This approach provides major entities as prior inputs along with the text to guide the reference resolution process. Its generalization ability has been validated across diverse domain datasets, and it has demonstrated robustness in long-form narratives such as novels, where pronouns and aliases frequently occur.

In this study, we adapted the prompts proposed in prior work [24] to fit our research objectives. Specifically, we restricted the recognized entities to character entities within the novel, thereby eliminating irrelevant mentions such as objects or locations, and explicitly incorporated the recognition of plural pronouns. The complete Stage 1 prompt incorporating MEI is presented in Appendix A.1, whereas Appendix A.2 details the subsequent mention-span expansion and the final matching procedure with the primary entities.

Once entity linking is completed, the text is segmented into scenes through LLM-based prompting. The prompt instructs the segmentation of scenes based on changes in time, location, or main characters as defined in SceneML [34]. Each scene is stored as a JSON object containing key metadata, such as a list of speakers, dialogue with matched speaker–utterance pairs, scene ID, and scene position (Chapter). This provides the foundation for the retrieval process in Section 3.3 and the persona analysis in Section 3.4. The complete prompt specification used in this step is presented in Appendix A.3.

3.3. Scene-Level Retrieval-Augmented Generation (RAG)

In the factual path of our framework, we adopt a scene-level RAG approach. Compared to naive chunk-based retrieval, scene segmentation aligns retrieval units with narrative boundaries, thereby preserving contextual coherence [35]. The overall process is schematically illustrated in Figure 3.

The input query is searched against the scene-level structured data constructed during preprocessing. The query is rewritten by the LLM to improve semantic matching. The prompt used for this process is provided in Appendix B. It is designed to eliminate ambiguity within the question while preserving the user’s original intent, enforce a consistent interrogative format, and prevent the model from inferring unsupported context. Moreover, it encourages concise and keyword-rich phrasing to enhance compatibility with retrieval systems. The retrieved scenes are subsequently reranked to improve precision and then passed to the response generation module, thereby ensuring factual accuracy in the final outputs.

3.4. Character Style Extraction

The persona path aggregates all dialogues for each character from the preprocessed scene data, in order to ensure consistent persona representation, and to perform two main analyses:

Stylistic Feature Analysis: Extracts signature vocabulary (via TF–IDF) and syntactic patterns (sentence length, question, exclamation, negation frequency). These features reflect how the character speaks.
Sentiment Analysis: Applies the pre-trained sentiment analysis model from Hugging Face, to score each character’s utterances into a score ranging from 1 (very negative) to 5 (very positive). This allowed for a quantitative measurement of the overall emotional state and tendencies of each character’s speech.

Combining lexical, syntactic, and sentiment features produces a multi-dimensional style profile that captures both linguistic habits and emotional tendencies. This profile directly informs the persona conditioning in the chatbot generation stage.

3.5. Persona-Based Chatbot Generation

In the final stage, the system integrates

Persona Instruction Prompt: Instructs the model to generate responses in the character’s style while referencing retrieved scenes. Additional rules enforce concise outputs and require that a reply is always generated, preventing empty answers.
Relevant Context from RAG: Supplies factually relevant scenes to ground the response.
Persona Analysis Data: Provides the extracted stylistic and sentiment features.
User Query: The incoming question or statement from the end user.

All components are combined into a single structured prompt for the LLM, which generates a response that is both factually accurate and stylistically faithful to the target character. Figure 4 illustrates the integration of the four components into a single prompt.

Merging the outputs of the factual and persona paths ensures that generated responses are neither generic fact lists nor style-only imitations, but coherent, immersive replies that maintain the integrity of the fictional world.

4. Experiments

This section presents the procedures and results of key experiments conducted to verify whether the proposed Fic2Bot framework can effectively ensure persona consistency and factual accuracy in novel-based chatbots. The primary objective of these experiments is to evaluate the extent to which each module of the framework contributes to the chatbot’s overall performance.

To this end, Section 4.1 introduces the datasets used in our experiments. Section 4.2 details the implementation of the proposed pipeline. Section 4.3 quantitatively evaluates each component of the framework. Section 4.3.1 assesses Entity Identification in the preprocessing stage; Section 4.3.2 evaluates the RAG component, which provides factual context from the novel; Section 4.3.3 examines speaker–utterance matching, which supports character style extraction; and Section 4.3.4 analyzes character speech patterns and emotional expressions based on the extracted corpus. Section 4.4 evaluates the final chatbot’s persona consistency across multi-turn dialogues, and Section 4.5 presents ablation and efficiency analyses, which investigate both the contribution of each component to the framework and the practical feasibility in terms of inference latency and computational cost.

4.1. Dataset

To verify the generality and scalability of our approach, we selected three novels according to two criteria: (1) Popularity diversity—including both widely known bestsellers and lesser-known titles to test performance when the LLM has varying prior exposure; (2) Genre diversity—covering distinct styles (children’s literature, romance comedy, and thriller) to avoid genre-specific overfitting.

In this study, we used three novels as source texts for persona-chatbot generation. Their titles and genres are listed in Table 1, and the original texts were collected exclusively for research purposes.

4.2. Implementation

All experiments were conducted in Python 3.10, with Gemini-2.5-pro serving as the primary LLM backbone. Texts were segmented at the chapter level, and Major Entity Identification (MEI) was performed using a two-stage prompting strategy.

For retrieval, novels were converted into JSON format with metadata fields. Embeddings were generated with Gemini-embedding-001 [36] and searched using the vector store Faiss [37]. To improve ranking quality, we applied the Advanced RAG framework with a reranker (BAAI/bge-reranker-base [38]) and Gemini-based query rewriting.

Speaker attribution was performed with Gemini-2.5-pro, using BookNLP as a baseline, and the resulting speaker–utterance pairs were used for stylistic profiling. For sentiment analysis, we applied the nlptown/bert-base-multilingual-uncased-sentiment model [39], aggregating polarity scores (1–5) by character.

Finally, Gemini-2.5-pro generated multi-turn dialogue responses conditioned on persona profiles and retrieved scenes, while GPT-4 served as an independent evaluator for persona consistency. All experiments were conducted using the default hyperparameters provided by the Gemini-2.5-pro model (temperature = 1.0, max tokens = 8192, top P = 0.95, etc.) and GPT-4 (temperature = 1.0, max tokens = 1024, top P = 1.0, etc.), without additional tuning.

4.3. System Component Evaluation

4.3.1. Entity Identification

In Section 3.2, we adopt the Major Entity Identification (MEI) task together with a two-stage prompting strategy to evaluate how accurately the preprocessing module links referring expressions (e.g., pronouns, aliases) to predefined major entities. Major entities are restricted to named characters that appear at least five times in each novel, and all texts are partitioned at the chapter level to ensure consistency across evaluation units.

Two metrics were used for the evaluation:

NER (Named Entity Recognition): Measures the model’s ability to accurately identify referring expressions (mentions) for each character, based on precision and recall. For each entity $e_{j}$ in document d, let the gold mentions be $G (e_{j})$ and the predicted mentions be $P (e_{j})$ .
We define

$NER Precision (e_{j}) = \frac{| G (e_{j}) \cap P (e_{j}) |}{| P (e_{j}) |}, NER Recall (e_{j}) = \frac{| G (e_{j}) \cap P (e_{j}) |}{| G (e_{j}) |},$

(1)

$NER F 1 (e_{j}) = \frac{2 \cdot Precision (e_{j}) \cdot Recall (e_{j})}{Precision (e_{j}) + Recall (e_{j})} .$

(2)
MEI (Major Entity Identification): Measures whether each identified mention is correctly linked to the corresponding character across the text. For each entity $m_{j}$ , let $G (m_{j})$ and $P (m_{j})$ denote the gold and predicted mention sets. Precision, recall, and F1 are then computed as

$MEI Precision (m_{j}) = \frac{| G (m_{j}) \cap P (m_{j}) |}{| P (m_{j}) |}, MEI Recall (m_{j}) = \frac{| G (m_{j}) \cap P (m_{j}) |}{| G (m_{j}) |},$

(3)

$F 1 (m_{j}) = \frac{2 \cdot Precision (m_{j}) \cdot Recall (m_{j})}{Precision (m_{j}) + Recall (m_{j})} .$

(4)

We then compute document-level performance with both macro and micro averaging:

$Macro MEI-F1 = \frac{1}{| M |} \sum_{m_{j} \in M} F 1 (m_{j}),$

(5)

$Micro MEI-F1 = \frac{2 \cdot \sum_{m_{j} \in M} | T P (m_{j}) |}{2 \cdot \sum_{m_{j} \in M} | T P (m_{j}) | + \sum_{m_{j} \in M} | F P (m_{j}) | + \sum_{m_{j} \in M} | F N (m_{j}) |} .$

(6)

Macro F1 averages entity-level F1 scores with equal weight, reflecting the model’s consistency across both frequent and infrequent characters under class imbalance. Micro F1 aggregates true positives, false positives, and false negatives across all entities before computing a single F1 score, giving more weight to frequent characters and representing overall performance proportional to their occurrence. By reporting both macro and micro F1, we capture not only the model’s overall accuracy but also its ability to generalize fairly across characters of varying frequency.

In the multi-entity case, where a single mention may correspond to multiple characters, we applied the same set-based precision, recall, and F1 metrics as in the single-entity case. However, rather than framing all mentions as a single classification problem, we evaluated each mention independently by comparing its predicted and gold sets. This formulation directly measures how well the predicted and gold mentions overlap for each case.

4.3.2. RAG Retrieval with Metadata Structuring

This experiment evaluates the effectiveness of scene-level metadata structuring for retrieval. The evaluation dataset consisted of fact-based questions automatically generated for each scene, where the answer was explicitly contained in the same scene. From this pool, 30 questions were sampled as the test set.

We report Recall@k and MRR@k, comparing against a baseline built with semantic chunking of the entire text. Recall@k evaluates whether the correct scene is included within the top-k retrieved results for a given query. It is calculated as follows:

Recall @ k = \frac{1}{N} \sum_{i = 1}^{N} 1 ({rank}_{i} \leq k)

(7)

In contrast, MRR@k measures the average reciprocal rank of the first correct result, defined as

MRR @ k = \frac{1}{N} \sum_{i = 1}^{N} \frac{1}{{rank}_{i}} \cdot 1 ({rank}_{i} \leq k)

(8)

Here,

{rank}_{i}

denotes the rank position of the correct scene for the i-th query. While Recall@k reflects the system’s ability to retrieve the correct scene, MRR@k captures how early in the ranking the correct result appears. Using both metrics allows us to evaluate the retrieval system in terms of both accuracy and efficiency.

4.3.3. Speaker Identification

The goal of this experiment is to assess whether the automatically generated speaker–utterance corpus is accurate enough for persona extraction. Evaluation was conducted against the gold-standard dataset using Accuracy and F1 score. Accuracy was computed over the entire set of gold-standard dialogues, including cases where utterances were not recognized as dialogues. F1 was calculated only on successfully recognized quotations, measuring whether speakers were correctly attributed.

4.3.4. Stylistic Feature Analysis

This section presents the stylistic analysis conducted on the speaker–utterance matching corpus generated in Section 4.3.3. First, the top TF–IDF terms for each character were extracted to identify the lexical cues that most strongly characterize their utterances. Figure 5 visualizes these terms as word clouds, which make the contrasts in vocabulary usage and expression tendencies across characters more salient. Through this visualization, one can intuitively observe differences in preferred word choices and recurrent expressions, providing clear evidence of each character’s unique linguistic profile. Importantly, these lexical distinctions are not limited to descriptive purposes but also serve as input for subsequent analyses, including sentiment evaluation. The outcomes of these analyses are then incorporated into the chatbot response generation process, ensuring that each character’s linguistic tendencies are faithfully reproduced in dialogue and that differences in speech styles are more distinctly reflected in the generated responses.

To assess the separability of character-specific speech patterns, we measured both Jaccard similarity and cosine similarity between the extracted word groups. Lower similarity values indicate more distinct linguistic differences between characters.

\begin{matrix} JaccardSim (A, B) = \frac{| A \cap B |}{| A \cup B |} & CosineSim (A, B) = \frac{A \cdot B}{∥ A ∥ ∥ B ∥} \end{matrix}

(9)

In addition, we analyzed syntactic features for each character’s utterances, including the average number of words per utterance, average number of characters, the proportion of exclamation marks, the proportion of question marks, the proportion of negative expressions, and the proportion of first-person pronouns.

4.3.5. Sentiment Analysis

Building on the extracted utterances illustrated in Figure 5, this experiment evaluates the emotional tendencies of characters by analyzing the sentiment of their utterances. Scores were assigned on a 1–5 scale, where lower values indicate a more negative tone and higher values indicate a more positive tone. The average score and distribution were examined to capture each character’s overall emotional atmosphere and expressive patterns.

4.4. Persona Consistency Evaluation in Multi-Turn Settings

This experiment aims to evaluate whether the chatbot consistently maintains its persona throughout multi-turn conversations. To this end, we constructed three dialogue scenarios: (1) ordinary conversation, (2) emotional conversation, and (3) fact-based conversation grounded in the fictional world of the novel. For ordinary and emotional dialogues, we utilized categories 1 (ordinary) and 4 (emotion and attitude) of the high-quality multi-turn dataset DailyDialog [40]. For fact-based fictional dialogues, we randomly selected scenes from the novel and assumed that the user and the character chatbot were situated within those scenes.

For each scenario, dialogues were generated with 6, 12, and 20 turns to simulate a variety of multi-turn situations. Each dialogue was constructed by alternating responses between a user-role LLM (User LLM) and the character chatbot. The User LLM generated utterances based on pre-defined scenario-specific prompts as shown in Appendix C Table A5. To ensure naturalness and comparability, the User LLM prompts explicitly restricted utterances to 1–2 sentences, preventing the User LLM from producing overly long messages. In addition, scenario-specific response styles were enforced to guarantee consistency and fairness across experimental conditions.

Finally, the persona consistency of the chatbot’s responses was evaluated using an LLM judging prompt (Appendix C Table A6). As reference, examples were drawn from the original corpus of the novel to represent the authentic speaking style of each character. The evaluation considered four aspects—Linguistic Style, Tone, Vocabulary, and Emotional Expression—each scored on a 1–5 scale with an averaged score. In addition, the judge determined whether the chatbot’s utterances sounded as if they were spoken by the same character from the novel (Yes/No), and provided a confidence score (0–100%) along with a concise rationale.

4.5. Module Effectiveness and Efficiency Analysis

4.5.1. Module Effectiveness and Baseline Comparison

To assess the effectiveness of individual components within Fic2Bot, we analyzed both the overall pipeline performance and the contribution of each module. Instead of full ablation for every component, we combined a baseline comparison with results from earlier sections that isolate each module’s effect. As a baseline, we implemented a full-text prompting approach, in which the entire novel text was provided as input and the model was given only a simple persona prompt (e.g., “You are [Character]. Respond accordingly.”) This setting removes modular preprocessing and relies solely on the LLM’s ability to infer character style and context. For response evaluation, we adopted the persona consistency evaluation methodology described in Section 4.4, conducting multi-turn dialogues primarily under the Original scenario. In addition, we introduced the Dialogue Coherence metric to provide a more fine-grained assessment of the chatbot’s conversational ability.

Furthermore, we reorganized the RAG performance results from Section 4.3.2 together with the outcomes of this effectiveness study, allowing a direct comparison of the incremental contributions of each module to factual accuracy and persona consistency. This design provides evidence of the causal relationship between the modular framework and the improvements in persona-driven dialogue generation performance.

4.5.2. Efficiency Analysis

To evaluate the practical applicability of the framework, we conducted an efficiency analysis focusing on building chatbot time, response latency, and computational cost. Specifically, we measured the entire chatbot generation Framework—including the MEI procedure, scene-level metadata structuring, embedding generation for vector-based retrieval, and persona extraction—as well as the end-to-end response time covering query rewriting, retrieval, reranking, and response generation. For comparison, we also measured the end-to-end response time of the baseline described in Section 4.5.1.

Building chatbot time was measured in seconds, and response latency was reported as the median latency (p50) and the worst-case latency (p95). In addition, we quantified the token usage of LLM calls, reporting the average number of input and output tokens consumed per query, thereby providing an estimate of potential costs. Through this analysis, the framework can be comprehensively evaluated in terms of the overall cost in real-world deployment environments, taking into account both chatbot building and query processing times.

5. Results and Discussion

This section presents the results of each evaluation task introduced in Section 4, following the same order as in the Methods sections.

5.1. Entity Identification Performance

Table 2 reports NER and MEI F1 scores for singular and plural references across the three novels. All models achieved consistently high performance (F1 > 0.90) in both tasks, demonstrating robust mention detection and accurate entity linking.

Performance on plural references and pronouns is particularly notable. As discussed in Section 2.2, although numerous studies have been conducted on coreference resolution, sometimes plural pronouns have been excluded from evaluation [41]. Even in studies addressing both singular and plural entities jointly, such as Zhou et al. [42], the best reported scores for plural linking were macro F1 = 0.205 and micro F1 = 0.417.

In contrast, our method attains substantially higher scores without additional neural network training, relying solely on the MEI-focused LLM prompting design described in Section 3.2. These results highlight the method’s effectiveness in handling the challenging problem of plural reference linking, while underscoring its efficiency and practical applicability.

5.2. RAG Retrieval Performance

Table 3 compares retrieval performance between structured scene-level data and the original text chunked using a semantic chunking library.

Across all three books, the proposed structured format consistently outperformed the original on both evaluation metrics. Compared to the original format, Winnie-the-Pooh achieved a +34.45 percentage points improvement in Recall@3 and a +25.0 percentage points improvement in MRR@3. Similar improvements were observed in the other datasets, with Off the Record nearly doubling its performance and And What Can We Offer You Tonight showing more moderate but still clear gains. These results indicate that scene segmentation with metadata (Section 3.3) enhances retrieval relevance by preserving narrative structure, thereby boosting both accuracy and ranking.

5.3. Speaker Identification Performance

Table 4 compares the performance of Fic2Bot with the BookNLP baseline in terms of Accuracy, macro F1, and micro F1. For Winnie-the-Pooh, the proposed model achieved an Accuracy of 0.9617, representing an improvement of approximately +21.85 percentage points over the baseline. Furthermore, the macro F1 increased from 0.7465 to 0.9995, and the micro F1 improved from 0.7898 to 0.9989, indicating near-perfect attribution performance.

A similar trend was observed for Off the Record, where Fic2Bot achieved an Accuracy of 0.9975 compared to 0.9366 for BookNLP. Macro F1 and micro F1 also improved substantially by +18.63 percentage points and +24.47 percentage points, respectively.

For And What Can We Offer You Tonight, the frequent use of pronouns and implicit speaker references posed significant challenges for speaker attribution, leading to relatively low performance for BookNLP. However, Fic2Bot achieved an Accuracy of 0.9014, representing a +60.71 percentage points improvement, while macro F1 and micro F1 increased by +58.50 and +65.53 percentage points, respectively, indicating its effectiveness under such challenging narrative conditions.

These findings demonstrate that the LLM-based prompting approach employed by Fic2Bot is highly effective for speaker–dialogue attribution in literary texts. By leveraging the contextual understanding capabilities of LLMs, Fic2Bot robustly handles implicit speaker cues and complex narrative structures.

5.4. Results of Stylistic Feature Analysis

Table 5 and Table 6 present the lexical similarity analysis across characters. Lexical similarity evaluation serves to verify whether each character employs a distinctive set of vocabulary and expressions. This demonstrates that their personas are clearly separable, providing evidence that such differentiated speaking styles can be faithfully incorporated into chatbot responses. The overall averages for each book remain below 0.14, indicating strong lexical separability and confirming that characters’ speech patterns are sufficiently distinct for style profiling. At the same time, a few character pairs with close narrative ties exhibit relatively high overlap—for instance, Jewel and Nero in And What Can We Offer You Tonight reach a cosine similarity of 0.7099. These cases reflect the natural vocabulary sharing that emerges between characters who are closely intertwined throughout the narrative, rather than indicating a lack of separability.

This suggests that the speaker–utterance segmentation technique described in Section 3.4 enables precise characterization of each character’s speech patterns and can be effectively leveraged for style profiling.

Table 7 presents an example of syntactic feature analysis results, focusing on representative characters from Winnie-the-Pooh among the three books analyzed. The results indicate that Winnie-the-Pooh exhibits the longest average utterance length (9.87 words, 50.86 characters) and the highest proportion of exclamation usage (28.21%), suggesting a more expressive emotional style. These findings imply that each speaker’s linguistic profile can be leveraged for style-based character modeling and maintaining persona consistency.

5.5. Results of Sentiment Analysis

Table 8 presents the sentiment score distribution of the five main characters in Off the Record, classified on a 1–5 scale, where 1 indicates the most negative sentiment and 5 indicates the most positive sentiment, with 3 representing a neutral or moderately balanced tone. The full results for all characters are provided in Appendix D Table A7.

For instance, Phil Jenkins exhibited the most pronounced positive tendency, with 75% of his utterances rated at the maximum score of 5 and no instances in the neutral-to-moderately-positive range (scores 3–4). In contrast, Simone Blake and Paisley McConkie displayed more balanced sentiment profiles, with utterances distributed across negative, neutral, and positive categories. These differences in emotional expression may reflect the distinct narrative roles and personality traits assigned to each character, serving as a key basis for defining emotional consistency and tone in persona construction.

5.6. Results of Persona Consistency Evaluation in Multi-Turn Settings

Table 9 presents the results of persona consistency evaluation for chatbot responses assessed using LLMs. The generated multi-turn dialogues were evaluated on a 5-point scale for four dimensions—linguistic style, tone, vocabulary and emotional expression—measuring how closely they resembled the original characters.

The results show that across all novels and scenarios, the chatbot achieved an average score of 3.86 or higher, demonstrating a generally high level of consistency. Moreover, in response to the question, “Do the chatbot-generated utterances clearly match the style of the same character as portrayed in the original novel?”, over 76.7% of cases were judged as Yes, indicating that the chatbot successfully reproduced the characters’ speaking styles.

In particular, Pooh from Winnie-the-Pooh achieved a perfect score of 5.0/5.0 across all scenarios. This outcome suggests that the repetitive and simplified speech patterns of children’s literature characters (e.g., “Oh, bother,” “a Bear of Very Little Brain”) and their exaggerated expressions contributed positively to persona extraction and preservation.

Overall, the proposed framework achieved a high level of persona consistency across diverse genres and characters, with the results indicating that the more salient the linguistic features of a character, the stronger the consistency achieved.

5.7. Results of Module Effectiveness and Efficiency Analysis

5.7.1. Analysis of Module Effectiveness

Table 10 presents the comparison between the baseline and the full Fic2Bot pipeline, showing consistent improvements in persona consistency and dialogue quality across all three novels. To further clarify the source of these gains, we link the baseline comparison with the individual module analyses reported in Section 4.3.

Major Entity Identification (MEI): As reported in Section 5.1, MEI achieved macro F1 scores above 0.94, with notable improvements in handling plural references compared to previous approaches. This module provides a reliable foundation for consistent entity tracking, which is essential for sustaining persona coherence.

Scene-level RAG: Section 5.2 demonstrated that Recall@3 improved by +34–47 percentage points compared to chunk-based retrieval. MRR@3 also increased by +20–44 percentage points, confirming that scene segmentation directly enhances factual grounding in generated responses.

Speaker Attribution and Style Profiling: Section 5.4 showed that our speaker attribution and lexical–syntactic profiling achieved low inter-character similarity (average Jaccard < 0.14), enabling distinct stylistic separability and supporting consistent persona reproduction.

These findings complement the full pipeline comparison in Table 10. For example, And What Can We Offer You Tonight exhibited the most substantial improvement, rising from 3.70 to 4.38 (+0.68), indicating the framework’s effectiveness even in more structurally complex narratives. Winnie-the-Pooh also showed a clear gain, improving 4.62 to 5.00 (+0.38), suggesting that highly distinctive characters benefit strongly from the integration of all modules. Off the Record showed smaller gains (+0.10), reflecting the difficulty of distinguishing characters in modern conversational settings.

In addition, outputs generated with Fic2Bot tended to contain richer factual details than those from the baseline. This can be attributed to the improved retrieval accuracy of scene-level RAG, which supplied more relevant narrative context to the generation process. However, because the evaluation relied on GPT-4, the scoring may have overemphasized surface-level stylistic features, potentially underestimating factual richness. This limitation highlights the need to complement LLM-based automatic evaluation with human studies to more comprehensively assess both style and factual fidelity.

In summary, the analysis indicates that (i) MEI secures consistent entity tracking, (ii) scene-level RAG strengthens factual accuracy, (iii) style profiling enforces persona distinctiveness, and (iv) their integration within Fic2Bot yields additive improvements in persona consistency and factual grounding across diverse genres.

5.7.2. Efficiency Analysis

Table 11 presents the results of the efficiency analysis conducted as described in Section 4.5.2.

For Winnie-the-Pooh, the response latency was the longest, with a median (p50) of 32.43 s and a 95th percentile (p95) of 37.41 s. This latency can be attributed to the generation of output tokens more than ten times longer than those of the other two novels. The verbose and descriptive speaking style of the character led to lengthy outputs, producing more detailed responses but also incurring the highest cost (USD 0.0053).

For Off the Record, the response latency was p50 14.31 s and p95 19.97 s, with a relatively small gap between the two, indicating stable response times. The average input and output tokens were 2149.02 and 22.95, respectively. The character’s direct and concise speaking style kept the outputs short, resulting in efficient operation at the lowest cost (USD 0.0029).

For And What Can We Offer You Tonight, the response latency was p50 16.20 s and p95 20.50 s, remaining within a narrow range. The average input and output tokens were 2654 and 30, respectively, and the character’s compact and poetic style contributed to shorter outputs, resulting in a relatively low cost (USD 0.0036).

Among the three novels, Off the Record required the longest chatbot building time, approximately 110 minutes. This appears to stem from the frequent scene transitions characteristic of modern drama, which likely imposed additional overhead in the scene-level embedding process. In contrast, children’s literature such as Winnie-the-Pooh has a simpler scene structure, resulting in shorter build times. In between these extremes, And What Can We Offer You Tonight exhibited an intermediate build time, positioned between the other two novels. Since chatbots are built only once when a novel is ingested, such differences do not translate into significant burdens for real-time operation. Finally, baseline, which relies on the entire text without additional retrieval or compression, exhibited shorter response latencies, with p50 and p95 ranging from a few to ten seconds. However, the number of input tokens per query increased excessively, resulting in high costs across all three novels, which could lead to exponential cost escalation in real-time operation.

6. Limitations

Several limitations remain in this study. First, the evaluation was confined to text-based novels, leaving applicability to multimodal narratives such as scripts or graphic novels untested.

Second, while prompt-based methods avoid additional fine-tuning, their heavy reliance on LLM prompting makes them vulnerable to performance variability across versions and providers. Notably, MultiLLM-Chatbot [43] evaluated five major LLM families across diverse metrics and domains, clearly demonstrating model-level differences. Although the overall architecture is still applicable, subtle variations in reasoning, style adherence, and retrieval alignment may cause inconsistencies, highlighting the need for systematic cross-LLM evaluation.

Third, human evaluation of subjective qualities such as fluency, empathy, and engagement was not included. Automated LLM-based judgments offer scalability but cannot fully capture human perception. Large-scale human studies will therefore be essential for validating persona consistency and understanding user experience in practice.

Finally, character style modeling relied on shallow linguistic features such as TF-IDF statistics, sentence length, and exclamation frequency. While informative, these features fail to capture deeper semantic, rhetorical, or narrative-level styles, limiting the richness of character expression. In addition, potential biases in sentiment analysis were not addressed in this study, which should be systematically examined in future work.

7. Conclusions

This paper presented Fic2Bot, an end-to-end framework that automatically transforms raw novel text into in-character chatbots by extracting character-specific linguistic and emotional profiles and integrating these with scene-grounded retrieval. Unlike conventional approaches, Fic2Bot achieves high performance without task-specific fine-tuning, making it both efficient and scalable.

Experiments across three novels demonstrated that the framework improves entity identification, retrieval precision, and persona consistency in multi-turn dialogues. These results highlight its ability to jointly enhance factual grounding and stylistic coherence. The significance of Fic2Bot lies not only in its empirical performance but also in its broader applicability, as it demonstrates scalability across diverse genres and narrative styles while adopting prompting strategies instead of additional training.

However, several promising directions remain for future research. First, generalizability should be validated beyond text-only novels, extending to diverse modalities and languages. Incorporating visual context, supported by recent advances in vision–language models [44] and multimodal RAG approaches [45], may enable richer narrative forms such as illustrated novels, scripts, and graphic narratives. Second, to mitigate dependence on LLM prompting, hybrid architectures that combine prompt-based methods with task-specific pretrained models should be explored, alongside systematic cross-LLM evaluations to assess robustness across model families. Third, large-scale human evaluations are needed to complement automated LLM-based judgments, particularly for subjective qualities such as fluency, empathy, and engagement. Finally, style modeling should move beyond shallow linguistic features (e.g., TF-IDF, sentence length) toward deeper semantic, rhetorical, and narrative-level modeling, while deployment feasibility must be further investigated with attention to efficiency, scalability, and lightweight implementations for real-world applications.

Author Contributions

Conceptualization, S.K., C.L. and S.J.; Funding acquisition, M.L.; Investigation, S.K., C.L. and S.J.; Methodology, S.K., C.L., S.J. and M.L.; Project administration, M.L.; Software, S.K., C.L. and S.J.; Supervision, M.L.; Writing—original draft, S.K., C.L. and S.J.; Writing—review & editing, M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Sungshin Women’s University Research Grant of 2025.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1

This prompt corresponds to the first stage of the MEI task.

Table A1. Example prompt and corresponding input/output for the Coreference Resolution task.

Prompt	You will receive a Text along with a list of Key Entities and their corresponding Cluster IDs as input. Your task is to perform Coreference Resolution on the provided text to categorize “each word belonging to a cluster” with its respective cluster id. Follow the format below to label a word with its cluster ID: word#cluster_id Please keep in mind: - Do not label general nouns, objects, or places. - Only tag names, pronouns, or descriptive phrases that refer to characters. - For example, tag both singular and plural references like: he, she, it, I, you, we, they, them, etc. - If a pronoun refers to multiple known characters, tag it with all relevant cluster IDs (e.g., they#pooh,#piglet). - Do not alter the content or order of the original text. - Ensure the output adheres to the specified format for easy parsing. - For characters not listed in Key Entities but appearing in the text, assign them the tag: #others Key Entities: {entity_description} Text: {text}
Input Example	Key Entities: “Winnie-the-Pooh”: “#pooh”, “Piglet”: “#piglet” Text: Pooh went to the forest. He saw Piglet. They talked for hours.
Output Example	Pooh#pooh went to the forest. He#pooh saw Piglet#piglet. They#pooh,#piglet talked for hours.

Appendix A.2

Table A2. Prompt for expanding head words into full noun phrases.

Description	This prompt instructs the task of expanding an already tagged head word into its complete noun phrase, marking the full span accordingly.
Prompt	Any word marked with # is the head of a noun phrase. Expand this head by including its full span (e.g., determiners, adjectives, and modifiers). Do not add or remove content beyond bracketing. Wrap the full span using the format: (FULL SPAN)#ClusterID. Only bracket the expanded noun phrase, do not modify other parts of the sentence. If the head is a possessive determiner, ONLY bracket the determiner, NOT the noun it modifies, UNLESS the noun itself has a #ClusterID tag. Text: {text}
Input Example	Piglet#piglet wished very much that his#piglet Grandfather T. W.#others were there, instead of elsewhere.
Output Example	(Piglet)#piglet wished very much that (his Grandfather T. W.)#others were there, instead of elsewhere.

Appendix A.3

Table A3. Prompt for scene-based structuring after MEI.

Description	This prompt instructs the task of segmenting text data, after completing Major Entity Identification (MEI), into distinct scenes and converting them into a structured JSON format.
Prompt	You are a Scene Annotator. You will receive a story text input from {position}. Your task is to segment the story into distinct scenes based on any change in one or more of the following elements: - Location - Time - Characters present For each scene, - “Scene ID”: e.g., “CH5_1”, “CH5_2” - “Position”: e.g., “CHAPTER_5” - “Speakers”: list of speakers - “Scene”: list of strings (full scene text including narration and dialogue) - “Dialogue”: list of “Speaker: utterance” lines only Use consistent and full speaker names in both the “Speakers” list and the “Dialogue” lines. Do not use shortened or partial names. Refer to the following character list and ID mapping, and always use the full name when writing speaker names: {character_entities} Use this output format: [{“Scene ID”: “CH5_1”, “Position”: “CHAPTER_5”, “Speakers”: [“Winnie-the-Pooh”, “Christopher Robin”], “Scene”: [“Winnie-the-Pooh walked through the forest...”, ““Hello!” said Christopher Robin.”, ““Hi!” said Winnie-the-Pooh.”], “Dialogue”: [“Christopher Robin: Hello!”, “Winnie-the-Pooh: Hi!”]}] Now annotate the following chapter using SceneML: `{{chapter_text}}`
Input Example	Position: CHAPTER_5 Character Entities: “Winnie-the-Pooh”: “pooh”, “Christopher Robin”: “robin” Text: Winnie-the-Pooh walked through the forest. “Hello!” said Christopher Robin. “Hi!” said Winnie-the-Pooh.
Output Example	[{“Scene ID”: “CH5_1”, “Position”: “CHAPTER_5”, “Speakers”: [“Winnie-the-Pooh”, “Christopher Robin”], “Scene”: [“Winnie-the-Pooh walked through the forest.”, ““Hello!” said Christopher Robin.”, ““Hi!” said Winnie-the-Pooh.”], “Dialogue”: [“Christopher Robin: Hello!”, “Winnie-the-Pooh: Hi!”]}]

Appendix B

Table A4. Prompt for question rewriting.

Description	You are a helpful assistant that rewrites user questions to make them more effective for document retrieval systems.
Instructions	Please rewrite the question below to: Remove ambiguity while preserving the original intent. Keep the format as a question, not a declarative sentence. Avoid guessing specific contexts not mentioned in the question. Use concise and keyword-rich phrasing suitable for retrieval.
Original Question	{original_query}
Rewritten Question	{rewritten_query}

Instructs to rewrite the question while preserving its intent, maintaining the question format, and making it suitable for retrieval.

Appendix C

Appendix C provides the prompt templates used in Section 4.4 for persona consistency evaluation. They were applied to simulate user utterances in different scenarios and to guide GPT-4 in judging consistency. Each subsection presents the exact prompt texts given to the models.

Table A5. User prompt guidelines for each scenario.

ordinary conversation	You are simulating the USER in a two-person chat. Write the user’s next single message in a neutral tone. Avoid role labels. Keep it short (1–2 sentences).
emotional conversation	You are simulating the USER in a two-person chat. Write the user’s next single message with emotional intensity, responding to a conflict or disagreement. Avoid role labels. Keep it short (1–2 sentences).
fact-based conversation (first)	You are the USER inside a novel scene. Read the scene excerpt(s) below and write your first chat message as if you are physically present in that situation, addressing `{character_name}` directly. – The message should be a curious or probing question about what’s happening in the scene now. – Use 1–2 short sentences. Avoid role labels. — Scene Excerpts — `{Scene}` Now write just the single user message (no quotes, no labels).
fact-based conversation (follow-up)	You are the USER inside the same ongoing novel scene. Stay fully in-world and speak as if you are physically here with the characters right now. Rules: · React directly to the assistant’s last line in-character. · Use present tense; deictic words like here/now/this are fine if natural. · Keep it to 1–2 short sentences. · No role labels, no quotes, no stage directions, no meta-commentary. · If possible, reference a concrete detail from the assistant’s line.

Table A6. Evaluation guidelines for character consistency analysis.

Role	You are an expert evaluator for character consistency analysis. You will be shown utterances from THREE different characters: Paisley, Hudson, and Leo. Each block contains multiple utterances representing that character’s style. You will then be shown a [Target] block of utterances.
Task	1. Compare [Target] against ALL THREE character styles (Paisley, Hudson, Leo). 2. Based on similarity across four aspects, decide whether [Target] matches Paisley specifically. 3. Output ONLY in the format below.
Evaluation Criteria (1–5 scale)	1. Linguistic Style: Word choice patterns, punctuation usage (periods, ellipses, exclamations, questions), formality level, sentence length 2. Tone: Emotional attitude (formal/casual, warm/cold), confidence level (certain vs hesitant) 3. Vocabulary: Characteristic or idiosyncratic words/phrases, simplicity vs complexity, catchphrases or signature expressions 4. Emotional Expression: How the character expresses feelings (worry, joy, curiosity, etc.) 5. Dialogue Coherence: Appropriateness of the response to preceding user utterance, variability of expression (avoid repetition)
Guidelines	- Compare [Target] against ALL THREE characters, not just Paisley. - “Yes” only if [Target] clearly matches Paisley’s stylistic and tonal patterns better than Hudson or Leo. - If evidence is weak or ambiguous, reflect lower confidence. - In the reason, cite specific linguistic evidence.
Output Format	#x201C;Linguistic Style”: 1–5, “Tone”: 1–5, “Vocabulary”: 1–5, “Emotional Expression”: 1–5, “Dialogue Coherence”: 1–5, “average”: (rounded average of the four scores, 1 decimal), “answer”: “Yes\|No”, “confidence”: “0–100%”, “reason”: “≤50 words explaining your decision with reference to the criteria”

Appendix D

This Table presents the sentiment score distribution for the characters in Off the Record, derived from the emotion analysis described in the study.

Table A7. Provides the sentiment score distribution for the characters in Off the Record (%).

Character	1 Star	2 Stars	3 Stars	4 Stars	5 Stars
Andrea	27.78	8.33	22.22	5.56	36.11
Carrie	33.33	4.17	25.00	8.33	29.17
Dalton	33.33	16.67	33.33	0.00	16.67
Dorian	42.86	14.29	14.29	14.29	14.29
Hudson Owens	23.58	6.50	31.30	17.07	21.54
Kyla Langford	0.00	0.00	100.00	0.00	0.00
Leo Davis	20.00	5.00	15.00	5.00	55.00
Linus	28.57	14.29	14.29	14.29	28.57
Lucy	0.00	0.00	0.00	50.00	50.00
Luke	100.00	0.00	0.00	0.00	0.00
Moria Owens	50.00	50.00	0.00	0.00	0.00
Paisley McConkie	29.50	10.25	26.09	9.94	24.22
Phil Jenkins	25.00	0.00	0.00	0.00	75.00
Prescott	0.00	0.00	66.67	0.00	33.33
Simone Blake	29.76	7.14	27.38	11.90	23.81
Stan	40.00	20.00	20.00	0.00	20.00
Tina	25.00	12.50	0.00	25.00	37.50
Ralph	0.00	0.00	0.00	0.00	100.00

Appendix E

Figure A1. Comparative Analysis of Star Ratings and Linguistic Features Across Conversation Types. Abbreviations: Wds = Avg. Words; Chars = Avg. Characters; Excl = Exclamation; Ques = Question; Neg = Negation; 1stP = First-Person.

References

Mardiana, N.; Da, R.O. The Phenomenon of Use of Chatbot Character AI by ARMY-BTS. In Proceedings of the International Seminar Enrichment of Career by Knowledge of Language and Literature, Tokyo, Japan, 23–25 November 2024; pp. 65–74. [Google Scholar]
Han, S.; Kim, B.; Yoo, J.Y.; Seo, S.; Kim, S.; Erdenee, E.; Chang, B. Meet your favorite character: Open-domain chatbot mimicking fictional characters with only a few utterances. arXiv 2022, arXiv:2204.10825. [Google Scholar] [CrossRef]
Wang, X.; Wang, H.; Zhang, Y.; Yuan, X.; Xu, R.; Huang, J.-t.; Yuan, S.; Guo, H.; Chen, J.; Zhou, S. Coser: Coordinating llm-based persona simulation of established roles. arXiv 2025, arXiv:2502.09082. [Google Scholar]
Character.AI. Available online: https://character.ai (accessed on 9 July 2025).
Liu, Y.; Wei, W.; Liu, J.; Mao, X.; Fang, R.; Chen, D. Improving personality consistency in conversation by persona extending. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–22 October 2022; pp. 1350–1359. [Google Scholar]
Xue, B.; Wang, W.; Wang, H.; Mi, F.; Wang, R.; Wang, Y.; Shang, L.; Jiang, X.; Liu, Q.; Wong, K.-F. Improving Factual Consistency for Knowledge-Grounded Dialogue Systems via Knowledge Enhancement and Alignment. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 7829–7844. [Google Scholar]
Kovacevic, N.; Boschung, T.; Holz, C.; Gross, M.; Wampfler, R. Chatbots with attitude: Enhancing chatbot interactions through dynamic personality infusion. In Proceedings of the 6th ACM Conference on Conversational User Interfaces, Waterloo, ON, Canada, 8–10 July 2024; pp. 1–16. [Google Scholar]
Wang, Y.; Leung, J.; Shen, Z. RoleRAG: Enhancing LLM Role-Playing via Graph Guided Retrieval. arXiv 2025, arXiv:2505.18541. [Google Scholar]
Liu, D.; Wu, Z.; Song, D.; Huang, H. A Persona-Aware LLM-Enhanced Framework for Multi-Session Personalized Dialogue Generation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 27 July–1 August 2025; pp. 103–123. [Google Scholar]
Zhu, S.; Ma, T.; Rong, H.; Al-Nabhan, N. A Personalized Multi-Turn Generation-Based Chatbot with Various-Persona-Distribution Data. Appl. Sci. 2023, 13, 3122. [Google Scholar] [CrossRef]
Vishnubhotla, K.; Rudzicz, F.; Hirst, G.; Hammond, A. Improving Automatic Quotation Attribution in Literary Novels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; pp. 737–746. [Google Scholar]
Wang, N.; Peng, Z.y.; Que, H.; Liu, J.; Zhou, W.; Wu, Y.; Guo, H.; Gan, R.; Ni, Z.; Yang, J.; et al. RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 14743–14777. [Google Scholar]
Hong, M.; Zhang, C.J.; Chen, C.; Lian, R.; Jiang, D. Dialogue Language Model with Large-Scale Persona Data Engineering. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, Albuquerque, NM, USA, 29 April–4 May 2025; pp. 961–970. [Google Scholar]
Liu, Y.; Zhang, Y.; Patel, S.O.; Zhu, Z.; Guo, S. HonkaiChat: Companions from Anime that feel alive! arXiv 2025, arXiv:2501.03277. [Google Scholar] [CrossRef]
Lee, O.; Joseph, K. A large-scale analysis of public-facing, community-built chatbots on Character. AI. arXiv 2025, arXiv:2505.13354. [Google Scholar]
Lu, K.; Yu, B.; Zhou, C.; Zhou, J. Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; pp. 7828–7840. [Google Scholar]
Li, C.; Leng, Z.; Yan, C.; Shen, J.; Wang, H.; Mi, W.; Fei, Y.; Feng, X.; Yan, S.; Wang, H. Chatharuhi: Reviving anime character in reality via large language model. arXiv 2023, arXiv:2308.09597. [Google Scholar] [CrossRef]
Roesiger, I.; Schulz, S.; Reiter, N. Towards Coreference for Literary Text: Analyzing Domain-Specific Phenomena. In Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Santa Fe, NM, USA, 25 August 2018; pp. 129–138. [Google Scholar]
Gardner, M.; Grus, J.; Neumann, M.; Tafjord, O.; Dasigi, P.; Liu, N.F.; Peters, M.; Schmitz, M.; Zettlemoyer, L. AllenNLP: A Deep Semantic Natural Language Processing Platform. In Proceedings of the Workshop for NLP Open Source Software (NLP-OSS), Melbourne, Australia, 15–20 July 2018; pp. 1–6. [Google Scholar]
Joshi, M.; Levy, O.; Zettlemoyer, L.; Weld, D. BERT for Coreference Resolution: Baselines and Analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 5803–5808. [Google Scholar]
Joshi, M.; Chen, D.; Liu, Y.; Weld, D.S.; Zettlemoyer, L.; Levy, O. Spanbert: Improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguist. 2020, 8, 64–77. [Google Scholar] [CrossRef]
BookNLP. BookNLP: A Natural Language Processing Pipeline for Books. GitHub Repository. 2023. Available online: https://github.com/booknlp/booknlp (accessed on 9 August 2025).
Bamman, D.; Lewke, O.; Mansoor, A. An Annotated Dataset of Coreference in English Literature. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 44–54. [Google Scholar]
Sundar, K.M.; Toshniwal, S.; Tapaswi, M.; Gandhi, V. Major Entity Identification: A Generalizable Alternative to Coreference Resolution. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 11679–11695. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel, T. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Fan, W.; Ding, Y.; Ning, L.; Wang, S.; Li, H.; Yin, D.; Chua, T.-S.; Li, Q. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 6491–6501. [Google Scholar]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, H.; Wang, H. Retrieval-augmented generation for large language models: A survey. arXiv 2023, arXiv:2312.10997. [Google Scholar]
Olsson, F. Beyond the Basics: Advanced Retrieval Techniques for RAG Systems: Assessing the Impact of Sentence-Window Retrieval and Auto-Merging Retrieval on the Performance of a RAG System in a Swedish Management Consulting Company. Master’s Thesis, KTH, School of Electrical Engineering and Computer Science, Stockholm, Sweden, 2024; p. 53. [Google Scholar]
Swacha, J.; Gracel, M. Retrieval-Augmented Generation (RAG) Chatbots for Education: A Survey of Applications. Appl. Sci. 2025, 15, 4234. [Google Scholar] [CrossRef]
Akkiraju, R.; Xu, A.; Bora, D.; Yu, T.; An, L.; Seth, V.; Shukla, A.; Gundecha, P.; Mehta, H.; Jha, A. Facts about building retrieval augmented generation-based chatbots. arXiv 2024, arXiv:2407.07858. [Google Scholar] [CrossRef]
Khan, U.A.; Khan, F.; Khan, E.; Hasnain, M.A.; Moinuddin, A.A. Towards Efficient Educational Chatbots: Benchmarking RAG Frameworks. In Proceedings of the 2025 International Conference on Emerging Technologies in Computing and Communication (ETCC), Bangalore, India, 26–27 June 2025; pp. 1–6. [Google Scholar]
Gargari, O.K.; Habibi, G. Enhancing medical AI with retrieval-augmented generation: A mini narrative review. Digit. Health 2025, 11, 20552076251337177. [Google Scholar] [CrossRef]
Khasanova Zafar kizi, M.; Suh, Y. Design and Performance Evaluation of LLM-Based RAG Pipelines for Chatbot Services in International Student Admissions. Electronics 2025, 14, 3095. [Google Scholar] [CrossRef]
Alrashid, T.; Gaizauskas, R.J. A Pilot Study on Annotating Scenes in Narrative Text using SceneML. In Proceedings of the Text2Story Workshop at ECIR, Online, 1 April 2021; pp. 7–14. [Google Scholar]
Zeng, N.; Hou, H.; Yu, F.R.; Shi, S.; He, Y.T. SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding. arXiv 2025, arXiv:2506.07600. [Google Scholar]
Lee, J.; Chen, F.; Dua, S.; Cer, D.; Shanbhogue, M.; Naim, I.; Ábrego, G.H.; Li, Z.; Chen, K.; Vera, H.S. Gemini embedding: Generalizable embeddings from gemini. arXiv 2025, arXiv:2503.07891. [Google Scholar] [CrossRef]
FAISS. Available online: https://faiss.ai/ (accessed on 8 August 2025).
BAAI. Bge-Reranker-Base. Hugging Face. 2023. Available online: https://huggingface.co/BAAI/bge-reranker-base (accessed on 9 August 2025).
NLPTown. BERT Base Multilingual Uncased Sentiment. Available online: https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment (accessed on 8 August 2025).
Li, Y.; Su, H.; Shen, X.; Li, W.; Cao, Z.; Niu, S. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, Taipei, Taiwan, 27 November–1 December 2017; Volume 1:Long Papers, pp. 986–995. [Google Scholar]
Chen, H.Y.; Zhou, E.; Choi, J.D. Robust coreference resolution and entity linking on dialogues: Character identification on tv show transcripts. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), Vancouver, BC, Canada, 3–4 August 2017; pp. 216–225. [Google Scholar]
Zhou, E.; Choi, J.D. They exist! introducing plural mentions to coreference resolution and entity linking. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp. 24–34. [Google Scholar]
Chakraborty, S.; Chowdhury, R.; Shuvo, S.R.; Chatterjee, R.; Roy, S. A scalable framework for evaluating multiple language models through cross-domain generation and hallucination detection. Sci. Rep. 2025, 15, 29981. [Google Scholar] [CrossRef]
Yu, X.; Yoo, S.; Lin, Y. Clipceil: Domain generalization through clip via channel refinement and image-text alignment. Adv. Neural Inf. Process. Syst. 2024, 37, 4267–4294. [Google Scholar]
Joshi, P.; Gupta, A.; Kumar, P.; Sisodia, M. Robust Multi Model RAG Pipeline for Documents Containing Text, Table & Images. In Proceedings of the 2024 3rd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), Rajkot, India, 27–29 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 993–999. [Google Scholar]

Figure 1. Overall architecture of the Fic2Bot framework, which consists of parallel factual and persona processing paths that are integrated in the final generation stage.

Figure 2. Two-stage Major Entity Identification process. The original story text is first processed with Prompt 1 for word-level mention detection and Prompt 2 for span expansion and entity linking. Using the LLM, all mentions are linked to major characters, resulting in a structured entity-annotated text for downstream tasks.

Figure 3. Proposed RAG pipeline optimized for character-consistent chatbot generation.

Figure 4. Final prompt used for generating chatbot responses, which integrates the results of scene retrieval and persona extraction.

Figure 5. Character-specific linguistic patterns visualized through distinctive keywords. This word cloud not only highlights distinctive lexical tendencies but also illustrates the extracted utterances that served as input for subsequent analyses, including sentiment evaluation.

Table 1. Summarizes the basic statistical information of the novels used as experimental data in this study.

	Winnie-the-Pooh	Off the Record	And What Can We Offer You Tonight
Author	A. A. Milne	Kasey Stockton	Premee Mohamed
Genre	Child	Romance Comedy	Revenge Drama
Tokens ¹	28,228	24,795	29,466
Narrative Perspective	third-person point of view	first-person point of view	first-person point of view
Main Characters ²	12	18	9

¹ Token counts are based on the gemini-2.5-pro preview model. ² Number of characters appearing more than five times.

Table 2. NER and MEI F1 scores for each book.

Book	NER F1	MEI F1
	F1 Score	Singular		Plural
	F1 Score	Macro F1	Micro F1	Macro F1	Micro F1
Winnie-the-Pooh	0.9933	0.9570	0.9954	0.9460	0.9510
Off the Record	0.9914	0.9620	0.9908	0.9490	0.9518
And What Can We Offer You Tonight	0.9738	0.9633	0.9650	0.9389	0.9463

Table 3. Comparison of retrieval performance (Recall@3 and MRR@3) between structured (ours) and original formats for each book.

Book	Version	Recall@3	MRR@3
Winnie-the-Pooh	Structured	0.7667	0.7333
	Original	0.4222	0.4833
Off the Record	Structured	0.7667	0.7000
	Original	0.3000	0.2611
And What Can We Offer You Tonight	Structured	0.5000	0.4389
	Original	0.2333	0.2333

Table 4. Accuracy and F1 scores of speaker attribution models across books. BookNLP is used as the baseline, while Fic2Bot represents our proposed model. All models are evaluated on the task of matching utterances to their corresponding speakers.

Book	System	Accuracy	Macro F1	Micro F1
Winnie-the-Pooh	BookNLP	0.7432	0.7465	0.7898
	Fic2Bot	0.9617	0.9995	0.9989
Off the Record	BookNLP	0.9366	0.8146	0.7458
	Fic2Bot	0.9975	0.9979	0.9905
And What Can We Offer You Tonight	BookNLP	0.2943	0.2712	0.3217
	Fic2Bot	0.9014	0.8562	0.9770

Table 5. Character style separability across books based on Jaccard and cosine similarity. Lower scores indicate better separation of character-specific styles

Book	Avg. Jaccard Similarity	Avg. Cosine Similarity
Winnie-the-Pooh	0.0495	0.1010
Off the Record	0.0213	0.0325
And What Can We Offer You Tonight	0.1317	0.0427

Table 6. Jaccard and cosine similarity for character pairs in various books.

Book	Character Pair	Jaccard	Cosine
Winnie-The-Pooh	Pooh vs. Piglet	0.1111	0.2664
	Pooh vs. Robin	0.0000	0.0000
Off The Record	Paisley vs. Hudson	0.3333	0.4966
	Paisley vs. Leo	0.0526	0.0566
And What Can We Offer You Tonight	Jewel vs. Winfield	0.1765	0.4386
	Jewel vs. Nero	0.4286	0.7099

Table 7. Stylistic feature statistics for each character. Abbreviations: Wds = Avg. Words; Chars = Avg. Characters; Excl = Exclamation; Ques = Question; Neg = Negation; 1stP = First-Person.

Character	Wds	Chars	Excl (%)	Ques (%)	Neg (%)	1stP (%)
Narrator	6.8621	33.0000	0.0000	20.6897	13.7931	41.3793
Christopher_Robin	8.1127	40.8803	12.6761	41.5493	4.9296	46.4789
Winnie-the-Pooh	9.8652	50.8560	28.2132	26.3323	11.2853	57.3668

Table 8. Sentiment score distribution of characters from Off the Record, based on 1–5 star ratings assigned by the sentiment analysis model.

Character	1 Star	2 Star	3 Star	4 Star	5 Star
Paisley McConkie	28.57	14.29	14.29	14.29	28.57
Hudson Owens	23.58	6.5	31.3	17.07	21.54
Simone Blake	29.76	7.14	27.38	11.9	23.81
Leo Davis	20.00	5.00	15.00	5.00	55.00
Phil Jenkins	25.00	0.00	0.00	0.00	75.00

Table 9. Evaluation results of persona consistency across novels and conversation types. Abbreviations: Or (Ordinary), Em (Emotional), Fa (Fact-based), Sty = Linguistic Style, Vcb = Vocabulary, Emo = Emotional Expression, Dial = Dialogue Coherence, Conf. = Confidence (%).

Novel	Type	Match	Sty	Tone	Vcb	Emo	Avg.	Conf.
Winnie-The-Pooh	Or	100%	5.00	5.00	5.00	5.00	5.00	100.0%
	Em	100%	5.00	5.00	5.00	5.00	5.00	100.0%
	Fa	100%	5.00	5.00	5.00	5.00	5.00	100.0%
Off The Record	Or	76.7%	4.13	4.13	3.97	3.87	4.15	79.5%
	Em	100%	4.03	3.93	3.23	4.17	3.87	79.7%
	Fa	100%	4.13	4.07	3.77	4.00	4.01	81.2%
And What Can We Offer You Tonight	Or	100%	4.50	4.50	4.47	4.42	4.47	87.6%
	Em	96.7%	4.00	4.00	3.43	4.00	3.86	79.3%
	Fa	100%	4.20	4.20	3.77	4.23	4.12	82.7%

Table 10. Ablation study results across novels. Abbreviations: Sty = Linguistic Style, Vcb = Vocabulary, Emo = Emotional Expression, Dial = Dialogue Coherence, Conf. = Confidence (%).

Novel	Match	Sty	Tone	Vcb	Emo	Dial	Conf.	Recall@3	MRR@3
Winnie-the-Pooh	100%	4.60	4.60	4.73	4.43	4.73	92.2	0.4222	0.4833
Winnie-the-Pooh+Fic2Bot	100%	5.00	5.00	5.00	5.00	5.00	100	0.7667	0.7333
Off The Record	76.7%	4.07	3.87	3.83	3.77	4.70	79.5	0.3000	0.2611
Off The Record+Fic2Bot	76.7%	4.13	4.13	3.97	3.87	4.67	80	0.7667	0.7000
And What Can We Offer You Tonight	66.7%	3.78	3.79	3.68	3.54	3.73	74	0.2333	0.2333
And What Can We Offer You Tonight+Fic2Bot	100%	4.50	4.50	4.47	4.42	4.00	87.6	0.5000	0.4389

Table 11. Latency and resource analysis across novels. Build denotes the time required to build chatbots, measured in seconds.

Novel	Latency			Resource
Novel	Build (s)	p50	p95	Token (In)	Token (Out)	Cost (USD)
Winnie-the-Pooh	-	8.72	15.61	44,182.61	37.93	0.0556
Winnie-the-Pooh+Fic2Bot	4219.88	32.43	37.41	1852.50	299.80	0.0053
Off the Record	-	4.51	14.08	50,001.85	39.18	0.0629
Off the Record+Fic2Bot	6587.48	14.31	19.97	2149.02	22.95	0.0029
And What Can We Offer You Tonight	-	7.10	16.90	40,289	40	0.0508
And What Can We Offer You Tonight+Fic2Bot	5835.21	16.20	20.50	2654	30	0.0036

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kang, S.; Lee, C.; Jung, S.; Lee, M. Fic2Bot: A Scalable Framework for Persona-Driven Chatbot Generation from Fiction. Electronics 2025, 14, 3859. https://doi.org/10.3390/electronics14193859

AMA Style

Kang S, Lee C, Jung S, Lee M. Fic2Bot: A Scalable Framework for Persona-Driven Chatbot Generation from Fiction. Electronics. 2025; 14(19):3859. https://doi.org/10.3390/electronics14193859

Chicago/Turabian Style

Kang, Sua, Chaelim Lee, Subin Jung, and Minsu Lee. 2025. "Fic2Bot: A Scalable Framework for Persona-Driven Chatbot Generation from Fiction" Electronics 14, no. 19: 3859. https://doi.org/10.3390/electronics14193859

APA Style

Kang, S., Lee, C., Jung, S., & Lee, M. (2025). Fic2Bot: A Scalable Framework for Persona-Driven Chatbot Generation from Fiction. Electronics, 14(19), 3859. https://doi.org/10.3390/electronics14193859

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fic2Bot: A Scalable Framework for Persona-Driven Chatbot Generation from Fiction

Abstract

1. Introduction

2. Related Work

2.1. Persona Chatbot

2.2. Coreference Resolution

2.3. Retrieval-Augmented Generation (RAG)

2.4. A Chatbot System Based on Retrieval-Augmented Generation (RAG)

3. Methods

3.1. Overall Pipeline Architecture

3.2. Preprocessing: Major Entity Identification and Scene Structuring

3.3. Scene-Level Retrieval-Augmented Generation (RAG)

3.4. Character Style Extraction

3.5. Persona-Based Chatbot Generation

4. Experiments

4.1. Dataset

4.2. Implementation

4.3. System Component Evaluation

4.3.1. Entity Identification

4.3.2. RAG Retrieval with Metadata Structuring

4.3.3. Speaker Identification

4.3.4. Stylistic Feature Analysis

4.3.5. Sentiment Analysis

4.4. Persona Consistency Evaluation in Multi-Turn Settings

4.5. Module Effectiveness and Efficiency Analysis

4.5.1. Module Effectiveness and Baseline Comparison

4.5.2. Efficiency Analysis

5. Results and Discussion

5.1. Entity Identification Performance

5.2. RAG Retrieval Performance

5.3. Speaker Identification Performance

5.4. Results of Stylistic Feature Analysis

5.5. Results of Sentiment Analysis

5.6. Results of Persona Consistency Evaluation in Multi-Turn Settings

5.7. Results of Module Effectiveness and Efficiency Analysis

5.7.1. Analysis of Module Effectiveness

5.7.2. Efficiency Analysis

6. Limitations

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1

Appendix A.2

Appendix A.3

Appendix B

Appendix C

Appendix D

Appendix E

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI