Enhancing Character-Coherent Role-Playing Dialogue with a Verifiable Emotion Reward
Abstract
1. Introduction
- Inadequate Data Diversity, Dialogue Depth, and Role Conditioning: The majority of datasets and fine-tuning methods support dialogues of fewer than ten turns, which is inadequate for verifying model coherence and stability in scenarios exceeding 20–50 turns. Existing datasets predominantly feature a variety of topics with neutral emotions, which hinders the ability to regulate emotions in a nuanced manner and to generate emotions in a dynamic way under complex conditions. Moreover, prevalent fine-tuning methodologies such as zero-shot or few-shot prompting frequently neglect fine-grained character trait modeling due to the absence of systematic character profiles and behavioral guidelines. This oversight results in limitations in the long-term maintenance of consistent role identity and personality.
- Context Memory and Knowledge Updating Deficiencies: Despite the exploration of retrieval augmentation and memory modules, the majority of methods exhibit a lack of unified integration of retrieval, memory, and generation processes. This phenomenon leads to information redundancy, noise retrieval, and delayed responses, ultimately undermining the effective incorporation of dialogue history and external knowledge.
- Incomplete Evaluation Frameworks: Conventional evaluation metrics, such as BLEU [28] and ROUGE [29], exhibit a poor correlation with human subjective preferences. Furthermore, extant evaluation approaches frequently encounter limitations in terms of sample size, inadequate dialogue duration, and disparate evaluation dimensions, resulting in deficient assessments of coherence, role fidelity, and emotional dynamics.
- The CHARCO dataset offers a comprehensive collection of diverse, emotionally rich, and character-coherent dialogues designed for multi-turn role-playing with large language models. These dialogues are supported by a three-level quality filtering pipeline (heuristic rules, GPT-4 scoring, and human spot-checks) to ensure high data quality, minimal redundancy, and consistency in dialogue generation.
- A novel retrieval-augmented memory module improves contextual awareness by updating knowledge dynamically, reducing redundancy, and ensuring relevance in long-form dialogues.
- Fine-tuning on the CHARCO dataset enables smaller models to achieve role-playing capabilities comparable to GPT-4, significantly enhancing role fidelity, emotional diversity, and coherence in multi-turn dialogues.
2. Proposed Framework
2.1. Preliminaries for Multi-Turn Dialogue in Large Language Models
2.2. Emotion-Consistent Reward Optimization
2.3. The Construction of the CHARCO Dataset
2.3.1. Meta-Topic Selection
2.3.2. Semantic-Enhanced Retrieval for Context Grounding
2.3.3. Scenario-Driven Dialogue Generation
Algorithm 1: Multi-turn dialogue generation with retrieval. |
|
2.3.4. Quality Control and Filtering
3. Experiment and Results
- How effective is CHARCO compared with existing datasets?
- Does CHARCO improves character coherence and emotional diversity?
- Why and how does the Verifiable Emotion Reward (VER) objective enhance emotional alignment and overall role-playing performance?
3.1. A Summary of the CHARCO Dataset
3.2. Implementation Details
3.3. Evaluation Strategy for the CHARCO Dataset
3.3.1. Measure of Textual Lexical Diversity
3.3.2. Repetition Rate
3.4. Evaluation Strategy for the Multi-Turn Role-Playing Capabilities
3.4.1. Scoring Protocol
- 1–3 (Poor): Major errors in comprehension, factuality, or consistency; clear failure to perform the task.
- 4–6 (Acceptable): Some minor issues or incomplete performance, but the response remains partially useful or coherent.
- 7–8 (Good): Fully correct or plausible responses that satisfy most task requirements.
- 9–10 (Excellent): Responses that are not only correct but also demonstrate fluency, emotional depth, and strong alignment with the persona and task.
3.4.2. Final Dialogue Score
3.4.3. Reliability Validation
3.5. Role-Playing Performance of Different LLMs on the CHARCO Dataset
3.6. Comparison Study
3.7. Ablation Study
3.8. Impact of Data Mixing on Role-Playing Performance
4. Limitations
4.1. Dataset Dependency Consideration
4.2. Reward Signal Refinement Opportunity
4.3. Long-Term Dynamics for Future Exploration
5. Conclusions
- Customer service and support: Maintaining a consistent brand voice and empathetic tone throughout lengthy conversations and resolving issues without causing emotional distress.
- Mental Health and Counseling Bots: Maintaining therapeutic alliances through aligned emotional responses throughout extended self-disclosure sessions.
- Interactive Storytelling and Game NPCs: Leveraging semantic-enhanced retrieval and long-turn coherence to craft richer, emotionally dynamic narratives.
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Table of Symbols
Symbol | Description |
---|---|
Persona profile | |
Dialogue goal (e.g., information retrieval and emotional support) | |
Space of all utterances (token sequences) | |
, | Inquirer/responder policies |
Dialogue initialization function | |
Utterances generated by inquirer/responder at turn t | |
Extracted sub-prompt (query) | |
Query extractor (returns ⌀ if no valid query) | |
Turn-forwarding function (reformats previous reply) | |
‖ | Sequence concatenation |
End-of-sequence token | |
Maximum number of turns | |
D | Dialogue history (all recorded pairs ) |
Generated dialogue with emotion labels | |
Utterance at turn t | |
Corresponding emotion label (ground-truth or target) | |
T | Total number of turns |
Output of pre-trained emotion classifier | |
Indicator function (1 if true, 0 otherwise) | |
Verifiable Emotion Reward (VER) | |
Parameterized generation policy | |
Expected return of policy | |
Policy gradient | |
Supervised generation loss | |
Weight coefficient for emotion reward | |
Total objective: |
References
- Raikov, A.; Giretti, A.; Pirani, M.; Spalazzi, L.; Guo, M. Accelerating human–computer interaction through convergent conditions for LLM explanation. Front. Artif. Intell. 2024, 7, 1406773. [Google Scholar] [CrossRef]
- Sowmiya, R.; Revathi, P.; Ragunath, D.; Gokila, P.; Kalaivani, T. Multi-Modal LLM Driven Computer Interface. In Proceedings of the 2024 8th International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC), Kirtipur, Nepal, 3–5 October 2024; IEEE: New York, NY, USA, 2024; pp. 484–489. [Google Scholar]
- Kumar, P. Large language models (LLMs): Survey, technical frameworks, and future challenges. Artif. Intell. Rev. 2024, 57, 260. [Google Scholar] [CrossRef]
- Chen, J.; Liu, Z.; Huang, X.; Wu, C.; Liu, Q.; Jiang, G.; Pu, Y.; Lei, Y.; Chen, X.; Wang, X.; et al. When large language models meet personalization: Perspectives of challenges and opportunities. World Wide Web 2024, 27, 42. [Google Scholar] [CrossRef]
- Nazi, Z.A.; Peng, W. Large language models in healthcare and medical domain: A review. Informatics 2024, 11, 57. [Google Scholar] [CrossRef]
- Bolpagni, M.; Gabrielli, S. Development of a comprehensive evaluation scale for LLM-powered counseling chatbots (CES-LCC) using the Edelphi method. Informatics 2025, 12, 33. [Google Scholar] [CrossRef]
- Pinto-Bernal, M.; Biondina, M.; Belpaeme, T. Designing Social Robots with LLMs for Engaging Human Interaction. Appl. Sci. 2025, 15, 6377. [Google Scholar] [CrossRef]
- Jedrzejczak, W.W.; Kobosko, J. Do Chatbots Exhibit Personality Traits? A Comparison of ChatGPT and Gemini Through Self-Assessment. Information 2025, 16, 523. [Google Scholar] [CrossRef]
- Klinkert, L.J.; Buongiorno, S.; Clark, C. Evaluating the efficacy of LLMs to emulate realistic human personalities. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, Lexington, KY, USA, 18–22 November 2024; Volume 20, pp. 65–75. [Google Scholar]
- Chen, Y.C.; Lee, S.H.; Sheu, H.; Lin, S.H.; Hu, C.C.; Fu, S.C.; Yang, C.P.; Lin, Y.C. Enhancing responses from large language models with role-playing prompts: A comparative study on answering frequently asked questions about total knee arthroplasty. BMC Med. Inform. Decis. Mak. 2025, 25, 196. [Google Scholar] [CrossRef]
- Feng, Q.; Xie, Q.; Wang, X.; Li, Q.; Zhang, Y.; Feng, R.; Zhang, T.; Gao, S. EmoCharacter: Evaluating the Emotional Fidelity of Role-Playing Agents in Dialogues. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Albuquerque, NM, USA, 29 April–4 May 2025; pp. 6218–6240. [Google Scholar]
- Yu, Y.; Yu, R.; Wei, H.; Zhang, Z.; Qian, Q. Beyond Dialogue: A Profile-Dialogue Alignment Framework Towards General Role-Playing Language Model. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; Che, W., Nabende, J., Shutova, E., Pilehvar, M.T., Eds.; pp. 11992–12022. [Google Scholar] [CrossRef]
- Dan, Y.; Zhou, J.; Chen, Q.; Tian, J.; He, L. P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 27 July–1 August 2025; pp. 6342–6362. [Google Scholar]
- Huang, L.; Lan, H.; Sun, Z.; Shi, C.; Bai, T. Emotional RAG: Enhancing role-playing agents through emotional retrieval. In Proceedings of the 2024 IEEE International Conference on Knowledge Graph (ICKG), Abu Dhabi, United Arab Emirates, 11–12 December 2024; IEEE: New York, NY, USA, 2024; pp. 120–127. [Google Scholar]
- Man, F.; Wang, H.; Fang, J.; Deng, Z.; Zhao, B.; Chen, X.; Li, Y. Context-Aware Sentiment Forecasting via LLM-based Multi-Perspective Role-Playing Agents. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; Che, W., Nabende, J., Shutova, E., Pilehvar, M.T., Eds.; pp. 2687–2703. [Google Scholar] [CrossRef]
- Gomes, A.; Brito, E.; Morais, L.; Ferreira, N. How do Data Journalists Design Maps to Tell Stories? arXiv 2025, arXiv:2508.10903. Available online: http://arxiv.org/abs/2508.10903 (accessed on 24 July 2025).
- Yu, T.; Shi, K.; Zhao, Z.; Penn, G. Multi-Agent Based Character Simulation for Story Writing. In Proceedings of the Fourth Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2025), Albuquerque, NM, USA, 4 May 2025; pp. 87–108. [Google Scholar]
- Zhang, P.; An, S.; Qiao, L.; Yu, Y.; Chen, J.; Wang, J.; Yin, D.; Sun, X.; Zhang, K. RolePlot: A Systematic Framework for Evaluating and Enhancing the Plot-Progression Capabilities of Role-Playing Agents. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; pp. 12337–12354. [Google Scholar]
- Ding, N.; Chen, Y.; Xu, B.; Qin, Y.; Zheng, Z.; Hu, S.; Liu, Z.; Sun, M.; Zhou, B. Enhancing chat language models by scaling high-quality instructional conversations. arXiv 2023, arXiv:2305.14233. [Google Scholar] [CrossRef]
- Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.; Guestrin, C.; Liang, P.; Hashimoto, T.B. Stanford Alpaca: An Instruction-Following LLaMA Model, 2023. Available online: https://github.com/tatsu-lab/stanford_alpaca (accessed on 24 July 2025).
- Xu, C.; Guo, D.; Duan, N.; McAuley, J. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv 2023, arXiv:2304.01196. [Google Scholar]
- Qi, Z.; Kaneko, T.; Takamizo, K.; Ukiyo, M.; Inaba, M. KokoroChat: A Japanese Psychological Counseling Dialogue Dataset Collected via Role-Playing by Trained Counselors. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; Che, W., Nabende, J., Shutova, E., Pilehvar, M.T., Eds.; pp. 12424–12443. [Google Scholar] [CrossRef]
- Tao, M.; Liang, X.; Shi, T.; Yu, L.; Xie, Y. RoleCraft-GLM: Advancing Personalized Role-Playing in Large Language Models. arXiv 2024, arXiv:2401.09432. Available online: http://arxiv.org/abs/2401.09432 (accessed on 24 July 2025).
- Kim, H.; Hessel, J.; Jiang, L.; West, P.; Lu, X.; Yu, Y.; Zhou, P.; Bras, R.L.; Alikhani, M.; Kim, G.; et al. SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization. arXiv 2022, arXiv:2212.10465. [Google Scholar]
- Ji, Y. Exploring the Impact of Instruction Data Scaling on Large Language Models: An Empirical Study on Real-World Use Cases. arXiv 2023, arXiv:2303.14742. [Google Scholar] [CrossRef]
- Tu, Q.; Fan, S.; Tian, Z.; Yan, R. Charactereval: A chinese benchmark for role-playing conversational agent evaluation. arXiv 2024, arXiv:2401.01275. [Google Scholar]
- Wang, Z.M.; Peng, Z.; Que, H.; Liu, J.; Zhou, W.; Wu, Y.; Guo, H.; Gan, R.; Ni, Z.; Yang, J.; et al. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv 2023, arXiv:2310.00746. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; Association for Computational Linguistics: Stroudsburg, PA, USA, 2002; pp. 311–318. [Google Scholar]
- Lin, C.Y.; Hovy, E. Automatic evaluation of summaries using N-gram co-occurrence statistics. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Edmonton, AB, Canada, 27 May–1 June 2003; Association for Computational Linguistics: Stroudsburg, PA, USA, 2003; Volume 1, pp. 71–78. [Google Scholar]
- McCarthy, P.M.; Jarvis, S. MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behav. Res. Methods 2010, 42, 381–392. [Google Scholar] [CrossRef]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Lambert, N.; Morrison, J.; Pyatkin, V.; Huang, S.; Ivison, H.; Brahman, F.; Miranda, L.J.V.; Liu, A.; Dziri, N.; Lyu, S.; et al. Tulu 3: Pushing frontiers in open language model post-training. arXiv 2024, arXiv:2411.15124. [Google Scholar] [CrossRef]
- Lehmann, M. The Definitive Guide to Policy Gradients in Deep Reinforcement Learning: Theory, Algorithms and Implementations. arXiv 2024, arXiv:2401.13662. Available online: http://arxiv.org/abs/2401.13662 (accessed on 24 July 2025). [CrossRef]
- Michailidis, P.; Michailidis, I.; Kosmatopoulos, E. Reinforcement learning for optimizing renewable energy utilization in buildings: A review on applications and innovations. Energies 2025, 18, 1724. [Google Scholar] [CrossRef]
- Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Bi, X.; Zhang, H.; Zhang, M.; Li, Y.; Wu, Y.; et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv 2024, arXiv:2402.03300. [Google Scholar]
- Multi-Granularity, M.L.M.F. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. In Proceedings of the ACL 2024, Bangkok, Thailand, 11–16 August 2024. [Google Scholar]
- Muennighoff, N.; Tazi, N.; Magne, L.; Reimers, N. Mteb: Massive text embedding benchmark. arXiv 2022, arXiv:2210.07316. [Google Scholar]
- Zhao, Y.; Gu, A.; Varma, R.; Luo, L.; Huang, C.C.; Xu, M.; Wright, L.; Shojanazeri, H.; Ott, M.; Shleifer, S.; et al. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. arXiv 2023, arXiv:2304.11277. Available online: http://arxiv.org/abs/2304.11277 (accessed on 24 July 2025). [CrossRef]
- Kim, T.; Vossen, P. EmoBERTa: Speaker-Aware Emotion Recognition in Conversation with RoBERTa. arXiv 2021, arXiv:2108.12009. Available online: http://arxiv.org/abs/2108.12009 (accessed on 24 July 2025).
- Xiao, S.; Liu, Z.; Zhang, P.; Muennighoff, N. C-Pack: Packaged Resources To Advance General Chinese Embedding. arXiv 2023, arXiv:2309.07597. Available online: http://arxiv.org/abs/2309.07597 (accessed on 24 July 2025).
- Yang, A.; Yang, B.; Hui, B.; Zheng, B.; Yu, B.; Zhou, C.; Li, C.; Li, C.; Liu, D.; Huang, F.; et al. Qwen2 Technical Report. arXiv 2024, arXiv:2407.10671. Available online: http://arxiv.org/abs/2407.10671 (accessed on 24 July 2025).
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. Available online: http://arxiv.org/abs/2307.09288 (accessed on 24 July 2025). [CrossRef]
- Bai, G.; Liu, J.; Bu, X.; He, Y.; Liu, J.; Zhou, Z.; Lin, Z.; Su, W.; Ge, T.; Zheng, B.; et al. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. arXiv 2024, arXiv:2402.14762. [Google Scholar]
- Cai, Z.; Cao, M.; Chen, H.; Chen, K.; Chen, K.; Chen, X.; Chen, X.; Chen, Z.; Chen, Z.; Chu, P.; et al. InternLM2 Technical Report. arXiv 2024, arXiv:2403.17297. Available online: http://arxiv.org/abs/2403.17297 (accessed on 24 July 2025). [CrossRef]
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. Available online: http://arxiv.org/abs/2310.06825 (accessed on 24 July 2025). [CrossRef]
- Li, A.; Gong, B.; Yang, B.; Shan, B.; Liu, C.; Zhu, C.; Zhang, C.; Guo, C.; Chen, D.; Li, D.; et al. Minimax-01: Scaling foundation models with lightning attention. arXiv 2025, arXiv:2501.08313. [Google Scholar] [CrossRef]
- GLM, T.; Zeng, A.; Xu, B.; Wang, B.; Zhang, C.; Yin, D.; Zhang, D.; Rojas, D.; Feng, G.; Zhao, H.; et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv 2024, arXiv:2406.12793. [Google Scholar]
Category | Dataset | Avg Conv Len | Total Dialogues | Avg Turns | Topic Breadth |
---|---|---|---|---|---|
Instruction/Non-Role-Play | Alpaca [20] | 91 | 52 k | 1.0 | Low |
UltraChat [19] | 1467 | 146.8 k | 3.8 | High | |
SODA [24] | 232 | 148.6 k | 3.6 | Low | |
BELLE [25] | 102 | 143.6 k | 3.1 | Low | |
Role-Play-Centric | RoleCraft [23] | 33 | 27 k | 14.6 | Low |
CharacterEval [26] | 370 | 1785 | 9.2 | Low | |
RoleLLM [27] | 50 | 140.7 k | 1.0 | Low |
Prompt Name | Purpose | Template |
---|---|---|
Meta-topic Selection | Select topics that best match the character profile. | Select 5 meta-topics that best match the character profile: {profile} |
Question Generation | Generate one open-ended question per topic. | Based on the character profile, craft one question (≤50 words) about {Topic}. |
Conversation Requirements | Define style, format, and constraints. | Conversation Requirements: [new conflict, brief user replies, NPC style rules, no terminal phrases] |
Scenario Initialization | Initiate the first exchange embedding conflict. | Play the role of {Name}, in a {emotion} mood, and ask “{Question}”. |
Turn Continuation | Continue dialogue one role + one user turn. | Continue in {emotion} mood based on last 2–3 turns; obey requirements. |
Conversation End | Conclude the session without finality. | End the conversation in {emotion} mood; output single role turn. |
Category | Value |
---|---|
Total Dialogues | 231,000 |
Average Rounds per Dialogue | 50 |
Number of Characters | 28 |
Number of Personality Traits | 45 |
Average Profile Length (tokens) | 381.4 |
Total Responses | 4,897,050 |
Aspect | Evaluation Criteria | Explanation |
---|---|---|
Perceptivity | Context Memory | Assessing whether the model correctly recalls earlier utterances and facts. |
Situational Inference | Determining if the model understands subtle narrative cues or hidden intentions implied in previous context. | |
Anaphora Resolution | Testing the model’s ability to resolve pronouns or referential expressions such as “he,” “that,” or “this.” | |
Adaptability | Topic Switching | The ability to transition smoothly when the user introduces a new topic. |
Code Completion/Format Conversion | Whether the model can rephrase or format content to match specific stylistic or structural goals. | |
Commonsense and Factual Reasoning | The ability to reason over user inputs that involve real-world knowledge. | |
Interactivity | Sentiment Analysis and Persona Inference | Whether the model can correctly interpret the user’s emotion or character traits. |
Instruction Clarification | Whether the model actively asks for clarification when the user prompt is vague or under-specified. | |
Grounding Reasoning/Math Reasoning | Whether the model can follow multi-turn logic chains to solve problems collaboratively with the user. |
Dataset | MTLD | Repetition Rate |
---|---|---|
Alpaca [20] | 42.8 | – |
UltraChat [19] | 74.3 | 50% |
SODA [24] | 38.6 | 75% |
BELLE [25] | 35.9 | 67% |
RoleCraft [23] | – | – |
CharacterEval [26] | 60.4 | – |
RoleLLM [27] | – | – |
CHARCO | 78.2 | 6% |
Model | Context Memory | Situational Inference | Anaphora Resolution | Average |
---|---|---|---|---|
Qwen2-7b [41] | 8.79 | 7.78 | 8.86 | 8.48 |
Qwen2-7b (LoRA) | 8.82 | 8.67 | 9.36 | 8.95 |
Qwen2-7b (VER) | 9.00 | 8.90 | 9.50 | 9.13 |
InternLM-7b [44] | 8.43 | 7.30 | 8.24 | 7.99 |
InternLM-20b [44] | 8.40 | 7.56 | 8.30 | 8.09 |
Llama2-7b [42] | 8.79 | 7.82 | 8.87 | 8.49 |
Llama2-7b (LoRA) | 8.42 | 7.77 | 9.57 | 8.59 |
Llama2-7b (VER) | 8.70 | 8.20 | 9.70 | 8.87 |
Llama2-13b [42] | 9.01 | 8.43 | 9.49 | 8.98 |
ChatGLM3-6b [26] | 6.43 | 4.90 | 6.94 | 6.09 |
Mistral-7b [45] | 5.52 | 6.64 | 8.09 | 6.75 |
MiniMax [46] | 8.32 | 8.01 | 9.52 | 8.62 |
GPT-3.5 [31] | 9.01 | 8.92 | 9.53 | 9.15 |
GPT-4 [31] | 9.23 | 9.13 | 9.59 | 9.32 |
Model | Topic Switch | Code Completion | Commonsense Reasoning | Factual Recall | Style Conversion | Average |
---|---|---|---|---|---|---|
Qwen2-7b [41] | 7.93 | 8.31 | 8.07 | 8.41 | 9.93 | 8.53 |
Qwen2-7b (LoRA) | 8.59 | 8.84 | 7.81 | 8.68 | 9.99 | 8.78 |
Qwen2-7b (VER) | 8.80 | 9.10 | 8.30 | 8.90 | 9.99 | 9.02 |
InternLM-7b [44] | 7.63 | 8.33 | 7.88 | 8.07 | 9.81 | 8.34 |
InternLM-20b [44] | 7.63 | 8.64 | 7.91 | 8.14 | 9.88 | 8.44 |
Llama2-7b [42] | 8.41 | 8.75 | 8.03 | 8.97 | 9.92 | 8.62 |
Llama2-7b (LoRA) | 8.86 | 8.40 | 8.31 | 8.74 | 9.64 | 8.79 |
Llama2-7b (VER) | 9.10 | 9.00 | 8.60 | 9.00 | 9.80 | 9.10 |
Llama2-13b [42] | 8.70 | 8.83 | 8.21 | 9.31 | 9.86 | 8.98 |
ChatGLM3-6b [47] | 5.86 | 5.62 | 6.57 | 7.54 | 8.96 | 6.91 |
Mistral-7b [45] | 8.03 | 8.34 | 7.87 | 8.59 | 9.86 | 8.54 |
MiniMax [46] | 8.76 | 9.10 | 8.86 | 8.78 | 9.20 | 8.94 |
GPT-3.5 [31] | 8.96 | 9.75 | 9.13 | 9.12 | 9.96 | 9.38 |
GPT-4 [31] | 9.45 | 9.56 | 9.23 | 9.36 | 9.96 | 9.51 |
Model | Sentiment Analysis | Math Reasoning | Grounding Reasoning | Instruction Comprehension | Personal Inference | Average |
---|---|---|---|---|---|---|
Qwen2-7b [41] | 7.91 | 7.13 | 5.87 | 7.87 | 8.21 | 7.40 |
Qwen2-7b (LoRA) | 8.64 | 7.07 | 6.43 | 9.08 | 8.80 | 8.00 |
Qwen2-7b (VER) | 8.90 | 7.80 | 7.10 | 9.30 | 9.00 | 8.42 |
InternLM-7b [44] | 7.29 | 7.27 | 6.14 | 7.02 | 8.03 | 7.15 |
InternLM-20b [44] | 7.96 | 7.27 | 6.21 | 7.52 | 8.18 | 7.43 |
Llama2-7b [42] | 7.44 | 6.76 | 5.90 | 7.86 | 8.24 | 7.24 |
Llama2-7b (LoRA) | 8.48 | 5.58 | 5.28 | 8.33 | 7.87 | 7.11 |
Llama2-7b (VER) | 8.80 | 6.80 | 6.80 | 8.90 | 8.50 | 7.76 |
Llama2-13b [42] | 7.86 | 7.00 | 6.06 | 8.28 | 8.64 | 7.57 |
ChatGLM3-6b [47] | 7.83 | 5.90 | 4.36 | 6.54 | 6.23 | 6.17 |
Mistral-7b [45] | 8.41 | 8.00 | 6.57 | 7.93 | 8.28 | 7.84 |
MiniMax [46] | 7.38 | 7.72 | 7.66 | 8.58 | 8.40 | 7.95 |
GPT-3.5 [31] | 7.96 | 8.24 | 7.08 | 8.99 | 8.72 | 8.20 |
GPT-4 [31] | 9.09 | 8.78 | 8.64 | 9.54 | 8.92 | 8.99 |
Model & Variant | Perceptivity | Adaptability | Interactivity |
---|---|---|---|
Qwen2-7b | |||
PPO | 8.85 | 8.80 | 7.80 |
PPO + VER | 9.13 | 9.02 | 8.42 |
GRPO | 8.75 | 8.70 | 7.70 |
GRPO + VER | 9.00 | 8.90 | 8.05 |
Llama2-7b | |||
PPO | 8.80 | 8.85 | 7.25 |
PPO + VER | 8.87 | 9.10 | 7.76 |
GRPO | 8.70 | 8.65 | 7.52 |
GRPO + VER | 8.95 | 8.85 | 7.70 |
Setting | Perceptivity | Adaptability | Interactivity |
---|---|---|---|
Full configuration | 9.13 | 9.02 | 8.42 |
– Retrieval removed | 7.75 | 8.20 | 8.15 |
– VER removed | 8.35 | 8.10 | 7.42 |
– LoRA | 8.95 | 8.78 | 8.00 |
– CHARCO-only (no mix) | 8.20 | 8.30 | 7.80 |
Depth | Avg. VER Reward | Persona Recall % | GPT-4 Score |
---|---|---|---|
Short (≤5 turns) | 0.72 | 85.4 | 7.8 |
Medium (10–20 turns) | 0.85 | 91.2 | 8.7 |
Full (up to 50 turns) | 0.93 | 95.6 | 9.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, J.; Wu, K.; Ouyang, Y. Enhancing Character-Coherent Role-Playing Dialogue with a Verifiable Emotion Reward. Information 2025, 16, 738. https://doi.org/10.3390/info16090738
Wang J, Wu K, Ouyang Y. Enhancing Character-Coherent Role-Playing Dialogue with a Verifiable Emotion Reward. Information. 2025; 16(9):738. https://doi.org/10.3390/info16090738
Chicago/Turabian StyleWang, Junqiao, Kunyu Wu, and Yuqi Ouyang. 2025. "Enhancing Character-Coherent Role-Playing Dialogue with a Verifiable Emotion Reward" Information 16, no. 9: 738. https://doi.org/10.3390/info16090738
APA StyleWang, J., Wu, K., & Ouyang, Y. (2025). Enhancing Character-Coherent Role-Playing Dialogue with a Verifiable Emotion Reward. Information, 16(9), 738. https://doi.org/10.3390/info16090738