Decoupled Dual-Stage Generation to Balance Factuality and Empathy in Customer-Support Dialogue Systems
Featured Application
Abstract
1. Introduction
2. Related Work
2.1. Dialogue Systems
2.2. The Customer-Support Domain
3. Dual-Stage Generation
3.1. Task Formulation and Framework Overview
3.2. Fact-Centric Drafting and Empathy-Aware Tuning
3.3. Training of the Query Interpretation Module
4. Dataset
4.1. Dataset Construction
4.2. Dataset Statistics
5. Experiments
5.1. Experimental Setup
5.2. Representation Analysis and Motivation
5.3. Quantitative Performance Evaluation
5.4. Scenario-Based Case Study
6. Limitations and Future Work
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Prompt Templates
Appendix A.1. Dual-Stage Generation




Appendix A.2. Dataset Construction

Appendix A.3. LLM-Based Evaluation



References
- Marcineková, K.; Sujová, A.J.; Ďurica, R. Implementing AI Chatbots in Customer Service Optimization—A Case Study in Micro-Enterprise. Information 2025, 16, 1078. [Google Scholar] [CrossRef]
- Uzan, S.; Freud, D.; Elalouf, A. Optimizing Chatbots to Improve Customer Experience and Satisfaction: Research on Personalization, Empathy, and Feedback Analysis. Appl. Sci. 2025, 15, 9439. [Google Scholar] [CrossRef]
- Roller, S.; Dinan, E.; Goyal, N.; Ju, D.; Williamson, M.; Liu, Y.; Xu, J.; Ott, M.; Smith, E.M.; Boureau, Y.L.; et al. Recipes for Building an Open-Domain Chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 300–325. [Google Scholar]
- Herzig, J.; Feigenblat, G.; Shmueli-Scheuer, M.; Konopnicki, D.; Rafaeli, A.; Altman, D.; Spivak, D. Classifying Emotions in Customer Support Dialogues in Social Media. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue; Fernandez, R., Minker, W., Carenini, G., Higashinaka, R., Artstein, R., Gainer, A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 64–73. [Google Scholar] [CrossRef]
- Chen, H.; Zhang, X. TS-CGANet: A Two-Stage Complex and Real Dual-Path Sub-Band Fusion Network for Full-Band Speech Enhancement. Appl. Sci. 2023, 13, 4431. [Google Scholar] [CrossRef]
- Lei, Z.; Zhang, Y.; Chen, S. A Dual-Template Prompted Mutual Learning Generative Model for Implicit Aspect-Based Sentiment Analysis. Appl. Sci. 2024, 14, 8719. [Google Scholar] [CrossRef]
- Choi, H.; Kim, S.; Liermann, W.; Seong, J.; Huang, J.X. Enhancing Automated Essay Scoring with Three Techniques: Two-Stage Fine-Tuning, Score Alignment, and Self-Training. arXiv 2026, arXiv:2602.01747. [Google Scholar] [CrossRef]
- Roh, J.; Kim, M.; Bae, K. Towards a small language model powered chain-of-reasoning for open-domain question answering. ETRI J. 2024, 46, 11–21. [Google Scholar] [CrossRef]
- Manai, S.; Gemme, L.; Zanoli, R.; Lavelli, A. The IDRE Dataset in Practice: Training and Evaluation of Small-to-Medium-Sized LLMs for Empathetic Rephrasing. Electronics 2025, 14, 4052. [Google Scholar] [CrossRef]
- Vinyals, O.; Le, Q. A neural conversational model. arXiv 2015, arXiv:1506.05869. [Google Scholar]
- Serban, I.; Sordoni, A.; Bengio, Y.; Courville, A.; Pineau, J. Building end-to-end dialogue systems using generative hierarchical neural network models. Proc. Aaai Conf. Artif. Intell. 2016, 30. [Google Scholar] [CrossRef]
- Yi, Z.; Ouyang, J.; Xu, Z.; Liu, Y.; Liao, T.; Luo, H.; Shen, Y. A survey on recent advances in llm-based multi-turn dialogue systems. ACM Comput. Surv. 2024, 58, 1–38. [Google Scholar] [CrossRef]
- Wang, Y.; Wang, M.; Manzoor, M.A.; Liu, F.; Georgiev, G.N.; Das, R.J.; Nakov, P. Factuality of Large Language Models: A Survey. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Al-Onaizan, Y., Bansal, M., Chen, Y.N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 19519–19529. [Google Scholar] [CrossRef]
- Seong, J.; Park, J.; Liermann, W.; Choi, H.; Nam, Y.; Kim, H.; Lim, S.; Lee, N. MemEIC: A Step Toward Continual and Compositional Knowledge Editing. arXiv 2025, arXiv:2510.25798. [Google Scholar] [CrossRef]
- Jang, G.; Choi, H.; lim, C.; Lee, K.H.; Yi, M.Y. Leveraging Pretrained Knowledge at Inference Time: LoRA-Gated Contrastive Decoding for Multilingual Factual Language Generation in Adapted LLMs. In Proceedings of the Fourteenth International Conference on Learning Representations, Rio de Janeiro, Brazil, 23–27 April 2026. [Google Scholar]
- Lee, H.; Jeong, O. A knowledge-grounded task-oriented dialogue system with hierarchical structure for enhancing knowledge selection. Sensors 2023, 23, 685. [Google Scholar] [CrossRef] [PubMed]
- Kang, M.; Kwak, J.M.; Baek, J.; Hwang, S.J. Knowledge graph-augmented language models for knowledge-grounded dialogue generation. arXiv 2023, arXiv:2305.18846. [Google Scholar]
- Xu, Z.; Cruz, M.J.; Guevara, M.; Wang, T.; Deshpande, M.; Wang, X.; Li, Z. Retrieval-augmented generation with knowledge graphs for customer service question answering. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval; ACM: New York, NY, USA, 2024; pp. 2905–2909. [Google Scholar]
- Shen, W.; Gao, Y.; Huang, C.; Wan, F.; Quan, X.; Bi, W. Retrieval-generation alignment for end-to-end task-oriented dialogue system. arXiv 2023, arXiv:2310.08877. [Google Scholar]
- Wang, X.; Sen, P.; Li, R.; Yilmaz, E. Adaptive retrieval-augmented generation for conversational systems. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 491–503. [Google Scholar]
- Mo, F.; Gao, Y.; Meng, C.; Liu, X.; Wu, Z.; Mao, K.; Wang, Z.; Chen, P.; Li, Z.; Li, X.; et al. Uniconv: Unifying retrieval and response generation for large language models in conversations. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 6936–6949. [Google Scholar]
- Knollmeyer, S.; Caymazer, O.; Grossmann, D. Document graphrag: Knowledge graph enhanced retrieval augmented generation for document question answering within the manufacturing domain. Electronics 2025, 14, 2102. [Google Scholar] [CrossRef]
- Görnemann, E.; Spiekermann, S. Emotional responses to human values in technology: The case of conversational agents. Hum.-Comput. Interact. 2024, 39, 310–337. [Google Scholar] [CrossRef]
- Marconi, L.; Longo, L.; Cabitza, F. Assessing Interaction Quality in Human–AI Dialogue: An Integrative Review and Multi-Layer Framework for Conversational Agents. Mach. Learn. Knowl. Extr. 2026, 8, 28. [Google Scholar] [CrossRef]
- Rashkin, H.; Smith, E.M.; Li, M.; Boureau, Y.L. Towards empathetic open-domain conversation models: A new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 5370–5381. [Google Scholar]
- Fu, Y.; Inoue, K.; Chu, C.; Kawahara, T. Reasoning before responding: Integrating commonsense-based causality explanation for empathetic response generation. arXiv 2023, arXiv:2308.00085. [Google Scholar] [CrossRef]
- Cai, M.; Wang, D.; Feng, S.; Zhang, Y. Empcrl: Controllable empathetic response generation via in-context commonsense reasoning and reinforcement learning. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024); Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 5734–5746. [Google Scholar]
- Cao, H.; Zhang, Y.; Feng, S.; Yang, X.; Wang, D.; Zhang, Y. TOOL-ED: Enhancing empathetic response generation with the tool calling capability of LLM. In Proceedings of the 31st International Conference on Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 5305–5320. [Google Scholar]
- Nicolescu, L.; Tudorache, M.T. Human-Computer Interaction in Customer Service: The Experience with AI Chatbots—A Systematic Literature Review. Electronics 2022, 11, 1579. [Google Scholar] [CrossRef]
- ABEL, U.; EMMANUEL, C.; PASCAL, U.O. Leveraging AI-Powered chatbots to enhance customer service efficiency and future opportunities in automated support. Comput. Sci. 2024, 5, 2485–2510. [Google Scholar]
- Adam, M.; Wessel, M.; Benlian, A. AI-based chatbots in customer service and their effects on user compliance. Electron. Mark. 2021, 31, 427–445. [Google Scholar] [CrossRef]
- Rohden, S.F.; Espartel, L.B. Emotional artificial intelligence: The impact of chatbot empathy and emotional tone on consumer satisfaction and word of mouth. Int. J. Hum.-Comput. Stud. 2026, 210, 103764. [Google Scholar] [CrossRef]
- Wu, S.; Hsu, W.; Lee, M.L. EHDChat: A Knowledge-Grounded, Empathy-Enhanced Language Model for Healthcare Interactions. In Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024); Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 141–151. [Google Scholar]
- Tsai, S.C.; Chen, Y.N. Balancing knowledge delivery and emotional comfort in healthcare conversational systems. arXiv 2025, arXiv:2506.13692. [Google Scholar] [CrossRef]
- Austin, J.L. How to Do Things with Words; Oxford University Press: Oxford, UK, 1962. [Google Scholar]
- Searle, J.R. Speech Acts: An Essay in the Philosophy of Language; Cambridge University Press: Cambridge, UK, 1969. [Google Scholar]
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
- Ekman, P. An argument for basic emotions. Cogn. Emot. 1992, 6, 169–200. [Google Scholar] [CrossRef]
- Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. Gpt-4o system card. arXiv 2024, arXiv:2410.21276. [Google Scholar] [CrossRef]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2002; pp. 311–318. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization; Association for Computational Linguistics: Stroudsburg, PA, USA, 2005; pp. 65–72. [Google Scholar]
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv 2019, arXiv:1904.09675. [Google Scholar]
- Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Adv. Neural Inf. Process. Syst. 2023, 36, 46595–46623. [Google Scholar]
- OpenAI. GPT-4o Mini Model. 2024. Available online: https://developers.openai.com/api/docs/models/gpt-4o-mini (accessed on 20 March 2026).
- Lee, A.; Kummerfeld, J.K.; Ann, L.; Mihalcea, R. A comparative multidimensional analysis of empathetic systems. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 179–189. [Google Scholar]
- See, A.; Roller, S.; Kiela, D.; Weston, J. What makes a good conversation? how controllable attributes affect human judgments. arXiv 2019, arXiv:1902.08654. [Google Scholar] [CrossRef]
- Cohen, J. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 1968, 70, 213–220. [Google Scholar] [CrossRef]
- Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 technical report. arXiv 2025, arXiv:2505.09388. [Google Scholar] [CrossRef]
- An, S.; Bae, K.; Choi, E.; Choi, K.; Jungkyu Choi, S.; Hong, S.; Hwang, J.; Jeon, H.; Jeongwon Jo, G.; Jo, H.; et al. EXAONE 3.5: Series of Large Language Models for Real-world Use Cases. arXiv 2024, arXiv:2412.04862. [Google Scholar] [CrossRef]
- Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
- Jang, Y.; Son, J.; Lee, T. KURE: Korea University Retrieval Embedding; NLP & AI Lab, Korea University: Seoul, Republic of Korea, 2024; Available online: https://huggingface.co/nlpai-lab/KURE-v1 (accessed on 20 March 2026).
- Maaten, L.v.d.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]






| Category | Statistic | Value |
|---|---|---|
| Dataset size | Dialogues | 4513 |
| Utterances | 40,207 | |
| Data split | Train/val/test | 80/10/10 |
| Test set | Factuality-prioritized utterances | 633 |
| Empathy-prioritized utterances | 967 | |
| Dialogue length (empathy) | Min. turns | 1 |
| Max. turns | 15 | |
| Avg. turns | 3.37 | |
| Dialogue length (factuality) | Min. turns | 1 |
| Max. turns | 17 | |
| Avg. turns | 4.62 | |
| User profile | Visitor | 2638 (58.5%) |
| Employee | 1875 (41.5%) | |
| Annotations | Emotion labels | {joy, neutral, sadness, surprise, fear, anger, disgust} |
| Utterance type labels | {empathy, factuality} | |
| Emotion distribution (%) | Joy/neutral/sadness | 53.6/28.2/7.2 |
| Surprise/fear/anger/disgust | 5.8/3.0/1.1/1.0 |
| Model | Variant | BLEU-3/4 | METEOR | BERTScore | Faithfulness |
|---|---|---|---|---|---|
| Qwen-3 (4B-Instruct) | Base | 9.76/7.49 | 0.294 | 0.501 | 3.341 |
| + RAG | 11.75/8.55 | 0.306 | 0.517 | 3.445 | |
| + QI | 9.47/6.34 | 0.290 | 0.508 | 3.321 | |
| + QI, RAG | 12.91/6.69 | 0.334 | 0.532 | 3.716 | |
| Ours (E→F) | 12.65/9.30 | 0.324 | 0.524 | 3.801 | |
| Ours (F→E) | 10.65/7.88 | 0.338 | 0.530 | 3.769 | |
| EXAONE-3.5 (7.8B-Instruct) | Base | 9.04/6.11 | 0.306 | 0.497 | 3.645 |
| + RAG | 7.95/5.34 | 0.296 | 0.487 | 3.409 | |
| + QI | 7.25/4.72 | 0.296 | 0.482 | 3.518 | |
| + QI, RAG | 8.94/6.22 | 0.327 | 0.502 | 3.882 | |
| Ours (E→F) | 7.90/5.45 | 0.318 | 0.492 | 3.899 | |
| Ours (F→E) | 6.72/4.56 | 0.315 | 0.494 | 3.856 |
| Model | Variant | Specificity | Reflection Level | Emotional Alignment |
|---|---|---|---|---|
| Qwen-3 (4B-Instruct) | Base | 0.576 | 2.673 | 4.215 |
| + RAG | 0.576 | 2.405 | 4.029 | |
| + QI | 0.601 | 2.411 | 4.016 | |
| + QI, RAG | 0.599 | 2.365 | 3.918 | |
| Ours (E→F) | 0.566 | 2.743 | 4.183 | |
| Ours (F→E) | 0.599 | 2.790 | 4.129 | |
| EXAONE-3.5 (7.8B-Instruct) | Base | 0.547 | 2.947 | 4.424 |
| + RAG | 0.553 | 3.038 | 4.407 | |
| + QI | 0.602 | 2.961 | 4.333 | |
| + QI, RAG | 0.603 | 2.891 | 4.250 | |
| Ours (E→F) | 0.578 | 3.614 | 4.477 | |
| Ours (F→E) | 0.610 | 3.859 | 4.468 |
| (a) Factuality | |||||
| Model | Variant | BLEU-3/4 | METEOR | BERTScore | Faithfulness |
| Llama-3 (8B-Instruct) | Base | 8.61/5.67 | 0.275 | 0.496 | 3.120 |
| Ours (F→E) | 7.72/5.82 | 0.296 | 0.469 | 3.294 | |
| Llama-3.1 (8B-Instruct) | Base | 10.43/7.08 | 0.285 | 0.510 | 3.226 |
| Ours (F→E) | 12.75/9.88 | 0.330 | 0.532 | 3.501 | |
| (b) Empathy | |||||
| Model | Variant | Specificity | Reflection Level | Emotional Alignment | |
| Llama-3 (8B-Instruct) | Base | 0.640 | 2.204 | 3.850 | |
| Ours (F→E) | 0.678 | 2.430 | 3.733 | ||
| Llama-3.1 (8B-Instruct) | Base | 0.623 | 2.245 | 3.962 | |
| Ours (F→E) | 0.623 | 2.452 | 3.843 | ||
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Kim, S.; Choi, H.; Huang, J.-X. Decoupled Dual-Stage Generation to Balance Factuality and Empathy in Customer-Support Dialogue Systems. Appl. Sci. 2026, 16, 3123. https://doi.org/10.3390/app16073123
Kim S, Choi H, Huang J-X. Decoupled Dual-Stage Generation to Balance Factuality and Empathy in Customer-Support Dialogue Systems. Applied Sciences. 2026; 16(7):3123. https://doi.org/10.3390/app16073123
Chicago/Turabian StyleKim, Serynn, Hongseok Choi, and Jin-Xia Huang. 2026. "Decoupled Dual-Stage Generation to Balance Factuality and Empathy in Customer-Support Dialogue Systems" Applied Sciences 16, no. 7: 3123. https://doi.org/10.3390/app16073123
APA StyleKim, S., Choi, H., & Huang, J.-X. (2026). Decoupled Dual-Stage Generation to Balance Factuality and Empathy in Customer-Support Dialogue Systems. Applied Sciences, 16(7), 3123. https://doi.org/10.3390/app16073123

