MOSAIC: A Cognitively Motivated Multi-Agent Framework for Interpretable and Training-Free Empathetic Dialogue
Abstract
1. Introduction
1.1. Research Gap and Motivating Question
1.2. Proposed Approach and Contributions
- Cognitively motivated modular architecture: We design a dialogue pipeline whose four processing stages correspond, at the functional and behavioral level, to evidence for dissociable components of human empathy, as documented by Decety and Jackson [1] and extended by Singer and Lamm [2]. No claim is made that the agents replicate underlying neural mechanisms; rather, grounding the decomposition in established cognitive constructs yields two empirically verified properties: (a) functional dissociability, as evidenced by characteristically different impairment profiles under each agent’s ablation (Section 4.2), and (b) failure attributability, as enabled by logging structured intermediate states (, , and ) at each turn for post hoc inspection. We emphasize that, here, interpretability denotes architectural transparency—the ability to localize failures to a specific processing stage—rather than a practitioner-validated gain in diagnostic utility, which remains to be assessed in future work.
- Hierarchical three-tier emotional memory: We introduce a tripartite memory structure informed by Tulving’s [18] and Dolcos and colleagues’ [19] accounts of emotional memory organization, comprising perceptual, semantic, and episodic tiers retrieved via adaptive scoring across emotion, situation, and coping-strategy dimensions. Ablation analyses show that episodic memory provides the largest single-tier contribution to empathy quality, while combining all three tiers introduces retrieval redundancy that partially attenuates this advantage; the principal value of the hierarchical design lies in the richer, multi-dimensional retrieval vocabulary it affords rather than in an unconditional additive benefit across all memory types. The scoring policy is designed to exhibit emotional congruence and temporal recency biases, both of which are verified empirically as engineering confirmations rather than emergent discoveries (Section 4.3.3).
- Training-free heterogeneous LLM orchestration: We demonstrate that a coordinated ensemble of open-source and API-accessible language models can be effectively orchestrated without task-specific fine-tuning using only role-specific chain-of-thought prompts. Full prompt schemata are provided for all agents (Section 3.4) to facilitate replication. A comprehensive latency and resource analysis (Section 4.5) shows that the asynchronous pipeline achieves 2.6 s per turn—approximately 86% slower than the strongest single-model baseline—a trade-off that is acceptable for non-real-time applications but constitutes a meaningful constraint for latency-sensitive deployment.
2. Related Work
2.1. Empathetic Dialogue Systems
2.2. Modular and Pipeline-Based Dialogue Architectures
2.3. Cognitive Architectures and Memory-Augmented Dialogue
3. Methods
3.1. Cognitive Motivation and Design Principles
3.2. Architectural Overview
3.3. Specialized Agent Design
3.3.1. Perception Agent
3.3.2. Cognition Agent
3.3.3. Event Agent
3.3.4. Response Agent
3.4. Prompt Engineering
3.5. Hierarchical Emotional Memory and Adaptive Retrieval
Memory Initialization and Cold-Start Behavior
3.6. Experimental Setup
3.6.1. Datasets and Evaluation Protocol
3.6.2. Evaluation Metrics
3.6.3. Human Evaluation Protocol
3.6.4. Baselines and Parameter-Scale Context
3.6.5. Implementation Details and Statistical Analysis
4. Results
4.1. Main Benchmark Results
4.2. Ablation Study: Functional Dissociability and Architectural Contributions
4.3. Memory Retrieval Dynamics and Memory-Type Contributions
4.3.1. Hierarchical Versus Flat Memory
4.3.2. Dimensional Retrieval Analysis
4.3.3. Verification of the Designed Scoring Policy
4.3.4. Memory-Type Contributions to Empathy Quality
4.4. Hyperparameter Sensitivity
4.5. Computational Cost and Latency Analysis
5. Discussion
5.1. Architectural Transparency as a Design Objective
5.2. Qualitative Analysis: Cross-Model Response Comparison
5.3. Scope and Limits of the Cognitive Grounding
5.4. Limitations and Directions for Future Work
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Full Prompt Templates
Appendix A.1. Perception Agent Prompt

Appendix A.2. Cognition Agent Prompt

Appendix A.3. Event Agent Prompt

Appendix A.4. Response Agent Prompt

Appendix B. Model Capacity Ablation Results
| Agent Substituted | Replacement Model | Empathy | Δ |
|---|---|---|---|
| MOSAIC (full) | — | 3.87 | — |
| Perception (Qwen-2.5-14B → 7B) | Qwen-2.5-7B | 3.74 | −0.13 |
| Cognition (Llama-3.1-70B → 13B) | Llama-3.1-13B | 3.73 | −0.14 |
| Event (Gemma-2-9B → 2B) | Gemma-2-2B | 3.76 | −0.11 |
| Response (Llama-3.1-70B → 13B) | Llama-3.1-13B | 3.73 | −0.14 |
Appendix C. FLOP Derivation
| Agent | N (Params) | T (Tokens, In + Out) | FLOPs () |
|---|---|---|---|
| Perception (Qwen-2.5-14B) | 350 | ||
| Cognition (Llama-3.1-70B) | 700 | ||
| Event (Gemma-2-9B) | 430 | ||
| Response (Llama-3.1-70B) | 900 | ||
| Total (sum across four agents) |
Appendix D. Per-Seed Variance on Main Metrics
| Seed | F1 | BLEU-2 | R-L | Emp |
|---|---|---|---|---|
| 42 | 76.2 | 8.0 | 26.7 | 3.86 |
| 123 | 76.5 | 8.1 | 26.9 | 3.88 |
| 456 | 76.4 | 8.1 | 26.8 | 3.87 |
| Mean ± SD |
Appendix E. Holm–Bonferroni Comparison
| Comparison vs. MOSAIC (Empathy) | Raw p | Bonferroni p | Holm p |
|---|---|---|---|
| GPT-4-Turbo | |||
| Claude-3.5 (zero-shot) | |||
| Llama-3.1-405B | |||
| Gemini-1.5-Pro (5-shot) | |||
| Claude-3.5 (5-shot) | |||
| GLHG (fine-tuned) | |||
| CEM (fine-tuned) | (n.s.) | (n.s.) | |
| MultiEMO (fine-tuned) | (n.s.) | (n.s.) | |
| EmpathGen (fine-tuned) | (n.s.) | (n.s.) |
References
- Decety, J. Dissecting the neural mechanisms mediating empathy. Emot. Rev. 2011, 3, 92–108. [Google Scholar] [CrossRef]
- Singer, T.; Lamm, C. The social neuroscience of empathy. Ann. N. Y. Acad. Sci. 2009, 1156, 81–96. [Google Scholar] [CrossRef]
- Fan, Y.; Han, S. Temporal dynamic of neural mechanisms involved in empathy for pain: An event-related brain potential study. Neuropsychologia 2008, 46, 160–173. [Google Scholar] [CrossRef] [PubMed]
- Saxe, R.; Kanwisher, N. People thinking about thinking people: The role of the temporo-parietal junction in “theory of mind”. NeuroImage 2003, 19, 1835–1842. [Google Scholar] [CrossRef]
- Svoboda, E.; McKinnon, M.C.; Levine, B. The functional neuroanatomy of autobiographical memory: A meta-analysis. Neuropsychologia 2006, 44, 2189–2208. [Google Scholar] [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Andreas, J.; Rohrbach, M.; Darrell, T.; Klein, D. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 39–48. [Google Scholar]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Proc. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
- Sabour, S.; Zheng, C.; Huang, M. CEM: Commonsense-aware empathetic response generation. Proc. AAAI Conf. Artif. Intell. 2022, 36, 11229–11237. [Google Scholar] [CrossRef]
- Peng, W.; Hu, Y.; Xing, L.; Xie, Y.; Sun, Y.; Li, Y. Control Globally, Understand Locally: A Global-to-Local Hierarchical Graph Network for Emotional Support Conversation. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Vienna, Austria, 23–29 July 2022; pp. 4299–4305. [Google Scholar] [CrossRef]
- Shi, T.; Huang, S.L. MultiEMO: An Attention-Based Correlation-Aware Multimodal Fusion Framework for Emotion Recognition in Conversations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL); Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 14752–14766. [Google Scholar]
- Park, J.S.; O’Brien, J.C.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative Agents: Interactive Simulacra of Human Behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), San Francisco, CA, USA, 29 October–1 November 2023; pp. 1–22. [Google Scholar] [CrossRef]
- Zhou, X.; Zhu, H.; Mathur, L.; Zhang, R.; Yu, H.; Qi, Z.; Morency, L.P.; Bisk, Y.; Fried, D.; Neubig, G.; et al. SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
- Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku; Technical Report; Anthropic: San Francisco, CA, USA, 2024. [Google Scholar]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. GPT-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Young, S.; Gašić, M.; Thomson, B.; Williams, J.D. POMDP-based statistical spoken dialogue systems: A review. Proc. IEEE 2013, 101, 1160–1179. [Google Scholar] [CrossRef]
- Tulving, E. Elements of Episodic Memory; Clarendon Press: Oxford, UK, 1983. [Google Scholar]
- Denkova, E.; Dolcos, S.; Dolcos, F. The Effect of Retrieval Focus and Emotional Valence on the Medial Temporal Lobe Activity during Autobiographical Recollection. Front. Behav. Neurosci. 2013, 7, 109. [Google Scholar] [CrossRef] [PubMed]
- Lin, Z.; Madotto, A.; Shin, J.; Xu, P.; Fung, P. MoEL: Mixture of empathetic listeners. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 121–132. [Google Scholar]
- Majumder, N.; Hong, P.; Peng, S.; Lu, J.; Ghosal, D.; Gelbukh, A.; Mihalcea, R.; Poria, S. MIME: MIMicking Emotions for Empathetic Response Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 8968–8979. [Google Scholar] [CrossRef]
- Li, Q.; Chen, H.; Ren, Z.; Ren, P.; Tu, Z.; Chen, Z. EMPDG: A multi-resolution empathetic dialogue generation framework. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4454–4466. [Google Scholar]
- Liu, S.; Zheng, C.; Demasi, O.; Sabour, S.; Li, Y.; Yu, Z.; Jiang, Y.; Huang, M. Towards emotional support dialog systems. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP), Virtual, 1–6 August 2021; pp. 3469–3483. [Google Scholar] [CrossRef]
- Wang, F.; Shen, X.; Yu, J.; Xia, R. Flexible Thinking for Multimodal Emotional Support Conversation via Reinforcement Learning. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, 4–9 November 2025; pp. 1341–1356. [Google Scholar] [CrossRef]
- Meng, T.; Shou, Y.; Ai, W.; Du, J.; Liu, H.; Li, K. A multi-message passing framework based on heterogeneous graphs in conversational emotion recognition. Neurocomputing 2024, 569, 127109. [Google Scholar] [CrossRef]
- Hu, T.; Zheng, C.; Liu, S.; Sun, L.; Sun, H.; Zhan, Q. A survey on emotional support dialogue systems. ACM Comput. Surv. 2024, 57, 1–43. [Google Scholar]
- Cheng, Y.; Shen, Y.; Liu, Y.; Wang, J. ConTegas: Contextualized empathetic dialogue generation with parameter-efficient tuning. In Proceedings of the IEEE International Conference on Data Mining (ICDM), Abu Dhabi, United Arab Emirates, 9–12 December 2024; pp. 1147–1152. [Google Scholar]
- Cao, X.; Xu, M.; Yu, X.; Yao, J.; Ye, W.; Huang, S.; Zhang, M.; Tsang, I.; Ong, Y.S.; Kwok, J.T.; et al. Analytical Survey of Learning with Low-Resource Data: From Analysis to Investigation. ACM Comput. Surv. 2025, 58, 1–47. [Google Scholar] [CrossRef]
- Liu, T.; Cheng, Y.; Wu, N.; Ma, D.; Sun, W. Can large language models understand context? A probing study on in-context reasoning and attention. Nat. Mach. Intell. 2024, 6, 932–941. [Google Scholar] [CrossRef]
- Zhang, J.; Qian, K.; Liu, Z.; Heinecke, S.; Meng, R.; Liu, Y.; Yu, Z.; Wang, H.; Savarese, S.; Xiong, C. DialogStudio: Towards richest and most diverse unified dataset collection for conversational AI. In Proceedings of the Findings of the Association for Computational Linguistics: EACL 2024, St. Julian’s, Malta, 17–22 March 2024; pp. 2299–2315. [Google Scholar] [CrossRef]
- Chen, Z.; Liu, B.; Moon, S.; Sankar, C.; Crook, P.; Wang, W.Y. KETOD: Knowledge-enriched task-oriented dialogue systems with entity spans. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, USA, 10–15 July 2022; pp. 2581–2593. [Google Scholar] [CrossRef]
- Khot, T.; Trivedi, H.; Finlayson, M.; Fu, Y.; Richardson, K.; Clark, P.; Sabharwal, A. Decomposed prompting: A modular approach for solving complex tasks. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Zhang, S.; Zhu, E.; Li, B.; Jiang, L.; Zhang, X.; Wang, C. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Cao, Y.; Chen, H.; Jin, L.; Liu, Y.; Wang, P.; Yu, Z. MetaAgents: Simulating interactive multi-agent cooperation and competition at scale. Proc. AAAI Conf. Artif. Intell. 2024, 38, 17945–17953. [Google Scholar]
- Lin, Z.; Wang, Y.; Zhou, Y.; Du, F.; Yang, Y. MLM-EOE: Automatic Depression Detection via Sentimental Annotation and Multi-Expert Ensemble. IEEE Trans. Affect. Comput. 2025, 16, 2842–2858. [Google Scholar] [CrossRef]
- Anderson, J.R.; Bothell, D.; Byrne, M.D.; Douglass, S.; Lebiere, C.; Qin, Y. An integrated theory of the mind. Psychol. Rev. 2004, 111, 1036–1060. [Google Scholar] [CrossRef]
- Laird, J.E. The Soar Cognitive Architecture; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar] [CrossRef]
- Li, J.; Saket, B.; Etessami, K.; Barzilay, R. LaMP: Large language model personalization with progressive retrieval and reflective writing. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 10821–10838. [Google Scholar]
- Zhong, W.; Guo, L.; Gao, Q.; Ye, H.; Wang, Y. MemoryBank: Enhancing large language models with long-term memory. Proc. AAAI Conf. Artif. Intell. 2024, 38, 19724–19731. [Google Scholar] [CrossRef]
- Shinn, N.; Cassano, F.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language agents with verbal reinforcement learning. Proc. Adv. Neural Inf. Process. Syst. 2023, 36, 8634–8652. [Google Scholar]
- LaBar, K.S.; Cabeza, R. Cognitive neuroscience of emotional memory. Nat. Rev. Neurosci. 2006, 7, 54–64. [Google Scholar] [CrossRef] [PubMed]
- Anderson, J.R. A spreading activation theory of memory. J. Verbal Learn. Verbal Behav. 1983, 22, 261–295. [Google Scholar] [CrossRef]
- Bower, G.H. Mood and memory. Am. Psychol. 1981, 36, 129–148. [Google Scholar] [CrossRef]
- Liu, W.; Chen, X.; Miao, D.; Zhang, H.; Qin, X.; Du, S.; Lu, P. SEAD-MGFE-Net: Schrödinger equation-based adaptive dropout multi-granular feature enhancement network for conversational aspect-based sentiment quadruple analysis. Inf. Sci. 2025, 723, 122684. [Google Scholar] [CrossRef]
- Blair, R.J.R. Responding to the emotions of others: Dissociating forms of empathy through the study of typical and psychiatric populations. Conscious. Cogn. 2005, 14, 698–718. [Google Scholar] [CrossRef]
- Davis, M.H. Measuring individual differences in empathy: Evidence for a multidimensional approach. J. Personal. Soc. Psychol. 1983, 44, 113–126. [Google Scholar] [CrossRef]
- Jiang, H.; Chen, X.; Miao, D.; Zhang, H.; Qin, X.; Du, S.; Lu, P. 3WD-DRT: A three-way decision enhanced dynamic routing transformer for cost-sensitive multimodal sentiment analysis. Inf. Sci. 2025, 725, 122704. [Google Scholar] [CrossRef]
- Rashkin, H.; Smith, E.M.; Li, M.; Boureau, Y.L. Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 5370–5381. [Google Scholar] [CrossRef]
- Huang, Y.; Wang, Y.; Lu, D.; Chen, Y.; Yu, D. Towards continuous emotional awareness: A multimodal emotion recognition framework with large language models. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 10851–10855. [Google Scholar] [CrossRef]
- Weiner, B. An attributional theory of achievement motivation and emotion. Psychol. Rev. 1985, 92, 548–573. [Google Scholar] [CrossRef]
- Premack, D.; Woodruff, G. Does the chimpanzee have a theory of mind? Behav. Brain Sci. 1978, 1, 515–526. [Google Scholar] [CrossRef]
- Zhang, Y.; Struhl, N.; Koster, U.; McCoy, R.T. A theory of mind emerges in large language models trained on cryptic crosswords. In Proceedings of the Annual Meeting of the Cognitive Science Society, Rotterdam, The Netherlands, 24–27 July 2024; Volume 46, pp. 4120–4127. [Google Scholar]
- Lin, C.Y. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating text generation with BERT. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]
- Holm, S. A Simple Sequentially Rejective Multiple Test Procedure. Scand. J. Stat. 1979, 6, 65–70. [Google Scholar]
- Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling Laws for Neural Language Models. arXiv 2020, arXiv:2001.08361. [Google Scholar] [CrossRef]
- Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; de Las Casas, D.; Hendricks, L.A.; Welbl, J.; Clark, A.; et al. Training Compute-Optimal Large Language Models. arXiv 2022, arXiv:2203.15556. [Google Scholar] [CrossRef]
- Flavell, J.H. Metacognition and cognitive monitoring: A new area of cognitive–developmental inquiry. Am. Psychol. 1979, 34, 906–911. [Google Scholar] [CrossRef]







| Agent | Model | Access | Core Instruction | Output Schema |
|---|---|---|---|---|
| Perception | Qwen-2.5-14B | API | Identify primary and secondary emotions, intensity [0–5], linguistic markers, and affective trajectory from the utterance and prior context | {primary, secondary, intensity, markers, trajectory} |
| Cognition | Llama-3.1-70B | Local cluster | Perform causal appraisal (controllability, stability, locus), infer mental states and psychological needs, generate three-dimensional retrieval keywords | {appraisal, mental_state, need, , , } |
| Event | Gemma-2-9B | API | Score stored episodes against current query keys; select top-3 with diversity reranking; provide per-episode retrieval rationale | [{sit, traj, cope, out, score, why} × 3] |
| Response | Llama-3.1-70B | Local cluster | Generate an empathetic reply integrating all upstream signals; calibrate affective register to ; avoid toxic positivity and premature reframing when grief or ambivalence is present | Free-form utterance (50–150 tokens) |
| System | Type | Approx. Parameters | Deployment |
|---|---|---|---|
| GLHG [10] | Fine-tuned | ∼330 M | Local |
| CEM [9] | Fine-tuned | ∼125 M | Local |
| MultiEMO [11] | Fine-tuned | ∼1.5 B | Local |
| EmpathGen [24] | Fine-tuned | ∼7 B | Local |
| GPT-4-Turbo [16] | Training-free | not disclosed | API |
| Claude-3.5 [15] | Training-free | not disclosed | API |
| Gemini-1.5-Pro | Training-free | not disclosed | API |
| Llama-3.1-405B [14] | Training-free | 405 B | Local |
| MOSAIC agent breakdown (sequential; ∼163 B total active per turn b) | |||
| Perception (Qwen-2.5-14B) | Training-free | 14 B | API |
| Cognition (Llama-3.1-70B) | Training-free | 70 B | Local cluster |
| Event (Gemma-2-9B) | Training-free | 9 B | API |
| Response (Llama-3.1-70B) | Training-free | 70 B | Local cluster |
| Model | F1 | BLEU-2 | R-L | Emp |
|---|---|---|---|---|
| Fine-tuned systems (2022–2025) | ||||
| GLHG (2022) [10] | 75.8 ** | 7.8 * | 26.3 ** | 3.79 ** |
| CEM (2022) [9] | 76.2 | 7.9 | 26.7 | 3.81 * |
| MultiEMO (2023) [11] | 77.1 * | 8.2 | 27.1 | 3.85 |
| EmpathGen (2025) [24] | 77.8 ** | 8.4 * | 27.8 ** | 3.92 * |
| Training-free baselines (zero-shot unless noted) | ||||
| GPT-4-Turbo [16] | 71.3 *** | 7.1 *** | 24.2 *** | 3.51 *** |
| Claude-3.5 [15] | 73.6 *** | 7.4 *** | 25.1 *** | 3.58 *** |
| Llama-3.1-405B [14] | 74.2 *** | 7.6 ** | 25.6 *** | 3.64 *** |
| Gemini-1.5-Pro (5-shot) | 74.8 *** | 7.7 ** | 25.9 ** | 3.69 *** |
| Claude-3.5 (5-shot) [15] | 75.3 ** | 7.9 | 26.1 * | 3.73 *** |
| MOSAIC (ours) | 76.4 | 8.1 | 26.8 | 3.87 |
| 95% CI | [75.8, 77.0] | [7.9, 8.3] | [26.4, 27.2] | [3.82, 3.92] |
| Model | BERT-S | Coherence | Empathy |
|---|---|---|---|
| EmpathGen (2025) [24] | 82.8 *** | 3.95 *** | 3.82 *** |
| Claude-3.5 (5-shot) [15] | 81.4 *** | 3.86 *** | 3.76 *** |
| MOSAIC (ours) | 84.1 | 4.12 | 4.02 |
| 95% CI | [83.4, 84.8] | [4.05, 4.19] | [3.95, 4.09] |
| Effect vs. Claude-3.5 (5-shot) |
| Model | Empathy | Coherence | Personalization | Overall |
|---|---|---|---|---|
| EmpathGen (2025) [24] | 3.68 *** | 3.91 *** | 3.28 *** | 3.62 *** |
| Claude-3.5 (5-shot) [15] | 3.54 *** | 3.88 *** | 3.24 *** | 3.55 *** |
| MOSAIC (ours) | 3.78 | 4.02 | 3.67 | 3.82 |
| Effect size |
| Variant | F1 | Emp | Pers | d (avg) |
|---|---|---|---|---|
| MOSAIC (full) | 76.4 | 3.87 | 3.67 | — |
| Agent ablations | ||||
| −Perception | 73.8 | 3.61 | 3.58 | 0.27 |
| −Cognition | 74.2 | 3.58 | 3.52 | 0.25 |
| −Event memory | 74.9 | 3.52 | 3.21 | 0.32 |
| Architecture ablations | ||||
| Modular + Uniform Llama-70B (all agents, local) | 75.3 | 3.74 | 3.55 | 0.14 |
| Single LLM (Llama-70B, no role prompts) | 74.1 | 3.61 | 3.45 | 0.24 |
| Single LLM (Llama-70B, structured CoT prompt) | 74.6 | 3.68 | 3.49 | 0.18 |
| Flat memory (no hierarchy) | 75.2 | 3.69 | 3.38 | 0.16 |
| Embedding-only retrieval | 75.8 | 3.73 | 3.49 | 0.11 † |
| 2D keywords ( + only) | 75.7 | 3.71 | 3.44 | 0.14 |
| No chain-of-thought prompts | 75.3 | 3.58 | 3.51 | 0.19 |
| Dimension | Matches | Recall | Utility | p-Value |
|---|---|---|---|---|
| Emotion () | 3.4 | 0.78 | 0.68 | — |
| Situation () | 2.1 | 0.62 | 0.71 | |
| Coping () | 1.6 | 0.54 | 0.79 | |
| Combined 3D | 4.8 | 0.84 | 0.82 |
| k | Emp | Pers | Note | |
|---|---|---|---|---|
| 1 | 0.35/0.35/0.30 | 3.71 † | 3.44 † | Single-episode, low diversity |
| 3 | 0.35/0.35/0.30 | 3.87 | 3.67 | Default |
| 5 | 0.35/0.35/0.30 | 3.86 | 3.65 | Marginal redundancy |
| 3 | 0.45/0.30/0.25 | 3.84 | 3.63 | Emotion-heavy |
| 3 | 0.25/0.45/0.30 | 3.83 | 3.64 | Situation-heavy |
| 3 | 0.25/0.30/0.45 | 3.85 | 3.66 | Coping-heavy |
| Agent | Model | Access | Latency (s) | Tokens (In/Out) | Note |
|---|---|---|---|---|---|
| Perception | Qwen-2.5-14B | API | 0.4 | 280/70 | Compact input |
| Cognition | Llama-3.1-70B | Local cluster | 1.3 | 520/180 | Chain of thought |
| Event | Gemma-2-9B | API | 0.3 | 340/90 | Dispatched after Cognition completes; overlaps with Response initialization (async) |
| Response | Llama-3.1-70B | Local cluster | 1.2 | 760/140 | Full context integration |
| Total (sequential) | 3.2 | 1900/480 | |||
| Total (async, Event ‖ Response init) | 2.6 | 1900/480 | |||
| System | Latency (s) | FLOPs/Turn | Tokens/Turn | Emp |
|---|---|---|---|---|
| Claude-3.5 (5-shot) [15] | 1.4 | — | ∼1500 | 3.73 |
| GPT-4-Turbo [16] | 1.6 | — | ∼1500 | 3.51 |
| Llama-3.1-405B (local) [14] | 2.1 | ∼1500 | 3.64 | |
| EmpathGen (fine-tuned) [24] | 0.9 | — | ∼800 | 3.92 |
| MOSAIC (sequential) | 3.2 | 2380 | 3.87 | |
| MOSAIC (async) | 2.6 | 2380 | 3.87 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Liu, K.; Xiong, H.; Zhang, J.; Peng, M. MOSAIC: A Cognitively Motivated Multi-Agent Framework for Interpretable and Training-Free Empathetic Dialogue. Electronics 2026, 15, 2078. https://doi.org/10.3390/electronics15102078
Liu K, Xiong H, Zhang J, Peng M. MOSAIC: A Cognitively Motivated Multi-Agent Framework for Interpretable and Training-Free Empathetic Dialogue. Electronics. 2026; 15(10):2078. https://doi.org/10.3390/electronics15102078
Chicago/Turabian StyleLiu, Kai, Hangyu Xiong, Jinyi Zhang, and Min Peng. 2026. "MOSAIC: A Cognitively Motivated Multi-Agent Framework for Interpretable and Training-Free Empathetic Dialogue" Electronics 15, no. 10: 2078. https://doi.org/10.3390/electronics15102078
APA StyleLiu, K., Xiong, H., Zhang, J., & Peng, M. (2026). MOSAIC: A Cognitively Motivated Multi-Agent Framework for Interpretable and Training-Free Empathetic Dialogue. Electronics, 15(10), 2078. https://doi.org/10.3390/electronics15102078

