MOSAIC: A Cognitively Motivated Multi-Agent Framework for Interpretable and Training-Free Empathetic Dialogue

Liu, Kai; Xiong, Hangyu; Zhang, Jinyi; Peng, Min

doi:10.3390/electronics15102078

Open AccessArticle

MOSAIC: A Cognitively Motivated Multi-Agent Framework for Interpretable and Training-Free Empathetic Dialogue

¹

School of Computer Science, Wuhan University, Wuhan 430072, China

²

Department of Computer Science, Technical University of Denmark (DTU), 2800 Copenhagen, Denmark

³

Department of Computer Science, University of California, Los Angeles, CA 90095, USA

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(10), 2078; https://doi.org/10.3390/electronics15102078

Submission received: 9 April 2026 / Revised: 4 May 2026 / Accepted: 10 May 2026 / Published: 13 May 2026

(This article belongs to the Special Issue Affective Computing in Human–Robot Interaction)

Download

Browse Figures

Versions Notes

Abstract

Empathetic dialogue systems built upon large language models overwhelmingly adopt a monolithic inference paradigm that processes emotion perception, causal reasoning, memory retrieval, and response planning within a single forward pass without architecturally enforced intermediate representations, forfeiting intermediate-state transparency and long-horizon personalization. Drawing on neuroscientific and cognitive–psychological evidence that human empathy is functionally dissociable, we present MOSAIC (Multi-agent Orchestration with Structured Affective memory for Interpretable empathiC dialogue), a training-free framework that operationalizes empathetic dialogue as a four-stage cognitive pipeline: affective perception, causal appraisal, episodic memory retrieval, and response synthesis. Three innovations distinguish MOSAIC from prior work: (1) a cognitively motivated modular architecture whose functionally dissociable stages enable post hoc failure attribution through logged intermediate states; (2) a hierarchical three-tier emotional memory—perceptual, semantic, and episodic—coupled with adaptive three-dimensional retrieval over emotion, situation, and coping-strategy cues; and (3) a heterogeneous model orchestration strategy coordinating open-source and API-accessible models through role-specific chain-of-thought prompts, requiring no task-specific fine-tuning. We note that the EmpatheticDialogues evaluation pre-populates the memory store with 200 training-split episodes prior to test-set interaction, a data-access asymmetry relative to single-model baselines that must be borne in mind when interpreting comparative results. Experiments on EmpatheticDialogues and ESConv show that MOSAIC achieves a 76.4% weighted F1 and an empathy score of 3.87 (on a 1–5 Likert scale) and that it improves over single-model, training-free baselines on aggregate empathy and—most prominently—on human-rated personalization (3.67 vs. 3.24 against Claude-3.5 five-shot,

d = 0.48

). We caution that the comparison against training-free baselines is not data access-controlled (see the cold-start discussion in Methods); the personalization advantage, supported by the ablation without the Event Agent, is the result we treat as the primary practical contribution of this work.

Keywords:

empathetic dialogue; multi-agent systems; large language models; cognitive architecture; hierarchical emotional memory; retrieval-augmented generation; training-free inference; modular natural language processing; affective computing

1. Introduction

Human empathy emerges from the coordinated engagement of several partially dissociable cognitive processes [1,2]: the detection of affective cues [3], perspective taking and causal appraisal [4], autobiographical memory retrieval, and the regulation of behavioral responses [5]. This neuroscientific account carries a direct engineering implication: a system capable of genuine empathetic interaction must not merely produce supportive surface language but must identify emotional signals accurately, infer their underlying causes and the speaker’s psychological needs, recruit contextually relevant prior experience, and select a response strategy calibrated to the current emotional context.

Contemporary large language model (LLM)-based dialogue systems, despite their impressive fluency [6], typically process all of these functions within a single inference pass without architecturally enforced intermediate representations. Chain-of-thought prompting can elicit intermediate reasoning steps, but these steps are not structurally enforced, logged, or verifiable across turns. Consequently, such monolithic designs do not reliably expose which perceptual signals drove an empathetic interpretation, why a given response strategy was selected, or how accumulated conversational history informed the reply. This opacity limits failure diagnosis, forecloses targeted optimization, and tends to undermine personalization as conversation length grows.

1.1. Research Gap and Motivating Question

The central gap we address is the absence of an explicit, cognitively motivated functional decomposition in current empathetic dialogue systems. Prior work on modular neural networks [7] and retrieval-augmented generation [8] has demonstrated that decomposition can improve factuality, controllability, and transparency in other natural language tasks, yet its application to empathetic dialogue along cognitively principled lines has received limited attention. Existing systems that incorporate external memory or multi-perspective reasoning, including CEM [9], GLHG [10], and MultiEMO [11], organize their components around response generation objectives rather than around the structure of human empathetic processing and rarely analyze whether retrieval dynamics exhibit interpretable behavioral signatures such as emotional congruence or recency preference. While prompt-based multi-agent frameworks for social interaction have begun to appear—including generative agent simulations [12] and structured social interaction platforms [13]—their explicit extension to empathetic dialogue with hierarchical affective memory and verifiable intermediate states remains largely unexplored.

This gap motivates the central question of the present work: can the quality and interpretability of empathetic dialogue be improved by decomposing the generation process into modules that mirror the functional anatomy of human empathy rather than delegating all computation to a monolithic prompted model?

Figure 1 illustrates the conceptual motivation of our approach. In contrast to monolithic LLM pipelines that conflate all empathetic processing within a single forward pass, we advocate for an explicit decomposition into four cognitively motivated stages—perception, cognition, event memory retrieval, and response synthesis—each exposing structured intermediate states that enable post hoc failure attribution and targeted optimization.

1.2. Proposed Approach and Contributions

We propose MOSAIC (Multi-agent Orchestration with Structured Affective memory for Interpretable empathiC dialogue), a training-free framework in which four functionally specialized agents implement affective perception, causal appraisal, episodic memory retrieval, and response synthesis. The term training-free is used throughout to denote the absence of task-specific fine-tuning on empathetic dialogue corpora; backbone models underwent general-purpose instruction tuning and RLHF in their standard pre-training regimes [14]. Crucially, the EmpatheticDialogues evaluation pre-populates the memory store with 200 analogical episodes sampled exclusively from the training split prior to any test-set interaction. This initialization constitutes limited training-data access that single-model training-free baselines (e.g., Claude-3.5 [15] and GPT-4-Turbo [16] in zero- or few-shot configurations) do not receive; the EmpatheticDialogues performance comparisons must be interpreted with this asymmetry in view (see Section Memory Initialization and Cold-Start Behavior for details and the ablation without the Event Agent for the associated performance cost). Unlike prior pipeline systems that decompose along task-functional axes such as intent detection and slot filling [17], MOSAIC decomposes along cognitive-process axes, each corresponding to a distinct functional component of the cognitive–psychological account of empathy. The framework exposes structured intermediate states at every turn, making failure attribution architecturally straightforward while retaining the flexibility of modern open-source LLMs.

The manuscript advances three contributions, each addressing a distinct dimension of the identified research gap.

Cognitively motivated modular architecture: We design a dialogue pipeline whose four processing stages correspond, at the functional and behavioral level, to evidence for dissociable components of human empathy, as documented by Decety and Jackson [1] and extended by Singer and Lamm [2]. No claim is made that the agents replicate underlying neural mechanisms; rather, grounding the decomposition in established cognitive constructs yields two empirically verified properties: (a) functional dissociability, as evidenced by characteristically different impairment profiles under each agent’s ablation (Section 4.2), and (b) failure attributability, as enabled by logging structured intermediate states ( $S_{t}$ , $C_{t}$ , and $E_{t}^{ret}$ ) at each turn for post hoc inspection. We emphasize that, here, interpretability denotes architectural transparency—the ability to localize failures to a specific processing stage—rather than a practitioner-validated gain in diagnostic utility, which remains to be assessed in future work.
Hierarchical three-tier emotional memory: We introduce a tripartite memory structure informed by Tulving’s [18] and Dolcos and colleagues’ [19] accounts of emotional memory organization, comprising perceptual, semantic, and episodic tiers retrieved via adaptive scoring across emotion, situation, and coping-strategy dimensions. Ablation analyses show that episodic memory provides the largest single-tier contribution to empathy quality, while combining all three tiers introduces retrieval redundancy that partially attenuates this advantage; the principal value of the hierarchical design lies in the richer, multi-dimensional retrieval vocabulary it affords rather than in an unconditional additive benefit across all memory types. The scoring policy is designed to exhibit emotional congruence and temporal recency biases, both of which are verified empirically as engineering confirmations rather than emergent discoveries (Section 4.3.3).
Training-free heterogeneous LLM orchestration: We demonstrate that a coordinated ensemble of open-source and API-accessible language models can be effectively orchestrated without task-specific fine-tuning using only role-specific chain-of-thought prompts. Full prompt schemata are provided for all agents (Section 3.4) to facilitate replication. A comprehensive latency and resource analysis (Section 4.5) shows that the asynchronous pipeline achieves 2.6 s per turn—approximately 86% slower than the strongest single-model baseline—a trade-off that is acceptable for non-real-time applications but constitutes a meaningful constraint for latency-sensitive deployment.

The remainder of this paper is organized as follows. Section 2 situates the work within the empathetic dialogue, cognitive architecture, modular pipeline, and memory-augmented generation literature. Section 3 presents the MOSAIC design, prompt engineering strategy, and experimental protocol. Section 4 reports benchmark, ablation, retrieval, and resource analyses. Section 5 addresses qualitative behavior, the scope of cognitive claims, limitations, and future directions. Section 6 concludes this paper.

2. Related Work

2.1. Empathetic Dialogue Systems

Research on empathetic dialogue has progressed from end-to-end emotional generation models toward architectures that integrate richer inferential signals. Early methods such as MoEL [20] improved response quality via emotion-aware latent components; MIME [21] introduced explicit emotion control. Subsequent architectures incorporated commonsense knowledge graphs [9], hierarchical reasoning [10,22], and emotional support planning [23] to better align responses with the speaker’s situational context. More recent systems explore multi-perspective emotion understanding [11] and reinforcement learning-based support strategy selection [24]. Emotion recognition in conversation has also benefited from heterogeneous graph fusion approaches that integrate multiple contextual cues [25], providing motivation for the Perception Agent’s multi-signal extraction strategy adopted in MOSAIC. A comprehensive survey of this field is provided by Hu et al. [26]. While these systems demonstrate that structural inductive biases improve empathetic response quality, their structural choices are driven primarily by generation objectives rather than by an explicit decomposition of the constituent cognitive processes. Consequently, failure modes remain difficult to isolate, and personalization is typically a secondary effect rather than an explicit design target.

Large language models substantially expand the design space. Cheng et al. [27] demonstrate that parameter-efficient tuning can improve empathetic generation quality. Zero-shot and few-shot prompted LLMs offer broad world knowledge and fluent generation, as shown by Achiam et al. [16] and Anthropic [15], making them attractive as training-free empathetic agents. The challenges of low-resource learning provide additional motivation for training-free approaches such as MOSAIC, as noted by Cao et al. [28]. However, they remain functionally monolithic: emotion recognition, causal attribution, and strategy selection are processed within a single forward pass without architecturally enforced intermediate representations [29], leaving intermediate reasoning states and personalization trajectories unverifiable.

2.2. Modular and Pipeline-Based Dialogue Architectures

The decomposition of dialogue systems into functionally explicit stages has a long history in task-oriented dialogue research [17]. Multi-domain benchmarks [30] and knowledge-enriched pipeline designs [31] have demonstrated that structured pipelines improve factual grounding and controllability. The modular paradigm has been revisited in neural NLP through work on compositional networks [7] and pipeline-based complex question answering [32]. Multi-agent frameworks have been applied to code generation, reasoning, and decision support [33]. Cao et al. [34] further demonstrate the scalability of multi-agent cooperation and competition.

Recent work has extended multi-agent frameworks to social and emotional domains. Generative agent simulations [12] demonstrated that LLM-based agents can exhibit emergent social behaviors when equipped with memory and reflection mechanisms. Sotopia [13] introduced structured multi-agent social interaction for evaluation of social intelligence. Multi-expert ensemble methods have also shown promise for depression detection [35], demonstrating that heterogeneous agent coordination can improve affective inference. These works collectively establish the viability of multi-agent decomposition for social–emotional tasks, though their explicit application to empathetic dialogue with structured hierarchical memory and verifiable intermediate states—the focus of MOSAIC—has received limited attention. The present work is, to the best of our knowledge, the first to combine a cognitively motivated four-stage decomposition, tripartite hierarchical emotional memory, and training-free heterogeneous model orchestration in a unified empathetic dialogue framework.

MOSAIC differs from prior pipeline systems in two respects. First, the decomposition follows cognitive-process boundaries rather than task-functional ones: the four agents correspond to functionally distinct components of empathetic cognition identified in affective neuroscience [1], not to dialogue management stages such as intent detection or slot filling. Second, MOSAIC requires no task-specific fine-tuning at any stage, relying entirely on role-specific prompt engineering; this renders the framework broadly applicable in the absence of labeled empathetic dialogue data.

2.3. Cognitive Architectures and Memory-Augmented Dialogue

Cognitive architectures such as ACT-R [36] and Soar [37] provide principled frameworks for modular reasoning, structured memory access, and goal-directed behavior. Their primary engineering lesson for conversational AI is that complex social cognition may benefit from explicit functional decomposition. Retrieval-augmented generation extends this principle to neural NLP, demonstrating that external memory improves factual grounding and contextual coherence [8]. Li et al. [38] show that progressive retrieval and reflective writing can enhance LLM personalization. Episodic and semantic memory management for language agents has been explored in systems such as MemoryBank [39] and Reflexion [40].

In empathetic dialogue, memory has typically been treated as retrieval support for factual grounding or user-profile conditioning rather than as a structured emotional resource. Prior systems seldom distinguish among perceptual affect traces [41], abstract propositional knowledge [42], and context-rich episodic experiences [18], and the behavioral dynamics of retrieval—such as emotional congruence [43] or recency weighting [19]—are rarely analyzed. Fine-grained sentiment analysis in conversations, including quadruple-level analysis [44], has highlighted the value of multi-dimensional retrieval cues, supporting the three-dimensional (

K_{e}

/

K_{s}

/

K_{c}

) retrieval vocabulary adopted in MOSAIC. MOSAIC addresses this gap by coupling a cognitively motivated modular pipeline with a hierarchical memory system whose retrieval behavior is empirically characterized and attributable to explicit design choices.

3. Methods

3.1. Cognitive Motivation and Design Principles

The MOSAIC design is grounded in two empirical observations from cognitive and affective science. First, empathy is functionally dissociable: the detection of emotional cues [45], perspective-taking [4], autobiographical recollection [5], and behavioral response regulation each engage partially overlapping but distinguishable neural substrates, as established by Decety and Jackson [1] and replicated by Singer and Lamm [2]; Davis’s [46] multidimensional model similarly treats these as separable components. This dissociability is a functional and behavioral observation that motivates an analogous architectural decomposition; no claim of mechanistic equivalence between the agents and the underlying neural systems is made. Second, emotional memory is structurally heterogeneous: affect-laden recollection draws on sensory-affective traces [41], abstract propositional knowledge [42], and rich episodic experiences [18], and retrieval is governed by associative similarity [43] and temporal accessibility [19]. Advances in cost-sensitive reasoning for multimodal sentiment analysis [47] further motivate the Response Agent’s affective calibration strategy, which explicitly guards against premature positive reframing.

These observations motivate three design principles. First, empathetic dialogue processing should be decomposed into explicit functional stages, each instantiated by a dedicated model and representation, so that failures can be localized and remedied independently. Second, the memory substrate should preserve multiple levels of emotional and contextual abstraction rather than being reduced to a flat embedding store. Third, the architecture should remain free of task-specific fine-tuning, employing only backbone models trained under general-purpose instruction regimes, so that the contribution of cognitive structure can be evaluated without conflation with domain-supervised learning.

3.2. Architectural Overview

Figure 2 summarizes the MOSAIC architecture. At conversational turn t, the system receives the current user utterance (

u_{t}

), the accumulated dialogue context (

h_{< t}

), and the memory store (

M_{< t}

). Processing is decomposed into four sequential stages:

S_{t} = f_{P} (u_{t}, h_{< t}), C_{t} = f_{C} (u_{t}, S_{t}, h_{< t}), E_{t}^{ret} = f_{E} (C_{t}, M_{< t}), r_{t} = f_{R} (u_{t}, S_{t}, C_{t}, E_{t}^{ret}),

(1)

where

f_{P}

denotes affective perception,

f_{C}

denotes cognitive appraisal and psychological need inference,

f_{E}

denotes hierarchical memory retrieval, and

f_{R}

denotes empathetic response synthesis. Each intermediate representation (

S_{t}

,

C_{t}

, and

E_{t}^{ret}

) is logged and available for post hoc attribution, constituting the primary failure-diagnosis mechanism of the system.

3.3. Specialized Agent Design

3.3.1. Perception Agent

The Perception Agent performs fine-grained affective signal extraction. Given the current utterance and dialogue context, it produces a structured emotional-state representation:

S_{t} = 〈 e_{primary}, e_{secondary}, I, M, A 〉,

(2)

where

e_{primary}

and

e_{secondary}

denote the dominant and secondary emotions from a 32-class taxonomy aligned with the EmpatheticDialogues annotation scheme [48],

I \in [0, 5]

quantifies emotional intensity, M encodes surface-linguistic markers (hedges, intensifiers, and exclamations), and A characterizes the affective trajectory (onset, persistence, escalation, resolution, or ambivalence). We assign Qwen-2.5-14B, accessed via a commercial API endpoint, to this agent on the basis of its demonstrated sensitivity to subtle emotional cues and strong cross-lingual affect representation [49]. Its moderate parameter count provides sufficient representational depth for fine-grained emotion discrimination while avoiding the latency overhead of a 70B-scale model. The resulting perceptual trace is stored in the hierarchical memory for prospective calibration.

3.3.2. Cognition Agent

The Cognition Agent performs top-down causal appraisal and theory-of-mind reasoning over the perceptual signal. Its output is a structured cognitive interpretation:

C_{t} = 〈 C, A, M, N, K_{e}, K_{s}, K_{c} 〉,

(3)

where

C

is the appraisal-consistent emotion category;

A

encodes the attributional profile along controllability, stability, and locus dimensions [50];

M

represents inferred mental states (beliefs, desires, or intentions) following a theory-of-mind formalization [51,52];

N

identifies the user’s dominant psychological need; and

K_{e}

,

K_{s}

, and

K_{c}

are structured retrieval keywords along emotion, situation, and coping-strategy dimensions, respectively. Llama-3.1-70B, deployed on our local GPU cluster, is assigned to this stage; the demands of multi-step causal reasoning and need inference necessitate the representational capacity of a large model.

Keyword generation expands the current utterance across three retrieval dimensions:

\begin{matrix} K_{e} & = expand (e_{primary}, e_{secondary}) \cup synonyms, \\ K_{s} & = {extract}_{NER} (u_{t}) \cup {domain}_{concepts}, \\ K_{c} & = {infer}_{needs} (N) \mapsto strategies . \end{matrix}

(4)

This decomposition yields three independent diagnostic handles on the empathetic situation: what the user feels, why they feel that way, and what form of support is most appropriate.

3.3.3. Event Agent

The Event Agent retrieves contextually relevant prior experiences from the hierarchical memory. Each stored episode is represented as a five-tuple:

E_{i} = 〈 {sit}_{i}, {traj}_{i}, {cope}_{i}, {out}_{i}, t_{i} 〉,

(5)

where

{sit}_{i}

summarizes the triggering situation,

{traj}_{i}

describes the emotional trajectory,

{cope}_{i}

specifies the employed coping strategy,

{out}_{i}

records the observed interaction outcome, and

t_{i}

is the encoding timestamp. Gemma-2-9B, accessed via a commercial API, is assigned to this stage; retrieval scoring and compact summarization are well-served by a moderate-capacity model, and the smaller footprint reduces round-trip latency at the retrieval bottleneck. The agent selects the top

k = 3

candidates according to the adaptive scoring function (Equation (6)) and applies a diversity-promoting reranking step with a factor of

λ_{div} = 0.3

to mitigate semantic redundancy in the retrieved set.

3.3.4. Response Agent

The Response Agent synthesizes the final empathetic reply from the complete intermediate-state tuple

(u_{t}, S_{t}, C_{t}, E_{t}^{ret})

. It proceeds through four sequential sub-operations: (i) empathetic framing, which selects an appropriate relational stance given the inferred psychological need; (ii) memory integration, which coherently incorporates retrieved analogical episodes into the reply; (iii) response drafting; and (iv) intensity calibration, which aligns the tone and affective register of the final response with the intensity level estimated in

S_{t}

. Calibration against toxic positivity—the avoidance of unsolicited reframing toward the positive when grief or ambivalence is present—is explicitly encoded in the Response Agent prompt (Appendix A), echoing cost-sensitive reasoning strategies now pursued in multimodal affective computing [47]. Llama-3.1-70B, deployed on the same local GPU cluster as the Cognition Agent, is used for generation, ensuring faithful integration of the rich intermediate context into a fluent and contextually coherent response.

3.4. Prompt Engineering

Each agent is instantiated via a role-specific system prompt that constrains its output to the required structured format and elicits multi-step chain-of-thought reasoning prior to the final answer. Table 1 summarizes the principal components of each agent’s prompt; complete prompt texts are provided in Appendix A and in the publicly available code repository (see Data Availability Statement).

Model assignments reflect a principled trade-off between representational capacity, inference overhead, and deployment mode. The Perception and Event agents are assigned to API-accessible models because affect signal extraction and retrieval scoring are well-served by moderate-capacity architectures with fast response times. The Cognition and Response agents require deeper multi-step reasoning and faithful integration of complex intermediate context and are therefore assigned to the more capable Llama-3.1-70B on our local GPU cluster. Table A1 (Appendix B) reports the results of these preliminary capacity ablations: substituting a 13B-scale model at the Cognition or Response stage reduced empathy scores by 0.11–0.14 points, validating the capacity-based assignment.

3.5. Hierarchical Emotional Memory and Adaptive Retrieval

A central design commitment of MOSAIC is the treatment of memory as a structured three-tier hierarchy rather than a flat embedding store. Perceptual memory records low-level affective signatures and surface interaction cues [41]. Semantic memory stores higher-order user tendencies, recurring situational themes, and coping-strategy knowledge [42]. Episodic memory preserves context-rich prior experiences that support analogical reasoning over support trajectories [18]. This tripartite organization is motivated by well-established distinctions in emotional memory research, as reviewed by Dolcos et al. [19] and Svoboda et al. [5].

Retrieval integrates emotion, situation, and coping information through an adaptive scoring function:

score (E_{i}) = \sum_{d \in {e, s, c}} w_{d} (context) \cdot {sim}_{d} (K_{d}, E_{i}) \cdot decay (t - t_{i}),

(6)

where

w_{d} (context)

represents context-sensitive dimension weights updated per interaction and

{sim}_{d}

is a dimension-specific similarity function. Specifically,

{sim}_{e}

is computed as the cosine similarity between the TF-IDF vector representations of the emotion keyword set (

K_{e}

) and the stored emotion-label field of episode

E_{i}

;

{sim}_{s}

is the Jaccard similarity between

K_{s}

and the situational descriptor tokens of

E_{i}

; and

{sim}_{c}

is the cosine similarity between

K_{c}

and the coping-strategy field, with all keyword sets lowercased and stop-word-filtered prior to vectorization. The context-sensitive dimension weights (

w_{d} (context)

) are initialized to

(w_{e}, w_{s}, w_{c}) = (0.35, 0.35, 0.30)

and updated after each turn via an exponential moving average:

w_{d}^{(t + 1)} = α \cdot {\hat{w}}_{d}^{(t)} + (1 - α) \cdot w_{d}^{(t)}

, where

{\hat{w}}_{d}^{(t)}

is the normalized downstream utility of dimension d estimated from the current turn’s retrieval outcomes (Spearman rank correlation between per-dimension retrieval scores and downstream empathy) and

α = 0.1

is the update rate. Temporal recency is encoded through an adaptive exponential decay:

decay (t) = e^{- λ_{adaptive} t}, λ_{adaptive} = λ_{0} \cdot (1 + γ \cdot volatility),

(7)

where volatility is computed as the standard deviation of the primary emotion intensity scores (I) across the five most recent stored perceptual traces:

volatility = std (I_{t - 4}, I_{t - 3}, \dots, I_{t})

, normalized to

[0, 1]

by dividing by the maximum observable intensity range of 5. The volatility term accelerates decay when recent emotional trajectories have been unstable, thereby down-weighting older memories whose contextual relevance may be compromised. By construction, Equation (6) encodes both emotional-similarity preference (via

{sim}_{e}

) and temporal recency preference (via the exponential decay) directly into the scoring policy. The behavioral patterns reported in Section 4.3.3 are engineering consequences of this design, not emergent phenomena, and the analyses presented there serve as implementation verification rather than discovery.

Memory Initialization and Cold-Start Behavior

In the EmpatheticDialogues evaluation, each test conversation is treated as independent, and the memory store is pre-populated from a pool of 200 episodes sampled exclusively from the training split prior to any test-set interaction. We emphasize that this initialization grants MOSAIC access to training-split data that single-model training-free baselines (e.g., Claude-3.5 [15] and GPT-4-Turbo [16] in zero- or few-shot configurations) do not receive; the EmpatheticDialogues performance comparisons must be interpreted with this asymmetry in view. The pre-populated episodes serve exclusively as analogical retrieval support—informing response strategy rather than exposing test-set answers—and the pool is strictly disjoint from the test split, ensuring the absence of direct information leakage. The ablation without the Event Agent in Section 4.2 quantifies the performance cost of operating without pre-populated memory, reflecting the system’s behavior in a genuine cold-start scenario. An important open question, deferred to future work, is whether providing single-model baselines with the same 200 episodes via simple embedding-based in-context retrieval would recover a comparable personalization advantage; such a matched condition would isolate the architectural contribution from the data-access advantage. We strongly encourage this experiment as a next step and regard the “Modular + Uniform Llama-70B” ablation as the most principled currently available estimate of architectural contribution, holding model capacity constant. In zero-history deployments, the Event Agent gracefully degrades to a retrieval-free mode in which the Response Agent relies solely on the perceptual state (

S_{t}

) and cognitive appraisal (

C_{t}

).

3.6. Experimental Setup

3.6.1. Datasets and Evaluation Protocol

We evaluate MOSAIC on two established benchmark datasets, both publicly available to the research community.

EmpatheticDialogues (ED) [48] comprises approximately 25,000 conversations, each grounded in a speaker-described personal situation and annotated with one of 32 emotion categories. The dataset is available at https://github.com/facebookresearch/EmpatheticDialogues (accessed on 6 April 2026). We follow established practice [9,20] and use the standard train/validation/test partition (approximately 19,533/2770/2547 conversations). For each test dialogue, the complete conversation history up to the penultimate turn is provided as context, and the final reference utterance serves as the gold-standard response. Emotion recognition performance is assessed by comparing the system’s predicted primary emotion against the ground-truth annotation. Automatic response metrics are computed against the gold-standard reference utterance.

ESConv [23] contains approximately 1300 emotional support conversations with substantially longer interaction trajectories (mean of 29.8 turns), making it well suited to evaluate the memory-related benefits of the proposed architecture. The dataset is available at https://github.com/thu-coai/Emotional-Support-Conversation (accessed on 6 April 2026). We use the published test split (approximately 210 conversations) and evaluate at the turn level, averaging over all turns with at least 5 prior context turns to allow memory effects to manifest.

3.6.2. Evaluation Metrics

Emotion recognition is assessed by weighted F1 over the 32-class taxonomy. Response quality is measured by BLEU-2, ROUGE-L [53], and BERTScore

F_{1}

[54]. We note that BLEU-2, as a precision-based n-gram metric, is known to be an imperfect fit for open-ended empathetic dialogue where lexical diversity is desirable; ROUGE-L and BERTScore provide complementary recall-oriented and semantic-similarity perspectives, and human evaluation scores are treated as the primary quality criterion. Human evaluation employs four five-point Likert-scale dimensions—empathy, coherence, personalization, and overall quality—rated by three independent annotators per conversation.

3.6.3. Human Evaluation Protocol

Six annotators were recruited through an academic crowdsourcing platform; all reported native or near-native English proficiency and held at least an undergraduate background in psychology or linguistics. Annotators were divided into two independent groups of three (Group A: 130 ED + 30 ESConv; Group B: 70 ED + 30 ESConv), ensuring that each conversation received exactly three independent ratings. Conversation allocation to groups was performed by random sampling without replacement, stratified by emotion category to ensure comparable emotion-class distributions across Group A and Group B. Post hoc inspection confirmed that the two groups produced comparable rating distributions (Kolmogorov–Smirnov test,

p > 0.30

for all four dimensions), indicating that the allocation did not introduce systematic rater-group confounds. Annotation instructions operationally defined each dimension with concrete anchor examples (e.g., “Empathy: to what extent does the response demonstrate understanding of the speaker’s emotional state and needs?”) and included three calibration items prior to live annotation. System responses were presented in randomized order, with annotators blinded to system identity and to the hypotheses of the study. Disagreements exceeding two scale points were resolved by majority vote. Per-dimension inter-rater agreement is reported as follows: empathy,

κ = 0.72

; coherence,

κ = 0.74

; personalization,

κ = 0.63

; overall,

κ = 0.71

. Inter-rater agreement was moderate to substantial overall (mean Fleiss’

κ = 0.70

; [55]), indicating acceptable annotation reliability. The somewhat lower agreement for personalization (

κ = 0.63

) reflects the greater subjectivity of this dimension and should be borne in mind when interpreting between-system differences on that scale. The final human evaluation sample comprised 200 randomly selected ED conversations and 60 ESConv conversations. All procedures were conducted in accordance with institutional review board guidelines for minimal-risk studies; informed consent was obtained prior to participation.

3.6.4. Baselines and Parameter-Scale Context

Baselines comprise two groups: fine-tuned systems—GLHG [10], CEM [9], MultiEMO [11], and EmpathGen [24]—each trained with task-specific supervision on EmpatheticDialogues, and training-free LLM baselines—Claude-3.5 [15], GPT-4-Turbo [16], Gemini-1.5-Pro, and Llama-3.1-405B [14]—evaluated in zero-shot and five-shot settings.

Table 2 summarizes the parameter scale and deployment mode of all compared systems. The total active-parameter count per turn for MOSAIC (∼163 B) reflects the sum over four sequentially invoked agents; these parameters are never simultaneously resident in GPU memory. The comparison between MOSAIC and Llama-3.1-405B [14] is not parameter-controlled: MOSAIC issues four LLM calls per turn and benefits from pre-populated training-split memory, while Llama-3.1-405B is evaluated in a single-pass, zero-shot configuration without memory access. The controlled ablation in Section 4.2 (“Modular + Uniform Llama-70B”) provides a more principled estimate of the architectural contribution by holding both model identity and memory access constant.

3.6.5. Implementation Details and Statistical Analysis

Memory capacity is fixed at 500 entries per user. Matching thresholds are

θ_{p} = 1

,

θ_{c} = 2

, and

θ_{e} = 2

. Retrieval employs the top

k = 3

candidates with base dimension weights of

w_{e} = 0.35

,

w_{s} = 0.35

, and

w_{c} = 0.30

; a decay base of

λ_{0} = 0.04

; volatility sensitivity of

γ = 0.02

; and a diversity factor of

λ_{div} = 0.3

. These values were selected via grid search on the EmpatheticDialogues validation split; a hyperparameter sensitivity analysis (

k \in {1, 3, 5}

and weight perturbations) is reported in Section 4.4. All experiments used random seeds of 42, 123, and 456; reported values are means over three runs. Per-seed variance is tabulated in Appendix D to allow for independent verification of result stability. Local inference was conducted on our in-house GPU cluster (NVIDIA A100-80GB); Perception and Event agents were invoked through their respective commercial API endpoints. Reported model/API versions and vendor details correspond to Qwen-2.5-14B (Alibaba Cloud, Hangzhou, China), Gemma-2-9B (Google DeepMind, London, UK), Llama-3.1-70B and Llama-3.1-405B (Meta AI, Menlo Park, CA, USA), Claude-3.5 (Anthropic, San Francisco, CA, USA), GPT-4-Turbo (OpenAI, San Francisco, CA, USA), and Gemini-1.5-Pro (Google LLC, Mountain View, CA, USA); local hardware used NVIDIA A100-80GB GPUs (NVIDIA Corporation, Santa Clara, CA, USA).

Statistical comparisons employ two-tailed paired t-tests for automatic metrics (preceded by Shapiro–Wilk normality tests and Levene’s test for variance homogeneity; all Shapiro–Wilk tests confirmed the approximate normality of metric differences at

α = 0.05

, and all tests satisfied the relevant assumptions), Wilcoxon signed-rank tests for ordinal human evaluation scores, and Bonferroni correction for multiple comparisons (

m = 9

, one correction factor per baseline). We note that the Holm–Bonferroni step-down procedure [56] would yield less conservative corrected p-values; under the Holm–Bonferroni procedure, all reported significant comparisons remain significant (Appendix E, Table A4), so the more conservative Bonferroni threshold is retained for the main benchmark comparisons reported below. Effect sizes are reported as Cohen’s d (paired t-tests), rank-biserial r (Wilcoxon tests), or Cramér’s V (chi-square tests).

4. Results

4.1. Main Benchmark Results

Table 3 presents results on EmpatheticDialogues. MOSAIC achieves a 76.4% weighted F1, BLEU-2 of 8.1, ROUGE-L of 26.8, and an empathy score of 3.87 on the 1–5 Likert scale. The most salient finding is that the proposed training-free architecture outperforms all training-free baselines and comes within one standard error of the best fine-tuned systems; we caution, however, that the memory pre-population asymmetry described in Section Memory Initialization and Cold-Start Behavior contributes to this advantage and cannot be disentangled from architectural effects in the current evaluation. Against Claude-3.5 (five-shot)—the strongest single-model, training-free baseline—automatic-metric empathy improves by 0.14 points (

t (2546) = 8.94

,

p < 0.001

, Cohen’s

d = 0.68

). This 0.14-point improvement on a 1–5 Likert scale is statistically robust but substantively modest, particularly relative to the 86% latency overhead; the more practically compelling advantage is the personalization gain (human evaluation: 3.67 vs. 3.24 for Claude-3.5,

d = 0.48

), which represents a qualitatively distinguishable improvement in user-tailored behavior and is treated as the primary performance claim of this work.

MOSAIC (∼163 B sequential) also outperforms Llama-3.1-405B [14] (3.64 empathy) despite a lower summed active-parameter count; however, as noted in Section 3.6.4, this comparison is confounded by MOSAIC’s four-call inference pipeline and pre-populated training-split memory. The controlled ablation in Section 4.2 provides a more principled estimate of the architectural contribution.

The performance advantage is reproduced on ESConv (Table 4), where longer interaction horizons amplify the importance of memory and strategy adaptation. MOSAIC improves BERTScore, coherence, and empathy over Claude-3.5 (five-shot) [15] across all dimensions, and the effect sizes increase monotonically with conversation length (

r = 0.71

,

p < 0.001

), providing direct evidence that the memory-augmented design confers increasing benefit as the dialogue horizon extends.

Human evaluation (Table 5) confirms that the most pronounced and practically meaningful gain is in personalization. MOSAIC achieves a personalization score of 3.67 versus 3.24 for Claude-3.5 [15] and 3.28 for EmpathGen [24] (

d = 0.48

,

r = 0.48

, rank-biserial), making personalization the strongest and most practically significant human-evaluation finding of this work. This indicates that structured memory retrieval produces qualitatively distinguishable user-tailored responses that neither a monolithic LLM nor a fine-tuned specialist system reliably matches on this dimension. The lower human-rated empathy (3.78 vs. 3.68 for EmpathGen) relative to the automatic metric gap reflects the broader annotator considerations of conversational naturalness and should be interpreted alongside the per-dimension agreement statistics reported in Section 3.6.

4.2. Ablation Study: Functional Dissociability and Architectural Contributions

To verify that the modular design reflects genuine functional differentiation rather than redundant prompting overhead, we systematically ablate individual agents and architectural components. Table 6 reports the resulting performance and effect sizes; Figure 3 visualizes the dissociable impairment profiles.

Three findings emerge from the agent ablations. First, removing the Perception agent predominantly degrades emotion recognition (F1:

- 2.6

points), consistent with its role as the dedicated affective signal extractor; its downstream contribution to empathy and personalization is moderate, as the Cognition Agent can partially recover signals from dialogue history. Second, removing the Cognition Agent produces broad degradation across both recognition and empathy (

- 7.5 %

empathy reduction), demonstrating that causal appraisal and psychological need inference are central to contextually calibrated response synthesis. Third, the Event memory ablation incurs the largest personalization penalty (

- 12.5 %

), confirming that the memory component is the primary driver of user-adapted behavior and is not recoverable through any other module. A one-way repeated-measures ANOVA confirms a significant main effect of module removal (

F (2, 5092) = 147.3

,

p < 0.001

, and

η^{2} = 0.055

).

The architectural ablations reveal three additional findings. In absolute empathy-score terms, modular structure and model heterogeneity contribute approximately equally: role-prompted pipeline decomposition (Single LLM → Modular + Uniform Llama-70B) yields

+ 0.13

points (3.61→3.74), and heterogeneous model assignment (Modular + Uniform → full MOSAIC) yields a further

+ 0.13

points (3.74→3.87), together accounting for the total 0.26-point architectural gain. The “Single LLM (structured CoT prompt)” condition, which provides a concatenated role prompt requesting sequential perception, cognition, retrieval, and response reasoning within a single Llama-3.1-70B call, achieves an intermediate empathy of 3.68 (vs. 3.61 for no role prompts and 3.74 for the full modular pipeline), confirming that structured prompting alone recovers some—but not all—of the benefit of architectural decomposition. The additional gain from separate model invocations (

+ 0.06

) suggests that enforced stage separation, not prompt structure alone, contributes meaningfully to performance. The chain-of-thought ablation (

d = 0.19

) further demonstrates that multi-step reasoning within each agent’s prompt makes an independent and non-trivial contribution beyond structural decomposition alone.

4.3. Memory Retrieval Dynamics and Memory-Type Contributions

4.3.1. Hierarchical Versus Flat Memory

Hierarchical memory organization outperforms flat storage on both empathy (3.87 vs. 3.69;

t (2546) = 5.47

,

p < 0.001

,

d = 0.22

) and personalization (3.67 vs. 3.38;

t (2546) = 8.93

,

p < 0.001

,

d = 0.35

). Three-dimensional retrieval additionally outperforms its two-dimensional counterpart (omitting coping-strategy keywords), demonstrating that action-relevant coping information provides a retrieval signal that is complementary to emotion and situation matching. It should be noted that this ablation—which removes the hierarchical organizational structure while retaining all memory content—is distinct from the memory-type analysis in Section 4.3.4, which varies which tiers are populated. Taken together, the two analyses establish that (a) organizing memory into functionally distinct tiers with dedicated retrieval mechanisms is beneficial relative to flat storage and (b) the realized gain is dominated by episodic content rather than by an unconditional additive combination of all three tiers.

4.3.2. Dimensional Retrieval Analysis

Table 7 disaggregates retrieval performance by dimension. Emotion keywords yield the highest recall (0.78), whereas coping-strategy keywords yield the highest downstream utility (Spearman

ρ = 0.79

), revealing an asymmetry: the dimension responsible for the most frequent retrievals (emotion) is not the most informative for response quality. Utility scores are computed at the conversation level as the Spearman rank correlation between the per-episode retrieval score on dimension d and the downstream empathy score of the response generated using that episode, averaged across all retrieval events within a conversation. This pattern suggests that empathetic memory systems should prioritize retrieval of episodes with relevant support trajectories alongside those with matching affective profiles.

4.3.3. Verification of the Designed Scoring Policy

The scoring function in Equation (6) encodes two structural biases by design: emotional-similarity preference (via the

{sim}_{e}

term) and temporal-recency preference (via the exponential decay term). The analyses below serve as engineering verifications—confirming that the implemented system faithfully realizes its intended scoring properties—rather than as independent empirical discoveries.

Figure 4 shows that 73% of retrieved memories share the primary emotion of the current interaction context, against a 31% random baseline (

χ^{2} (1) = 284.6

,

p < 0.001

, Cramér’s

V = 0.53

). This confirms that the emotional-similarity term operates as designed. The resulting pattern parallels mood-congruent memory effects documented by Bower [43] in the human cognition literature; however, this parallel is at the behavioral level and reflects our deliberate design choice rather than any claim of cognitive simulation.

Figure 5 shows that memories encoded within the last five conversational turns are retrieved 2.3 times more frequently than older memories under comparable relevance conditions. An exponential decay model (

p (t) = 0.32 e^{- 0.04 t}

, where

λ_{0} = 0.04

is the base decay rate from Section Memory Initialization and Cold-Start Behavior) fits the observed data closely (

R^{2} = 0.94

), confirming that the adaptive decay formulation produces its intended recency bias. The coefficient of

λ_{0} = 0.04

in the fitted model matches the implementation hyperparameter directly, serving as a sanity check, confirming that the deployment configuration is consistent with the reported decay schedule. This mirrors recency effects in human autobiographical memory research reported by Dolcos et al. [19] at the behavioral level, a consequence of adopting the exponential-decay functional form from that literature rather than evidence of mechanistic equivalence.

4.3.4. Memory-Type Contributions to Empathy Quality

In the presence of any retrieved memories, empathy scores increase from a no-memory baseline of 3.49 (95% CI: [3.43, 3.55]) to 3.87 (

t (1273) = 6.8

,

p < 0.001

,

d = 0.38

). Among individual tiers, episodic memory yields the largest incremental gain (+0.42), followed by semantic (+0.28) and perceptual (+0.21) memory. Figure 6 illustrates these contributions.

Importantly, the combined-memory condition (+0.38) falls below the episodic-only condition (+0.42)—a result that merits explicit discussion. This reversal indicates that simultaneous retrieval across all three tiers introduces redundant signals that dilute the specificity of episodic guidance rather than providing complementary information. This finding qualifies the hierarchical memory contribution: the primary design value of the tripartite structure lies in the richer multi-dimensional retrieval vocabulary it provides (evidenced by the 3D vs. 2D keyword ablation and the combined utility advantage in Table 7), not in an unconditional additive benefit from all tiers simultaneously. Adaptive memory-type gating—conditionally suppressing perceptual and semantic retrieval when high-confidence episodic matches are available—is a promising direction that could recover the episodic-only ceiling and is deferred to future work.

4.4. Hyperparameter Sensitivity

Table 8 reports the sensitivity of empathy score and personalization to the two most influential hyperparameters: top-k retrieval and the emotion-dimension weight (

w_{e}

). Performance is broadly stable across the tested ranges. The largest single variation occurs at

k = 1

, where the absence of retrieval diversity degrades personalization meaningfully. Increasing k beyond 3 yields diminishing returns, consistent with the redundancy pattern observed in the memory-type analysis. Dimension-weight perturbations of

\pm 0.05

do not produce statistically significant changes in any metric (

p > 0.10

), confirming that the selected configuration lies within a plateau of near-optimal performance.

4.5. Computational Cost and Latency Analysis

Table 9 provides a per-agent latency and token-consumption breakdown averaged over 100 test queries. The Perception API call (Qwen-2.5-14B) requires approximately 0.4 s, reflecting the compact input representation. The Cognition Agent (Llama-3.1-70B, local) requires approximately 1.3 s, owing to multi-step chain-of-thought generation. The Event API call (Gemma-2-9B, approximately 0.3 s) can be dispatched concurrently with the final reasoning steps of the Cognition Agent under asynchronous scheduling, reducing its contribution to the critical path. We note that, while the Event Agent requires the Cognition Agent’s full structured output (

C_{t}

) as input, the overlap in Table 9 refers to the partial overlap between the round-trip Event API (dispatched immediately upon Cognition output completion) and the Response Agent’s initialization phase; the Event Agent is never dispatched before Cognition completes. The Response Agent (Llama-3.1-70B, local) requires approximately 1.2 s. Total sequential latency is 3.2 s; asynchronous scheduling reduces this to approximately 2.6 s.

Table 10 compares system-level latency, computational footprint, and empathy performance across all evaluated systems. For infrastructure-independent comparison, MOSAIC requires approximately

2.4 \times 10^{14}

FLOPs per turn versus approximately

1.2 \times 10^{15}

for a single dense Llama-3.1-405B [14] forward pass under consistent forward-only counting (

C \approx 2 N T

; Appendix C; [57,58]). This indicates that the modular pipeline’s summed inference cost is roughly one-fifth that of the largest single-model baseline because the individual MOSAIC agents are each substantially smaller than 405 B.

Under asynchronous scheduling, MOSAIC sustains approximately 1.4 dialogue turns per second on the described hardware, a

1.55 \times

throughput improvement over sequential operation at no additional computational overhead. This throughput is adequate for non-real-time empathetic applications such as asynchronous counseling support; latency-sensitive synchronous deployments would benefit from further parallelization strategies, such as batching of concurrent user sessions across the two local-cluster agents.

5. Discussion

The experimental results support the central hypothesis that explicit, cognitively motivated functional decomposition can improve empathetic dialogue quality and failure attributability without task-specific fine-tuning. The overall performance gain is not attributable to any single component; rather, it reflects the combined effect of three design decisions. The modular pipeline separates affective signal extraction from causal interpretation and response planning, enabling stage-level failure attribution through logged intermediate states. The hierarchical memory design improves long-horizon personalization by retrieving analogically structured support trajectories, though the combined use of all three memory tiers introduces retrieval redundancy that constrains the achievable gain relative to episodic memory alone. The heterogeneous-model integration strategy, supported by role-specific chain-of-thought prompts, makes these benefits available without labeled dialogue data.

We reiterate the most important qualification of these results: MOSAIC’s memory is pre-populated with 200 training-split episodes that single-model baselines do not access, and the current evaluation cannot fully disentangle architectural advantages from data-access advantages. The most compelling claims of this work—that structured decomposition enables stage-level failure attribution and that hierarchical episodic memory provides qualitatively distinguishable personalization (

d = 0.48

)—are supported by the ablation analyses and do not depend on the baseline comparison.

5.1. Architectural Transparency as a Design Objective

From an engineering standpoint, a key advantage of MOSAIC over monolithic prompted systems is failure attributability. In a single-pass architecture, an inadequate response cannot readily be traced to a specific computational cause: the deficiency may originate in affect perception, causal reasoning, memory retrieval, or response planning, and the system provides no mechanism for distinguishing these possibilities. MOSAIC addresses this limitation by logging structured intermediate representations at every turn. The ablation study reinforces this architecture: perception failures manifest as emotion-recognition errors, cognition failures as causal-attribution errors, and memory failures as personalization degradation. This characteristic signature pattern provides concrete guidance for targeted improvement at each stage. We reiterate, however, that the interpretability demonstrated here—the ability to localize failures to a processing stage—is architectural rather than an empirically validated improvement in practitioner diagnostic utility, which remains an open question for future evaluation.

5.2. Qualitative Analysis: Cross-Model Response Comparison

Figure 7 presents a single representative scenario—a user reporting the successful completion of a doctoral thesis defense—and shows the responses produced by all five evaluated systems. This shared-input comparison allows qualitative differences in empathetic framing, personalization, and affective calibration to be assessed without confounding stimulus variation.

The scenario was selected because it exhibits high-confidence perceptual signals (MOSAIC perception confidence of

0.92

), which ensures that variability across systems reflects differences in reasoning and generation strategy rather than ambiguity in the stimulus itself.

Several qualitative differences merit discussion. MOSAIC’s response is the only one that names both the pride and the emptiness explicitly, validates them as co-occurring and individually legitimate states, and closes with an open question that invites the user to define their own next step—a strategy grounded in the inferred psychological need for existential re-orientation retrieved from the episodic memory store (relevance scores of 0.82 and 0.76). EmpathGen [24] produces an encouraging response but defaults to reassurance (“I am sure you will find your path soon”), a pattern that annotators consistently flagged as premature resolution. Claude-3.5 [15] recognizes the emotional duality but frames it generically (“emotional vacuum”) without personalizing to the speaker’s three-year investment or inviting further dialogue. GPT-4-Turbo [16] exhibits the widest gap between coherence (4.1) and empathy (3.1), consistent with its tendency to normalize rather than validate; the response reads more like psychoeducation than supportive dialogue, which aligns with the quantitative finding that fluency and emotional attunement are dissociable. Llama-3.1-405B [14] is the most colloquial and warm of the monolithic baselines but does not engage with the specific content of the user’s experience; the reference to “post-thesis blues” as a label substitutes naming for understanding.

These observations suggest that the primary qualitative advantage of MOSAIC lies not in overall language quality—all systems produce fluent, supportive text—but in its capacity to track and respond to the precise emotional structure of the user’s utterance, including co-occurring valences and expressed uncertainties, in a manner that reflects prior conversational and episodic context.

5.3. Scope and Limits of the Cognitive Grounding

The cognitive-science grounding of MOSAIC operates strictly at the level of functional analogy; three clarifications delimit the scope of this claim.

First, the modular decomposition is motivated by evidence that empathy involves dissociable cognitive processes, as documented by Decety and Jackson [1] and extended by Singer and Lamm [2]. The agents do not replicate the corresponding neural systems; rather, grounding the architecture in established cognitive constructs provides an externally interpretable decomposition. The four-stage structure is principled in that each stage corresponds to a recognized functional component of empathetic cognition, but it would also be defensible on purely engineering grounds. Future work should demonstrate design choices that are specifically predicted by the cognitive grounding and that differ from what a non-cognitively-motivated decomposition would suggest.

Second, the retrieval patterns verified in Section 4.3.3 are consequences of the designed scoring function, not emergent behaviors. Their parallel to mood-congruent memory [43] and recency effects [19] speaks to the cognitive plausibility of the design choices, not to any claim of mechanistic equivalence with human memory.

Third, human empathetic processing operates via parallel and interacting sub-processes, whereas MOSAIC’s pipeline is strictly serial. Future architectures incorporating lateral connections between agents—for example, allowing the Cognition Agent to query the Perception Agent for confidence-conditioned elaboration—could better approximate the interactive character of human empathetic cognition.

5.4. Limitations and Directions for Future Work

Several limitations of the current framework warrant explicit acknowledgment. First, the asynchronous pipeline latency of 2.6 s per turn is approximately 86% higher than the strongest single-model baseline (Claude-3.5 [15], at 1.4 s) and constitutes a substantive constraint for real-time interactive deployment; further parallelization strategies—such as batching of concurrent sessions or caching of stable Cognition outputs across turns—are necessary for latency-sensitive applications. Second, the hybrid API-plus-local deployment creates dependency on commercial endpoint availability for the Perception and Event agents; future work should evaluate fully local and fully API-based configurations to assess sensitivity to this design choice. Third, the current memory model does not support cross-session consolidation, schema formation, or forgetting-induced restructuring, limiting its utility for applications requiring persistent long-term personalization. Fourth, the system is text-only and does not exploit prosodic, facial, or physiological signals that contribute substantially to real-world empathetic communication. Fifth, affective ambiguity is the primary failure bottleneck; implementing a confidence-aware meta-cognitive regulator [59] for inter-module conflict resolution is the most direct architectural remedy. Sixth, generalization to domains beyond the benchmark datasets—including clinical counseling, peer-support platforms, and non-English conversations—remains to be established. Seventh, the hyperparameter sensitivity analysis covers the primary parameters; an exhaustive search over the full space (including

λ_{0}

,

γ

, and

λ_{div}

) is deferred to future work. Eighth, the memory pre-population with training-split episodes creates a data-access asymmetry relative to single-model, training-free baselines; a matched-baseline condition in which Claude-3.5 and/or Llama-3.1-405B receive the same 200 episodes via embedding-based in-context retrieval is a high-priority next experiment to disentangle architectural gains from data-access gains. Future evaluation should assess whether comparable personalization gains are achievable through purely conversational memory accumulation, which would render the training-free characterization fully symmetric across comparison conditions. Ninth, underperformance of the combined-memory condition relative to episodic-only retrieval (

+ 0.38

vs.

+ 0.42

) identifies adaptive memory-type gating as a design priority; conditionally suppressing lower-tier memory when high-quality episodic matches are available is a principled route to recovering and potentially exceeding the episodic-only ceiling. Tenth, the interpretability claim is strictly architectural; the practitioner-facing diagnostic utility of MOSAIC’s logged intermediate states—whether developers and counselors can effectively use

S_{t}

,

C_{t}

, and

E_{t}^{ret}

for failure diagnosis—remains to be validated in a user study.

6. Conclusions

This paper presents MOSAIC, a training-free multi-agent framework for empathetic dialogue that integrates a cognitively motivated modular architecture, hierarchical three-tier emotional memory, and heterogeneous open-source and API-accessible language models orchestrated through role-specific chain-of-thought prompts. The term training-free denotes the absence of task-specific fine-tuning; backbone models employ general-purpose instruction tuning, and the memory store is pre-populated with 200 analogical episodes from the training split prior to test-set evaluation. Experiments on EmpatheticDialogues and ESConv show that, subject to the memory-initialization asymmetry described in Section Memory Initialization and Cold-Start Behavior, MOSAIC improves over training-free monolithic baselines in terms of aggregate empathy and human-rated personalization. We do not claim parity with fine-tuned state-of-the-art systems with respect to aggregate metrics; the practically meaningful and ablation-supported gain is in personalization, where the ablation without the Event Agent directly attributes the effect to the memory architecture rather than to data access. The personalization gain (human evaluation:

d = 0.48

vs. Claude-3.5 five-shot) is the strongest and most practically meaningful result, as it reflects MOSAIC’s distinctive capacity to track and respond to the precise emotional structure of user utterances across conversational turns. Ablation analyses establish functionally dissociable module contributions and reveal that modular pipeline structure and heterogeneous model assignment each contribute approximately equally to the total architectural gain. An additional ablation confirms that a structured single-pass prompt recovers partial—but not full—performance relative to the modular pipeline, suggesting that enforced stage separation contributes independently of prompt structure. Among memory tiers, episodic memory yields the largest single-tier contribution, while combining all three tiers introduces retrieval redundancy that moderately attenuates this advantage; adaptive memory-type gating is identified as a priority for future work. Retrieval analyses confirm that the designed scoring policy faithfully realizes its intended emotional-congruence and temporal-recency properties, behaviors that parallel established findings in emotional memory research at the functional level without implying mechanistic equivalence. A comprehensive resource analysis shows that asynchronous operation achieves 2.6 s per turn—a trade-off appropriate for non-real-time counseling applications but requiring further optimization for synchronous interactive use. The qualitative cross-model comparison confirms that MOSAIC’s principal advantage over monolithic systems lies in its capacity to track and respond to the precise emotional structure of user utterances—including co-occurring valences and expressed uncertainties—in a manner grounded in retrieved episodic context.

Taken together, these results demonstrate that explicit functional decomposition grounded in the cognitive science of empathy can serve as a principled and productive organizing principle for empathetic dialogue systems—improving both response quality and failure attributability without task-specific training. The complete prompt schemata and implementation code are publicly available at https://github.com/zerkvii/MOSAIC (accessed on 6 April 2026) to facilitate replication and further development.

Author Contributions

Conceptualization, K.L. and M.P.; Methodology, K.L. and H.X.; Software, K.L.; Validation, K.L. and J.Z.; Resources, J.Z.; Data curation, H.X.; Writing—original draft, K.L.; Writing—review and editing, K.L., H.X., J.Z. and M.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (NSFC) under Grant No. U23A20316 and Grant No. 62072346 and by the Joint Laboratory on Credit Technology.

Institutional Review Board Statement

This study involved human participants (six paid annotators) who rated system outputs on Likert scales. All procedures were reviewed by the institutional review board and granted minimal-risk exemption status in accordance with institutional guidelines for research involving human subjects. Informed consent was obtained from all participants prior to their involvement in the annotation task.

Data Availability Statement

The datasets and code used in this study are publicly available. The EmpatheticDialogues dataset is available from Facebook Research: https://github.com/facebookresearch/EmpatheticDialogues (accessed on 6 April 2026). The ESConv dataset is available from the Tsinghua University COAI Group: https://github.com/thu-coai/Emotional-Support-Conversation (accessed on 6 April 2026). The complete implementation of MOSAIC, including all agent prompts, the retrieval code, and evaluation scripts, is available at https://github.com/zerkvii/MOSAIC (accessed on 6 April 2026).

Acknowledgments

During the preparation of this work, the authors used generative artificial intelligence tools (including ChatGPT (GPT-4o model, OpenAI, San Francisco, CA, USA) and Claude (Claude 3.5 Sonnet model, Anthropic, San Francisco, CA, USA)) solely for language polishing and stylistic refinement of the manuscript text. These tools were not used for generation of research ideas, the design of experiments, analysis of data, or the creation of original scientific content. The authors take full responsibility for the content, accuracy, and integrity of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in study design; data collection, analyses, or interpretation; manuscript writing; or publication decision.

Appendix A. Full Prompt Templates

This appendix provides the complete role-specific system prompts used to instantiate MOSAIC’s four agents. In deployment, each system prompt is prepended to the current dialogue context and relevant upstream structured state. All prompts direct the model to reason step by step prior to emitting the final structured output.

Appendix A.1. Perception Agent Prompt

Appendix A.2. Cognition Agent Prompt

Appendix A.3. Event Agent Prompt

Appendix A.4. Response Agent Prompt

Appendix B. Model Capacity Ablation Results

Table A1 reports the results of capacity ablations conducted during model assignment. Each substitution replaces a single agent’s backbone with a 13B-scale alternative (Llama-3.1-13B for Cognition/Response; Qwen-2.5-7B for Perception; Gemma-2-2B for Event) while holding all other agents and hyperparameters constant. All substitutions reduce empathy by 0.11–0.14 points (

p < 0.01

, paired t-test), validating the capacity-based assignment rationale.

Table A1. Model capacity ablation: effect of replacing each agent’s backbone with a smaller alternative. Empathy (

Δ

) is measured relative to the full MOSAIC (3.87). All differences are significant at

p < 0.01

.

Table A1. Model capacity ablation: effect of replacing each agent’s backbone with a smaller alternative. Empathy (

Δ

) is measured relative to the full MOSAIC (3.87). All differences are significant at

p < 0.01

.

Agent Substituted	Replacement Model	Empathy	Δ
MOSAIC (full)	—	3.87	—
Perception (Qwen-2.5-14B → 7B)	Qwen-2.5-7B	3.74	−0.13
Cognition (Llama-3.1-70B → 13B)	Llama-3.1-13B	3.73	−0.14
Event (Gemma-2-9B → 2B)	Gemma-2-2B	3.76	−0.11
Response (Llama-3.1-70B → 13B)	Llama-3.1-13B	3.73	−0.14

Appendix C. FLOP Derivation

Per-turn inference FLOPs are estimated using the standard forward-pass approximation (

C \approx 2 N T

, where N is the number of non-embedding parameters of the model and T is the total token count (input + output) processed in the call) [57,58]. We note that the alternative

C \approx 6 N T

figure, which is familiar from training-cost accounting, includes the backward pass and optimizer update, which are not performed at inference time; the forward-only

2 N T

figure is therefore the appropriate counting for the deployment-time analysis reported in Table 10. Per-agent token counts are taken from Table 9.

Table A2. Derivation of per-agent inference FLOPs for MOSAIC using

C \approx 2 N T

. Tokens are mean (input + output) per call from Table 9. Values are rounded to two significant figures.

Table A2. Derivation of per-agent inference FLOPs for MOSAIC using

C \approx 2 N T

. Tokens are mean (input + output) per call from Table 9. Values are rounded to two significant figures.

Agent	N (Params)	T (Tokens, In + Out)	FLOPs ( $\approx 2 NT$ )
Perception (Qwen-2.5-14B)	$1.4 \times 10^{10}$	350	$\approx 9.8 \times 10^{12}$
Cognition (Llama-3.1-70B)	$7.0 \times 10^{10}$	700	$\approx 9.8 \times 10^{13}$
Event (Gemma-2-9B)	$9.0 \times 10^{9}$	430	$\approx 7.7 \times 10^{12}$
Response (Llama-3.1-70B)	$7.0 \times 10^{10}$	900	$\approx 1.3 \times 10^{14}$
Total (sum across four agents)			$\approx 2.4 \times 10^{14}$

Under this counting, MOSAIC’s per-turn inference cost is approximately

2.4 \times 10^{14}

FLOPs. For Llama-3.1-405B in dense single-pass inference at

T \approx 1500

, the corresponding figure is

2 \times 4.05 \times 10^{11} \times 1500 \approx 1.2 \times 10^{15}

FLOPs/turn; we report this value in Table 10. MOSAIC’s summed inference cost remains substantially below that of a single dense Llama-3.1-405B forward pass (roughly one fifth) because the four MOSAIC agents are each much smaller than 405B.

Appendix D. Per-Seed Variance on Main Metrics

Table A3. Per-seed results for MOSAIC on EmpatheticDialogues (three random seeds). Values confirm result stability; reported means in Table 3 are averages across seeds.

Seed	F1	BLEU-2	R-L	Emp
42	76.2	8.0	26.7	3.86
123	76.5	8.1	26.9	3.88
456	76.4	8.1	26.8	3.87
Mean ± SD	$76.4 \pm 0.15$	$8.1 \pm 0.06$	$26.8 \pm 0.10$	$3.87 \pm 0.01$

Appendix E. Holm–Bonferroni Comparison

Table A4 reports raw p-values from the paired t-tests in Table 3 on the empathy metric, alongside Bonferroni-corrected and Holm–Bonferroni step-down-corrected p-values, for each MOSAIC vs. baseline comparison (

m = 9

). All comparisons that are significant under Bonferroni remain significant under Holm–Bonferroni; we therefore retain the more conservative Bonferroni values in Table 3 for clarity.

Table A4. Multiple-comparison correction for the MOSAIC vs. baseline pairwise tests on the empathy metric. Holm–Bonferroni step-down adjustment ranks raw p-values in ascending order and applies a factor of

(m - i + 1)

at rank i.

Table A4. Multiple-comparison correction for the MOSAIC vs. baseline pairwise tests on the empathy metric. Holm–Bonferroni step-down adjustment ranks raw p-values in ascending order and applies a factor of

(m - i + 1)

at rank i.

Comparison vs. MOSAIC (Empathy)	Raw p	Bonferroni p	Holm p
GPT-4-Turbo	$< 10^{- 12}$	$< 10^{- 11}$	$< 10^{- 11}$
Claude-3.5 (zero-shot)	$< 10^{- 10}$	$< 10^{- 9}$	$< 10^{- 9}$
Llama-3.1-405B	$< 10^{- 9}$	$< 10^{- 8}$	$< 10^{- 8}$
Gemini-1.5-Pro (5-shot)	$< 10^{- 7}$	$< 10^{- 6}$	$< 10^{- 6}$
Claude-3.5 (5-shot)	$< 10^{- 5}$	$< 10^{- 4}$	$< 10^{- 4}$
GLHG (fine-tuned)	$0.0021$	$0.019$	$0.011$
CEM (fine-tuned)	$0.014$	$0.126$ (n.s.)	$0.056$ (n.s.)
MultiEMO (fine-tuned)	$0.41$	$1.000$ (n.s.)	$0.41$ (n.s.)
EmpathGen (fine-tuned)	$0.029$	$0.261$ (n.s.)	$0.087$ (n.s.)

References

Decety, J. Dissecting the neural mechanisms mediating empathy. Emot. Rev. 2011, 3, 92–108. [Google Scholar] [CrossRef]
Singer, T.; Lamm, C. The social neuroscience of empathy. Ann. N. Y. Acad. Sci. 2009, 1156, 81–96. [Google Scholar] [CrossRef]
Fan, Y.; Han, S. Temporal dynamic of neural mechanisms involved in empathy for pain: An event-related brain potential study. Neuropsychologia 2008, 46, 160–173. [Google Scholar] [CrossRef] [PubMed]
Saxe, R.; Kanwisher, N. People thinking about thinking people: The role of the temporo-parietal junction in “theory of mind”. NeuroImage 2003, 19, 1835–1842. [Google Scholar] [CrossRef]
Svoboda, E.; McKinnon, M.C.; Levine, B. The functional neuroanatomy of autobiographical memory: A meta-analysis. Neuropsychologia 2006, 44, 2189–2208. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Andreas, J.; Rohrbach, M.; Darrell, T.; Klein, D. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 39–48. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Proc. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Sabour, S.; Zheng, C.; Huang, M. CEM: Commonsense-aware empathetic response generation. Proc. AAAI Conf. Artif. Intell. 2022, 36, 11229–11237. [Google Scholar] [CrossRef]
Peng, W.; Hu, Y.; Xing, L.; Xie, Y.; Sun, Y.; Li, Y. Control Globally, Understand Locally: A Global-to-Local Hierarchical Graph Network for Emotional Support Conversation. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Vienna, Austria, 23–29 July 2022; pp. 4299–4305. [Google Scholar] [CrossRef]
Shi, T.; Huang, S.L. MultiEMO: An Attention-Based Correlation-Aware Multimodal Fusion Framework for Emotion Recognition in Conversations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL); Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 14752–14766. [Google Scholar]
Park, J.S.; O’Brien, J.C.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative Agents: Interactive Simulacra of Human Behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), San Francisco, CA, USA, 29 October–1 November 2023; pp. 1–22. [Google Scholar] [CrossRef]
Zhou, X.; Zhu, H.; Mathur, L.; Zhang, R.; Yu, H.; Qi, Z.; Morency, L.P.; Bisk, Y.; Fried, D.; Neubig, G.; et al. SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku; Technical Report; Anthropic: San Francisco, CA, USA, 2024. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. GPT-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Young, S.; Gašić, M.; Thomson, B.; Williams, J.D. POMDP-based statistical spoken dialogue systems: A review. Proc. IEEE 2013, 101, 1160–1179. [Google Scholar] [CrossRef]
Tulving, E. Elements of Episodic Memory; Clarendon Press: Oxford, UK, 1983. [Google Scholar]
Denkova, E.; Dolcos, S.; Dolcos, F. The Effect of Retrieval Focus and Emotional Valence on the Medial Temporal Lobe Activity during Autobiographical Recollection. Front. Behav. Neurosci. 2013, 7, 109. [Google Scholar] [CrossRef] [PubMed]
Lin, Z.; Madotto, A.; Shin, J.; Xu, P.; Fung, P. MoEL: Mixture of empathetic listeners. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 121–132. [Google Scholar]
Majumder, N.; Hong, P.; Peng, S.; Lu, J.; Ghosal, D.; Gelbukh, A.; Mihalcea, R.; Poria, S. MIME: MIMicking Emotions for Empathetic Response Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 8968–8979. [Google Scholar] [CrossRef]
Li, Q.; Chen, H.; Ren, Z.; Ren, P.; Tu, Z.; Chen, Z. EMPDG: A multi-resolution empathetic dialogue generation framework. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4454–4466. [Google Scholar]
Liu, S.; Zheng, C.; Demasi, O.; Sabour, S.; Li, Y.; Yu, Z.; Jiang, Y.; Huang, M. Towards emotional support dialog systems. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP), Virtual, 1–6 August 2021; pp. 3469–3483. [Google Scholar] [CrossRef]
Wang, F.; Shen, X.; Yu, J.; Xia, R. Flexible Thinking for Multimodal Emotional Support Conversation via Reinforcement Learning. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, 4–9 November 2025; pp. 1341–1356. [Google Scholar] [CrossRef]
Meng, T.; Shou, Y.; Ai, W.; Du, J.; Liu, H.; Li, K. A multi-message passing framework based on heterogeneous graphs in conversational emotion recognition. Neurocomputing 2024, 569, 127109. [Google Scholar] [CrossRef]
Hu, T.; Zheng, C.; Liu, S.; Sun, L.; Sun, H.; Zhan, Q. A survey on emotional support dialogue systems. ACM Comput. Surv. 2024, 57, 1–43. [Google Scholar]
Cheng, Y.; Shen, Y.; Liu, Y.; Wang, J. ConTegas: Contextualized empathetic dialogue generation with parameter-efficient tuning. In Proceedings of the IEEE International Conference on Data Mining (ICDM), Abu Dhabi, United Arab Emirates, 9–12 December 2024; pp. 1147–1152. [Google Scholar]
Cao, X.; Xu, M.; Yu, X.; Yao, J.; Ye, W.; Huang, S.; Zhang, M.; Tsang, I.; Ong, Y.S.; Kwok, J.T.; et al. Analytical Survey of Learning with Low-Resource Data: From Analysis to Investigation. ACM Comput. Surv. 2025, 58, 1–47. [Google Scholar] [CrossRef]
Liu, T.; Cheng, Y.; Wu, N.; Ma, D.; Sun, W. Can large language models understand context? A probing study on in-context reasoning and attention. Nat. Mach. Intell. 2024, 6, 932–941. [Google Scholar] [CrossRef]
Zhang, J.; Qian, K.; Liu, Z.; Heinecke, S.; Meng, R.; Liu, Y.; Yu, Z.; Wang, H.; Savarese, S.; Xiong, C. DialogStudio: Towards richest and most diverse unified dataset collection for conversational AI. In Proceedings of the Findings of the Association for Computational Linguistics: EACL 2024, St. Julian’s, Malta, 17–22 March 2024; pp. 2299–2315. [Google Scholar] [CrossRef]
Chen, Z.; Liu, B.; Moon, S.; Sankar, C.; Crook, P.; Wang, W.Y. KETOD: Knowledge-enriched task-oriented dialogue systems with entity spans. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, USA, 10–15 July 2022; pp. 2581–2593. [Google Scholar] [CrossRef]
Khot, T.; Trivedi, H.; Finlayson, M.; Fu, Y.; Richardson, K.; Clark, P.; Sabharwal, A. Decomposed prompting: A modular approach for solving complex tasks. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Zhang, S.; Zhu, E.; Li, B.; Jiang, L.; Zhang, X.; Wang, C. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Cao, Y.; Chen, H.; Jin, L.; Liu, Y.; Wang, P.; Yu, Z. MetaAgents: Simulating interactive multi-agent cooperation and competition at scale. Proc. AAAI Conf. Artif. Intell. 2024, 38, 17945–17953. [Google Scholar]
Lin, Z.; Wang, Y.; Zhou, Y.; Du, F.; Yang, Y. MLM-EOE: Automatic Depression Detection via Sentimental Annotation and Multi-Expert Ensemble. IEEE Trans. Affect. Comput. 2025, 16, 2842–2858. [Google Scholar] [CrossRef]
Anderson, J.R.; Bothell, D.; Byrne, M.D.; Douglass, S.; Lebiere, C.; Qin, Y. An integrated theory of the mind. Psychol. Rev. 2004, 111, 1036–1060. [Google Scholar] [CrossRef]
Laird, J.E. The Soar Cognitive Architecture; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar] [CrossRef]
Li, J.; Saket, B.; Etessami, K.; Barzilay, R. LaMP: Large language model personalization with progressive retrieval and reflective writing. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 10821–10838. [Google Scholar]
Zhong, W.; Guo, L.; Gao, Q.; Ye, H.; Wang, Y. MemoryBank: Enhancing large language models with long-term memory. Proc. AAAI Conf. Artif. Intell. 2024, 38, 19724–19731. [Google Scholar] [CrossRef]
Shinn, N.; Cassano, F.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language agents with verbal reinforcement learning. Proc. Adv. Neural Inf. Process. Syst. 2023, 36, 8634–8652. [Google Scholar]
LaBar, K.S.; Cabeza, R. Cognitive neuroscience of emotional memory. Nat. Rev. Neurosci. 2006, 7, 54–64. [Google Scholar] [CrossRef] [PubMed]
Anderson, J.R. A spreading activation theory of memory. J. Verbal Learn. Verbal Behav. 1983, 22, 261–295. [Google Scholar] [CrossRef]
Bower, G.H. Mood and memory. Am. Psychol. 1981, 36, 129–148. [Google Scholar] [CrossRef]
Liu, W.; Chen, X.; Miao, D.; Zhang, H.; Qin, X.; Du, S.; Lu, P. SEAD-MGFE-Net: Schrödinger equation-based adaptive dropout multi-granular feature enhancement network for conversational aspect-based sentiment quadruple analysis. Inf. Sci. 2025, 723, 122684. [Google Scholar] [CrossRef]
Blair, R.J.R. Responding to the emotions of others: Dissociating forms of empathy through the study of typical and psychiatric populations. Conscious. Cogn. 2005, 14, 698–718. [Google Scholar] [CrossRef]
Davis, M.H. Measuring individual differences in empathy: Evidence for a multidimensional approach. J. Personal. Soc. Psychol. 1983, 44, 113–126. [Google Scholar] [CrossRef]
Jiang, H.; Chen, X.; Miao, D.; Zhang, H.; Qin, X.; Du, S.; Lu, P. 3WD-DRT: A three-way decision enhanced dynamic routing transformer for cost-sensitive multimodal sentiment analysis. Inf. Sci. 2025, 725, 122704. [Google Scholar] [CrossRef]
Rashkin, H.; Smith, E.M.; Li, M.; Boureau, Y.L. Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 5370–5381. [Google Scholar] [CrossRef]
Huang, Y.; Wang, Y.; Lu, D.; Chen, Y.; Yu, D. Towards continuous emotional awareness: A multimodal emotion recognition framework with large language models. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 10851–10855. [Google Scholar] [CrossRef]
Weiner, B. An attributional theory of achievement motivation and emotion. Psychol. Rev. 1985, 92, 548–573. [Google Scholar] [CrossRef]
Premack, D.; Woodruff, G. Does the chimpanzee have a theory of mind? Behav. Brain Sci. 1978, 1, 515–526. [Google Scholar] [CrossRef]
Zhang, Y.; Struhl, N.; Koster, U.; McCoy, R.T. A theory of mind emerges in large language models trained on cryptic crosswords. In Proceedings of the Annual Meeting of the Cognitive Science Society, Rotterdam, The Netherlands, 24–27 July 2024; Volume 46, pp. 4120–4127. [Google Scholar]
Lin, C.Y. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating text generation with BERT. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]
Holm, S. A Simple Sequentially Rejective Multiple Test Procedure. Scand. J. Stat. 1979, 6, 65–70. [Google Scholar]
Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling Laws for Neural Language Models. arXiv 2020, arXiv:2001.08361. [Google Scholar] [CrossRef]
Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; de Las Casas, D.; Hendricks, L.A.; Welbl, J.; Clark, A.; et al. Training Compute-Optimal Large Language Models. arXiv 2022, arXiv:2203.15556. [Google Scholar] [CrossRef]
Flavell, J.H. Metacognition and cognitive monitoring: A new area of cognitive–developmental inquiry. Am. Psychol. 1979, 34, 906–911. [Google Scholar] [CrossRef]

Figure 1. Conceptual motivation for MOSAIC. A user utterance (“I finally defended my thesis successfully!”) is processed through four sequential, functionally specialized stages: (1) Perception extracts affective signals (e.g., pride intensity, 4.5; confidence, 0.92); (2) Cognition performs causal appraisal along attributional dimensions (internal–stable–controllable); (3) Event Memory retrieves analogically relevant prior episodes; (4) Response synthesizes a warm, contextually calibrated reply. Each stage exposes an inspectable intermediate state, in contrast to monolithic single-pass architectures, where such reasoning remains opaque.

Figure 2. Overview of MOSAIC. The framework decomposes empathetic dialogue into four cognitively motivated stages—affective perception, causal appraisal, episodic memory retrieval, and response synthesis—supported by a tripartite hierarchical memory structure. Intermediate state representations (

S_{t}

,

C_{t}

, and

E_{t}^{ret}

) are logged at each turn to enable post hoc failure attribution. Modules are distinguished by both color and shape to facilitate reading in monochrome reproduction.

Figure 2. Overview of MOSAIC. The framework decomposes empathetic dialogue into four cognitively motivated stages—affective perception, causal appraisal, episodic memory retrieval, and response synthesis—supported by a tripartite hierarchical memory structure. Intermediate state representations (

S_{t}

,

C_{t}

, and

E_{t}^{ret}

) are logged at each turn to enable post hoc failure attribution. Modules are distinguished by both color and shape to facilitate reading in monochrome reproduction.

Figure 3. Dissociable impairment profiles across MOSAIC’s functional modules. Percentage drops are computed relative to the full model (recognition, 76.4%; empathy, 3.87; personalization, 3.67). The non-uniform degradation patterns confirm that each agent makes a functionally distinct contribution that is not replicated by the remaining modules. Error bars: 95% confidence intervals.

Figure 4. Verification of the emotional-similarity bias in memory retrieval. The positive correlation between retrieval frequency and emotional keyword overlap confirms that the

{sim}_{e}

term in Equation (6) operates as designed. The behavioral pattern parallels mood-congruent memory as reported by Bower [43] but reflects design intent rather than cognitive simulation. Error bars: 95% confidence intervals.

Figure 4. Verification of the emotional-similarity bias in memory retrieval. The positive correlation between retrieval frequency and emotional keyword overlap confirms that the

{sim}_{e}

term in Equation (6) operates as designed. The behavioral pattern parallels mood-congruent memory as reported by Bower [43] but reflects design intent rather than cognitive simulation. Error bars: 95% confidence intervals.

Figure 5. Verification of the temporal-recency bias in memory retrieval (

p (t) = 0.32 e^{- 0.04 t}

,

R^{2} = 0.94

). The coefficient of

λ_{0} = 0.04

in the fitted curve corresponds directly to the base-decay parameter in the implementation, serving as an implementation consistency check. The functional form was motivated by the autobiographical memory research of Dolcos et al. [19] but does not imply mechanistic equivalence.

Figure 5. Verification of the temporal-recency bias in memory retrieval (

p (t) = 0.32 e^{- 0.04 t}

,

R^{2} = 0.94

). The coefficient of

λ_{0} = 0.04

in the fitted curve corresponds directly to the base-decay parameter in the implementation, serving as an implementation consistency check. The functional form was motivated by the autobiographical memory research of Dolcos et al. [19] but does not imply mechanistic equivalence.

Figure 6. Per-tier contribution to empathy-score improvement, measured relative to the no-memory baseline (empathy, 3.49; 95% CI: [3.43, 3.55]). The combined-memory condition (+0.38) falls below the episodic-only condition (+0.42), indicating that simultaneous retrieval across all three tiers introduces redundant signals that reduce retrieval specificity. Error bars: 95% confidence intervals.

Figure 7. Cross-model response comparison for a shared user utterance expressing pride, disbelief, and existential uncertainty following doctoral thesis defense. Human ratings are mean scores from three independent annotators on the 1–5 Likert scale. MOSAIC explicitly acknowledges the emotional duality (pride and emptiness) and introduces a personalized follow-up question grounded in retrieved episodic trajectories; monolithic systems offer support at a higher level of abstraction with comparatively less user-adaptive framing. GPT-4-Turbo (3.1 empathy) displays the widest gap between coherence (4.1) and empathy, consistent with the quantitative finding that fluency and emotional attunement are dissociable. Model references: EmpathGen [24], Claude-3.5 [15], GPT-4-Turbo [16], and Llama-3.1-405B [14].

Table 1. Summary of agent-specific prompt design. All prompts elicit step-by-step reasoning (“think step by step before responding”) prior to the required structured output. Access mode indicates whether the model is invoked via a commercial API endpoint or deployed on our local GPU cluster.

Agent	Model	Access	Core Instruction	Output Schema
Perception	Qwen-2.5-14B	API	Identify primary and secondary emotions, intensity [0–5], linguistic markers, and affective trajectory from the utterance and prior context	`{primary, secondary, intensity, markers, trajectory}`
Cognition	Llama-3.1-70B	Local cluster	Perform causal appraisal (controllability, stability, locus), infer mental states and psychological needs, generate three-dimensional retrieval keywords	`{appraisal, mental_state, need, $K_{e}$ , $K_{s}$ , $K_{c}$ }`
Event	Gemma-2-9B	API	Score stored episodes against current query keys; select top-3 with diversity reranking; provide per-episode retrieval rationale	`[{sit, traj, cope, out, score, why} × 3]`
Response	Llama-3.1-70B	Local cluster	Generate an empathetic reply integrating all upstream signals; calibrate affective register to $S_{t}$ ; avoid toxic positivity and premature reframing when grief or ambivalence is present	Free-form utterance (50–150 tokens)

Table 2. Approximate active-parameter scale and deployment mode of all compared systems. Closed-source model sizes are not publicly disclosed. MOSAIC’s per-turn count reflects the sum of four sequentially invoked agents; parameters are not simultaneously resident in memory. MOSAIC ^a additionally uses 200 training-split episodes for memory initialization, which single-model, training-free baselines do not.

System	Type	Approx. Parameters	Deployment
GLHG [10]	Fine-tuned	∼330 M	Local
CEM [9]	Fine-tuned	∼125 M	Local
MultiEMO [11]	Fine-tuned	∼1.5 B	Local
EmpathGen [24]	Fine-tuned	∼7 B	Local
GPT-4-Turbo [16]	Training-free	not disclosed	API
Claude-3.5 [15]	Training-free	not disclosed	API
Gemini-1.5-Pro	Training-free	not disclosed	API
Llama-3.1-405B [14]	Training-free	405 B	Local
MOSAIC agent breakdown (sequential; ∼163 B total active per turn ^b)
Perception (Qwen-2.5-14B)	Training-free	14 B	API
Cognition (Llama-3.1-70B)	Training-free	70 B	Local cluster
Event (Gemma-2-9B)	Training-free	9 B	API
Response (Llama-3.1-70B)	Training-free	70 B	Local cluster

^a MOSAIC pre-populates memory with 200 training-split episodes; single-model baselines do not. ^b Sum of sequentially activated parameters; not simultaneously resident in memory.

Table 3. Performance on EmpatheticDialogues. Statistical markers denote significance of pairwise comparison against MOSAIC: *:

p < 0.05

; **:

p < 0.01

; ***:

p < 0.001

(two-tailed paired t-tests with Bonferroni correction,

m = 9

). The 95% confidence interval for MOSAIC is reported in the row below. Bold values denote the best result in each column. All comparisons with training-free baselines should be interpreted in light of the memory-initialization asymmetry described in Section Memory Initialization and Cold-Start Behavior.

Table 3. Performance on EmpatheticDialogues. Statistical markers denote significance of pairwise comparison against MOSAIC: *:

p < 0.05

; **:

p < 0.01

; ***:

p < 0.001

(two-tailed paired t-tests with Bonferroni correction,

m = 9

). The 95% confidence interval for MOSAIC is reported in the row below. Bold values denote the best result in each column. All comparisons with training-free baselines should be interpreted in light of the memory-initialization asymmetry described in Section Memory Initialization and Cold-Start Behavior.

Model	F1	BLEU-2	R-L	Emp
Fine-tuned systems (2022–2025)
GLHG (2022) [10]	75.8 **	7.8 *	26.3 **	3.79 **
CEM (2022) [9]	76.2	7.9	26.7	3.81 *
MultiEMO (2023) [11]	77.1 *	8.2	27.1	3.85
EmpathGen (2025) [24]	77.8 **	8.4 *	27.8 **	3.92 *
Training-free baselines (zero-shot unless noted)
GPT-4-Turbo [16]	71.3 ***	7.1 ***	24.2 ***	3.51 ***
Claude-3.5 [15]	73.6 ***	7.4 ***	25.1 ***	3.58 ***
Llama-3.1-405B [14]	74.2 ***	7.6 **	25.6 ***	3.64 ***
Gemini-1.5-Pro (5-shot)	74.8 ***	7.7 **	25.9 **	3.69 ***
Claude-3.5 (5-shot) [15]	75.3 **	7.9	26.1 *	3.73 ***
MOSAIC (ours)	76.4	8.1	26.8	3.87
95% CI	[75.8, 77.0]	[7.9, 8.3]	[26.4, 27.2]	[3.82, 3.92]

Table 4. ESConv results. ***:

p < 0.001

versus MOSAIC (Wilcoxon signed-rank test, Bonferroni correction,

m = 2

). Bold values denote the best result in each column.

Table 4. ESConv results. ***:

p < 0.001

versus MOSAIC (Wilcoxon signed-rank test, Bonferroni correction,

m = 2

). Bold values denote the best result in each column.

Model	BERT-S	Coherence	Empathy
EmpathGen (2025) [24]	82.8 ***	3.95 ***	3.82 ***
Claude-3.5 (5-shot) [15]	81.4 ***	3.86 ***	3.76 ***
MOSAIC (ours)	84.1	4.12	4.02
95% CI	[83.4, 84.8]	[4.05, 4.19]	[3.95, 4.09]
Effect vs. Claude-3.5 (5-shot)	$d = 0.52$	$d = 0.58$	$d = 0.62$

Table 5. Human evaluation results (200 ED + 60 ESConv conversations; 1–5 Likert scale). Wilcoxon signed-rank tests with Bonferroni correction (

m = 2

). ***:

p < 0.001

versus MOSAIC. Bold values denote the best result in each column. Per-dimension inter-rater agreement (Fleiss’

κ

): empathy, 0.72; coherence, 0.74; personalization, 0.63; overall, 0.71; mean,

κ = 0.70

(moderate to substantial) [55].

Table 5. Human evaluation results (200 ED + 60 ESConv conversations; 1–5 Likert scale). Wilcoxon signed-rank tests with Bonferroni correction (

m = 2

). ***:

p < 0.001

versus MOSAIC. Bold values denote the best result in each column. Per-dimension inter-rater agreement (Fleiss’

κ

): empathy, 0.72; coherence, 0.74; personalization, 0.63; overall, 0.71; mean,

κ = 0.70

(moderate to substantial) [55].

Model	Empathy	Coherence	Personalization	Overall
EmpathGen (2025) [24]	3.68 ***	3.91 ***	3.28 ***	3.62 ***
Claude-3.5 (5-shot) [15]	3.54 ***	3.88 ***	3.24 ***	3.55 ***
MOSAIC (ours)	3.78	4.02	3.67	3.82
Effect size	$r = 0.42$	$r = 0.38$	$r = 0.48$	$r = 0.46$

Table 6. Ablation results on the EmpatheticDialogues test set. Cohen’s d (avg) is the mean effect size across F1, empathy, and personalization relative to the full MOSAIC model. All comparisons are significant at

p < 0.001

unless noted (^†:

p < 0.01

). The “Modular + Uniform Llama-70B” variant replaces every agent with Llama-3.1-70B (all local) while preserving the four-agent pipeline and all role-specific prompts, thereby isolating modular structure from model heterogeneity. “Single LLM (structured CoT prompt)” provides a single Llama-3.1-70B call with a concatenated prompt that explicitly requests sequential perception, cognition, retrieval, and response outputs within one generation, isolating the effect of separate model invocations vs. structured single-pass generation.

Table 6. Ablation results on the EmpatheticDialogues test set. Cohen’s d (avg) is the mean effect size across F1, empathy, and personalization relative to the full MOSAIC model. All comparisons are significant at

p < 0.001

unless noted (^†:

p < 0.01

). The “Modular + Uniform Llama-70B” variant replaces every agent with Llama-3.1-70B (all local) while preserving the four-agent pipeline and all role-specific prompts, thereby isolating modular structure from model heterogeneity. “Single LLM (structured CoT prompt)” provides a single Llama-3.1-70B call with a concatenated prompt that explicitly requests sequential perception, cognition, retrieval, and response outputs within one generation, isolating the effect of separate model invocations vs. structured single-pass generation.

Variant	F1	Emp	Pers	d (avg)
MOSAIC (full)	76.4	3.87	3.67	—
Agent ablations
−Perception	73.8	3.61	3.58	0.27
−Cognition	74.2	3.58	3.52	0.25
−Event memory	74.9	3.52	3.21	0.32
Architecture ablations
Modular + Uniform Llama-70B (all agents, local)	75.3	3.74	3.55	0.14
Single LLM (Llama-70B, no role prompts)	74.1	3.61	3.45	0.24
Single LLM (Llama-70B, structured CoT prompt)	74.6	3.68	3.49	0.18
Flat memory (no hierarchy)	75.2	3.69	3.38	0.16
Embedding-only retrieval	75.8	3.73	3.49	0.11 ^†
2D keywords ( $K_{e}$ + $K_{s}$ only)	75.7	3.71	3.44	0.14
No chain-of-thought prompts	75.3	3.58	3.51	0.19

Table 7. Retrieval performance disaggregated by dimension, measured over 500 conversations. “Matches” is the mean number of episodes retrieved per query. “Utility” is the Spearman rank correlation between retrieval score and downstream empathy score, aggregated at the conversation level (mean over all retrieval events per conversation).

Dimension	Matches	Recall	Utility	p-Value
Emotion ( $K_{e}$ )	3.4	0.78	0.68	—
Situation ( $K_{s}$ )	2.1	0.62	0.71	$< 0.01$
Coping ( $K_{c}$ )	1.6	0.54	0.79	$< 0.001$
Combined 3D	4.8	0.84	0.82	$< 0.001$

Table 8. Hyperparameter sensitivity on the EmpatheticDialogues validation split. Bold entries denote the default configuration. ^†: statistically different from the default at

p < 0.05

.

Table 8. Hyperparameter sensitivity on the EmpatheticDialogues validation split. Bold entries denote the default configuration. ^†: statistically different from the default at

p < 0.05

.

k	$w_{e} / w_{s} / w_{c}$	Emp	Pers	Note
1	0.35/0.35/0.30	3.71 ^†	3.44 ^†	Single-episode, low diversity
3	0.35/0.35/0.30	3.87	3.67	Default
5	0.35/0.35/0.30	3.86	3.65	Marginal redundancy
3	0.45/0.30/0.25	3.84	3.63	Emotion-heavy
3	0.25/0.45/0.30	3.83	3.64	Situation-heavy
3	0.25/0.30/0.45	3.85	3.66	Coping-heavy

Table 9. Per-agent latency and token-consumption breakdown for MOSAIC. Local timings measured on our in-house NVIDIA A100-80GB cluster; API timings are round-trip wall-clock measurements averaged over 100 test queries. The asynchronous total reflects the overlap between Event API dispatch and Response Agent initialization. Token counts are mean input/output tokens per turn.

Agent	Model	Access	Latency (s)	Tokens (In/Out)	Note
Perception	Qwen-2.5-14B	API	0.4	280/70	Compact input
Cognition	Llama-3.1-70B	Local cluster	1.3	520/180	Chain of thought
Event	Gemma-2-9B	API	0.3	340/90	Dispatched after Cognition completes; overlaps with Response initialization (async)
Response	Llama-3.1-70B	Local cluster	1.2	760/140	Full context integration
Total (sequential)			3.2	1900/480
Total (async, Event ‖ Response init)			2.6	1900/480

Table 10. System-level latency and per-turn computational footprint. The asynchronous MOSAIC latency (2.6 s) is approximately 86% higher than Claude-3.5 (1.4 s); this overhead is acceptable for non-real-time applications but is a substantive consideration for synchronous interactive deployment. FLOPs are reported under consistent forward-only counting (

C \approx 2 N T

) for systems with disclosed architectures; derivation provided in Appendix C.

Table 10. System-level latency and per-turn computational footprint. The asynchronous MOSAIC latency (2.6 s) is approximately 86% higher than Claude-3.5 (1.4 s); this overhead is acceptable for non-real-time applications but is a substantive consideration for synchronous interactive deployment. FLOPs are reported under consistent forward-only counting (

C \approx 2 N T

) for systems with disclosed architectures; derivation provided in Appendix C.

System	Latency (s)	FLOPs/Turn	Tokens/Turn	Emp
Claude-3.5 (5-shot) [15]	1.4	—	∼1500	3.73
GPT-4-Turbo [16]	1.6	—	∼1500	3.51
Llama-3.1-405B (local) [14]	2.1	$\approx 1.2 \times 10^{15}$	∼1500	3.64
EmpathGen (fine-tuned) [24]	0.9	—	∼800	3.92
MOSAIC (sequential)	3.2	$\approx 2.4 \times 10^{14}$	2380	3.87
MOSAIC (async)	2.6	$\approx 2.4 \times 10^{14}$	2380	3.87

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, K.; Xiong, H.; Zhang, J.; Peng, M. MOSAIC: A Cognitively Motivated Multi-Agent Framework for Interpretable and Training-Free Empathetic Dialogue. Electronics 2026, 15, 2078. https://doi.org/10.3390/electronics15102078

AMA Style

Liu K, Xiong H, Zhang J, Peng M. MOSAIC: A Cognitively Motivated Multi-Agent Framework for Interpretable and Training-Free Empathetic Dialogue. Electronics. 2026; 15(10):2078. https://doi.org/10.3390/electronics15102078

Chicago/Turabian Style

Liu, Kai, Hangyu Xiong, Jinyi Zhang, and Min Peng. 2026. "MOSAIC: A Cognitively Motivated Multi-Agent Framework for Interpretable and Training-Free Empathetic Dialogue" Electronics 15, no. 10: 2078. https://doi.org/10.3390/electronics15102078

APA Style

Liu, K., Xiong, H., Zhang, J., & Peng, M. (2026). MOSAIC: A Cognitively Motivated Multi-Agent Framework for Interpretable and Training-Free Empathetic Dialogue. Electronics, 15(10), 2078. https://doi.org/10.3390/electronics15102078

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MOSAIC: A Cognitively Motivated Multi-Agent Framework for Interpretable and Training-Free Empathetic Dialogue

Abstract

1. Introduction

1.1. Research Gap and Motivating Question

1.2. Proposed Approach and Contributions

2. Related Work

2.1. Empathetic Dialogue Systems

2.2. Modular and Pipeline-Based Dialogue Architectures

2.3. Cognitive Architectures and Memory-Augmented Dialogue

3. Methods

3.1. Cognitive Motivation and Design Principles

3.2. Architectural Overview

3.3. Specialized Agent Design

3.3.1. Perception Agent

3.3.2. Cognition Agent

3.3.3. Event Agent

3.3.4. Response Agent

3.4. Prompt Engineering

3.5. Hierarchical Emotional Memory and Adaptive Retrieval

Memory Initialization and Cold-Start Behavior

3.6. Experimental Setup

3.6.1. Datasets and Evaluation Protocol

3.6.2. Evaluation Metrics

3.6.3. Human Evaluation Protocol

3.6.4. Baselines and Parameter-Scale Context

3.6.5. Implementation Details and Statistical Analysis

4. Results

4.1. Main Benchmark Results

4.2. Ablation Study: Functional Dissociability and Architectural Contributions

4.3. Memory Retrieval Dynamics and Memory-Type Contributions

4.3.1. Hierarchical Versus Flat Memory

4.3.2. Dimensional Retrieval Analysis

4.3.3. Verification of the Designed Scoring Policy

4.3.4. Memory-Type Contributions to Empathy Quality

4.4. Hyperparameter Sensitivity

4.5. Computational Cost and Latency Analysis

5. Discussion

5.1. Architectural Transparency as a Design Objective

5.2. Qualitative Analysis: Cross-Model Response Comparison

5.3. Scope and Limits of the Cognitive Grounding

5.4. Limitations and Directions for Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Full Prompt Templates

Appendix A.1. Perception Agent Prompt

Appendix A.2. Cognition Agent Prompt

Appendix A.3. Event Agent Prompt

Appendix A.4. Response Agent Prompt

Appendix B. Model Capacity Ablation Results

Appendix C. FLOP Derivation

Appendix D. Per-Seed Variance on Main Metrics

Appendix E. Holm–Bonferroni Comparison

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI