Next Article in Journal
A Secure and Sustainable Transition from Legacy Smart Cards to Mobile Credentials in University Access Control Systems
Previous Article in Journal
Assessing the Readability of Russian Textbooks Using Large Language Models
Previous Article in Special Issue
Cloud Security Assessment: A Taxonomy-Based and Stakeholder-Driven Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

Empathy by Design: Reframing the Empathy Gap Between AI and Humans in Mental Health Chatbots

1
School of Computer Science, University of Nottingham, Jubilee Campus, Wollaton Road, Nottingham NG8 1BB, UK
2
School of Health Sciences, University of Nottingham, Queen’s Medical Centre, Nottingham NG7 2HA, UK
3
NIHR Nottingham Biomedical Research Centre, Nottingham University Hospitals NHS Trust, Nottingham NG7 2UH, UK
*
Author to whom correspondence should be addressed.
Information 2025, 16(12), 1074; https://doi.org/10.3390/info16121074
Submission received: 17 October 2025 / Revised: 27 November 2025 / Accepted: 3 December 2025 / Published: 4 December 2025
(This article belongs to the Special Issue Internet of Things (IoT) and Cloud/Edge Computing)

Abstract

Artificial intelligence (AI) chatbots are now embedded across therapeutic contexts, from the United Kingdom’s National Health Service (NHS) Talking Therapies to widely used platforms like ChatGPT. Whether welcomed or not, these systems are increasingly used for both patient care and everyday support, sometimes even replacing human contact. Their capacity to convey empathy strongly influences how people experience and benefit from them. However, current systems often create an “AI empathy gap”, where interactions feel impersonal and superficial compared to those with human practitioners. This paper, presented as a critical narrative review, cautiously challenges the prevailing narrative that empathy is a uniquely human skill that AI cannot replicate. We argue this belief can stem from an unfair comparison: evaluating generic AIs against an idealised human practitioner. We reframe capabilities seen as exclusively human, such as building bonds through long-term memory and personalisation, not as insurmountable barriers but as concrete design targets. We also discuss the critical architectural and privacy trade-offs between cloud and on-device (edge) solutions. Accordingly, we propose a conceptual framework to meet these targets. It integrates three key technologies: Retrieval-Augmented Generation (RAG) for long-term memory; feedback-driven adaptation for real-time emotional tuning; and lightweight adapter modules for personalised conversational styles. This framework provides a path toward systems that users perceive as genuinely empathic, rather than ones that merely mimic supportive language. While AI cannot experience emotional empathy, it can model cognitive empathy and simulate affective and compassionate responses in coordinated ways at the behavioural level. However, because these systems lack conscious, autonomous ‘helping’ intentions, these design advancements must be considered alongside careful ethical and regulatory safeguards.

1. Introduction

1.1. Background

Empathy—the ability to recognise another’s emotions and respond in a manner sensitive to the context—is typically divided into three types: cognitive empathy (understanding another’s perspective), affective empathy (sharing their feelings), and compassionate empathy (motivation to help) [1]. This capacity is widely recognised for its critical role in enhancing well-being and alleviating psychological distress [2,3,4], particularly in mental health settings [5,6]. Conversely, invalidating or unemphatic communication has been linked to heightened negative affect and poorer outcomes, including harm [7,8]. When this capability is implemented in artificial intelligence (AI) systems, it is often termed artificial or computational empathy [9,10]. While there is ongoing debate, including philosophical and phenomenological views that reserve “real” empathy for claims about consciousness, inner feeling, or moral depth [11], our focus is on how empathy can be implemented or simulated across these three commonly described components as part of an integrated whole. AI can arguably implement cognitive empathy through perspective modelling, which does not require experiencing emotions, whereas emotional and compassionate empathy are more plausibly seen as simulated, since AI systems lack conscious, autonomous caring intentions or consciously experienced feeling states.
As AI chatbots spread across therapeutic contexts, from clinical services to their use in social robots and mobile apps, delivering interactions that are perceived as empathic becomes a central design challenge. For example, Limbic Access, an AI chatbot used to augment clinical intake and assessment in NHS (National Health Service) Talking Therapies has, according to the company, delivered assessments to nearly 400,000 patients across the United Kingdom (UK) [12]. Mental health conditions are often treated (and diagnosed) through language, making them an especially suitable domain for large language models (LLMs), which process unstructured text as both input and output [13]. Recent advances in natural language processing (NLP) and generative AI have enabled the development of systems capable of delivering increasingly human-like conversations and forming connections through personalised dialogue [14], with evidence of improvements in psychological outcomes such as reduced anxiety and stress [15,16]. Beyond these clinical gains, computational empathy can also offer interactional benefits, such as clearer understanding of user intent, allowing systems to better infer what individuals hope to achieve and provide more contextually appropriate support. These advances mark a substantial shift from early rule-based systems to modern chatbots that engage in more coherent, context-aware, and emotionally attuned exchanges.
Despite these advancements, a recurring narrative is that human interactions are intrinsically more empathic than AI, with the Topol Review, a UK government report, claiming that “empathy and compassion” are “essential human skill[s] that AI cannot replicate” [17]. Many head-to-head comparisons of perceived empathy currently use generic LLMs such as ChatGPT as the AI arm, not purpose-built therapeutic agents [18,19]. While these models can generate empathic-sounding language, such responses can often feel superficial. Users may detect deeper, subtler signals of empathy during human interactions, signals that go beyond comforting phrases [20]. Reflecting this, Scholich et al. [21] found that therapists conveyed empathy through probing questions and sensitivity to nonverbal cues, whereas chatbots tended to fall back on reassurance and generic advice, leading the authors to conclude that current systems cannot match the empathic depth required in sensitive situations. However, much current debate compares AI to an idealised human practitioner while evaluating AI by its generic implementations, which creates an asymmetry in expectations [22]. Moreover, many capabilities typically considered uniquely human—such as longitudinal rapport, memory for personal details, and sensitivity to non-verbal cues—can be engineered through long-term memory architectures, adaptive personalisation, and multimodal affect recognition. We therefore treat these not as categorical barriers but as concrete design targets, provided they are implemented with safeguards to prevent over-attachment, mitigate deception, and protect vulnerable users.
This paper therefore reframes the so-called “AI empathy gap” [23] as a problem of design and evaluation rather than an inherent technological shortfall. In text-based interactions, LLMs have already been rated as more empathic than clinicians in several studies, and they can outperform humans on standard emotional-intelligence tests [24,25]. Meanwhile, automatic emotion recognition from facial, vocal, and textual cues is now at or near human-level performance [26]. Where AI falls short, this typically reflects current design constraints rather than fundamental limitations.

1.2. Scope and Approach of This Critical Narrative Review

Within this context, we present a critical narrative review and conceptual framework to guide the design of future systems. A critical narrative review synthesises literature through an explicit interpretive lens, drawing together empirical, conceptual, and theoretical work to produce an integrated, theory-driven account of a field [27]. This approach complements systematic work that quantitatively synthesises evidence on perceived empathy in comparisons between AI chatbots and human healthcare practitioners [24]. Rather than providing an exhaustive systematic synthesis, our goal is to interpret and reframe this literature through a design-oriented lens. In this paper, we adopt this approach by combining our interpretive position with an iterative search of empirical, conceptual, and technical literature on mental health chatbots, computational empathy, and conversational systems. Because narrative reviews involve iterative cycles of searching, analysis, and interpretation rather than exhaustive systematic screening [27], we used purposive web-based searches, citation chaining, and the authors’ expertise to identify relevant literature. Drawing on this approach, the paper proceeds in three steps. First, we review the current state of chatbots to establish what current systems already achieve. Second, we outline the limitations that sustain the perception of an empathy gap, focusing on personalisation, memory, and real-time adaptation. Third, we propose a conceptual framework that integrates retrieval-augmented memory, feedback-driven adaptation, and optional adapter-based style control. We argue that the so-called empathy gap is contingent, not categorical, with differences in perceived empathy arising from current design and evaluation choices, not from properties that only humans can possess.

2. Review of Current Systems and Approaches

2.1. Early Chatbots and Contemporary Deployments

Early mental health chatbots were often rule-based. A well-known example is ELIZA, developed in the 1960s, with the aim of simulating a psychotherapist [28]. It used simple pattern matching to rephrase the user’s input. Contemporary systems are much more sophisticated, mixing decision-tree style conversation paths with machine learning. Woebot, a widely publicised and clinically evaluated mental health chatbot [29], illustrates this hybrid approach. In practice, the chatbot combined scripted dialogues with limited NLP methods (like keyword spotting) to manage brief check-ins and branched CBT (cognitive behavioural therapy) exercises, which allowed safe and predictable interactions but limited flexibility. Empathic responses came from pre-scripted options rather than being generated afresh, which constrained the system’s capacity for adaptive, person-specific empathy. The company ultimately retired its direct-to-consumer app on 30 June 2025, citing the cost and uncertainty of United States Food and Drug Administration (FDA) marketing authorisation as it considered shifting toward more sophisticated large-language-model features [30].
Wysa illustrates the current state of purpose-built therapeutic chatbots through its dual-product strategy. The Wysa Digital Referral Assistant, deployed in NHS Talking Therapies, supports intake and triage using structured dialogue and simple NLP rather than LLMs. This design makes responses predictable and easier to validate for clinical safety, enabling its classification as a Class I medical device and a conditional recommendation from the National Institute for Health and Care Excellence (NICE) for use during a three-year evidence-generation period [31]. Despite the limitations of a retrieval-based system, a mixed-methods study found that users formed a bond with the Wysa chatbot at levels comparable to those with human therapists [32]. However, limitations were apparent, with users expressing frustration over its frequent misunderstandings, irrelevant responses, and repetitiveness, which we can largely attribute to it not having an LLM’s ability to generate new content and process unstructured text. The company’s consumer app, Wysa+, does use LLMs with guardrails, but this version is not NHS-approved [33], likely reflecting the regulatory challenges in validating the safety and predictability of generative models. By contrast, Limbic Access integrates an LLM constrained by a proprietary control layer, allowing it to secure Class IIa certification and be deployed within NHS Talking Therapies. Together, these examples highlight the trade-off between limiting conversational flexibility to satisfy regulatory requirements and using LLMs with safeguards to deliver more adaptive and human-like empathy.

2.2. General-Purpose Large Language Models

General-purpose LLMs are also playing an increasingly prominent role in mental health support, even though they were not originally designed and validated for this purpose. OpenAI’s analysis of ChatGPT found that many people use it for comfort and therapeutic support [34], and participants in some studies have even rated ChatGPT as more empathic than clinicians [24].
LLMs achieve human-like empathy through training that captures emotional and social patterns in language. They are first pretrained on vast text corpora containing natural examples of emotional expression and social dialogue. This allows them to internalise patterns such as “I’m sorry to hear…” even without explicit emotional objectives. In effect, pretraining instils a baseline understanding of human language that includes emotional nuance After pretraining, LLMs undergo supervised fine-tuning on narrower datasets to refine their style and improve performance on specific tasks [35]. For empathy, this often means training on example conversations labelled for compassionate responses. While supervised tuning provides examples, Reinforcement Learning from Human Feedback (RLHF) is used to align the model’s behaviour to human preferences on empathy and optimise the linguistic patterns of supportive communication learned during pre-training [36]. Through this process, models are systematically optimised to generate responses that human raters prefer, effectively ‘punishing’ outputs perceived as unhelpful or cold and rewarding those that signal warmth and understanding. Figure 1 illustrates this process.
This process manifests in several key linguistic features. For instance, Small et al. [37] found that ChatGPT-4 tended to reflect users’ own wording back to them, along with more affiliative expressions and positive words, resulting in a 61.5% higher positive polarity score in its replies compared to clinicians (0.21 vs. 0.13). This matters because positively framed messages tend to increase perceived empathy and trustworthiness [38]. The length of LLM replies is also an important factor, as longer responses correlate with higher perceived empathy [24]. In a direct evaluation, ChatGPT averaged 211 words per reply versus 52 for clinicians; this difference in length may help explain why its answers were rated as more empathic, in addition to also being rated as higher quality [39]. Clinicians’ replies, by contrast, can often appear rushed [40] due to short appointment times and competing demands: one study found patients were interrupted after a median of only 11 s [41]. Such brevity may contribute to a broader problem reported by many people with lived experience of mental illness, who describe feeling devalued, dismissed, and dehumanised by health professionals, for example, when spoken to in stigmatising or demeaning ways [42]. LLMs, on the other hand, are trained to mirror supportive language and can feel less judgemental [43] and they are not subject to the same time pressures. However, they can reflect biases from their training data [44], and they are not a substitute for professional care or crisis services.

2.3. Social Robots

Many conversational systems are now embodied in physical robots that are being used to promote social engagement, reduce loneliness, and support emotional well-being [45,46]. Embodiment adds channels such as touch, gesture, gaze, and vocal prosody, but these cues will feel hollow if the dialogue is generic. The framework proposed below is therefore a prerequisite for integrated multimodal empathy in social robots. Addressing empathic conversation first creates a foundation for adding nonverbal behaviours that feel coherent and personal.
Table 1 synthesises the capabilities of rule-based and generative AI systems against the benchmark of an idealised human practitioner. While we have demonstrated how AI chatbots can already convey empathy in ways comparable to, and sometimes exceeding, human practitioners, important limitations remain, that go beyond linguistic markers. Section 3 therefore examines the key technological barriers.

3. Limitations of Current Approaches: Connectivity, Personalisation and Real-Time Adaptation

3.1. Shallow and Hollow Personalisation

Many chatbots still operate as “one-size-fits-all” systems, with limited capacity to learn from individual users or adapt style to user preferences. Reviews of mental-health chatbots describe a heavy reliance on scripted, rule-based flows, which can lead to generic interactions [47]. Even for advanced LLMs, developers acknowledge the need for more individualisation. Sam Altman, CEO of OpenAI, remarked their goal for ChatGPT is “more per-user customisation of model personality”, since users vary widely in whether they prefer concise factual replies, more emotionally expressive language, or a warmer conversational tone [48]. This highlights how personalisation remains a key challenge, much as therapists must adjust their approach to the unique needs of each client. Truly adaptive systems (those that remember personal details, adjust communication style, and modify therapeutic strategies based on what works for a specific user) are largely absent. This is an issue because personalisation (tailoring the conversation to an individual’s needs and context, like referencing past experiences) is highly valued in therapy, as it is known to be a key factor in improving adherence and outcomes [49]. Current systems are apt at using many empathic words and phrases (“I’m sorry to hear that…”), but there is a difference between “empathic-sounding” text and actually making the user feel empathically understood. Liu et al. [20] highlight how this results in some users perceiving chatbots as less genuinely empathic than humans. Interestingly, automated analyses of chatbot-based conversations show no large difference in the linguistic markers of empathy between bots and humans, implying chatbots can say the “right things”, yet users can still feel something is missing. This suggests that many current systems rely on superficial cues to signal empathy; they can sprinkle apologetic or sympathetic phrases, but often lack the deeper consistency, insight, or warmth that users subconsciously look for. Achieving genuine-feeling empathy on par with humans requires more than surface-level responses; it requires the chatbot to truly engage with the user’s experience in a nuanced and personal way, which current systems fail to do.

3.2. Limited Memory and Context Length

Most chatbots still operate within limited context windows. Even though some newer models can process hundreds of thousands of tokens at once (a token is just a small text unit, roughly the size of a short word or part of a word) [50], the window is still finite and tied to a single session. As a result, the AI can “forget” earlier parts of a conversation, especially in prolonged chats or across sessions. Important nuances, such as a deeply personal remark made earlier by the user, can be lost. Without a long-term memory of interactions, the chatbot cannot reliably follow up in a human-like way. Users may find themselves re-explaining things, which is frustrating, breaks the empathic rapport, and weakens the “illusion” of genuineness. Some systems now try to address this. For example, ChatGPT supports saved memories, where concise facts about a user can persist across chats and be reintroduced into future conversations [51]. This improves continuity but is still not the same as an open-ended, human-like memory. More generally, retrieval-based approaches are used to reinsert important context without overwhelming model capacity, and this retrieve-and-inject pattern is likely to be central to more empathic systems.

3.3. Rigidity in Emotional Adaptation

Current chatbots often employ a fixed empathy style that does not evolve within conversations or persist across them. Kang et al. [52] found that these systems tend to rely on a narrow set of “safe” strategies, most often generic reassurance (“I’m sorry you’re going through this”), and struggle to transition toward more active forms of help or deeper exploration when the situation requires it. This rigidity contrasts with human supporters, who naturally shift between listening, problem-solving, or even humour based on continuous assessment of the other person’s needs. This limitation stems from multiple factors. In part, it reflects a design emphasis on safety, causing chatbots to default to risk-averse responses. But as Scholich et al. [21] reinforce, the challenge goes beyond safety constraints; therapists emphasise that effective adaptation depends on reading nonverbal cues—facial expressions, tone of voice, or subtle signals like sarcasm—that inform moment-to-moment adjustments. When lacking this capability, chatbots can remain locked into generalised response patterns.

3.4. Cloud Architectures

Most contemporary mental health chatbots, operate on a cloud-based architecture, requiring an internet connection and the transmission of user data to a central server. This design inherently introduces privacy risks. Reviews of mental health apps consistently found weak data protections amongst apps, which eroded user trust and engagement [53]. This erosion of trust is problematic, since patient trust correlates strongly and positively with perceived clinician empathy [54]. Aside from trust, there are legal and regulatory consequences, e.g., if developers fail to comply with laws like the General Data Protection Regulation (GDPR). Dependence on connectivity also creates fragility. A lost connection can abruptly shatter the illusion of presence, revealing an error message that undermines the agent’s perceived reliability and empathy. Furthermore, if a chatbot were to use multimodal signals (e.g., facial expressions), effective interaction requires very low latency, yet continuously streaming sensitive data from users’ devices to the cloud is both ethically and technically problematic. These factors make standard cloud-based architectures difficult to reconcile with the demands of empathic mental health support, particularly when multi-modal.

4. Future Directions: A Conceptual Framework for Adaptive Empathy

To bridge the gap between the current state of AI empathy and its potential, we propose a conceptual framework to guide future implementations, built on three capabilities: Personalised Memory, Dynamic Adaptation, and Stylistic Flexibility. This framework directly addresses the identified shortfalls—shallow personalisation, limited memory, and emotional rigidity—by treating human-like empathy not as an unattainable trait but as a set of concrete engineering challenges.

4.1. Building a Personal Profile for Each User

Continuity is important for trust and feeling safe [55]. People feel understood when prior concerns are remembered and revisited, which in turn builds trust. We propose that an empathic chatbot should gradually build and update a personal model of each user, similar to how a skilled empathic human practitioner recalls important details about an individual over time. Rather than treating every interaction as isolated, as is typically done, the chatbot could track individual preferences, sensitivities, personalities, and conversation styles, and adjust its communication accordingly based not only on the current input but also by actively drawing on past conversations and stored personal information. In practice, this could involve creating a lightweight “profile” for each user, where key patterns (such as emotional triggers, preferred tones, and major life events) are stored (with consent) and appropriately referenced during conversations Each time the chatbot generates a reply, it should retrieve relevant information from the user’s profile to guide its response—a process known as Retrieval-Augmented Generation (RAG)—ensuring the interaction feels more personal and contextually appropriate. A lightweight retrieval layer can support this process by comparing the user’s current message to their stored profile information and pulling out the most relevant details based on meaning, ensuring the chatbot’s replies feel more personal without overwhelming the model’s active context. To make this work, the system would need a way to connect new messages with past information. This kind of memory is possible with existing technology. For example, a model like MiniLMv2 can turn each message into a compact “fingerprint” of its meaning, while tools such as FAISS or Chroma can act as efficient filing systems that store and search these fingerprints [56,57,58]. Together, they would allow a chatbot to recognise when new messages connect to past ones and bring that context into the reply. In practice, this means the system could remember a user once mentioned loving golden retrievers and later connect that detail when the user says they are thinking of getting a pet, making the interaction feel more personal and continuous. To illustrate this difference in practice, Table 2 compares responses to a real-world user query adapted from the Reddit community r/asktherapists. The comparison highlights how a standard generative model offers generic reassurance, whereas applying the proposed framework—specifically the RAG component—enables the system to perform ‘reality testing’ by retrieving specific evidence of the user’s past success.

4.2. Adjusting Responses Dynamically Based on User Feedback

Empathy is an interactive process that involves recognising, interpreting, and responding to another’s verbal and non-verbal cues [59]. A therapist demonstrates empathy by noticing a client’s change in tone, a hesitation, or a shift in engagement, and adapting their response accordingly. For conversational agents, this could extend to a “match, then steer” principle, where the system first mirrors the user’s current tone to build rapport, then gently nudges the exchange toward a more constructive emotional state (for example calming visible frustration or re-engaging a disengaged user). User outcomes can also be logged over time, for example, noting whether someone responds positively to humour or prefers concise, directive advice, and then used to adjust the chatbot’s behaviour. While this resembles reinforcement learning (RL), it does not require updating model weights. Instead, behaviour can be adapted dynamically through prompt manipulation guided by user feedback. Simple mechanisms like thumbs-up or thumbs-down can provide basic signals, but these are crude compared to real interaction. A more natural method would be to observe how a user’s affect changes after each response, noting whether they become more positive or negative following a chatbot reply. Because short-term variations in facial expression or vocal tone can be noisy, signals can be averaged over short time windows (for example, 30 s) to reflect more persistent emotional states rather than brief fluctuations. This smoothed signal, combined with text-based sentiment, could provide a more robust estimate of how effective the chatbot’s current conversational style is.
We can also watch basic engagement cues, for example, gaze alignment with the camera or long pauses, since therapists often notice when someone seems distant. Prior work in multimodal virtual humans has shown that facial, gaze, and vocal features can be tracked during mental health conversations to keep interactions natural [60]. Toolkits like OpenFace 2.0 and on-device Software Development Kits (SDKs) such as those developed by BLUESKEYE AI already support fast, lightweight affect recognition on mobile and desktop devices, processing facial expressions and gaze locally, with some SDKs also incorporating vocal features, without needing cloud servers [61,62]. This demonstrates that real-time multimodal loops can run efficiently on mobile hardware, adding a “privacy-preserving” edge layer for adapting chatbot responses by keeping sensitive data local. This layer could be augmented with other on-device AI functionalities to create a more comprehensive edge architecture (Figure 2).

4.3. Potential for True Reinforcement Learning with Local Models

The adaptive methods described above work by adjusting prompts and retrieving user-specific context, without altering the underlying model. This raises the question of whether “true” RL, in the machine learning sense, could be applied instead. In principle, if a local model such as LLaMA 3 were used (rather than a closed API model like GPT-5), user feedback could directly fine-tune the model’s internal weights so that, over time, it learns what users collectively experience as empathic. This kind of group-level training could be valuable for tailoring the chatbot to the needs of a specific target population, since the model would gradually align with what that community finds empathic for them. However, this approach comes with major limitations. First, it requires large amounts of data and secure infrastructure for collecting and transmitting feedback. Second, even with sufficient data, fine-tuning produces a general model that reflects average user preferences rather than personalising responses for individuals. Finally, the cost of fine-tuning even a “small” model with billions of parameters makes it impractical to train separate models for each user. For these reasons, individual-level adaptation is likely better achieved by maintaining external user profiles and memory stores, which guide the chatbot’s responses dynamically without retraining the base model.

4.4. Exploring Adapter-Based Personalisation

As a practical middle ground between relying solely on external memory with prompt-based control and implementing full RL, a promising approach could be to use lightweight adapter modules on a fixed, fine-tuned base model. These modules, such as those enabled by Low-Rank Adaptation (LoRA), add a small number of trainable parameters without changing the model’s weight [63]. During operation (a process known as inference), the large base model remains unchanged; instead, specific adapters can be loaded on-demand to steer the conversational style. For example, making it more direct and action-oriented, or more reflective and nurturing, all without the significant computational cost of retraining the full model. While creating a unique adapter for every user does not scale effectively due to management overhead, a more realistic approach is micro-stratification. This involves maintaining a small, curated library of adapters designed for user cohorts that share common preferences or therapeutic needs. Adapter could be combined with retrieval-augmented generation (RAG). In this combined architecture, RAG could govern what is said by injecting relevant, user-specific facts into the context, while adapters shape how it is said by modulating the model’s stylistic tendencies. In practice, a service could maintain a versioned set of pre-validated adapters; for example, ‘Supportive’, ‘Directive’, ‘Brief’, or ‘Expansive’, and then select amongst them based on recent feedback signals or a user’s declared preferences. This method keeps operational complexity modest, preserves a clean audit trail (since both the base model and the adapters are fixed and versioned), and offers finer stylistic control than prompting alone, while avoiding the substantial data and infrastructure burden of continuous, population-level RL.

4.5. Edge Architectures

Looking beyond the framework’s core capabilities, the choice between cloud and edge computing is important for trust and reliability. In response to the privacy and accessibility issues of cloud-based mental health chatbots, an edge model, where the chatbot runs entirely on the user’s device, offers a compelling alternative. It works offline and processes conversations locally, so sensitive data does not leave the device, and network latency is removed. When systems rely on many signals, for example, text, gaze, and facial affect, low end-to-end latency is critical for timely adaptation; performing these inferences locally can help by avoiding round-trip delays. However, if the local hardware is underpowered, there can be significant computational latency, making interactions slower despite being on-device. The practical value is therefore mixed, since some users will prioritise privacy while others will notice reduced speed or capability until hardware and model efficiency improve. A primary challenge is the computational demand of modern LLMs, which often exceeds consumer mobile capacity. Despite these hurdles, some applications use a full edge approach, such as Layla, a commercial offline chatbot that downloads a large model file and runs quantized Llama-2 variants on mobile CPUs [64]. The result preserves privacy, but reasoning depth and response speed remain lower than leading cloud systems.
Figure 3 provides a high-level overview of the conceptual framework.

4.6. Challenges and Risks

Despite this potential, the move toward genuinely adaptive, empathic AI introduces new and heightened risks that demand careful consideration, especially in mental health contexts, where clinical systems operate under strict regulatory oversight such as NHS governance frameworks. As increasingly capable systems combine long-term memory, adaptive personalisation, and finely tuned empathic language, some interactions may start to feel indistinguishable from those with a caring human practitioner. This increasing realism raises important questions about how clearly users understand that empathy is being provided by an AI system, and whether some individuals might place more trust or emotional weight on these systems than intended. Because our proposed framework aims to support more coherent and contextually attuned empathic behaviour empathy at the behavioural level, safeguards should be built in from the outset to promote transparency, appropriate use, and protection of vulnerable users. Current systems often rely on scripted dialogues precisely to mitigate harm (i.e., they are deterministic); our framework, by design, moves away from this model. These problems include well-documented issues like AI hallucinations [65], and the potential for sycophancy, where a chatbot, in trying to validate the user’s feelings, may reinforce harmful thoughts or behaviours without appropriate challenge [66,67]. Creating an empathic chatbot increases this risk, as its supportive tone can unintentionally affirm delusions or unsafe choices [68]. While human therapists also make mistakes, errors from AI systems often attract greater scrutiny and may be more severe, underscoring the need for rigorous testing, escalation protocols, and strong governance. These challenges are compounded by technical and ethical hurdles, especially concerning data privacy, user consent, and the substantial computational demands of personalisation.
At a broader level, further societal risks emerge. Increasing sophistication raises the danger of user over-reliance, potentially discouraging individuals from seeking professional medical support. The capacity for users to form strong emotional bonds with empathic systems also presents risks of commercial exploitation. If high levels of perceived empathy and supportiveness can be achieved, such capabilities could be repurposed beyond therapeutic contexts—for instance, in romantic or companion-oriented applications—creating risks of dependency and social withdrawal. Reports from users of the Replika chatbot illustrate these concerns, describing romantic attachment and even grief when updates altered the bot’s behaviour [69].
However, these challenges do not constitute an argument against developing computational empathy but rather highlight the need for clear regulatory and ethical frameworks. Regardless of differing perspectives on its role, AI is going to be increasingly used in patient care—sometimes even replacing human contact—and people are already seeking emotional support from such systems [24]. Refusing to design for empathy would therefore be less responsible than doing so transparently and safely. Failing to imbue these systems with empathy could, at best, mean forgoing key benefits associated with human empathy, such as adherence and satisfaction, and at worst, risk causing patient distress through interactions perceived as cold or unsupportive [3].

5. Discussion

5.1. Implications for Care Delivery

This paper has argued that the perceived “empathy gap” in AI is not a fundamental technological limitation but a design challenge that can be addressed through a more sophisticated architectural approach. The proposed framework—integrating retrieval-augmented long-term memory, feedback-driven adaptation, and modular style control—offers a concrete roadmap for developing AI chatbots that can deliver more personalised, consistent, and emotionally attuned interactions. The implications of such a development, particularly within the current mental health landscape, are significant. Rather than replacing traditional therapy, these systems could serve as complementary tools that offer immediate, empathic support during times when human help may not be available. This is especially relevant given the persistent shortage of mental health professionals and the rising demand for accessible care around the clock [70]. In this environment, an AI tool that offers immediate, empathic support could serve as a vital stopgap, providing continuity and engagement when human help is unavailable. Research shows that more than one-third of individuals diagnosed with mental health conditions in primary care receive no formal treatment [71]. For this significant cohort, AI chatbots designed with our proposed framework could help fill a critical support vacuum, offering a safe and scalable way to engage in empathic conversations.

5.2. Felt Versus Performed Empathy

When discussing AI and empathy, it is important to acknowledge the ongoing debate over whether empathy is uniquely human, something AI might imitate convincingly but cannot truly “feel” [72]. Yet, in clinical practice, human empathy is often a performed competence through learned behaviours rather than genuinely experiencing each patient’s emotions [73]. While many practitioners do exhibit authentic compassion, the continual emotional demands of empathising with patients can naturally lead to compassion fatigue, which is prevalent across healthcare professions [74,75]. Compassion fatigue, a form of emotional and physical exhaustion arising from prolonged exposure to others’ suffering, can diminish empathic presence and reduce the quality of care. A well-designed AI, built on the principles outlined here, could consistently replicate the behaviours of empathy without such constraints. From the user’s perspective, the critical factor is whether they feel understood and supported, irrespective of whether empathy arises from genuine emotion or well-executed, learned behaviours. From this standpoint, AI-driven empathic interactions could closely parallel much of what humans routinely deliver in practice. However, the more effective an AI is at simulating support, the greater the risk of fostering an unhealthy over-dependency that prevents users from seeking professional help. A primary challenge, therefore, is to harness AI’s consistency without undermining the essential contribution of human clinicians.

6. Conclusions

In conclusion, the narrative of an inherent and unbridgeable “AI empathy gap” is a misleading oversimplification. As this paper has argued, the perceived shortfall in empathy between AI and human practitioners is not a fundamental technological limitation but rather a reflection of current design and evaluation constraints. By reframing empathy as a set of observable, replicable behaviours rather than an exclusively human trait, we can approach it as a concrete engineering challenge. Within this scope, our contribution is a critical narrative synthesis and a conceptual framework intended to inform and structure future system implementations. The proposed conceptual framework—integrating retrieval-augmented memory for deep personalisation, feedback-driven loops for dynamic adaptation, and adapter-based modules for stylistic control—provides a roadmap for developing AI systems that can listen, remember, and adapt in ways that make users feel genuinely understood. Empathic AI chatbots offer great potential to enhance healthcare service delivery, improve access to care, and support personalised treatment plans. However, this path comes with serious technical and ethical challenges. Largely, concerns about authenticity, data privacy, algorithmic bias and accountability—which may undermine trust and the therapeutic relationship [76]. Progress must therefore be careful and responsible, focusing on safety, transparency, and human oversight. Ultimately, the goal is to ensure that effective, empathic support is accessible to everyone, whether delivered by a human or a sufficiently advanced and safe AI. Future research should test each component of the framework, both on its own and in combination, for example, via a factorial study, to measure perceived empathy and benefits (while ensuring safeguards are in place).

Author Contributions

A.H. conceptualized the project, conducted the research, and wrote the manuscript. H.B. provided supervision, feedback, and contributed to the review and editing of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Engineering and Physical Sciences Research Council (EPSRC) through the Turing AI World Leading Researcher Fellowship Somabotics: Creatively Embodying Artificial Intelligence (grant number EP/Z534808/1) and by the NIHR HealthTech Research Centre in Rehabilitation. The funders had no role in the design, analysis, interpretation, or decision to submit this review.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Written informed consent for publication of Figure 2 was obtained from the author who appears in the image.

Data Availability Statement

No new data were created or analysed in this study. Data sharing is not applicable to this article.

Acknowledgments

A.H. is a PhD student within the Somabotics programme at the University of Nottingham. H.B. is supported by the NIHR HealthTech Research Centre in Rehabilitation.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AIArtificial Intelligence
APIApplication Programming Interface
CBTCognitive Behavioural Therapy
CPUCentral Processing Unit
FAISSFacebook AI Similarity Search
FDAUnited States Food and Drug Administration
GDPRGeneral Data Protection Regulation
GPTGenerative Pretrained Transformer
LLaMALarge Language Model Meta AI
LLMLarge Language Model
LoRALow Rank Adaptation
MiniLMv2MiniLM version 2
NHSNational Health Service
NICENational Institute for Health and Care Excellence
NLPNatural Language Processing
RAGRetrieval Augmented Generation
RLReinforcement Learning
RLHFReinforcement Learning from Human Feedback
SDKSoftware Development Kit
SFTSupervised Fine-Tuning
UKUnited Kingdom

References

  1. Powell, P.A.; Roberts, J. Situational determinants of cognitive, affective, and compassionate empathy in naturalistic digital interactions. Comput. Hum. Behav. 2017, 68, 137–148. [Google Scholar] [CrossRef]
  2. Decety, J. Empathy in medicine, what it is, and how much we really need it. Am. J. Med. 2020, 133, 561–566. [Google Scholar] [CrossRef]
  3. Derksen, F.; Bensing, J.; Lagro-Janssen, A. Effectiveness of empathy in general practice, a systematic review. Br. J. Gen. Pract. 2013, 63, e76–e84. [Google Scholar] [CrossRef] [PubMed]
  4. Nembhard, I.M.; David, G.; Ezzeddine, I.; Betts, D.; Radin, J. A systematic review of research on empathy in health care. Health Serv. Res. 2023, 58, 250–263. [Google Scholar] [CrossRef]
  5. MacFarlane, P.; Timothy, A.; McClintock, A.S. Empathy from the client’s perspective, a grounded theory analysis. Psychother. Res. 2017, 27, 227–238. [Google Scholar] [CrossRef] [PubMed]
  6. Elliott, R.; Bohart, A.C.; Watson, J.C.; Murphy, D. Therapist empathy and client outcome, an updated meta-analysis. Psychotherapy 2018, 55, 399–410. [Google Scholar] [CrossRef]
  7. Benitez, C.; Southward, M.W.; Altenburger, E.M.; Howard, K.P.; Cheavens, J.S. The within-person effects of validation and invalidation on in-session changes in affect. Personal. Disord. 2019, 10, 406–415. [Google Scholar] [CrossRef]
  8. Moyers, T.B.; Miller, W.R. Is low therapist empathy toxic? Psychol. Addict. Behav. 2013, 27, 878–884. [Google Scholar] [CrossRef] [PubMed]
  9. Brännström, A.; Wester, J.; Nieves, J.C. A formal understanding of computational empathy in interactive agents. Cogn. Syst. Res. 2024, 85, 101203. [Google Scholar] [CrossRef]
  10. Zhu, Q.; Luo, J. Toward artificial empathy for human-centered design. J. Mech. Des. 2023, 146, 061401. [Google Scholar] [CrossRef]
  11. McStay, A. Replika in the Metaverse, the moral problem with empathy in “It from Bit”. AI Ethics 2023, 3, 1433–1445. [Google Scholar] [CrossRef]
  12. Limbic. Limbic for Talking Therapies. 2025. Available online: https://www.limbic.ai/nhs-talking-therapies (accessed on 17 September 2025).
  13. Malgaroli, M.; Schultebraucks, K.; Myrick, K.J.; Andrade Loch, A.; Ospina-Pinillos, L.; Choudhury, T.; Kotov, R.; De Choudhury, M.; Torous, J. Large language models for the mental health community, framework for translating code to care. Lancet Digit. Health 2025, 7, e282–e285. [Google Scholar] [CrossRef]
  14. Howcroft, A.; Hopkins, G. Exploring Mars, an immersive survival game for planetary education. In Proceedings of the 18th European Conference on Games Based Learning, Aarhus, Denmark, 3–4 October 2024; Academic Conferences International Limited: Reading, UK, 2024; pp. 1135–1144. [Google Scholar] [CrossRef]
  15. Chin, H.; Song, H.; Baek, G.; Shin, M.; Jung, C.; Cha, M.; Choi, J.; Cha, C. The potential of chatbots for emotional support and promoting mental well-being in different cultures, mixed methods study. J. Med. Internet Res. 2023, 25, e51712. [Google Scholar] [CrossRef] [PubMed]
  16. Zhou, L.; Gao, J.; Li, D.; Shum, H.Y. The design and implementation of XiaoIce, an empathetic social chatbot. Comput. Linguist. 2020, 46, 53–93. [Google Scholar] [CrossRef]
  17. The Topol Review. Preparing the Healthcare Workforce to Deliver the Digital Future; Health Education England: Leeds, UK, 2019. Available online: https://topol.digitalacademy.nhs.uk/the-topol-review (accessed on 2 December 2025).
  18. Reynolds, K.; Nadelman, D.; Durgin, J.; Ansah-Addo, S.; Cole, D.; Fayne, R.; Harrell, J.; Ratycz, M.; Runge, M.; Shepard-Hayes, A.; et al. Comparing the quality of ChatGPT- and physician-generated responses to patients’ dermatology questions in the electronic medical record. Clin. Exp. Dermatol. 2024, 49, 715–718. [Google Scholar] [CrossRef] [PubMed]
  19. Li, B.; Wang, A.; Strachan, P.; Séguin, J.A.; Lachgar, S.; Schroeder, K.C.; Fleck, M.S.; Wong, R.; Karthikesalingam, A.; Natarajan, V.; et al. Conversational AI in health, design considerations from a Wizard-of-Oz dermatology case study with users, clinicians and a medical LLM. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems; Association for Computing Machinery: Honolulu, HI, USA, 2024; p. 88. [Google Scholar] [CrossRef]
  20. Liu, T.; Giorgi, S.; Aich, A.; Lahnala, A.; Curtis, B.; Ungar, L.; Sedoc, J. The illusion of empathy, how AI chatbots shape conversation perception. Proc. AAAI Conf. Artif. Intell. 2025, 39, 14327–14336. [Google Scholar] [CrossRef]
  21. Scholich, T.; Barr, M.; Wiltsey Stirman, S.; Raj, S. A comparison of responses from human therapists and large language model–based chatbots to assess therapeutic communication, mixed methods study. JMIR Ment. Health 2025, 12, e69709. [Google Scholar] [CrossRef]
  22. Zohny, H. Reframing “dehumanisation”, AI and the reality of clinical communication. J. Med. Ethics 2025. Online ahead of print. [Google Scholar] [CrossRef]
  23. Kurian, N. AI’s empathy gap, the risks of conversational artificial intelligence for young children’s well-being and key ethical considerations for early childhood education and care. Contemp. Issues Early Child. 2023, 26, 132–139. [Google Scholar] [CrossRef]
  24. Howcroft, A.; Bennett Weston, A.; Khan, A.; Griffiths, J.; Gay, S.; Howick, J. AI chatbots versus human healthcare professionals, a systematic review and meta-analysis of empathy in patient care. Br. Med. Bull. 2025, 156, ldaf017. [Google Scholar] [CrossRef]
  25. Schlegel, K.; Sommer, N.R.; Mortillaro, M. Large language models are proficient in solving and creating emotional intelligence tests. Commun. Psychol. 2025, 3, 80. [Google Scholar] [CrossRef]
  26. Schuller, D.; Schuller, B.W. The age of artificial emotional intelligence. Computer 2018, 51, 38–46. [Google Scholar] [CrossRef]
  27. Sukhera, J. Narrative reviews, flexible, rigorous, and practical. J. Grad. Med. Educ. 2022, 14, 414–417. [Google Scholar] [CrossRef]
  28. Coheur, L. From Eliza to Siri and Beyond. In Proceedings of the Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU 2020), Proceedings Part I, Lisbon, Portugal, 15–19 June 2020; Lecture Notes in Computer Science, LNCS 1237. pp. 29–41, ISBN 978-3-030-50145-7. [Google Scholar] [CrossRef]
  29. Fitzpatrick, K.K.; Darcy, A.; Vierhile, M. Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (Woebot), a randomized controlled trial. JMIR Ment. Health 2017, 4, e19. [Google Scholar] [CrossRef]
  30. Maheu, M.M. AI Psychotherapy Shutdown, What Woebot’s Exit Signals For Clinicians. Telehealth.org, 19 August 2025. Available online: https://telehealth.org/blog/ai-psychotherapy-shutdown-what-woebots-exit-signals-for-clinicians/ (accessed on 17 October 2025).
  31. NICE. Digital Front Door Technologies to Gather Service User Information for NHS Talking Therapies for Anxiety and Depression Assessments, Early Value Assessment (HTE30). 24 July 2025. Available online: https://www.nice.org.uk/guidance/hte30 (accessed on 17 October 2025).
  32. Beatty, C.; Malik, T.; Meheli, S.; Sinha, C. Evaluating the therapeutic alliance with a free-text CBT conversational agent (Wysa), a mixed-methods study. Front. Digit. Health 2022, 4, 847991. [Google Scholar] [CrossRef]
  33. Wysa. FAQs. 2025. Available online: https://www.wysa.com/faq (accessed on 17 October 2025).
  34. Phang, J.; Lampe, M.; Ahmad, L.; Agarwal, S.; Fang, C.M.; Liu, A.R.; Danry, V.; Lee, E.; Chan, S.W.; Pataranutaporn, P. Investigating affective use and emotional well-being on ChatGPT. arXiv 2025, arXiv:2504.03888. [Google Scholar] [CrossRef]
  35. Anisuzzaman, D.M.; Malins, J.G.; Friedman, P.A.; Attia, Z.I. Fine-tuning large language models for specialized use cases. Mayo Clin. Proc. Digit. Health 2025, 3, 100184. [Google Scholar] [CrossRef]
  36. Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. GPT-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
  37. Small, W.R.; Wiesenfeld, B.; Brandfield-Harvey, B.; Jonassen, Z.; Mandal, S.; Stevens, E.R.; Major, V.J.; Lostraglio, E.; Szerencsy, A.; Jones, S.; et al. Large Language Model-Based Responses to Patients’ In-Basket Messages. JAMA Netw. Open 2024, 7, e2422399. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  38. Tanco, K.; Rhondali, W.; Perez-Cruz, P.; Tanzi, S.; Chisholm, G.B.; Baile, W.; Frisbee-Hume, S.; Williams, J.; Masino, C.; Cantu, H.; et al. Patient perception of physician compassion after a more optimistic vs a less optimistic message, a randomized clinical trial. JAMA Oncol. 2015, 1, 176–183. [Google Scholar] [CrossRef]
  39. Ayers, J.W.; Poliak, A.; Dredze, M.; Leas, E.C.; Zhu, Z.; Kelley, J.B.; Faix, D.J.; Goodman, A.M.; Longhurst, C.A.; Hogarth, M.; et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med. 2023, 183, 589–596. [Google Scholar] [CrossRef]
  40. Hogan, T.; Luger, T.; Volkman, J.; Rocheleau, M.; Mueller, N.; Barker, A.; Nazi, K.; Houston, T.; Bokhour, B. Patient centeredness in electronic communication, evaluation of patient-to-health care team secure messaging. J. Med. Internet Res. 2018, 20, e82. [Google Scholar] [CrossRef]
  41. Phillips, K.A.; Ospina, N.S.; Montori, V.M. Physicians interrupting patients. J. Gen. Intern. Med. 2019, 34, 1965. [Google Scholar] [CrossRef]
  42. Knaak, S.; Mantler, E.; Szeto, A. Mental illness-related stigma in healthcare, barriers to access and care and evidence-based solutions. Healthc. Manag. Forum 2017, 30, 111–116. [Google Scholar] [CrossRef]
  43. Papneja, H.; Yadav, N. Self-disclosure to conversational AI, a literature review, emergent framework, and directions for future research. Pers. Ubiquitous Comput. 2025, 29, 119–151. [Google Scholar] [CrossRef]
  44. Schnepper, R.; Roemmel, N.; Schaefert, R.; Lambrecht-Walzinger, L.; Meinlschmidt, G. Exploring biases of large language models in the field of mental health, comparative questionnaire study of the effect of gender and sexual orientation in anorexia nervosa and bulimia nervosa case vignettes. JMIR Ment. Health 2025, 12, e57986. [Google Scholar] [CrossRef]
  45. Blavette, L.; Dacunha, S.; Alameda-Pineda, X.; Hernández García, D.; Gannot, S.; Gras, F.; Gunson, N.; Lemaignan, S.; Polic, M.; Tandeitnik, P.; et al. Acceptability and usability of a socially assistive robot integrated with a large language model for enhanced human-robot interaction in a geriatric care institution, mixed methods evaluation. JMIR Hum. Factors 2025, 12, e76496. [Google Scholar] [CrossRef]
  46. Laban, G.; Morrison, V.; Kappas, A.; Cross, E.S. Coping with emotional distress via self-disclosure to robots, an intervention with caregivers. Int. J. Soc. Robot. 2025, 17, 1837–1870. [Google Scholar] [CrossRef]
  47. Haque, M.D.R.; Rubya, S. An overview of chatbot-based mobile mental health apps, insights from app description and user reviews. JMIR mHealth uHealth 2023, 11, e44838. [Google Scholar] [CrossRef]
  48. Webb, E. Sam Altman Says GPT-5’s “Personality” Will Get a Revamp, But It Will Not Be as “Annoying” as GPT-4o. Business Insider. 2025. Available online: https://www.businessinsider.com/sam-altman-openai-gpt5-personality-update-gpt4o-return-backlash-2025-8 (accessed on 16 October 2025).
  49. Hornstein, S.; Zantvoort, K.; Lueken, U.; Funk, B.; Hilbert, K. Personalization strategies in digital mental health interventions, a systematic review and conceptual framework for depressive symptoms. Front. Digit. Health 2023, 5, 1170002. [Google Scholar] [CrossRef]
  50. Anthropic. Claude 3.5 Sonnet. 2024. Available online: https://www.anthropic.com/news/claude-3-5-sonnet (accessed on 15 September 2025).
  51. OpenAI. What Is Memory? 2025. Available online: https://help.openai.com/en/articles/8983136-what-is-memory (accessed on 15 September 2025).
  52. Kang, D.; Kim, S.; Kwon, T.; Moon, S.; Cho, H.; Yu, Y.; Lee, D.; Yeo, J. Can large language models be good emotional supporter? Mitigating preference bias on emotional support conversation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; pp. 15232–15261. [Google Scholar] [CrossRef]
  53. Koh, J.; Tng, G.Y.Q.; Hartanto, A. Potential and pitfalls of mobile mental health apps in traditional treatment, an umbrella review. J. Pers. Med. 2022, 12, 1376. [Google Scholar] [CrossRef]
  54. Hojat, M.; Louis, D.Z.; Maxwell, K.; Markham, F.; Wender, R.; Gonnella, J.S. Patient perceptions of physician empathy, satisfaction with physician, interpersonal trust, and compliance. Int. J. Med. Educ. 2010, 1, 83–87. [Google Scholar] [CrossRef]
  55. Biringer, E.; Hartveit, M.; Sundfør, B.; Ruud, T.; Borg, M. Continuity of care as experienced by mental health service users, a qualitative study. BMC Health Serv. Res. 2017, 17, 763. [Google Scholar] [CrossRef]
  56. Wang, W.; Bao, H.; Huang, S.; Dong, L.; Wei, F. MiniLMv2, multi-head self-attention relation distillation for compressing pretrained transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 2140–2151. [Google Scholar] [CrossRef]
  57. Öztürk, E.; Mesut, A. Performance Analysis of Chroma, Qdrant, and Faiss Databases; Technical University of Gabrovo: Gabrovo, Bulgaria, 2024; Available online: https://unitechsp.tugab.bg/images/2024/4-CST/s4_p72_v3.pdf (accessed on 17 October 2025).
  58. Pan, J.J.; Wang, J.; Li, G. Survey of vector database management systems. VLDB J. 2024, 33, 1591–1615. [Google Scholar] [CrossRef]
  59. Leisten, L.M.; Findling, F.; Bellinghausen, J.; Kinateder, M.; Probst, T.; Lion, D.; Shiban, Y. The effect of non-lexical verbal signals on the perceived authenticity, empathy and understanding of a listener. Eur. J. Couns. Psychol. 2021, 10, 1–7. [Google Scholar] [CrossRef]
  60. Rizzo, A.; Scherer, S.; DeVault, D.; Gratch, J.; Artstein, R.; Hartholt, A.; Lucas, G.; Marsella, S.; Morbini, F.; Nazarian, A.; et al. Detection and computational analysis of psychological signals using a virtual human interviewing agent. J. Pain Manag. 2016, 9, 311–321. [Google Scholar]
  61. Baltrusaitis, T.; Zadeh, A.; Lim, Y.C.; Morency, L.-P. OpenFace 2.0: Facial Behavior Analysis Toolkit. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018. [Google Scholar] [CrossRef]
  62. BLUESKEYE AI. Health and Well-Being. 2025. Available online: https://www.blueskeye.com/health-well-being (accessed on 15 September 2025).
  63. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA, low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations, Online, 25–29 April 2022. [Google Scholar]
  64. Layla Network. Meet Layla. 2025. Available online: https://www.layla-network.ai (accessed on 3 October 2025).
  65. Geroimenko, V. Generative AI hallucinations in healthcare, a challenge for prompt engineering and creativity. In Human-Computer Creativity: Generative AI in Education, Art, and Healthcare; Geroimenko, V., Ed.; Springer Nature: Cham, Switzerland, 2025; pp. 321–335. [Google Scholar] [CrossRef]
  66. Gerken, T. Update That Made ChatGPT “Dangerously” Sycophantic Pulled. BBC News. 2025. Available online: https://www.bbc.co.uk/news/articles/cn4jnwdvg9qo (accessed on 16 October 2025).
  67. Østergaard, S.D. Will generative artificial intelligence chatbots generate delusions in individuals prone to psychosis? Schizophr. Bull. 2023, 49, 1418–1419. [Google Scholar] [CrossRef]
  68. Franzen, C. Ex-OpenAI CEO and Power Users Sound Alarm over AI Sycophancy and Flattery of Users. VentureBeat. 2025. Available online: https://venturebeat.com/ai/ex-openai-ceo-and-power-users-sound-alarm-over-ai-sycophancy-and-flattery-of-users (accessed on 16 October 2025).
  69. Purtill, J. Replika Users Fell in Love with Their AI Chatbot Companions. Then They Lost Them. ABC News. 2023. Available online: https://www.abc.net.au/news/science/2023-03-01/replika-users-fell-in-love-with-their-ai-chatbot-companion/102028196 (accessed on 14 October 2025).
  70. Brooks Holliday, S.; Matthews, S.; Bialas, A.; McBain, R.K.; Cantor, J.H.; Eberhart, N.; Breslau, J. A qualitative investigation of preparedness for the launch of 988, implications for the continuum of emergency mental health care. Adm. Policy Ment. Health Ment. Health Serv. Res. 2023, 50, 616–629. [Google Scholar] [CrossRef] [PubMed]
  71. Catalao, R.; Broadbent, M.; Ashworth, M.; Das-Munshi, J.; Schofield, L.H.; Hotopf, M.; Dorrington, S. Access to psychological therapies amongst patients with a mental health diagnosis in primary care, a data linkage study. Soc. Psychiatry Psychiatr. Epidemiol. 2024, 60, 2149–2161. [Google Scholar] [CrossRef] [PubMed]
  72. Montemayor, C.; Halpern, J.; Fairweather, A. In principle obstacles for empathic AI, why we cannot replace human empathy in healthcare. AI Soc. 2022, 37, 1353–1359. [Google Scholar] [CrossRef]
  73. Larson, E.B.; Yao, X. Clinical empathy as emotional labor in the patient–physician relationship. JAMA 2005, 293, 1100–1106. [Google Scholar] [CrossRef]
  74. Garnett, A.; Hui, L.; Oleynikov, C.; Boamah, S. Compassion fatigue in healthcare providers, a scoping review. BMC Health Serv. Res. 2023, 23, 1336. [Google Scholar] [CrossRef] [PubMed]
  75. Noor, A.M.; Suryana, D.; Kamarudin, E.M.E.; Naidu, N.B.M.; Kamsani, S.R.; Govindasamy, P. Compassion fatigue in helping professions, a scoping literature review. BMC Psychol. 2025, 13, 349. [Google Scholar] [CrossRef] [PubMed]
  76. American Psychological Association. Ethical Guidance for AI in the Professional Practice of Health Service Psychology. 2025. Available online: https://www.apa.org/topics/artificial-intelligence-machine-learning/ethical-guidance-ai-professional-practice (accessed on 17 September 2025).
Figure 1. Pipeline of training and aligning a language model: from pretraining on large text corpora, to supervised fine-tuning on labelled dialogues, to RLHF for helpful, safe, and empathic behaviour.
Figure 1. Pipeline of training and aligning a language model: from pretraining on large text corpora, to supervised fine-tuning on labelled dialogues, to RLHF for helpful, safe, and empathic behaviour.
Information 16 01074 g001
Figure 2. Real-time facial affect recognition (via BLUESKEYE AI) running locally on a MacBook. The system tracks facial landmarks (left) to calculate Action Units (bottom-left), which are then mapped onto a Valence-Arousal circumplex (right) to classify the user’s emotional state (currently detecting “Sad”).
Figure 2. Real-time facial affect recognition (via BLUESKEYE AI) running locally on a MacBook. The system tracks facial landmarks (left) to calculate Action Units (bottom-left), which are then mapped onto a Valence-Arousal circumplex (right) to classify the user’s emotional state (currently detecting “Sad”).
Information 16 01074 g002
Figure 3. Conceptual Model of Adaptive, Multimodal Chatbot System.
Figure 3. Conceptual Model of Adaptive, Multimodal Chatbot System.
Information 16 01074 g003
Table 1. Comparison of rule-based chatbots, typical generative AI chatbots, and idealised human practitioners across key dimensions relevant to perceived empathy, flexibility, memory, safety, and overall consistency.
Table 1. Comparison of rule-based chatbots, typical generative AI chatbots, and idealised human practitioners across key dimensions relevant to perceived empathy, flexibility, memory, safety, and overall consistency.
DimensionRule-Based AI ChatbotTypical Generative AI Chatbot (LLMs)Human Practitioner (“Idealised” Standard)
Determinism/CoherenceDeterministic, highly predictable repliesCoherent and natural-sounding; capable of nuanced dialogueAdapts how they speak based on the person and situation
Perceived empathyVery generic, scripted empathyOften perceived as empathic, but can feel “hollow” or superficialDeep, context-sensitive empathy; can read non-verbal cues 
AvailabilityContinuous 24/7 access, barring server outages or maintenance.Continuous 24/7 access, barring server outages or maintenanceScheduled through dedicated appointments, though availability is often constrained by waiting lists and high workloads, potentially resulting in brief or brusque replies
Scope/FlexibilityInflexible responses; works best in narrow, pre-defined scenariosDetailed replies; broad conversational scope across many topicsHigh flexibility; can work across complex, messy, and ambiguous real-world problems
Memory/ContinuityLittle or no memory of past conversationsMemory within a session; typically forgets across sessionsCan remember personal history and build rapport over time
Safety/ReliabilitySafer because behaviour is tightly constrained and predictableRisk of errors and hallucinations; needs guardrailsProfessionally trained and regulated, but still subject to time pressure and errors
Judgement/BiasLow sense of “judgement”, though design choices can add biasReduced overt judgement, but may reflect training-data biasesStrives for a non-judgemental stance, though susceptible to passing judgement (even if unconscious) potentially leading to less empathy and dismissive attitudes
Privacy/DeploymentOften local or on tightly controlled systemsUsually cloud-based; raises privacy and data-sharing concerns; on-device (edge) options are emergingBound by strict professional and legal duties of confidentiality
Overall Quality/ConsistencyConsistent but rigid; quality is steady but extremely limitedQuality varies with prompts, model, and safeguards; can sometimes exceed humans in rated perceived empathyVariable between practitioners and moment-to-moment (fatigue, workload, emotional state, and emotion-type differences—e.g., easier to empathise with grief than anger)
Table 2. Comparison of a standard generative AI response versus the proposed Retrieval-Augmented Generation (RAG) framework applied to a workplace anxiety vignette.
Table 2. Comparison of a standard generative AI response versus the proposed Retrieval-Augmented Generation (RAG) framework applied to a workplace anxiety vignette.
Standard Generative AIRetrieval-Augmented Generation
User Input “I freak out and panic I’m gonna get fired [when scores aren’t the greatest].” “I freak out and panic I’m gonna get fired [when scores aren’t the greatest].”
Internal LogicDetects topic: “Job Security/Anxiety”.
Generates generic supportive advice.
Retrieves Memory: “User received ‘Top Performer’ bonus last month”.
Response“It’s understandable to feel anxious when metrics seem tied to your security, and many people notice their worry spike when scores dip. It can help to look at longer-term trends instead of any single week and remind yourself that consistency tends to matter more than brief fluctuations. Grounding techniques or a short reset break can also help steady things when the panic starts to rise.”“Let’s pause and look at what’s real here. You earned a top-tier bonus just last month, which is strong evidence that your company recognizes your performance and values what you bring. With that in mind, does this week reflect an actual issue the company is concerned about, or is it more likely a normal dip that feels bigger because you hold yourself to a high standard and care about doing well?”
Potential User ImpactGeneric: User feels patronised with basic advice.Personal: User feels validated by their own track record.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Howcroft, A.; Blake, H. Empathy by Design: Reframing the Empathy Gap Between AI and Humans in Mental Health Chatbots. Information 2025, 16, 1074. https://doi.org/10.3390/info16121074

AMA Style

Howcroft A, Blake H. Empathy by Design: Reframing the Empathy Gap Between AI and Humans in Mental Health Chatbots. Information. 2025; 16(12):1074. https://doi.org/10.3390/info16121074

Chicago/Turabian Style

Howcroft, Alastair, and Holly Blake. 2025. "Empathy by Design: Reframing the Empathy Gap Between AI and Humans in Mental Health Chatbots" Information 16, no. 12: 1074. https://doi.org/10.3390/info16121074

APA Style

Howcroft, A., & Blake, H. (2025). Empathy by Design: Reframing the Empathy Gap Between AI and Humans in Mental Health Chatbots. Information, 16(12), 1074. https://doi.org/10.3390/info16121074

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop