Development of a Comprehensive Evaluation Scale for LLM-Powered Counseling Chatbots (CES-LCC) Using the eDelphi Method

Bolpagni, Marco; Gabrielli, Silvia

doi:10.3390/informatics12010033

Open AccessArticle

Development of a Comprehensive Evaluation Scale for LLM-Powered Counseling Chatbots (CES-LCC) Using the eDelphi Method

by

Marco Bolpagni

^1,2,*

and

Silvia Gabrielli

²

¹

Department of General Psychology, University of Padova, 35121 Padova, Italy

²

Digital Health Research, Centre for Digital Health and Wellbeing, Fondazione Bruno Kessler, 38123 Trento, Italy

^*

Author to whom correspondence should be addressed.

Informatics 2025, 12(1), 33; https://doi.org/10.3390/informatics12010033

Submission received: 5 February 2025 / Revised: 9 March 2025 / Accepted: 12 March 2025 / Published: 20 March 2025

(This article belongs to the Section Human-Computer Interaction)

Download

Browse Figure

Versions Notes

Abstract

Background/Objectives: With advancements in Large Language Models (LLMs), counseling chatbots are becoming essential tools for delivering scalable and accessible mental health support. Traditional evaluation scales, however, fail to adequately capture the sophisticated capabilities of these systems, such as personalized interactions, empathetic responses, and memory retention. This study aims to design a robust and comprehensive evaluation scale, the Comprehensive Evaluation Scale for LLM-Powered Counseling Chatbots (CES-LCC), using the eDelphi method to address this gap. Methods: A panel of 16 experts in psychology, artificial intelligence, human-computer interaction, and digital therapeutics participated in two iterative eDelphi rounds. The process focused on refining dimensions and items based on qualitative and quantitative feedback. Initial validation, conducted after assembling the final version of the scale, involved 49 participants using the CES-LCC to evaluate an LLM-powered chatbot delivering Self-Help Plus (SH+), an Acceptance and Commitment Therapy-based intervention for stress management. Results: The final version of the CES-LCC features 27 items grouped into nine dimensions: Understanding Requests, Providing Helpful Information, Clarity and Relevance of Responses, Language Quality, Trust, Emotional Support, Guidance and Direction, Memory, and Overall Satisfaction. Initial real-world validation revealed high internal consistency (Cronbach’s alpha = 0.94), although minor adjustments are required for specific dimensions, such as Clarity and Relevance of Responses. Conclusions: The CES-LCC fills a critical gap in the evaluation of LLM-powered counseling chatbots, offering a standardized tool for assessing their multifaceted capabilities. While preliminary results are promising, further research is needed to validate the scale across diverse populations and settings.

Keywords:

counseling chatbots; mental health chatbots; Large Language Models (LLMs); digital mental health; chatbot evaluation; eDelphi methodology; evaluation scale

1. Introduction

1.1. Background

Counseling chatbots are conversational agents designed to provide mental health support, guidance, and therapeutic conversations [1]. These chatbots simulate human-like interactions to help individuals manage their emotional well-being, particularly in situations where immediate access to human counselors is unavailable. By offering scalable and accessible mental health services, counseling chatbots address key barriers such as cost and geographical limitations, making them an appealing solution for both users and healthcare systems [2]. The adoption of counseling chatbots has accelerated due to increasing mental health awareness and the growing demand for scalable solutions, a trend that has been further amplified by the COVID-19 pandemic [3]. In many regions, shortages of mental health professionals and long wait times for therapy [4] further highlight the need for alternative support systems. Counseling chatbots can bridge this gap by offering immediate, on-demand assistance. As mental health issues become more pressing globally, the role of these technologies is expected to expand in the next years. Traditionally, counseling chatbots relied on rule-based systems and Natural Language Processing (NLP) techniques to interpret and respond to user inputs [5]. These systems often used decision trees and pattern matching to generate responses, which limited their ability to handle complex or nuanced conversations. However, recent advancements in Large Language Models (LLMs), such as GPT-3 and GPT-4, represent a significant upgrade. LLMs leverage deep learning techniques, particularly transformer [6] architectures to process and generate text with a much deeper understanding of context and semantics. Unlike traditional NLP systems, LLMs are trained on vast amounts of data, which enables them to generate more coherent, personalized, and contextually aware responses. Consequently, where implemented, LLMs have vastly improved the quality of text generated, the level of personalization, and the contextual awareness of chatbot interactions, allowing more natural and emotionally resonant conversations [7]. As counseling chatbots become more sophisticated due to the introduction of LLMs, so must the methods used to evaluate them. Traditional evaluation scales, centered on user satisfaction and related concepts, are valuable but may no longer suffice in capturing the full spectrum of these systems’ capabilities. Their measures are useful but may not fully capture the unique strengths and challenges of modern LLM-powered counseling chatbots. For example, while traditional scales might assess whether a chatbot provides relevant answers, they often overlook aspects like the chatbot’s ability to express empathy, remember previous interactions, or maintain a natural and engaging conversation over time. The integration of LLMs introduces new aspects, such as enhanced empathy, seamless conversational flow, and contextually appropriate emotional responses, which require more nuanced and multifaceted evaluation tools.

1.2. Related Works

The evaluation of mental health chatbots is a complex task that involves assessing multiple aspects of their performance and impact on users. Effective mental health chatbots must engage users, provide a positive user experience, be easy to use, offer helpful and empathetic support, foster trust and alliance, and demonstrate strong technical performance in terms of language quality [8]. As such, researchers and developers have identified several key aspects that are crucial to the success of these systems. Engagement [9] and user experience [10], for example, can influence users’ motivation to continue using the chatbot and their overall satisfaction with the system. Usability [11] is critical to ensuring that users can effectively interact with the chatbot and access the support they need. Perceived helpfulness, empathy, trust, support, and alliance are all essential components of a therapeutic relationship and are critical to establishing a sense of rapport and connection between the user and the chatbot [8]. Technical performance and language quality, meanwhile, are fundamental to ensure that the chatbot can provide accurate and informative responses to users’ queries.

To evaluate these dimensions, researchers have employed a range of scales and metrics. Engagement, for instance, has been evaluated [12,13] using the User Engagement Scale (UES) [14]. This scale provides insights into users’ emotional and cognitive investment in interacting with the chatbot. In contrast, user experience has been assessed [15,16,17] through the User Experience Questionnaire (UEQ) [18], which captures users’ subjective experience of using the chatbot in terms of attractiveness, perspicuity, efficiency, dependability, stimulation, and novelty. To evaluate usability, instead, researchers have utilized [19,20,21,22] several scales, including the System Usability Scale (SUS) [23], Chatbot Usability Questionnaire (CUQ) [24], and Bot Usability Scale (BUS) [25]. These scales provide a comprehensive understanding of users’ perceptions of the chatbot’s ease of use. Perceived helpfulness, which refers to users’ beliefs about the chatbot’s ability to provide effective support, has been evaluated [26,27] using frameworks such as the Unified Theory of Acceptance and Use of Technology (UTAUT) [28] and Perceived Usefulness and Ease of Use (PEOU) [29]. These frameworks help researchers understand the factors that influence users’ intentions to use mental health chatbots. Empathy, a critical component of human–computer interaction in mental health contexts, has been assessed [30,31] using scales such as the Perceived Empathy of Technology Scale (PETS) [32] and Empathy Scale for Human–Computer Communication (ESHCC) [33]. These scales capture users’ perceptions of the chatbot’s ability to understand and respond to their emotional needs. In addition to empathy, perceived trust, support, and alliance are essential aspects of mental health chatbots. The Virtual Therapist Alliance Scale (VTAS) [34] has been used [35] to evaluate these dimensions, providing insights into users’ perceptions of the chatbot as a supportive and trustworthy therapeutic agent. Technical performance and language quality instead are often evaluated [36,37,38,39] using automated metrics such as perplexity, BLEU [40], and ROUGE [41]. These metrics provide quantitative insights into the chatbot’s ability to generate coherent, contextually appropriate, and grammatically accurate responses.

Despite the availability of these tools, many studies still rely on custom evaluation grids tailored to their research needs [42,43], introducing variability across studies. These grids are often designed to cover all relevant aspects in a compact form, as using a full set of scales for the evaluations would be too lengthy and impractical for many studies. This lack of standardization, however, can hinder comparisons between studies and limit the generalizability of findings. As LLM-powered counseling chatbots become increasingly sophisticated, an integrated evaluation approach is essential to ensure that all aspects are adequately assessed while keeping the length of the scale manageable.

1.3. Aim

To address the limitations of current evaluation methods and provide a compact comprehensive tool designed for the unique demands of LLM-powered counseling chatbots, we aim to develop a novel scale (Comprehensive Evaluation Scale for LLM-Powered Counseling Chatbots (CES-LCC)) using the eDelphi method [44]. This approach is particularly well-suited to emerging fields like AI-driven counseling, where expert knowledge is still evolving and consensus on best practices has not yet been fully established.

2. Materials and Methods

2.1. Participants

Following the Delphi methodology [45], this study employed purposive sampling to assemble a panel of experts in counseling psychology, Artificial Intelligence (AI), and Human-Computer Interaction (HCI). The Delphi method does not mandate a statistically representative sample, thereby affording flexibility in panel size. Conventionally, panel sizes range from 15 to 30 participants [46,47]. Given this flexibility, invitations were extended to 22 experts, selected for their professional experience with counseling technologies, LLM-powered systems, and chatbot development. Of these, 16 experts agreed to participate. Eligibility criteria included a minimum of three years of professional experience, relevant academic publications or industry contributions, and familiarity with LLM technologies. Recruitment took place in a single round between 26 June 2024 and 4 July 2024. Invitations were distributed via email, outlining the study’s objectives and the participants’ role in defining and refining the evaluation scale. Participation was voluntary, with an estimated time commitment of 30 min per round. Experts were given a two-week time window to complete the first-round survey, with a reminder sent at the halfway point to encourage response. Following the completion of the first round, a second round of the eDelphi process was conducted to refine and consolidate the initial feedback. Participants were presented with a summary of the first-round results, including aggregated ratings and qualitative comments, and were invited to reassess their responses based on the group’s collective insights. The second-round survey was distributed on 20 July 2024, with a two-week response window and similar reminders to encourage participation.

2.2. Procedure

For our eDelphi study, we followed the four steps proposed by [48], which consist of (1) a preparatory phase, (2) eDelphi rounds, (3) data processing and analysis, and (4) conclusion and reporting. The procedure for the eDelphi study is visually summarized in Figure 1, which provides an overview of the steps and corresponding activities carried out at each stage.

2.2.1. Preparatory Phase

The development of CES-LCC began with the identification of key dimensions critical to the assessment of LLM-powered counseling chatbots. Given that LLM-powered counseling chatbots represent a relatively new and rapidly evolving field, we adopted a targeted approach to the literature. Instead of conducting a comprehensive review across a broad range of sources, we focused on key publications and resources that specifically address chatbot evaluation in the context of mental health [42,43,49,50] and LLM technologies [51,52,53,54] as well as on the scales described in Section 1.2. This targeted analysis allowed us to concentrate on the most relevant aspects to be evaluated, leading to the identification of recurring themes reflecting the technical, emotional, and linguistic dimensions relevant to the evaluation of LLM-powered counseling chatbots.

The initial pool of items was generated by M.B. and S.G., who have interdisciplinary expertise in AI, digital health system design, and psychology. The dimensions identified for the first draft of the scale included: Understanding my Requests, Providing Helpful Information, Clarity and Relevance of Responses, Ease of Use and Interaction, Language Quality, Trust, Emotional Support, Guidance and Direction, and Overall Satisfaction. Based on these dimensions, a total of 18 items (see Appendix A.1) were generated, with 2 items allocated to each category. We deliberately chose to limit the number of items per dimension to 2 to avoid guiding the experts too heavily and to allow for more open-ended feedback during the Delphi process.

2.2.2. eDelphi Rounds

The eDelphi process was structured into two iterative rounds aimed at refining and validating the scale. Both rounds were administered electronically via the Qualtrics XM platform [55].

First Round

During the initial round, experts were provided with the preliminary version of the scale comprising the eighteen items developed in the preparatory phase. Participants assessed each item’s relevance using a five-point Likert scale ranging from “Not relevant at all” to “Very relevant” and its priority on a separate five-point Likert scale from “Very low” to “Very high”. Additionally, they offered qualitative feedback regarding each item’s clarity and comprehensiveness and proposed additional items or dimensions. Following the first round, the original English scale was translated into Italian to ensure accessibility and applicability across both local and international contexts. The translation process followed a rigorous two-step approach to ensure linguistic and conceptual accuracy. Initially, the original scale (in English) was translated into Italian using a multilingual LLM, specifically Mistral Large [56]. Subsequently, two independent bilingual translators, both with expertise in artificial intelligence and psychology, undertook the refinement process. Each translator worked independently, leveraging their specialized knowledge to enhance the linguistic precision and conceptual clarity of the translation. Finally, the translators collaborated to consolidate their refinements into a cohesive and accurate final version.

Second Round

The revised scale, incorporating modifications from the first round, was inspected in the second eDelphi round by the same panel of experts. In this round, participants reviewed the updated scale, which included retained, revised, and newly added items or dimensions. Experts re-evaluated each item’s relevance using the same five-point Likert scale and assessed priority by ranking the items in order of importance. This ranking method facilitated a more precise determination of each item’s relative significance, helping future efforts to develop shorter versions of the scale by identifying the most critical items. Additionally, experts were asked to flag any items they deemed redundant and, if flagged, to specify which other item(s) the redundant item overlapped with. Participants also provided additional qualitative feedback to confirm whether the revisions effectively addressed prior concerns and were encouraged to suggest further enhancements or provide specific recommendations for improvement. Experts whose mother tongue is Italian were also asked to evaluate the translation quality, using both a five-point Likert scale from “Very poor” to “Excellent” and responding to the open-ended question: “Is there a better way to translate this item into Italian? If so, please provide the improved version below”.

2.2.3. Data Processing and Analysis

Throughout both eDelphi rounds, data processing involved detailed quantitative and qualitative analyses. All computations were performed using Python 3.11.5, with the Pandas library [57] (version 2.1.1) utilized for data processing and statistical analysis. For the first round, descriptive statistics (mean, median, interquartile range, and standard deviation) were calculated for both relevance and priority ratings of each item. In contrast, priority assessment in the second round was approached differently, utilizing rankings derived through the Borda count method [58]. This method aggregated participant rankings by assigning points inversely proportional to rank positions, providing a more structured framework for determining the collective prioritization of items. As in [48], items were retained if over 75% of participants rated them as 4 or 5 in relevance and if the interquartile range was below 2. Items with a mean relevance score below 3 or those failing to meet agreement criteria were excluded, along with any dimensions devoid of remaining items post-removal. Qualitative feedback from both rounds underwent thematic analysis [59] to extract common themes and suggestions related to item clarity, comprehensiveness, and potential oversights. This analysis was performed independently by M.B. and S.G. and subsequently consolidated through consensus. The insights derived from this analysis informed the necessary revisions and additions to the scale, thereby enhancing its overall quality and comprehensiveness. In the second round, an additional analysis was also performed to address redundancy among items. Since no established guidelines for redundancy were found in the literature, a statistical criterion was applied. Items flagged as redundant were analyzed, and those exceeding the third quartile of redundancy flags were systematically removed. This redundancy-focused refinement ensured a more concise and efficient evaluation scale, aligned with expert consensus.

To systematically guide and organize the refinement of the evaluation scale, we introduced the Add, Modify, Drop (AMD) approach, as summarized in the form of an algorithm in Algorithm 1. This framework consolidates established methods for decision-making regarding items/dimensions in the context of the development of new scales using the Delphi method [60,61,62,63]. For each dimension, items were dropped based on the aforementioned quantitative criteria, new items were added in response to qualitative feedback to address identified gaps, and existing items were modified according to qualitative feedback to improve clarity, comprehensiveness, and relevance.

Algorithm 1. Add, Modify, Drop (ADM) algorithm

Require: D: Dimensions,

I_{d}

: Items in d, Quantitative thresholds:

t_{agree}

,

t_{IQR}

,

t_{mean}

,

t_{redundancy}

, Q: Qualitative feedback.
Ensure: Refined scale with updated dimensions and items.

1:: Initialize $D_{refined} \leftarrow D$ , $I_{d}^{refined} \leftarrow I_{d}$ for all $d \in D$ .
2:: for each $d \in D_{refined}$ do
3:: for each $i \in I_{d}^{refined}$ do
4:: Compute: % agree, mean, IQR.
5:: if % $\geq t_{agree}$ and mean $> t_{mean}$ and $IQR < t_{IQR}$ then
6:: Retain i.
7:: else
8:: Drop i.
9:: end if
10:: end for
11:: Refine $I_{d}^{refined}$ with Q: improve clarity, address gaps, and add missing items.
12:: Perform redundancy analysis:
13:: for each $i \in I_{d}^{refined}$ do
14:: Compute the number of flags for i based on redundancy checks.
15:: if flags for $i \geq t_{redundancy}$ then
16:: Remove i from $I_{d}^{refined}$ .
17:: end if
18:: end for
19:: if $I_{d}^{refined} = \emptyset$ then
20:: Remove d from $D_{refined}$ .
21:: end if
22:: end for
23:: return $D_{refined}$ with updated $I_{d}^{refined}$ .

2.2.4. Conclusion and Reporting

Upon completing the two eDelphi rounds, the final version of CES-LCC was assembled, documented, and prepared for dissemination (a full version of the scale is provided in Section 3.3).

2.3. Initial Validation in Real-World

To assess the reliability of CES-LCC, we conducted an initial validation in a real-world setting. This stage focused on testing the internal consistency of the scale items both globally and across its dimensions. Data collection involved 49 users (participants details in Appendix C) engaging with an LLM-powered chatbot that delivered the first session of the Self-Help Plus (SH+), an Acceptance and Commitment Therapy (ACT) based intervention for stress management and prevention originally developed by the World Health Organization (WHO) [64]. In this session, participants were introduced to the chatbot and received information about stress, emotional storms, and some exercises people can use to manage these situations (e.g., grounding, focused attention). Participants filled the CES-LCC after a single interaction with the chatbot. Items were rated on a 5-point Likert scale ranging from “Strongly Disagree” to “Strongly Agree”. The reliability of the evaluation scale was assessed by calculating Cronbach’s alpha [65] for each of the nine dimensions, as well as for the overall scale. Additionally, both item-total correlations [66] and inter-item correlations were examined for each dimension to verify that individual items contribute meaningfully to their respective constructs without introducing strong redundancy.

3. Results

3.1. Demographic Description of Experts

The expert panel consisted of 16 professionals with expertise in psychology, AI, HCI, and digital therapeutics (DTx) (Table 1). Gender representation was balanced, with 56.25% male and 43.75% female participants, and academic qualifications were predominantly at the master’s (50%) and doctoral (31.25%) levels. Participants’ average age was 34.5 years (SD = 10.66), indicating moderate age variability, while professional roles ranged from researchers (50%) to AI developers (37.5%) and psychologists (12.5%). The group included a broad spectrum of professional experience, with half of the participants having junior roles (3–5 years in the field) and others contributing more senior-level expertise (18.75% with 21+ years of experience). All participants were based in Italy. While the homogeneous geographical location offers cultural homogeneity at the same time it might limit the views offered by the participants involved (more is discussed in Section 4).

3.2. First Round

During the first round of the eDelphi process, a thorough assessment of the original 18-item scale was undertaken. Item-level analysis revealed that 6 items did not meet the predetermined agreement criteria based on relevance and interquartile range (IQR) thresholds, leading to their exclusion from the scale (see Appendix A.1). Concurrently, experts provided qualitative feedback that resulted in the addition of 12 new items, enhancing the scale’s ability to capture significant aspects that need to be evaluated in LLM-powered counseling chatbots. Additionally, five existing items were split into separate subitems to more effectively address distinct aspects, as some items were initially found to cover multiple, overlapping areas. To improve clarity, two items were rephrased based on the qualitative suggestions provided (see Appendix A.3). The dimension “Ease of use and interaction” was removed due to the absence of remaining items after the exclusion process. Meanwhile, a new dimension called “Memory” was introduced to evaluate the chatbot’s ability to retain and utilize prior interactions effectively.

Experts recommended (see Appendix A.2) that each dimension should contain at least three items to meet the psychometric requirement for assessing internal consistency and to ensure the scale’s reliability (“Psychometrically, factors should have at least 3 items to be considered reliable, with 2 items it is not even possible to calculate internal consistency”). Consequently, after initial revisions, new items were added to dimensions with fewer than three items to meet this requirement. Moreover, qualitative feedback highlighted the necessity of assessing privacy and security concerns related to the chatbot. However, experts concluded that these aspects pertain more to production nonfunctional requirements rather than intrinsic characteristics of the chatbot’s functionality and, therefore, were excluded from the evaluation scale (“Items related to data privacy and security might be relevant in this scenario. However, in my experience, these items are more aligned with production or implementation processes and might be better addressed under regulations like the EU AI Act and GDPR”.). Moreover, in relation to the “Trust” dimension, experts raised concerns about anthropomorphizing chatbots. They cautioned that attributing human-like qualities could skew the assessment of trustworthiness. As a result, all items containing any form of anthropomorphism were removed and replaced with new items proposed by the experts to better align with the construct while avoiding any reference to human characteristics. The priority assessment of the items did not provide a clear picture, as priority values ranged from 3.13 to 4.88, with an average value of 3.88 (SD = 0.41).

3.3. Second Round

The second round of the eDelphi process focused on refining the revised evaluation scale, which at this point consisted of 34 items across nine dimensions, incorporating feedback provided during the first round. General feedback from experts gathered in the second round indicated that the scale was comprehensive and complete. However, experts also remarked on the presence of some redundancy in the scale, with certain items overlapping or duplicating information. The average relevance score of the items in this round was 4.32 (SD = 0.40), reflecting strong agreement among experts regarding the importance of the included items. Despite this general consensus, five items were removed due to low agreement, as they failed to meet the thresholds for relevance and IQR established in the methodology. Additionally, two items were excluded based on redundancy. The Italian translation of the scale received high ratings for quality, with an average score of 4.59 (SD = 0.31). However, comments from mother-tongue experts (n = 12) highlighted the need for minor adjustments to improve linguistic precision and conceptual clarity. Consequently, 14 items were revised to enhance their translation quality and to maintain consistency between the English and Italian versions. Furthermore, four items were rephrased in both languages based on qualitative feedback. To complement these refinements, the collective ranking of the items for each dimension was computed. This analysis not only provides a foundation for the potential development of shorter versions of the scale, which may improve its deployment in practical application settings but also makes it possible to identify the most relevant item within each dimension. These rankings, along with the finalized version of CES-LCC (27 items across 9 dimensions), are included in Table 2.

3.4. Initial Validation

The initial validation of the scale in a real-world setting demonstrated its reliability both globally and across individual dimensions. A total of 49 participants completed the evaluation scale after interacting with the LLM-powered chatbot that delivers the first session of the Self-Help Plus (SH+) intervention. The overall scale exhibited excellent internal consistency, with a Cronbach’s alpha of 0.94, exceeding the generally accepted threshold of 0.70 for reliability [67]. Across the nine dimensions, Cronbach’s alpha values ranged from 0.47 to 0.91, with 8 out of 9 dimensions exceeding the 0.70 threshold (see Table 3). These results suggest strong consistency for all the dimensions except for CRR (Clarity and Relevance of Responses), which reached only a Cronbach’s alpha value of 0.47. Further investigation revealed that CRR3, the only reverse-coded item on the scale, was a key contributor to the lower reliability. Despite being recoded for the computation of Cronbach’s alpha, it may have introduced additional cognitive complexity for respondents, potentially affecting the consistency of responses within this dimension. Average item-total correlations across the dimensions ranged from 0.33 to 0.82, with most dimensions showing satisfactory alignment between items and their respective constructs. The Emotional Support (ES) and Overall Satisfaction (OS) dimensions achieved the highest item-total correlations, with means of 0.82 and 0.80, respectively, reflecting strong coherence within these constructs. By contrast, the CRR dimension showed the lowest mean item-total correlation at 0.33, further highlighting the misalignment of CRR3 with the rest of the items in this dimension. Inter-item correlations provided additional insights into the internal structure of the scale. Mean inter-item correlations ranged from 0.28 (CRR) to 0.77 (ES). While most dimensions demonstrated inter-item correlations within the acceptable range (0.20–0.70) [68], ES and OS displayed notably high mean inter-item correlations (0.77 and 0.75, respectively), suggesting a need to investigate potential residual redundancies in these dimensions. These preliminary findings indicate that while the scale overall and most of its dimensions demonstrate acceptable psychometric properties, specific dimensions, such as CRR need further investigation.

4. Discussion

This study aimed to develop a comprehensive evaluation scale for LLM-powered counseling chatbots (CES-LCC) leveraging the domain knowledge of a pool of experts using the eDelphi method. Through two rounds of expert feedback, the scale was refined to address a broad spectrum of aspects related to the evaluation of this type of counseling chatbot. The results, particularly the qualitative feedback obtained using open-ended questions, highlight the importance of a multidisciplinary approach to developing tools that effectively evaluate the different aspects and functionalities of modern digital health solutions. The final version of CES-LCC includes 27 items across nine dimensions and offers a robust framework to assess the unique challenges and capabilities of LLM-powered counseling chatbots. The structured eDelphi process facilitated the identification and integration of critical evaluation dimensions, leading to significant refinements. For instance, the addition of a “Memory” dimension underscores the need to assess chatbots’ abilities to retain and build upon previous interactions, a functionality critical for creating a coherent and personalized user experience with digital mental health interventions. The importance of this dimension is grounded in the scientific literature on the concepts of Memory Support Intervention [69,70] in which information on previous sessions is embedded by the therapist in the ongoing dialogue, session summaries, and skill-building exercises to enhance retention, facilitate continuity, and promote the practical application of therapeutic concepts in the patient’s daily life. In addition, recent research [71] on LLM-powered chatbots with long-term memory features suggests that embedding memories in interactions can improve engagement and user experience while fostering self-disclosure and a sense of familiarity. These findings underscore the significant role of memory in shaping the long-term user experience, distinguishing chatbots with memory capabilities from those without. The exclusion of a “Privacy and security” dimension from the scale instead reflects the need for a focused approach to evaluation in which chatbots and their infrastructural aspects (e.g., production choices like selecting encryption methods, implementing security protocols, adhering to privacy regulations) are evaluated separately. This is in line with ISO/IEC 25010 [72], which distinguishes between different quality characteristics in system and security evaluation. In this standard, security (encompassing attributes such as confidentiality, integrity, and authenticity) is treated as a distinct nonfunctional requirement, separate from usability or functional suitability [73]. The expert panel also emphasized the importance of avoiding anthropomorphic language when assessing trust to avoid misattributing human-like qualities to AI agents. This insight reflects the broader challenge related to both the design of transparent AI agents and the frameworks to assess such systems. Anthropomorphism is often used to increase retention and to promote self-disclosure [74], however using it in the context of mental health poses a great risk of exacerbating maladaptive behaviors and thoughts (e.g., social isolation) [75]. As a result, evaluation methods must address the unique capabilities of AI-driven systems while avoiding the promotion of anthropomorphic views, particularly in contexts where such perspectives could pose significant risks.

4.1. Implications

By addressing gaps in existing methodologies, CES-LCC offers a framework for comprehensively assessing LLM-powered counseling chatbots. Unlike traditional evaluation tools that often focus on single aspects, this scale captures a broader spectrum of dimensions, including emotional support, memory retention, and trustworthiness. This study highlights the potential value and utility of the developed evaluation scale, although its applicability and impact require further validation and exploration. For researchers, the scale provides an integrated tool that can facilitate systematic investigations into the effectiveness of counseling chatbots. By combining technical and relational dimensions, the scale encourages multidisciplinary studies, potentially fostering deeper insights into how these technologies interact with users in complex, emotionally charged scenarios. This could contribute to the development of more sophisticated chatbot designs and the refinement of LLM technologies in therapeutic settings. In practice, the scale may serve as a useful tool for developers, mental health practitioners, and policymakers to evaluate and improve counseling chatbots. Developers could use the scale to identify specific areas for enhancement, ensuring their chatbots meet the demands of the users. Mental health practitioners might find the scale helpful when selecting chatbots to integrate into their services, as it provides a structured way to assess their potential. Finally, policymakers, particularly those involved in healthcare technology regulation, could leverage the scale to establish benchmarks for chatbot performance and safety.

4.2. Limitations and Future Research

While this study provides valuable insights, several limitations must be acknowledged. First of all, the expert sample size was limited, and because the participant pool included only experts from Italy, the generalizability of the findings to broader cultural or professional contexts may be limited. The selection of Italian experts was a strategic choice, as we aimed to develop both the English scale and an Italian variant. Given our familiarity with the Italian landscape, we were able to engage highly relevant professionals whose expertise aligned closely with the study’s objectives. This approach ensured that the input collected was well-informed and contextually grounded. While including experts from multiple countries at this stage could have provided broader insights, it would have also introduced challenges in ensuring the relevance and comparability of expert contributions. To enhance the generalizability of the findings, future research should expand the expert panel internationally, allowing for cross-cultural validation and adaptation of the scale. Additionally, although an initial real-world validation was conducted, it involved filling CES-LCC after a single session delivered by a chatbot to a limited group of users (n = 49), which restricts the robustness of the conclusions regarding the scale’s practical application and reliability. Comprehensive real-world testing with diverse user groups is needed to assess the scale’s reliability and utility across various scenarios. The scale’s development is still in its nascent stages, and its psychometric properties require further investigation. In particular, while factor analysis was not employed in this study, it remains a crucial next step to reinforce the psychometric robustness of the scale. Given the study’s exploratory nature and reliance on expert consensus through the eDelphi method, factor analysis was deferred to a later stage when a larger and more diverse dataset would be available. Incorporating factor analysis in future research will help confirm the scale’s latent structure and enhance its validity, supporting its application across different contexts. Finally, the reliance on the eDelphi method, which depends on subjective expert judgment, introduces potential biases despite efforts to ensure diverse expertise and minimize the influence of individual perspectives. Future iterations should aim to integrate additional methodologies (e.g., factor analysis) to corroborate and enhance the objectivity of the findings.

5. Conclusions

This study presents the CES-LCC, a comprehensive evaluation scale developed to assess the unique challenges posed by evaluating LLM-powered counseling chatbots. Through an iterative eDelphi process involving multidisciplinary experts, the scale captures critical dimensions such as emotional support, trust, memory retention, and overall satisfaction. Initial validation in a real-world setting indicates strong reliability, emphasizing its potential utility for researchers, developers, and practitioners. The scale’s multidimensional approach encourages a holistic assessment of chatbot performance, facilitating the identification of areas for enhancement. Despite its limitations, including the reliance on a geographically restricted expert panel for its development and limited user validation, the CES-LCC represents a significant step forward in standardizing the evaluation of modern counseling chatbots. Future research should focus on broader validation efforts, integrating diverse user perspectives, exploring the scale’s psychometric properties, and examining its applicability in real-world contexts more extensively.

Author Contributions

M.B. and S.G. contributed substantially to the conception and design of the study, to the acquisition of data, and to the editing of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by Hub Life Science—Digital Health (LSH-DH) PNC-E3-2022-23683267—Project DHEAL-COM—CUP C63C22001970001, Ministry of Health (Italy) under the Piano Nazionale Complementare al PNRR Ecosistema Innovativo della Salute (Code PNC-E.3). Disclaimer: This publication reflects only the authors’ views, and the Italian Ministry of Health is not responsible for any use that may be made of the information it contains.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the HIT Ethics Committee of the University of Padova (Code: 2023_20241R2, Date: 2 May 2024).

Informed Consent Statement

Written informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Data are available on request. All data have been fully anonymized, with no personally identifiable information included. Participants have been de-identified using randomly generated unique identifiers, and no linkage key exists that could enable re-identification. The dataset is securely stored in an encrypted format on a GDPR-compliant server.

Acknowledgments

During the preparation of this work the authors used ChatGPT (GPT-4o) to improve readability and language. After using this service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Round 1 Results Overview (Items)

Dim	Item	Relevance				Priority			Sample QL Feedback
		% Agree	M (SD)	Mdn (R)	IQR	M (SD)	Mdn (R)	IQR
UR	The chatbot consistently understands what I am saying or asking.	100.00	4.88 (0.34)	5 (4–5)	0.00	4.88 (0.34)	5 (4–5)	0.00	- I’d add another item about the tone (e.g., The chatbot understands the tone of my request) - I believe an important feature is not only that it understands but also infers from the sentence
UR	I have to rephrase my requests very often for the chatbot to understand.	62.50	3.69 (1.14)	4 (1–5)	1.25	3.88 (0.89)	4 (3–5)	2.00	N/A
PHI	The chatbot provides accurate and helpful information.	87.50	4.44 (1.09)	5 (1–5)	1.00	4.38 (1.15)	5 (1–5)	1.00	- I would divide the question into two separate questions. Answers might be accurate, but not helpful and vice versa, so (1) the chatbot provides accurate information; (2) the chatbot provides helpful information - I think it’s important to add that the information it provides is grounded in theoretical frameworks and scientific literature.
PHI	The chatbot often provides incorrect or incomplete information.	75.00	3.94 (1.18)	4 (1–5)	1.25	3.69 (1.25)	4 (1–5)	0.75	N/A
CRR	The chatbot’s responses are clear, concise, and easy to understand.	81.25	4.06 (1.00)	4 (2–5)	1.00	4.00 (1.03)	4 (2–5)	1.25	- Conciseness and easiness are very different dimensions, I don’t feel like they should be evaluated together - I’d add an item about verbosity (e.g., The chatbot adds superfluous information related to the query)
CRR	The chatbot’s responses are confusing or irrelevant to my questions.	81.25	4.00 (1.32)	4 (1–5)	1.00	3.94 (1.12)	4 (1–5)	1.00	- I’d split the item in two (confusing/irrelevant)
EUI	The chatbot is easy to interact with.	62.50	3.94 (1.00)	4 (2–5)	2.00	3.63 (1.31)	3.5 (1–5)	2.00	- I have no better way. My doubt is about the definition of easy. What does it mean? How can someone evaluate this dimension?
EUI	Using the chatbot is frustrating or requires too many tentative interactions.	62.50	3.81 (1.11)	4 (1–5)	2.00	3.56 (1.21)	3.5 (2–5)	2.25	- The item is not very clear; does it refer to access or the actual internal use of the LLM?
LQ	The chatbot uses correct grammar and spelling in its responses.	75.00	3.88 (1.02)	4 (1–5)	0.50	3.50 (1.21)	4 (1–5)	1.00	N/A
LQ	The chatbot’s language style is natural and appropriate for the context.	68.75	4.00 (0.97)	4 (2–5)	2.00	3.50 (1.37)	3.5 (1–5)	2.00	- I would divide this item into 2 items: (1) The chatbot’s language style is/seems natural; (2) The chatbot’s language is appropriate for the context.
T	I believe the chatbot has my best interests at heart.	43.75	3.13 (1.50)	3 (1–5)	2.25	3.13 (1.50)	3 (1–5)	2.25	- I would avoid formulations that suggest that the chatbot might have a cognition. - I believe that it is tricky to talk about the chatbot as if it has agency and conscience also because it might lead to overreliance and/or excessive anthropomorphization of the tool.
T	I am willing to rely on the chatbot in the future.	62.50	3.81 (0.91)	4 (2–5)	1.25	3.50 (1.15)	3.5 (2–5)	1.50	N/A
ES	The chatbot makes me feel heard and understood.	75.00	4.06 (1.12)	4 (1–5)	1.25	3.94 (1.18)	4 (1–5)	1.25	- I’d add an item about sense of humor (e.g., The chatbot shows to have a sense of humor when required)
ES	The chatbot’s responses feel empathetic and supportive.	87.50	4.19 (0.98)	4 (2–5)	1.00	4.00 (0.89)	4 (1–5)	0.00	- I’d add an item about feeling reassured (e.g., The chatbot’s inputs and responses can make me feel reassured)
GD	The chatbot provides helpful advice and suggestions for coping with my problems.	93.75	4.38 (0.62)	4 (3–5)	1.00	4.25 (0.68)	4 (3–5)	1.00	- This question might crossload into the “Providing helpful information” factor. However, I would keep it, because in this case it specifically talks about coping, but maybe phrasing it as follows: The chatbot provides adjusted guidance in coping with my problems.
GD	The chatbot encourages me to take positive steps towards my goals.	75.00	3.88 (0.81)	4 (2–5)	0.25	3.81 (0.83)	4 (2–5)	1.00	- I believe it is important to assess an individual’s goals carefully. For example, a person with an eating disorder might set a goal to lose an extreme amount of weight, which is unhealthy. Therefore, it’s crucial to remember that a patient’s goals are not always the best for their well-being.
OS	I am overall satisfied with the usability of this chatbot.	87.50	4.44 (0.73)	5 (3–5)	1.00	4.44 (0.73)	5 (3–5)	1.00	- I want to point out that not only the usability but also the effectiveness in helping is important.
OS	I would not recommend this chatbot to others due to usability issues.	75.00	3.88 (1.15)	4 (1–5)	1.25	3.88 (1.31)	4 (1–5)	2.00	- This statement sounds somehow redundant to the first one in terms that lower scores on statement 1 seem to be almost equivalent to high

Appendix A.2. Round 1 Qualitative Feedback (General and New Dimensions)

Feedback Type	Content
New dimensions	- I think assessing memory quality is crucial when dealing with real-world implementations using LLMs. So, I suggest adding this dimension to the assessment. - I believe there are missing items related to the perception of how the chatbot handles my privacy and data security, such as how it shares my personal information with third parties. - In my opinion, privacy and data security, especially regarding how the chatbot shares personal information, are important aspects not included in the items. However, these seem more tied to production or implementation and might be better addressed in a separate, dedicated evaluation. - Items related to data privacy and security might be relevant in this scenario. However, in my experience, these items are more aligned with production or implementation processes and might be better addressed under regulations like the EU AI Act and GDPR.
General	- Almost all the statements sound actually very relevant, I provided some lower scores in some of them just to distinguish the ones I think are most relevant but in general, all are relevant! - I felt like all of the shown questions were relevant in some way: that’s why some evaluations were a bit harsh, just so that I could express what is more relevant from my point of view. Anyway, the questions were all pretty clear - An important consideration for real-life implementation of LLM-powered chatbots is ensuring accessibility for a wide range of users. This includes compatibility with various devices, such as smartphones, tablets, and computers, to meet diverse user needs. Additionally, designing the chatbot to be inclusive is crucial—for example, allowing users to specify preferred names and pronouns to support transgender and gender-diverse individuals, and incorporating features like colorblind-friendly graphics or text presentation options to assist users with visual impairments or reading difficulties. These steps can significantly enhance user experience and inclusivity. - Psychometrically, factors should have at least 3 items to be considered reliable, with 2 items it is not even possible to calculate internal consistency

Appendix A.3. Round 1 Decision

Dim	Add (Motivation)	Modify (Motivation)	Drop (Motivation)
UR	- “The chatbot understands the tone of my request” (QL Feedback) - “The chatbot asks specific questions to better understand my requests” (QL Feedback) - “The chatbot infers information from my messages” (QL Feedback)	Nothing	- “I have to rephrase my requests very often for the chatbot to understand”. (% Agree Relevance)
PHI	- “The chatbot provides information grounded in theory and scientific literature”. (QL Feedback) - “The chatbot provides references”. (QL Feedback)	- Split the item “The chatbot provides accurate and helpful information”. into “The chatbot provides accurate information” and “The chatbot provides helpful information” (QL Feedback) - Split the item “The chatbot often provides incorrect or incomplete information”. into “The chatbot often provides incorrect information”. and “The chatbot often provides incomplete information”. (QL Feedback)	Nothing
CRR	- “The chatbot adds superfluous information related to the query” (QL Feedback)	- Split the item “The chatbot’s responses are clear, concise, and easy to understand”. into two different items “The chatbot’s responses are clear, and easy to understand”, “The chatbot’s responses are adequately concise” (QL Feedback) - Split the item “The chatbot’s responses are confusing or irrelevant to my questions”. into “The chatbot’s responses are confusing” and “The chatbot’s responses are irrelevant to my questions”. (QL Feedback)	Nothing
EUI	Nothing	Nothing	- The entire dimension (% Agree Relevance, IQR Relevance, QL Feedback)
LQ	Nothing	- Split the item “The chatbot’s language style is natural and appropriate for the context”. into “The chatbot’s language style is/seems natural” and “The chatbot’s language is appropriate for the context”. (% Agree Relevance, IQR Relevance, QL Feedback)	Nothing
T	- “I feel safe sharing my personal matters with the chatbot” (QL Feedback) - “I believe that the feedback/information provided by the chatbot are trustworthy” (QL Feedback) - “I believe the chatbot is transparent about its limitations and capabilities”. (QL Feedback)	Nothing	- “I believe the chatbot has my best interests at heart” (% Agree Relevance, IQR Relevance, QL Feedback) - “I am willing to rely on the chatbot in the future”. (% Agree Relevance)
ES	- “The chatbot’s responses can make me feel reassured” (QL Feedback) - “The chatbot shows to have a sense of humor when required” (QL Feedback)	Nothing	Nothing
GD	- “The chatbot helps me set realistic and achievable goals”. (QL Feedback)	- Modify “The chatbot provides helpful advice and suggestions for coping with my problems”. into “The chatbot provides adjusted guidance in coping with my problems” to avoid cross loading with other factor (QL Feedback) - Modify “The chatbot encourages me to take positive steps”. (QL Feedback)	Nothing
OS	- “I am overall satisfied with the effectiveness of this chatbot” (QL Feedback) - “I feel that my interactions with the chatbot were worthwhile”. (QL Feedback)	Nothing	- “I would not recommend this chatbot to others due to usability issues”. (Redundancy, QL Feedback)
M [New]	- “The chatbot accurately recalls key details from previous conversations”. (QL Feedback) - “The chatbot maintains consistency by integrating past interactions into current responses”. (QL Feedback) - “The chatbot adapts its advice based on information provided in earlier sessions”. (QL Feedback)	Nothing	Nothing

Appendix B

Appendix B.1. Round 2 Results Overview (Items)

Dim	Item (Italian)	Relevance			Redund		Priority	Translation Qual		Sample QL Feedback
		% Agree	M (SD)	Mdn (R)	IQR	Flags	Points	% Agree	M (SD)
UR	The chatbot consistently understands what I am saying or asking. (Il chatbot capisce sempre ciò che sto dicendo o chiedendo.)	100.00	4.87 (0.35)	5 (4–5)	0.00	1	56	73.33	4.73 (0.47)	- I’d prefer “The chatbot consistently understands what I am saying AND asking”. The “or” makes it hard to trust high scores. [Content] - Toglierei il “sempre” che in italiano potrebbe inserire un dubbio invece che rafforzare [Translation]
	The chatbot understands the tone of my request. (Il chatbot capisce il tono della mia richiesta.)	73.33	4.07 (0.80)	4 (3–5)	1.50	3	27	73.33	4.67 (0.65)	- It is not clear to me what “understanding the tone” means here [Content]
	The chatbot asks specific questions to better understand my requests. (Il chatbot fa domande specifiche per capire meglio le mie richieste.)	86.67	4.20 (0.68)	4 (3–5)	1.00	0	30	80.00	4.92 (0.29)	N/A
	The chatbot infers information from my messages. (Il chatbot inferisce informazioni dai miei messaggi.)	80.00	4.43 (0.94)	5 (2–5)	1.00	2	37	60.00	4.25 (1.06)	- The term “infer” is a bit ambiguous; I would suggest revising it as follows: “The chatbot is able to make adequate inferences based on my messages”. [Content] - Il chatbot deduce informazioni dai miei messaggi [Translation]
PHI	The chatbot provides accurate information. (Il chatbot fornisce informazioni accurate.)	86.67	4.47 (1.09)	5 (2–5)	1.00	2	81	80.00	4.83 (0.39)	N/A
	The chatbot provides helpful information. (Il chatbot fornisce informazioni utili.)	100.00	4.80 (0.41)	5 (4–5)	0.00	0	68	80.00	4.92 (0.29)	N/A
	The chatbot often provides incorrect information. (Il chatbot fornisce spesso informazioni errate.)	80.00	4.13 (1.13)	4 (1–5)	1.00	4	52	80.00	4.92 (0.29)	N/A
	The chatbot often provides incomplete information. (Il chatbot fornisce spesso informazioni incomplete.)	73.33	4.00 (0.76)	4 (3–5)	1.00	1	53	80.00	4.92 (0.29)	N/A
	The chatbot provides information grounded in theory and scientific literature. (Il chatbot fornisce informazioni basate su teorie e letteratura scientifica.)	80.00	4.13 (1.13)	4 (1–5)	1.00	3	35	73.33	4.50 (0.67)	- Il chatbot fornisce informazioni supportate da teorie e letteratura [Translation]
	The chatbot provides references. (Il chatbot fornisce riferimenti bibliografici.)	66.67	3.60 (1.06)	4 (1–5)	1.00	5	26	60.00	4.42 (0.90)	- I don’t think it’s crucial for users to have a research paper attached to questions such as “I feel bad lately, I can’t sleep”. It would make the UX poorer in my opinion. This would make more sense if you are building a search engine kind of system. [Content] - Il chatbot fornisce riferimenti alle fonti utilizzate [Translation]
CRR	The chatbot’s responses are clear, and easy to understand. (Le risposte del chatbot sono chiare e facili da capire.)	100.00	4.93 (0.26)	5 (4–5)	0.00	0	57	73.33	4.75 (0.62)	- Le risposte del chatbot sono chiare e semplici da capire [Translation]
	The chatbot’s responses are adequately concise. (Le risposte del chatbot sono sufficientemente concise.)	80.00	4.13 (0.74)	4 (3–5)	1.00	1	53	80.00	4.75 (0.45)	N/A
	The chatbot’s responses are confusing. (Le risposte del chatbot sono confondenti.)	93.33	4.07 (0.96)	4 (1–5)	0.50	5	50	53.33	3.83 (1.19)	- This is just the reverse of clear [Content] - Le risposte del chat mi confondono [Translation]
	The chatbot’s responses are irrelevant to my questions. (Le risposte del chatbot non sono pertinenti alle mie domande.)	100.00	4.73 (0.46)	5 (4–5)	0.50	1	42	80.00	4.92 (0.29)	N/A
	The chatbot adds superfluous information related to the query. (Il chatbot aggiunge informazioni superflue relative alla richiesta.)	53.33	3.47 (0.92)	4 (1–5)	1.00	8	23	60.00	4.45 (1.29)	- Il chatbot aggiunge informazioni superflue rispetto alla richiesta. [Translation]
LQ	The chatbot uses correct grammar and spelling in its responses. (Il chatbot fornisce risposte con grammatica e ortografia corrette.)	80.00	3.73 (1.22)	4 (1–5)	0.00	2	34	60.00	4.42 (0.90)	- Il chatbot fornisce risposte grammaticalmente e ortograficamente corrette. [Translation]
	The chatbot’s language style is/seems natural. (Lo stile linguistico del chatbot è/sembra naturale.)	86.67	4.47 (0.74)	5 (3–5)	1.00	0	23	66.67	4.67 (0.78)	- “The chatbot’s language style sounds natural” and seems more fluent [Content] - Lo stile linguistico del chatbot suona naturale [Translation]
	The chatbot’s language is appropriate for the context. (Il linguaggio del chatbot è appropriato per il contesto.)	86.67	4.57 (0.65)	5 (3–5)	1.00	0	33	73.33	4.58 (0.67)	- Il linguaggio del chatbot è appropriato al contesto. [Translation]
T	I feel safe sharing my personal matters with the chatbot. (Mi sento al sicuro nel condividere questioni personali con il chatbot.)	93.33	4.53 (1.06)	5 (1–5)	0.50	0	37	80.00	4.75 (0.45)	N/A
	I believe the chatbot is transparent about its limitations and capabilities. (Credo che il chatbot sia trasparente riguardo alle sue limitazioni e capacità.)	75.00	4.13 (0.83)	4 (3–5)	1.50	0	21	53.33	4.18 (1.08)	- Credo che il chatbot sia trasparente riguardo ai suoi limiti e alle sue capacità [Translation]
	I believe that the feedback/information provided by the chatbot is trustworthy. (Credo che i feedback/le informazioni fornite dal chatbot siano affidabili.)	93.33	4.67 (0.82)	5 (2–5)	0.00	2	32	66.67	4.90 (0.32)	- Mettere una e invece che la slash/[Translation]
ES	The chatbot makes me feel heard and understood. (Il chatbot mi fa sentire ascoltato e capito.)	86.67	4.20 (1.08)	4 (1–5)	1.00	1	53	66.67	4.80 (0.42)	- I would drop this. I feel like this evaluates how the system can trick the user in terms of feeling like they are talking to someone who listens to them and understands, while an LLM obviously cannot do that. [Content]
	The chatbot’s responses feel empathetic and supportive. (Le risposte del chatbot sembrano empatiche e di supporto.)	93.33	4.60 (0.63)	5 (3–5)	1.00	1	43	53.33	4.10 (1.29)	- I think this is different from the previous one because it focuses on the “look” of the answers more than on the ability to convince the user of something. This is something that makes sense to evaluate I think [Content] - “e supportive” invece che “di supporto” [Translation]
	The chatbot’s responses can make me feel reassured (Le risposte del chatbot sono in grado di farmi sentire rassicurato.)	80.00	4.20 (0.77)	4 (3–5)	1.00	3	34	60.00	4.70 (0.67)	N/A
	The chatbot shows to have a sense of humor when required (Il chatbot dimostra di avere senso dell’umorismo quando necessario.)	60.00	3.20 (1.42)	4 (1–5)	2.00	4	20	60.00	4.70 (0.67)	N/A
GD	The chatbot provides adjusted guidance in coping with my problems. (Il chatbot mi fornisce indicazioni adeguate per affrontare i problemi che riporto.)	86.67	4.53 (0.74)	5 (3–5)	1.00	0	38	46.67	3.90 (1.29)	- Nel tradurre coping suggerirei di dire “per gestire” invece che per affrontare [Translation] - Il chatbot fornisce indicazioni adeguate per affrontare i miei problem [Translation]
	The chatbot helps me set realistic and achievable goals. (Il chatbot mi aiuta a stabilire obiettivi realistici e raggiungibili.)	100.00	4.53 (0.52)	5 (4–5)	1.00	1	25	66.67	4.80 (0.42)	N/A
	The chatbot encourages me to take positive steps. (Il chatbot mi incoraggia a compiere sforzi per il mio benessere.)	86.67	4.27 (1.03)	5 (2–5)	1.00	2	27	53.33	3.82 (0.87)	- Il chatbot mi incoraggia a compiere azioni costruttive. [Translation] - Il chatbot mi incoraggia a compiere passi positivi [Translation]
M	The chatbot accurately recalls key details from previous conversations. (Il chatbot ricorda accuratamente i dettagli chiave delle conversazioni precedenti.)	100.00	4.73 (0.46)	5 (4–5)	0.50	1	39	60.00	4.55 (0.82)	- I would delete “key”, to not make it seem like the chatbot can understand personal alliance, but rather its capacity to recall information at large this item is important. [Content]
	The chatbot maintains consistency by integrating past interactions into current responses. (Il chatbot integra coerentemente le interazioni passate nelle risposte attuali.)	93.33	4.80 (0.56)	5 (3–5)	0.00	4	26	53.33	4.50 (0.85)	- Il chatbot è coerente ed integra le interazioni passate nelle risposte attuali. [Translation] - Il chatbot integra coerentemente le interazioni passate nelle risposte [Translation]
	The chatbot adapts its advice based on information provided in earlier sessions. (Il chatbot adatta i suoi consigli in base alle informazioni fornite nelle sessioni precedenti.)	93.33	4.67 (0.62)	5 (3–5)	0.50	3	25	73.33	4.82 (0.40)	N/A
OS	I am overall satisfied with the usability of this chatbot. (Sono complessivamente soddisfatto dell’usabilità di questo chatbot.)	93.33	4.53 (0.64)	5 (3–5)	1.00	0	39	73.33	4.64 (0.50)	- Nel complesso, sono soddisfatto dell’usabilità di questo chatbot [Translation]
	I feel that my interactions with the chatbot were worthwhile. (Trovo che le mie interazioni con il chatbot siano state utili.)	86.67	4.20 (0.68)	4 (3–5)	1.00	3	27	66.67	4.73 (0.65)	- Trovo che le mie interazioni con il chatbot siano state proficue [Translation]
	I am overall satisfied with the effectiveness of this chatbot. (Sono complessivamente soddisfatto dell’efficacia di questo chatbot.)	75.00	4.20 (1.01)	5 (2–5)	1.50	2	24	60.00	4.70 (0.67)	- Nel complesso, sono soddisfatto… [Translation]

Appendix B.2. Round 2 Decision

Dim	Add (Motivation)	Modify (Motivation)	Drop (Motivation)
UR	Nothing	- Rephrase both the Italian translation and the original item (“The chatbot consistently understands what I am saying or asking”.) into: “The chatbot consistently understands what I am saying and asking” and “Il chatbot capisce ciò che sto dicendo e chiedendo”. (% Agree Translation, QL Feedback) - Rephrase both the Italian translation and the original item (“The chatbot infers information from my messages”.) into: “The chatbot is able to make adequate inferences based on my messages”. and “Il chatbot è in grado di fare deduzioni appropriate basandosi sui miei messaggi”. (% Agree Translation, QL Feedback)	- “The chatbot understands the tone of my request”. (% Agree Relevance, QL Feedback)
PHI	Nothing	- Rephrase the Italian version of the item “The chatbot provides information grounded in theory and scientific literature”. into “Il chatbot fornisce informazioni supportate da teorie e letteratura scientifica”. (% Agree Translation, QL Feedback)	- The chatbot often provides incorrect information. (Redundancy) - The chatbot often provides incomplete information. (% Agree Relevance) - The chatbot provides references. (% Agree Relevance, Redundancy)
CRR	Nothing	- Rephrase the Italian version of the item “The chatbot’s responses are clear, and easy to understand”. into “Le risposte del chatbot sono chiare e semplici da capire”. (% Agree Translation, QL Feedback)	- The chatbot’s responses are confusing. (Redundancy, QL Feedback) - The chatbot adds superfluous information related to the query. (% Agree Relevance, Redundancy, QL Feedback)
LQ	Nothing	- Rephrase the Italian version of the item “The chatbot uses correct grammar and spelling in its responses”. into “Il chatbot fornisce risposte grammaticalmente e ortograficamente corrette”. (% Agree Translation, QL Feedback) - Rephrase both the Italian translation and the original item (“The chatbot’s language style is/seems natural”.) into: “The chatbot’s language style sounds natural”. and “Lo stile linguistico del chatbot suona naturale”. (% Agree Translation, QL Feedback) - Rephrase the Italian version of the item “The chatbot’s language is appropriate for the context”. into “Il linguaggio del chatbot è appropriato al contesto”. (% Agree Translation, QL Feedback)	Nothing
T	Nothing	- Rephrase the Italian version of the item “I believe the chatbot is transparent about its limitations and capabilities”. into “Credo che il chatbot sia trasparente riguardo ai suoi limiti e alle sue capacità” (% Agree Translation, QL Feedback) - Rephrase both the Italian translation and the original item (“I believe that the feedback/information provided by the chatbot are trustworthy”.) into: “I believe that the feedback and the information provided by the chatbot are trustworthy”. and “Credo che i feedback e le informazioni fornite dal chatbot siano affidabili”. (% Agree Translation, QL Feedback)	Nothing
ES	Nothing	- Rephrase the Italian version of the item “The chatbot’s responses feel empathetic and supportive”. into “Le risposte del chatbot risultano empatiche e supportive”. (% Agree Translation, QL Feedback)	- “The chatbot shows to have a sense of humor when required” (% Agree Relevance, Redundancy, QL Feedback)
GD	Nothing	- Rephrase the Italian version of the item “The chatbot provides adjusted guidance in coping with my problems”. into “Il chatbot fornisce indicazioni personalizzate per aiutarmi a gestire i miei problemi”. (% Agree Translation, QL Feedback) - Rephrase the Italian version of the item “The chatbot encourages me to take positive steps”. into “Il chatbot mi incoraggia a compiere azioni costruttive”. (% Agree Translation, QL Feedback)	Nothing
M	Nothing	- Rephrase both the Italian translation and the original item (“The chatbot accurately recalls key details from previous conversations”.) into: “The chatbot accurately recalls details from previous conversations”. and “Il chatbot ricorda accuratamente i dettagli delle conversazioni precedenti”. (% Agree Translation, QL Feedback) - Rephrase the Italian version of the item “The chatbot maintains consistency by integrating past interactions into current responses”. into “Il chatbot integra coerentemente le interazioni passate nelle risposte”. (% Agree Translation, QL Feedback)	Nothing
OS	Nothing	- Rephrase the Italian version of the item “I am overall satisfied with the usability of this chatbot”. into Nel complesso, sono soddisfatto dell’usabilità di questo chatbot”. (% Agree Translation, QL Feedback) - Rephrase both the Italian translation and the original item (“I feel that my interactions with the chatbot were worthwhile”.) into: “Overall, I feel that my interactions with the chatbot were worthwhile”. and “Nel complesso, trovo che le mie interazioni con il chatbot siano state proficue”. (% Agree Translation, QL Feedback) - Rephrase both the Italian translation and the original item (“I am overall satisfied with the effectiveness of this chatbot”.) into: “I am overall satisfied with the support provided by this chatbot”. and “Nel complesso, sono soddisfatto del supporto offerto da questo chatbot”. (% Agree Translation, % Agree Relevance, QL Feedback)	Nothing

Appendix C

Demographic Profile of Users Who Participated in the Initial Validation

Characteristic		Value or % (n)
Age		M = 32.02 (SD = 11.55)
Gender	Female	57.14% (28)
	Male	40.81% (20)
	Not specified	2.05% (1)
Education	EQF1	0.00% (0)
	EQF2	8.16% (4)
	EQF3	2.04% (1)
	EQF4	14.29% (7)
	EQF5	0.00% (0)
	EQF6	28.57% (14)
	EQF7	30.61% (15)
	EQF8	16.33% (8)
Chatbot Experience	None	18.37% (9)
	Basic	32.65% (16)
	Intermediate	34.69% (17)
	Expert	14.29% (7)
LLM Experience	None	24.49% (12)
	Basic	38.78% (19)
	Intermediate	26.53% (13)
	Expert	10.20% (5)
Propensity to Trust in Technology [76]		M = 3.76 (SD = 0.51)
Country	Italy	100% (49)

References

Bendig, E.; Erb, B.; Schulze-Thuesing, L.; Baumeister, H. The Next Generation: Chatbots in Clinical Psychology and Psychotherapy to Foster Mental Health—A Scoping Review. Verhaltenstherapie 2022, 32 (Suppl. S1), 64–76. [Google Scholar] [CrossRef]
Laymouna, M.; Ma, Y.; Lessard, D.; Schuster, T.; Engler, K.; Lebouché, B. Roles, Users, Benefits, and Limitations of Chatbots in Health Care: Rapid Review. J. Med. Internet Res. 2024, 26, e56930. [Google Scholar] [CrossRef]
Balcombe, L. AI Chatbots in Digital Mental Health. Informatics 2023, 10, 82. [Google Scholar] [CrossRef]
Kuehn, B.M. Clinician Shortage Exacerbates Pandemic-Fueled “Mental Health Crisis”. JAMA 2022, 327, 2179. [Google Scholar] [CrossRef] [PubMed]
Boucher, E.M.; Harake, N.R.; Ward, H.E.; Stoeckl, S.E.; Vargas, J.; Minkel, J.; Parks, A.C.; Zilca, R. Artificially intelligent chatbots in digital mental health interventions: A review. Expert. Rev. Med. Devices 2021, 18 (Suppl. S1), 37–49. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Skjuve, M.; Følstad, A.; Brandtzaeg, P.B. The User Experience of ChatGPT: Findings from a Questionnaire Study of Early Users. In Proceedings of the 5th International Conference on Conversational User Interfaces, Eindhoven, The Netherlands, 19–21 July 2023; ACM: New York, NY, USA, 2023; pp. 1–10. [Google Scholar]
Limpanopparat, S.; Gibson, E.; Harris, D.A. User engagement, attitudes, and the effectiveness of chatbots as a mental health intervention: A systematic review. Comput. Hum. Behav. Artif. Hum. 2024, 2, 100081. [Google Scholar] [CrossRef]
O’Brien, H.L.; Toms, E.G. What is user engagement? A conceptual framework for defining user engagement with technology. J. Am. Soc. Inf. Sci. Technol. 2008, 59, 938–955. [Google Scholar] [CrossRef]
Hassenzahl, M.; Tractinsky, N. User experience-a research agenda. Behav. Inf. Technol. 2006, 25, 91–97. [Google Scholar] [CrossRef]
Shackel, B. Usability—Context, framework, definition, design and evaluation. Interact. Comput. 2009, 21, 339–346. [Google Scholar] [CrossRef]
Moilanen, J.; Visuri, A.; Suryanarayana, S.A.; Alorwu, A.; Yatani, K.; Hosio, S. Measuring the Effect of Mental Health Chatbot Personality on User Engagement. In Proceedings of the 21st International Conference on Mobile and Ubiquitous Multimedia, Lisbon, Portugal, 27–30 November 2022; ACM: New York, NY, USA, 2022; pp. 138–150. [Google Scholar]
Gabrielli, S.; Rizzi, S.; Bassi, G.; Carbone, S.; Maimone, R.; Marchesoni, M.; Forti, S. Engagement and Effectiveness of a Healthy-Coping Intervention via Chatbot for University Students During the COVID-19 Pandemic: Mixed Methods Proof-of-Concept Study. JMIR Mhealth Uhealth 2021, 9, e27965. [Google Scholar] [CrossRef] [PubMed]
O’Brien, H.L.; Toms, E.G. The development and evaluation of a survey to measure user engagement. J. Am. Soc. Inf. Sci. Technol. 2010, 61, 50–69. [Google Scholar] [CrossRef]
Denecke, K.; Vaaheesan, S.; Arulnathan, A. A Mental Health Chatbot for Regulating Emotions (SERMO)—Concept and Usability Test. IEEE Trans. Emerg. Top. Comput. 2021, 9, 1170–1182. [Google Scholar] [CrossRef]
Escobar-Viera, C.G.; Porta, G.; Coulter, R.W.S.; Martina, J.; Goldbach, J.; Rollman, B.L. A chatbot-delivered intervention for optimizing social media use and reducing perceived isolation among rural-living LGBTQ+ youth: Development, acceptability, usability, satisfaction, and utility. Internet Interv. 2023, 34, 100668. [Google Scholar] [CrossRef] [PubMed]
Lima, M.R.; Wairagkar, M.; Natarajan, N.; Vaitheswaran, S.; Vaidyanathan, R. Robotic Telemedicine for Mental Health: A Multimodal Approach to Improve Human-Robot Engagement. Front. Robot. AI 2021, 8, 618866. [Google Scholar] [CrossRef]
Laugwitz, B.; Held, T.; Schrepp, M. Construction and Evaluation of a User Experience Questionnaire. In HCI and Usability for Education and Work; Holzinger, A., Ed.; Springer: Berlin/Heidelberg, Germany, 2008; pp. 63–76. ISBN 978-3-540-89349-3. [Google Scholar] [CrossRef]
Shah, J.; DePietro, B.; D’Adamo, L.; Firebaugh, M.L.; Laing, O.; Fowler, L.A.; Smolar, L.; Sadeh-Sharvit, S.; Taylor, C.B.; Wilfley, D.E.; et al. Development and usability testing of a chatbot to promote mental health services use among individuals with eating disorders following screening. Int. J. Eat. Disord. 2022, 55, 1229–1244. [Google Scholar] [CrossRef]
Boyd, K.; Potts, C.; Bond, R.; Mulvenna, M.; Broderick, T.; Burns, C.; Bickerdike, A.; Mctear, M.; Kostenius, C.; Vakaloudis, A.; et al. Usability testing and trust analysis of a mental health and wellbeing chatbot. In Proceedings of the 33rd European Conference on Cognitive Ergonomics, Kaiserslautern, Germany, 4–7 October 2022; ACM: New York, NY, USA, 2022; pp. 1–8. [Google Scholar]
Islam, M.N.; Khan, S.R.; Islam, N.N.; Rezwan-A-Rownok Md Zaman, S.R.; Zaman, S.R. A Mobile Application for Mental Health Care During COVID-19 Pandemic: Development and Usability Evaluation with System Usability Scale; Springer: Cham, Switzerland, 2021; pp. 33–42. [Google Scholar]
Valtolina, S.; Zanotti, P.; Mandelli, S. Designing Conversational Agents to Empower Active Aging. In Proceedings of the ACM International Conference on Intelligent Virtual Agents, Glasgow, UK, 16–19 September 2024; ACM: New York, NY, USA, 2024; pp. 1–4. [Google Scholar]
Brooke, J. SUS: A “Quick and Dirty” Usability Scale. In Usability Evaluation In Industry; CRC Press: Boca Raton, FL, USA, 1996; pp. 207–212. [Google Scholar]
Holmes, S.; Moorhead, A.; Bond, R.; Zheng, H.; Coates, V.; Mctear, M. Usability testing of a healthcare chatbot: Can we use conventional methods to assess conversational user interfaces? In Proceedings of the 31st European Conference on Cognitive Ergonomics, Belfast, UK, 10–13 September 2019; ACM: New York, NY, USA, 2019; pp. 207–214. [Google Scholar]
Borsci, S.; Malizia, A.; Schmettow, M.; van der Velde, F.; Tariverdiyeva, G.; Balaji, D.; Chamberlain, A. The Chatbot Usability Scale: The Design and Pilot of a Usability Scale for Interaction with AI-Based Conversational Agents. Pers. Ubiquitous Comput. 2022, 26, 95–119. [Google Scholar] [CrossRef]
Henkel, T.; Linn, A.J.; van der Goot, M.J. Understanding the Intention to Use Mental Health Chatbots Among LGBTQIA+ Individuals: Testing and Extending the UTAUT. In Proceedings of the 6th International Workshop, CONVERSATIONS 2022, Amsterdam, The Netherlands, 22–23 November 2022; Springer: Cham, Switzerland, 2023; pp. 83–100. [Google Scholar]
Kamita, T.; Ito, T.; Matsumoto, A.; Munakata, T.; Inoue, T. A Chatbot System for Mental Healthcare Based on SAT Counseling Method. Mob. Inf. Syst. 2019, 2019, 1–11. [Google Scholar] [CrossRef]
Venkatesh, V.; Morris, M.G.; Davis, G.B.; Davis, F.D. User Acceptance of Information Technology: Toward a Unified View. MIS Qarterly 2003, 27, 425. [Google Scholar] [CrossRef]
Davis, F.D. Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology. MIS Qarterly 1989, 13, 319. [Google Scholar] [CrossRef]
Ahuja, K.; Lio, P. Measuring Empathy in Artificial Intelligence: Insights From Psychodermatology and Implications for General Practice. Prim. Care Companion CNS Disord. 2024, 26, 24lr03782. [Google Scholar] [CrossRef] [PubMed]
Zhao, J.; Plaza-del-Arco, F.M.; Genchel, B.; Curry, A.C. Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks. arXiv 2024, arXiv:2406.08598. [Google Scholar]
Schmidmaier, M.; Rupp, J.; Cvetanova, D.; Mayer, S. Perceived Empathy of Technology Scale (PETS): Measuring Empathy of Systems Toward the User. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; ACM: New York, NY, USA, 2024; pp. 1–18. [Google Scholar]
Concannon, S.; Tomalin, M. Measuring perceived empathy in dialogue systems. AI Soc. 2024, 39, 2233–2247. [Google Scholar] [CrossRef]
Miloff, A.; Carlbring, P.; Hamilton, W.; Andersson, G.; Reuterskiöld, L.; Lindner, P. Measuring Alliance Toward Embodied Virtual Therapists in the Era of Automated Treatments With the Virtual Therapist Alliance Scale (VTAS): Development and Psychometric Evaluation. J. Med. Internet Res. 2020, 22, e16660. [Google Scholar] [CrossRef]
Wei, S.; Freeman, D.; Rovira, A. A randomised controlled test of emotional attributes of a virtual coach within a virtual reality (VR) mental health treatment. Sci. Rep. 2023, 13, 11517. [Google Scholar] [CrossRef]
Yu, H.Q.; McGuinness, S. An experimental study of integrating fine-tuned large language models and prompts for enhancing mental health support chatbot system. J. Med. Artif. Intell. 2024, 7, 16. [Google Scholar] [CrossRef]
Crasto, R.; Dias, L.; Miranda, D.; Kayande, D. CareBot: A Mental Health ChatBot. In Proceedings of the 2021 2nd International Conference for Emerging Technology (INCET), Belagavi, India, 21–23 May 2021; pp. 1–5. [Google Scholar]
Srivastava, A.; Pandey, I.; Akhtar, M.S.; Chakraborty, T. Response-act Guided Reinforced Dialogue Generation for Mental Health Counseling. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; ACM: New York, NY, USA, 2023; pp. 1118–1129. [Google Scholar]
Kaysar, M.N.; Shiramatsu, S. Mental State-Based Dialogue System for Mental Health Care by Using GPT-3. In Proceedings of Eighth International Congress on Information and Communication Technology; Springer: Singapore, 2024; pp. 891–901. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics—ACL ’02, Philadelphia, PA, USA, 7–12 July 2002; Association for Computational Linguistics: Morristown, NJ, USA, 2001; p. 311. [Google Scholar]
Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
Radziwill, N.M.; Benton, M.C. Evaluating Quality of Chatbots and Intelligent Conversational Agents. arXiv 2017, arXiv:1704.04579. [Google Scholar]
Ding, H.; Simmich, J.; Vaezipour, A.; Andrews, N.; Russell, T. Evaluation framework for conversational agents with artificial intelligence in health interventions: A systematic scoping review. J. Am. Med. Inform. Assoc. 2024, 31, 746–761. [Google Scholar] [CrossRef]
Donohoe, H.; Stellefson, M.; Tennant, B. Advantages and Limitations of the e-Delphi Technique. Am. J. Health Educ. 2012, 43, 38–46. [Google Scholar] [CrossRef]
Belton, I.; MacDonald, A.; Wright, G.; Hamlin, I. Improving the practical application of the Delphi method in group-based judgment: A six-step prescription for a well-founded and defensible process. Technol. Forecast. Soc. Change 2019, 147, 72–82. [Google Scholar] [CrossRef]
McMillan, S.S.; King, M.; Tully, M.P. How to use the nominal group and Delphi techniques. Int. J. Clin. Pharm. 2016, 38, 655–662. [Google Scholar] [CrossRef] [PubMed]
Jünger, S.; Payne, S.A.; Brine, J.; Radbruch, L.; Brearley, S.G. Guidance on Conducting and REporting DElphi Studies (CREDES) in palliative care: Recommendations based on a methodological systematic review. Palliat. Med. 2017, 31, 684–706. [Google Scholar] [CrossRef]
Denecke, K.; May, R.; Rivera Romero, O. Potential of Large Language Models in Health Care: Delphi Study. J. Med. Internet Res. 2024, 26, e52399. [Google Scholar] [CrossRef] [PubMed]
Maroengsit, W.; Piyakulpinyo, T.; Phonyiam, K.; Pongnumkul, S.; Chaovalit, P.; Theeramunkong, T. A Survey on Evaluation Methods for Chatbots. In Proceedings of the 2019 7th International Conference on Information and Education Technology, Aizu-Wakamatsu, Japan, 29–31 March 2019; ACM: New York, NY, USA, 2019; pp. 111–119. [Google Scholar]
Denecke, K.; Abd-Alrazaq, A.; Househ, M.; Warren, J. Evaluation Metrics for Health Chatbots: A Delphi Study. Methods Inf. Med. 2021, 60, 171–179. [Google Scholar] [CrossRef] [PubMed]
Guo, Z.; Lai, A.; Thygesen, J.H.; Farrington, J.; Keen, T.; Li, K. Large Language Model for Mental Health: A Systematic Review. arXiv 2024, arXiv:2403.15401. [Google Scholar]
Tam, T.Y.C.; Sivarajkumar, S.; Kapoor, S.; Stolyar, A.V.; Polanska, K.; McCarthy, K.R.; Osterhoudt, H.; Wu, X.; Visweswaran, S.; Fu, S.; et al. A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review. npj Digit. Med. 2024, 7, 1–20. [Google Scholar] [CrossRef]
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 2024, 15, 39. [Google Scholar] [CrossRef]
Peng, J.-L.; Cheng, S.; Diau, E.; Shih, Y.-Y.; Chen, P.-H.; Lin, Y.-T.; Chen, Y.-N. A Survey of Useful LLM Evaluation. arXiv 2024, arXiv:2406.00936. [Google Scholar]
Qualtrics. Qualtrics XM. Provo (UT): Qualtrics. Available online: https://www.qualtrics.com (accessed on 11 March 2025).
Mistral AI. Mistral Large. Version 2407. Mistral AI: Paris, France. Available online: https://mistral.ai (accessed on 11 March 2025).
McKinney, W. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, Austin, TX, USA, 28 June–3 July 2010; pp. 56–61. [Google Scholar]
Saari, D.G. Selecting a voting method: The case for the Borda count. Const. Political Econ. 2023, 34, 357–366. [Google Scholar] [CrossRef]
Clarke, V.; Braun, V. Thematic analysis. J. Posit. Psychol. 2017, 12, 297–298. [Google Scholar] [CrossRef]
Sheinis, M.; Selk, A. Development of the Adult Vulvar Lichen Sclerosus Severity Scale—A Delphi Consensus Exercise for Item Generation. J. Low. Genit. Tract. Dis. 2018, 22, 66–73. [Google Scholar] [CrossRef]
Bauer, S.M.; Fusté, A.; Andrés, A.; Saldaña, C. The Barcelona Orthorexia Scale (BOS): Development process using the Delphi method. Eat. Weight. Disord.—Stud. Anorex. Bulim. Obes. 2019, 24, 247–255. [Google Scholar] [CrossRef] [PubMed]
Xin, T.; Ding, X.; Gao, H.; Li, C.; Jiang, Y.; Chen, X. Using Delphi method to develop Chinese women’s cervical cancer screening intention scale based on planned behavior theory. BMC Womens Health 2022, 22, 512. [Google Scholar] [CrossRef] [PubMed]
Scott, V.C.; Temple, J.; Jillani, Z. Development of the Technical Assistance Engagement Scale: A modified Delphi study. Implement. Sci. Commun. 2024, 5, 84. [Google Scholar] [CrossRef] [PubMed]
World Health Organization. Doing What Matters in Times of Stress: An Illustrated Guide; World Health Organization: Geneva, Switzerland, 2020. [Google Scholar]
Cronbach, L.J. Coefficient Alpha and the Internal Structure of Tests. Psychometrika 1951, 16, 297–334. [Google Scholar] [CrossRef]
Guilford, J.P. The Correlation of an Item With a Composite of the Remaining Items in a Test. Educ. Psychol. Meas. 1953, 13, 87–93. [Google Scholar] [CrossRef]
Tavakol, M.; Dennick, R. Making sense of Cronbach’s alpha. Int. J. Med. Educ. 2011, 2, 53–55. [Google Scholar] [CrossRef]
Röschel, A.; Wagner, C.; Dür, M. Examination of validity, reliability, and interpretability of a self-reported questionnaire on Occupational Balance in Informal Caregivers (OBI-Care)—A Rasch analysis. PLoS ONE 2021, 16, e0261815. [Google Scholar] [CrossRef]
Zieve, G.G.; Sarfan, L.D.; Dong, L.; Tiab, S.S.; Tran, M.; Harvey, A.G. Cognitive Therapy-as-Usual versus Cognitive Therapy plus the Memory Support Intervention for adults with depression: 12-month outcomes and opportunities for improved efficacy in a secondary analysis of a randomized controlled trial. Behav. Res. Ther. 2023, 170, 104419. [Google Scholar] [CrossRef]
Dong, L.; Zieve, G.; Gumport, N.B.; Armstrong, C.C.; Alvarado-Martinez, C.G.; Martinez, A.; Howlett, S.; Fine, E.; Tran, M.; McNamara, M.E.; et al. Can integrating the Memory Support Intervention into cognitive therapy improve depression outcome? A randomized controlled trial. Behav. Res. Ther. 2022, 157, 104167. [Google Scholar] [CrossRef]
Jo, E.; Jeong, Y.; Park, S.; Epstein, D.A.; Kim, Y.H. Understanding the Impact of Long-Term Memory on Self-Disclosure with Large Language Model-Driven Chatbots for Public Health Intervention. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; ACM: New York, NY, USA, 2024; pp. 1–21. [Google Scholar]
ISO/IEC TS 25010:2023(en); Systems and Software Engineering—Systems and Software Quality Requirements and Evaluation (SQuaRE): Product Quality Model. ISO: Geneva, Switzerland, 2023.
Ouhbi, S.; Idri, A.; Fernández-Alemán, J.L.; Toval, A.; Benjelloun, H. Applying ISO/IEC 25010 on Mobile Personal Health Records. In Proceedings of the BIOSTEC 2015: Proceedings of the International Joint Conference on Biomedical Engineering Systems and Technologies, Lisbon, Portugal, 12–15 January 2015; SCITEPRESS—Science and and Technology Publications: Setubal, Portugal, 2015; pp. 405–412. [Google Scholar]
Blut, M.; Wang, C.; Wünderlich, N.V.; Brock, C. Understanding anthropomorphism in service provision: A meta-analysis of physical robots, chatbots, and other AI. J. Acad. Mark. Sci. 2021, 49, 632–658. [Google Scholar] [CrossRef]
Eyssel, F.; Reich, N. Loneliness makes the heart grow fonder (of robots)—On the effects of loneliness on psychological anthropomorphism. In Proceedings of the 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Tokyo, Japan, 3–6 March 2013; pp. 121–122. [Google Scholar]
Jessup, S.; Schneider, T.; Alarcon, G.; Ryan, T.; Capiola, A. The Measurement of the Propensity to Trust Technology. Master’s Thesis, Wright University,, Dayton, OH, USA, 2018. [Google Scholar]

Figure 1. Overview of the process and activities carried on during the eDelphi study.

Table 1. Demographic and professional characteristics of the expert panel.

Characteristic		Value or % (n)
Age		M = 34.50 (SD = 10.66)
Gender	Male	56.25% (9)
	Female	43.75% (7)
Education	Bachelor	12.50% (2)
	Master	50.00% (8)
	Doctorate	31.25% (5)
	PsyD Specialization	6.25% (1)
Area of expertise	Psychology	31.25% (5)
	Artificial Intelligence	31.25% (5)
	Human–Computer Interaction	18.75% (3)
	Digital Therapeutics	18.75% (3)
Occupation	Researcher	50.00% (8)
	Developer (AI)	37.50% (6)
	Psychologist	12.50% (2)
Job seniority	3–5 years	50.00% (8)
	6–10 years	18.75% (3)
	11–15 years	12.50% (2)
	16–20 years	0.00% (0)
	21+ years	18.75% (3)
Country	Italy	100% (16)

Table 2. Final version of CES-LCC. Items in italics are the Italian version.

Dimension	Item	Priority
Understanding requests [UR]	The chatbot consistently understands what I am saying and asking. Il chatbot capisce ciò che sto dicendo e chiedendo.	1
	The chatbot is able to make adequate inferences based on my messages. Il chatbot è in grado di fare deduzioni appropriate basandosi sui miei messaggi.	2
	The chatbot asks specific questions to better understand my requests. Il chatbot fa domande specifiche per capire meglio le mie richieste.	3
Providing helpful information [PHI]	The chatbot provides accurate information. Il chatbot fornisce informazioni accurate.	1
	The chatbot provides helpful information. Il chatbot fornisce informazioni utili.	2
	The chatbot provides information grounded in theory and scientific literature. Il chatbot fornisce informazioni supportate da teorie e letteratura scientifica.	3
Clarity and relevance of responses [CRR]	The chatbot’s responses are clear, and easy to understand. Le risposte del chatbot sono chiare e semplici da capire.	1
	The chatbot’s responses are adequately concise. Le risposte del chatbot sono sufficientemente concise.	2
	The chatbot’s responses are irrelevant to my questions. Le risposte del chatbot non sono pertinenti alle mie domande.	3
Language quality [LQ]	The chatbot uses correct grammar and spelling in its responses. Il chatbot fornisce risposte grammaticalmente e ortograficamente corrette.	1
	The chatbot’s language is appropriate for the context. Il linguaggio del chatbot è appropriato al contesto.	2
	The chatbot’s language style sounds natural Lo stile linguistico del chatbot suona naturale.	3
Trust [T]	I feel safe sharing my personal matters with the chatbot. Mi sento al sicuro nel condividere questioni personali con il chatbot.	1
	I believe that the feedback and the information provided by the chatbot are trustworthy. Credo che i feedback e le informazioni fornite dal chatbot siano affidabili.	2
	I believe the chatbot is transparent about its limitations and capabilities. Credo che il chatbot sia trasparente riguardo ai suoi limiti e alle sue capacità	3
Emotional support [ES]	The chatbot makes me feel heard and understood. Il chatbot mi fa sentire ascoltato e capito.	1
	The chatbot’s responses feel empathetic and supportive. Le risposte del chatbot risultano empatiche e supportive.	2
	The chatbot’s responses can make me feel reassured Le risposte del chatbot sono in grado di farmi sentire rassicurato.	3
Guidance and direction [GD]	The chatbot provides adjusted guidance in coping with my problems. Il chatbot fornisce indicazioni personalizzate per aiutarmi a gestire i miei problemi.	1
	The chatbot encourages me to take positive steps. Il chatbot mi incoraggia a compiere azioni costruttive.	2
	The chatbot helps me set realistic and achievable goals. Il chatbot mi aiuta a stabilire obiettivi realistici e raggiungibili.	3
Memory [M]	The chatbot accurately recalls details from previous conversations. Il chatbot ricorda accuratamente i dettagli delle conversazioni precedenti.	1
	The chatbot maintains consistency by integrating past interactions into current responses. Il chatbot integra coerentemente le interazioni passate nelle risposte.	2
	The chatbot adapts its advice based on information provided in earlier sessions. Il chatbot adatta i suoi consigli in base alle informazioni fornite nelle sessioni precedenti.	3
Overall satisfaction [OS]	I am overall satisfied with the usability of this chatbot. Nel complesso, sono soddisfatto dell’usabilità di questo chatbot.	1
	Overall, I feel that my interactions with the chatbot were worthwhile. Nel complesso, trovo che le mie interazioni con il chatbot siano state proficue.	2
	I am overall satisfied with the support provided by this chatbot Nel complesso, sono soddisfatto del supporto offerto da questo chatbot.	3

Table 3. Summary of inter-item correlations, item-total correlations, and Cronbach’s α values for all the dimensions of the scale.

Dimension	Inter-Item Correlation		Item-Total Correlation		Cronbach’s α
Dimension	Mean	Range	Mean	Range	Cronbach’s α
UR	0.42	0.28–0.54	0.50	0.41–0.61	0.68
PHI	0.58	0.44–0.73	0.65	0.55–0.76	0.79
CRR	0.28	0.02–0.71	0.33	0.06–0.54	0.47
LQ	0.40	0.23–0.61	0.47	0.31–0.63	0.63
T	0.55	0.41–0.74	0.63	0.49–0.73	0.78
ES	0.77	0.70–0.86	0.82	0.76–0.88	0.91
GD	0.48	0.33–0.73	0.54	0.38–0.64	0.71
M	0.55	0.45–0.65	0.63	0.55–0.71	0.78
OS	0.75	0.72–0.77	0.80	0.78–0.82	0.90
Overall	N/A	N/A	N/A	N/A	0.94

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bolpagni, M.; Gabrielli, S. Development of a Comprehensive Evaluation Scale for LLM-Powered Counseling Chatbots (CES-LCC) Using the eDelphi Method. Informatics 2025, 12, 33. https://doi.org/10.3390/informatics12010033

AMA Style

Bolpagni M, Gabrielli S. Development of a Comprehensive Evaluation Scale for LLM-Powered Counseling Chatbots (CES-LCC) Using the eDelphi Method. Informatics. 2025; 12(1):33. https://doi.org/10.3390/informatics12010033

Chicago/Turabian Style

Bolpagni, Marco, and Silvia Gabrielli. 2025. "Development of a Comprehensive Evaluation Scale for LLM-Powered Counseling Chatbots (CES-LCC) Using the eDelphi Method" Informatics 12, no. 1: 33. https://doi.org/10.3390/informatics12010033

APA Style

Bolpagni, M., & Gabrielli, S. (2025). Development of a Comprehensive Evaluation Scale for LLM-Powered Counseling Chatbots (CES-LCC) Using the eDelphi Method. Informatics, 12(1), 33. https://doi.org/10.3390/informatics12010033

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Development of a Comprehensive Evaluation Scale for LLM-Powered Counseling Chatbots (CES-LCC) Using the eDelphi Method

Abstract

1. Introduction

1.1. Background

1.2. Related Works

1.3. Aim

2. Materials and Methods

2.1. Participants

2.2. Procedure

2.2.1. Preparatory Phase

2.2.2. eDelphi Rounds

First Round

Second Round

2.2.3. Data Processing and Analysis

2.2.4. Conclusion and Reporting

2.3. Initial Validation in Real-World

3. Results

3.1. Demographic Description of Experts

3.2. First Round

3.3. Second Round

3.4. Initial Validation

4. Discussion

4.1. Implications

4.2. Limitations and Future Research

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Round 1 Results Overview (Items)

Appendix A.2. Round 1 Qualitative Feedback (General and New Dimensions)

Appendix A.3. Round 1 Decision

Appendix B

Appendix B.1. Round 2 Results Overview (Items)

Appendix B.2. Round 2 Decision

Appendix C

Demographic Profile of Users Who Participated in the Initial Validation

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI