Development of a Comprehensive Evaluation Scale for LLM-Powered Counseling Chatbots (CES-LCC) Using the eDelphi Method
Abstract
:1. Introduction
1.1. Background
1.2. Related Works
1.3. Aim
2. Materials and Methods
2.1. Participants
2.2. Procedure
2.2.1. Preparatory Phase
2.2.2. eDelphi Rounds
First Round
Second Round
2.2.3. Data Processing and Analysis
Algorithm 1. Add, Modify, Drop (ADM) algorithm |
Require: D: Dimensions, : Items in d, Quantitative thresholds: , , , , Q: Qualitative feedback. Ensure: Refined scale with updated dimensions and items.
|
2.2.4. Conclusion and Reporting
2.3. Initial Validation in Real-World
3. Results
3.1. Demographic Description of Experts
3.2. First Round
3.3. Second Round
3.4. Initial Validation
4. Discussion
4.1. Implications
4.2. Limitations and Future Research
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
Appendix A.1. Round 1 Results Overview (Items)
Dim | Item | Relevance | Priority | Sample QL Feedback | |||||
% Agree | M (SD) | Mdn (R) | IQR | M (SD) | Mdn (R) | IQR | |||
UR | The chatbot consistently understands what I am saying or asking. | 100.00 | 4.88 (0.34) | 5 (4–5) | 0.00 | 4.88 (0.34) | 5 (4–5) | 0.00 | - I’d add another item about the tone (e.g., The chatbot understands the tone of my request) - I believe an important feature is not only that it understands but also infers from the sentence |
I have to rephrase my requests very often for the chatbot to understand. | 62.50 | 3.69 (1.14) | 4 (1–5) | 1.25 | 3.88 (0.89) | 4 (3–5) | 2.00 | N/A | |
PHI | The chatbot provides accurate and helpful information. | 87.50 | 4.44 (1.09) | 5 (1–5) | 1.00 | 4.38 (1.15) | 5 (1–5) | 1.00 | - I would divide the question into two separate questions. Answers might be accurate, but not helpful and vice versa, so (1) the chatbot provides accurate information; (2) the chatbot provides helpful information - I think it’s important to add that the information it provides is grounded in theoretical frameworks and scientific literature. |
The chatbot often provides incorrect or incomplete information. | 75.00 | 3.94 (1.18) | 4 (1–5) | 1.25 | 3.69 (1.25) | 4 (1–5) | 0.75 | N/A | |
CRR | The chatbot’s responses are clear, concise, and easy to understand. | 81.25 | 4.06 (1.00) | 4 (2–5) | 1.00 | 4.00 (1.03) | 4 (2–5) | 1.25 | - Conciseness and easiness are very different dimensions, I don’t feel like they should be evaluated together - I’d add an item about verbosity (e.g., The chatbot adds superfluous information related to the query) |
The chatbot’s responses are confusing or irrelevant to my questions. | 81.25 | 4.00 (1.32) | 4 (1–5) | 1.00 | 3.94 (1.12) | 4 (1–5) | 1.00 | - I’d split the item in two (confusing/irrelevant) | |
EUI | The chatbot is easy to interact with. | 62.50 | 3.94 (1.00) | 4 (2–5) | 2.00 | 3.63 (1.31) | 3.5 (1–5) | 2.00 | - I have no better way. My doubt is about the definition of easy. What does it mean? How can someone evaluate this dimension? |
Using the chatbot is frustrating or requires too many tentative interactions. | 62.50 | 3.81 (1.11) | 4 (1–5) | 2.00 | 3.56 (1.21) | 3.5 (2–5) | 2.25 | - The item is not very clear; does it refer to access or the actual internal use of the LLM? | |
LQ | The chatbot uses correct grammar and spelling in its responses. | 75.00 | 3.88 (1.02) | 4 (1–5) | 0.50 | 3.50 (1.21) | 4 (1–5) | 1.00 | N/A |
The chatbot’s language style is natural and appropriate for the context. | 68.75 | 4.00 (0.97) | 4 (2–5) | 2.00 | 3.50 (1.37) | 3.5 (1–5) | 2.00 | - I would divide this item into 2 items: (1) The chatbot’s language style is/seems natural; (2) The chatbot’s language is appropriate for the context. | |
T | I believe the chatbot has my best interests at heart. | 43.75 | 3.13 (1.50) | 3 (1–5) | 2.25 | 3.13 (1.50) | 3 (1–5) | 2.25 | - I would avoid formulations that suggest that the chatbot might have a cognition. - I believe that it is tricky to talk about the chatbot as if it has agency and conscience also because it might lead to overreliance and/or excessive anthropomorphization of the tool. |
I am willing to rely on the chatbot in the future. | 62.50 | 3.81 (0.91) | 4 (2–5) | 1.25 | 3.50 (1.15) | 3.5 (2–5) | 1.50 | N/A | |
ES | The chatbot makes me feel heard and understood. | 75.00 | 4.06 (1.12) | 4 (1–5) | 1.25 | 3.94 (1.18) | 4 (1–5) | 1.25 | - I’d add an item about sense of humor (e.g., The chatbot shows to have a sense of humor when required) |
The chatbot’s responses feel empathetic and supportive. | 87.50 | 4.19 (0.98) | 4 (2–5) | 1.00 | 4.00 (0.89) | 4 (1–5) | 0.00 | - I’d add an item about feeling reassured (e.g., The chatbot’s inputs and responses can make me feel reassured) | |
GD | The chatbot provides helpful advice and suggestions for coping with my problems. | 93.75 | 4.38 (0.62) | 4 (3–5) | 1.00 | 4.25 (0.68) | 4 (3–5) | 1.00 | - This question might crossload into the “Providing helpful information” factor. However, I would keep it, because in this case it specifically talks about coping, but maybe phrasing it as follows: The chatbot provides adjusted guidance in coping with my problems. |
The chatbot encourages me to take positive steps towards my goals. | 75.00 | 3.88 (0.81) | 4 (2–5) | 0.25 | 3.81 (0.83) | 4 (2–5) | 1.00 | - I believe it is important to assess an individual’s goals carefully. For example, a person with an eating disorder might set a goal to lose an extreme amount of weight, which is unhealthy. Therefore, it’s crucial to remember that a patient’s goals are not always the best for their well-being. | |
OS | I am overall satisfied with the usability of this chatbot. | 87.50 | 4.44 (0.73) | 5 (3–5) | 1.00 | 4.44 (0.73) | 5 (3–5) | 1.00 | - I want to point out that not only the usability but also the effectiveness in helping is important. |
I would not recommend this chatbot to others due to usability issues. | 75.00 | 3.88 (1.15) | 4 (1–5) | 1.25 | 3.88 (1.31) | 4 (1–5) | 2.00 | - This statement sounds somehow redundant to the first one in terms that lower scores on statement 1 seem to be almost equivalent to high |
Appendix A.2. Round 1 Qualitative Feedback (General and New Dimensions)
Feedback Type | Content |
New dimensions | - I think assessing memory quality is crucial when dealing with real-world implementations using LLMs. So, I suggest adding this dimension to the assessment. - I believe there are missing items related to the perception of how the chatbot handles my privacy and data security, such as how it shares my personal information with third parties. - In my opinion, privacy and data security, especially regarding how the chatbot shares personal information, are important aspects not included in the items. However, these seem more tied to production or implementation and might be better addressed in a separate, dedicated evaluation. - Items related to data privacy and security might be relevant in this scenario. However, in my experience, these items are more aligned with production or implementation processes and might be better addressed under regulations like the EU AI Act and GDPR. |
General | - Almost all the statements sound actually very relevant, I provided some lower scores in some of them just to distinguish the ones I think are most relevant but in general, all are relevant! - I felt like all of the shown questions were relevant in some way: that’s why some evaluations were a bit harsh, just so that I could express what is more relevant from my point of view. Anyway, the questions were all pretty clear - An important consideration for real-life implementation of LLM-powered chatbots is ensuring accessibility for a wide range of users. This includes compatibility with various devices, such as smartphones, tablets, and computers, to meet diverse user needs. Additionally, designing the chatbot to be inclusive is crucial—for example, allowing users to specify preferred names and pronouns to support transgender and gender-diverse individuals, and incorporating features like colorblind-friendly graphics or text presentation options to assist users with visual impairments or reading difficulties. These steps can significantly enhance user experience and inclusivity. - Psychometrically, factors should have at least 3 items to be considered reliable, with 2 items it is not even possible to calculate internal consistency |
Appendix A.3. Round 1 Decision
Dim | Add (Motivation) | Modify (Motivation) | Drop (Motivation) |
UR | - “The chatbot understands the tone of my request” (QL Feedback) - “The chatbot asks specific questions to better understand my requests” (QL Feedback) - “The chatbot infers information from my messages” (QL Feedback) | Nothing | - “I have to rephrase my requests very often for the chatbot to understand”. (% Agree Relevance) |
PHI | - “The chatbot provides information grounded in theory and scientific literature”. (QL Feedback) - “The chatbot provides references”. (QL Feedback) | - Split the item “The chatbot provides accurate and helpful information”. into “The chatbot provides accurate information” and “The chatbot provides helpful information” (QL Feedback) - Split the item “The chatbot often provides incorrect or incomplete information”. into “The chatbot often provides incorrect information”. and “The chatbot often provides incomplete information”. (QL Feedback) | Nothing |
CRR | - “The chatbot adds superfluous information related to the query” (QL Feedback) | - Split the item “The chatbot’s responses are clear, concise, and easy to understand”. into two different items “The chatbot’s responses are clear, and easy to understand”, “The chatbot’s responses are adequately concise” (QL Feedback) - Split the item “The chatbot’s responses are confusing or irrelevant to my questions”. into “The chatbot’s responses are confusing” and “The chatbot’s responses are irrelevant to my questions”. (QL Feedback) | Nothing |
EUI | Nothing | Nothing | - The entire dimension (% Agree Relevance, IQR Relevance, QL Feedback) |
LQ | Nothing | - Split the item “The chatbot’s language style is natural and appropriate for the context”. into “The chatbot’s language style is/seems natural” and “The chatbot’s language is appropriate for the context”. (% Agree Relevance, IQR Relevance, QL Feedback) | Nothing |
T | - “I feel safe sharing my personal matters with the chatbot” (QL Feedback) - “I believe that the feedback/information provided by the chatbot are trustworthy” (QL Feedback) - “I believe the chatbot is transparent about its limitations and capabilities”. (QL Feedback) | Nothing | - “I believe the chatbot has my best interests at heart” (% Agree Relevance, IQR Relevance, QL Feedback) - “I am willing to rely on the chatbot in the future”. (% Agree Relevance) |
ES | - “The chatbot’s responses can make me feel reassured” (QL Feedback) - “The chatbot shows to have a sense of humor when required” (QL Feedback) | Nothing | Nothing |
GD | - “The chatbot helps me set realistic and achievable goals”. (QL Feedback) | - Modify “The chatbot provides helpful advice and suggestions for coping with my problems”. into “The chatbot provides adjusted guidance in coping with my problems” to avoid cross loading with other factor (QL Feedback) - Modify “The chatbot encourages me to take positive steps”. (QL Feedback) | Nothing |
OS | - “I am overall satisfied with the effectiveness of this chatbot” (QL Feedback) - “I feel that my interactions with the chatbot were worthwhile”. (QL Feedback) | Nothing | - “I would not recommend this chatbot to others due to usability issues”. (Redundancy, QL Feedback) |
M [New] | - “The chatbot accurately recalls key details from previous conversations”. (QL Feedback) - “The chatbot maintains consistency by integrating past interactions into current responses”. (QL Feedback) - “The chatbot adapts its advice based on information provided in earlier sessions”. (QL Feedback) | Nothing | Nothing |
Appendix B
Appendix B.1. Round 2 Results Overview (Items)
Dim | Item (Italian) | Relevance | Redund | Priority | Translation Qual | Sample QL Feedback | ||||
% Agree | M (SD) | Mdn (R) | IQR | Flags | Points | % Agree | M (SD) | |||
UR | The chatbot consistently understands what I am saying or asking. (Il chatbot capisce sempre ciò che sto dicendo o chiedendo.) | 100.00 | 4.87 (0.35) | 5 (4–5) | 0.00 | 1 | 56 | 73.33 | 4.73 (0.47) | - I’d prefer “The chatbot consistently understands what I am saying AND asking”. The “or” makes it hard to trust high scores. [Content] - Toglierei il “sempre” che in italiano potrebbe inserire un dubbio invece che rafforzare [Translation] |
The chatbot understands the tone of my request. (Il chatbot capisce il tono della mia richiesta.) | 73.33 | 4.07 (0.80) | 4 (3–5) | 1.50 | 3 | 27 | 73.33 | 4.67 (0.65) | - It is not clear to me what “understanding the tone” means here [Content] | |
The chatbot asks specific questions to better understand my requests. (Il chatbot fa domande specifiche per capire meglio le mie richieste.) | 86.67 | 4.20 (0.68) | 4 (3–5) | 1.00 | 0 | 30 | 80.00 | 4.92 (0.29) | N/A | |
The chatbot infers information from my messages. (Il chatbot inferisce informazioni dai miei messaggi.) | 80.00 | 4.43 (0.94) | 5 (2–5) | 1.00 | 2 | 37 | 60.00 | 4.25 (1.06) | - The term “infer” is a bit ambiguous; I would suggest revising it as follows: “The chatbot is able to make adequate inferences based on my messages”. [Content] - Il chatbot deduce informazioni dai miei messaggi [Translation] | |
PHI | The chatbot provides accurate information. (Il chatbot fornisce informazioni accurate.) | 86.67 | 4.47 (1.09) | 5 (2–5) | 1.00 | 2 | 81 | 80.00 | 4.83 (0.39) | N/A |
The chatbot provides helpful information. (Il chatbot fornisce informazioni utili.) | 100.00 | 4.80 (0.41) | 5 (4–5) | 0.00 | 0 | 68 | 80.00 | 4.92 (0.29) | N/A | |
The chatbot often provides incorrect information. (Il chatbot fornisce spesso informazioni errate.) | 80.00 | 4.13 (1.13) | 4 (1–5) | 1.00 | 4 | 52 | 80.00 | 4.92 (0.29) | N/A | |
The chatbot often provides incomplete information. (Il chatbot fornisce spesso informazioni incomplete.) | 73.33 | 4.00 (0.76) | 4 (3–5) | 1.00 | 1 | 53 | 80.00 | 4.92 (0.29) | N/A | |
The chatbot provides information grounded in theory and scientific literature. (Il chatbot fornisce informazioni basate su teorie e letteratura scientifica.) | 80.00 | 4.13 (1.13) | 4 (1–5) | 1.00 | 3 | 35 | 73.33 | 4.50 (0.67) | - Il chatbot fornisce informazioni supportate da teorie e letteratura [Translation] | |
The chatbot provides references. (Il chatbot fornisce riferimenti bibliografici.) | 66.67 | 3.60 (1.06) | 4 (1–5) | 1.00 | 5 | 26 | 60.00 | 4.42 (0.90) | - I don’t think it’s crucial for users to have a research paper attached to questions such as “I feel bad lately, I can’t sleep”. It would make the UX poorer in my opinion. This would make more sense if you are building a search engine kind of system. [Content] - Il chatbot fornisce riferimenti alle fonti utilizzate [Translation] | |
CRR | The chatbot’s responses are clear, and easy to understand. (Le risposte del chatbot sono chiare e facili da capire.) | 100.00 | 4.93 (0.26) | 5 (4–5) | 0.00 | 0 | 57 | 73.33 | 4.75 (0.62) | - Le risposte del chatbot sono chiare e semplici da capire [Translation] |
The chatbot’s responses are adequately concise. (Le risposte del chatbot sono sufficientemente concise.) | 80.00 | 4.13 (0.74) | 4 (3–5) | 1.00 | 1 | 53 | 80.00 | 4.75 (0.45) | N/A | |
The chatbot’s responses are confusing. (Le risposte del chatbot sono confondenti.) | 93.33 | 4.07 (0.96) | 4 (1–5) | 0.50 | 5 | 50 | 53.33 | 3.83 (1.19) | - This is just the reverse of clear [Content] - Le risposte del chat mi confondono [Translation] | |
The chatbot’s responses are irrelevant to my questions. (Le risposte del chatbot non sono pertinenti alle mie domande.) | 100.00 | 4.73 (0.46) | 5 (4–5) | 0.50 | 1 | 42 | 80.00 | 4.92 (0.29) | N/A | |
The chatbot adds superfluous information related to the query. (Il chatbot aggiunge informazioni superflue relative alla richiesta.) | 53.33 | 3.47 (0.92) | 4 (1–5) | 1.00 | 8 | 23 | 60.00 | 4.45 (1.29) | - Il chatbot aggiunge informazioni superflue rispetto alla richiesta. [Translation] | |
LQ | The chatbot uses correct grammar and spelling in its responses. (Il chatbot fornisce risposte con grammatica e ortografia corrette.) | 80.00 | 3.73 (1.22) | 4 (1–5) | 0.00 | 2 | 34 | 60.00 | 4.42 (0.90) | - Il chatbot fornisce risposte grammaticalmente e ortograficamente corrette. [Translation] |
The chatbot’s language style is/seems natural. (Lo stile linguistico del chatbot è/sembra naturale.) | 86.67 | 4.47 (0.74) | 5 (3–5) | 1.00 | 0 | 23 | 66.67 | 4.67 (0.78) | - “The chatbot’s language style sounds natural” and seems more fluent [Content] - Lo stile linguistico del chatbot suona naturale [Translation] | |
The chatbot’s language is appropriate for the context. (Il linguaggio del chatbot è appropriato per il contesto.) | 86.67 | 4.57 (0.65) | 5 (3–5) | 1.00 | 0 | 33 | 73.33 | 4.58 (0.67) | - Il linguaggio del chatbot è appropriato al contesto. [Translation] | |
T | I feel safe sharing my personal matters with the chatbot. (Mi sento al sicuro nel condividere questioni personali con il chatbot.) | 93.33 | 4.53 (1.06) | 5 (1–5) | 0.50 | 0 | 37 | 80.00 | 4.75 (0.45) | N/A |
I believe the chatbot is transparent about its limitations and capabilities. (Credo che il chatbot sia trasparente riguardo alle sue limitazioni e capacità.) | 75.00 | 4.13 (0.83) | 4 (3–5) | 1.50 | 0 | 21 | 53.33 | 4.18 (1.08) | - Credo che il chatbot sia trasparente riguardo ai suoi limiti e alle sue capacità [Translation] | |
I believe that the feedback/information provided by the chatbot is trustworthy. (Credo che i feedback/le informazioni fornite dal chatbot siano affidabili.) | 93.33 | 4.67 (0.82) | 5 (2–5) | 0.00 | 2 | 32 | 66.67 | 4.90 (0.32) | - Mettere una e invece che la slash/[Translation] | |
ES | The chatbot makes me feel heard and understood. (Il chatbot mi fa sentire ascoltato e capito.) | 86.67 | 4.20 (1.08) | 4 (1–5) | 1.00 | 1 | 53 | 66.67 | 4.80 (0.42) | - I would drop this. I feel like this evaluates how the system can trick the user in terms of feeling like they are talking to someone who listens to them and understands, while an LLM obviously cannot do that. [Content] |
The chatbot’s responses feel empathetic and supportive. (Le risposte del chatbot sembrano empatiche e di supporto.) | 93.33 | 4.60 (0.63) | 5 (3–5) | 1.00 | 1 | 43 | 53.33 | 4.10 (1.29) | - I think this is different from the previous one because it focuses on the “look” of the answers more than on the ability to convince the user of something. This is something that makes sense to evaluate I think [Content] - “e supportive” invece che “di supporto” [Translation] | |
The chatbot’s responses can make me feel reassured (Le risposte del chatbot sono in grado di farmi sentire rassicurato.) | 80.00 | 4.20 (0.77) | 4 (3–5) | 1.00 | 3 | 34 | 60.00 | 4.70 (0.67) | N/A | |
The chatbot shows to have a sense of humor when required (Il chatbot dimostra di avere senso dell’umorismo quando necessario.) | 60.00 | 3.20 (1.42) | 4 (1–5) | 2.00 | 4 | 20 | 60.00 | 4.70 (0.67) | N/A | |
GD | The chatbot provides adjusted guidance in coping with my problems. (Il chatbot mi fornisce indicazioni adeguate per affrontare i problemi che riporto.) | 86.67 | 4.53 (0.74) | 5 (3–5) | 1.00 | 0 | 38 | 46.67 | 3.90 (1.29) | - Nel tradurre coping suggerirei di dire “per gestire” invece che per affrontare [Translation] - Il chatbot fornisce indicazioni adeguate per affrontare i miei problem [Translation] |
The chatbot helps me set realistic and achievable goals. (Il chatbot mi aiuta a stabilire obiettivi realistici e raggiungibili.) | 100.00 | 4.53 (0.52) | 5 (4–5) | 1.00 | 1 | 25 | 66.67 | 4.80 (0.42) | N/A | |
The chatbot encourages me to take positive steps. (Il chatbot mi incoraggia a compiere sforzi per il mio benessere.) | 86.67 | 4.27 (1.03) | 5 (2–5) | 1.00 | 2 | 27 | 53.33 | 3.82 (0.87) | - Il chatbot mi incoraggia a compiere azioni costruttive. [Translation] - Il chatbot mi incoraggia a compiere passi positivi [Translation] | |
M | The chatbot accurately recalls key details from previous conversations. (Il chatbot ricorda accuratamente i dettagli chiave delle conversazioni precedenti.) | 100.00 | 4.73 (0.46) | 5 (4–5) | 0.50 | 1 | 39 | 60.00 | 4.55 (0.82) | - I would delete “key”, to not make it seem like the chatbot can understand personal alliance, but rather its capacity to recall information at large this item is important. [Content] |
The chatbot maintains consistency by integrating past interactions into current responses. (Il chatbot integra coerentemente le interazioni passate nelle risposte attuali.) | 93.33 | 4.80 (0.56) | 5 (3–5) | 0.00 | 4 | 26 | 53.33 | 4.50 (0.85) | - Il chatbot è coerente ed integra le interazioni passate nelle risposte attuali. [Translation] - Il chatbot integra coerentemente le interazioni passate nelle risposte [Translation] | |
The chatbot adapts its advice based on information provided in earlier sessions. (Il chatbot adatta i suoi consigli in base alle informazioni fornite nelle sessioni precedenti.) | 93.33 | 4.67 (0.62) | 5 (3–5) | 0.50 | 3 | 25 | 73.33 | 4.82 (0.40) | N/A | |
OS | I am overall satisfied with the usability of this chatbot. (Sono complessivamente soddisfatto dell’usabilità di questo chatbot.) | 93.33 | 4.53 (0.64) | 5 (3–5) | 1.00 | 0 | 39 | 73.33 | 4.64 (0.50) | - Nel complesso, sono soddisfatto dell’usabilità di questo chatbot [Translation] |
I feel that my interactions with the chatbot were worthwhile. (Trovo che le mie interazioni con il chatbot siano state utili.) | 86.67 | 4.20 (0.68) | 4 (3–5) | 1.00 | 3 | 27 | 66.67 | 4.73 (0.65) | - Trovo che le mie interazioni con il chatbot siano state proficue [Translation] | |
I am overall satisfied with the effectiveness of this chatbot. (Sono complessivamente soddisfatto dell’efficacia di questo chatbot.) | 75.00 | 4.20 (1.01) | 5 (2–5) | 1.50 | 2 | 24 | 60.00 | 4.70 (0.67) | - Nel complesso, sono soddisfatto… [Translation] |
Appendix B.2. Round 2 Decision
Dim | Add (Motivation) | Modify (Motivation) | Drop (Motivation) |
UR | Nothing | - Rephrase both the Italian translation and the original item (“The chatbot consistently understands what I am saying or asking”.) into: “The chatbot consistently understands what I am saying and asking” and “Il chatbot capisce ciò che sto dicendo e chiedendo”. (% Agree Translation, QL Feedback) - Rephrase both the Italian translation and the original item (“The chatbot infers information from my messages”.) into: “The chatbot is able to make adequate inferences based on my messages”. and “Il chatbot è in grado di fare deduzioni appropriate basandosi sui miei messaggi”. (% Agree Translation, QL Feedback) | - “The chatbot understands the tone of my request”. (% Agree Relevance, QL Feedback) |
PHI | Nothing | - Rephrase the Italian version of the item “The chatbot provides information grounded in theory and scientific literature”. into “Il chatbot fornisce informazioni supportate da teorie e letteratura scientifica”. (% Agree Translation, QL Feedback) | - The chatbot often provides incorrect information. (Redundancy) - The chatbot often provides incomplete information. (% Agree Relevance) - The chatbot provides references. (% Agree Relevance, Redundancy) |
CRR | Nothing | - Rephrase the Italian version of the item “The chatbot’s responses are clear, and easy to understand”. into “Le risposte del chatbot sono chiare e semplici da capire”. (% Agree Translation, QL Feedback) | - The chatbot’s responses are confusing. (Redundancy, QL Feedback) - The chatbot adds superfluous information related to the query. (% Agree Relevance, Redundancy, QL Feedback) |
LQ | Nothing | - Rephrase the Italian version of the item “The chatbot uses correct grammar and spelling in its responses”. into “Il chatbot fornisce risposte grammaticalmente e ortograficamente corrette”. (% Agree Translation, QL Feedback) - Rephrase both the Italian translation and the original item (“The chatbot’s language style is/seems natural”.) into: “The chatbot’s language style sounds natural”. and “Lo stile linguistico del chatbot suona naturale”. (% Agree Translation, QL Feedback) - Rephrase the Italian version of the item “The chatbot’s language is appropriate for the context”. into “Il linguaggio del chatbot è appropriato al contesto”. (% Agree Translation, QL Feedback) | Nothing |
T | Nothing | - Rephrase the Italian version of the item “I believe the chatbot is transparent about its limitations and capabilities”. into “Credo che il chatbot sia trasparente riguardo ai suoi limiti e alle sue capacità” (% Agree Translation, QL Feedback) - Rephrase both the Italian translation and the original item (“I believe that the feedback/information provided by the chatbot are trustworthy”.) into: “I believe that the feedback and the information provided by the chatbot are trustworthy”. and “Credo che i feedback e le informazioni fornite dal chatbot siano affidabili”. (% Agree Translation, QL Feedback) | Nothing |
ES | Nothing | - Rephrase the Italian version of the item “The chatbot’s responses feel empathetic and supportive”. into “Le risposte del chatbot risultano empatiche e supportive”. (% Agree Translation, QL Feedback) | - “The chatbot shows to have a sense of humor when required” (% Agree Relevance, Redundancy, QL Feedback) |
GD | Nothing | - Rephrase the Italian version of the item “The chatbot provides adjusted guidance in coping with my problems”. into “Il chatbot fornisce indicazioni personalizzate per aiutarmi a gestire i miei problemi”. (% Agree Translation, QL Feedback) - Rephrase the Italian version of the item “The chatbot encourages me to take positive steps”. into “Il chatbot mi incoraggia a compiere azioni costruttive”. (% Agree Translation, QL Feedback) | Nothing |
M | Nothing | - Rephrase both the Italian translation and the original item (“The chatbot accurately recalls key details from previous conversations”.) into: “The chatbot accurately recalls details from previous conversations”. and “Il chatbot ricorda accuratamente i dettagli delle conversazioni precedenti”. (% Agree Translation, QL Feedback) - Rephrase the Italian version of the item “The chatbot maintains consistency by integrating past interactions into current responses”. into “Il chatbot integra coerentemente le interazioni passate nelle risposte”. (% Agree Translation, QL Feedback) | Nothing |
OS | Nothing | - Rephrase the Italian version of the item “I am overall satisfied with the usability of this chatbot”. into Nel complesso, sono soddisfatto dell’usabilità di questo chatbot”. (% Agree Translation, QL Feedback) - Rephrase both the Italian translation and the original item (“I feel that my interactions with the chatbot were worthwhile”.) into: “Overall, I feel that my interactions with the chatbot were worthwhile”. and “Nel complesso, trovo che le mie interazioni con il chatbot siano state proficue”. (% Agree Translation, QL Feedback) - Rephrase both the Italian translation and the original item (“I am overall satisfied with the effectiveness of this chatbot”.) into: “I am overall satisfied with the support provided by this chatbot”. and “Nel complesso, sono soddisfatto del supporto offerto da questo chatbot”. (% Agree Translation, % Agree Relevance, QL Feedback) | Nothing |
Appendix C
Demographic Profile of Users Who Participated in the Initial Validation
Characteristic | Value or % (n) | |
Age | M = 32.02 (SD = 11.55) | |
Gender | Female | 57.14% (28) |
Male | 40.81% (20) | |
Not specified | 2.05% (1) | |
Education | EQF1 | 0.00% (0) |
EQF2 | 8.16% (4) | |
EQF3 | 2.04% (1) | |
EQF4 | 14.29% (7) | |
EQF5 | 0.00% (0) | |
EQF6 | 28.57% (14) | |
EQF7 | 30.61% (15) | |
EQF8 | 16.33% (8) | |
Chatbot Experience | None | 18.37% (9) |
Basic | 32.65% (16) | |
Intermediate | 34.69% (17) | |
Expert | 14.29% (7) | |
LLM Experience | None | 24.49% (12) |
Basic | 38.78% (19) | |
Intermediate | 26.53% (13) | |
Expert | 10.20% (5) | |
Propensity to Trust in Technology [76] | M = 3.76 (SD = 0.51) | |
Country | Italy | 100% (49) |
References
- Bendig, E.; Erb, B.; Schulze-Thuesing, L.; Baumeister, H. The Next Generation: Chatbots in Clinical Psychology and Psychotherapy to Foster Mental Health—A Scoping Review. Verhaltenstherapie 2022, 32 (Suppl. S1), 64–76. [Google Scholar] [CrossRef]
- Laymouna, M.; Ma, Y.; Lessard, D.; Schuster, T.; Engler, K.; Lebouché, B. Roles, Users, Benefits, and Limitations of Chatbots in Health Care: Rapid Review. J. Med. Internet Res. 2024, 26, e56930. [Google Scholar] [CrossRef]
- Balcombe, L. AI Chatbots in Digital Mental Health. Informatics 2023, 10, 82. [Google Scholar] [CrossRef]
- Kuehn, B.M. Clinician Shortage Exacerbates Pandemic-Fueled “Mental Health Crisis”. JAMA 2022, 327, 2179. [Google Scholar] [CrossRef] [PubMed]
- Boucher, E.M.; Harake, N.R.; Ward, H.E.; Stoeckl, S.E.; Vargas, J.; Minkel, J.; Parks, A.C.; Zilca, R. Artificially intelligent chatbots in digital mental health interventions: A review. Expert. Rev. Med. Devices 2021, 18 (Suppl. S1), 37–49. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Skjuve, M.; Følstad, A.; Brandtzaeg, P.B. The User Experience of ChatGPT: Findings from a Questionnaire Study of Early Users. In Proceedings of the 5th International Conference on Conversational User Interfaces, Eindhoven, The Netherlands, 19–21 July 2023; ACM: New York, NY, USA, 2023; pp. 1–10. [Google Scholar]
- Limpanopparat, S.; Gibson, E.; Harris, D.A. User engagement, attitudes, and the effectiveness of chatbots as a mental health intervention: A systematic review. Comput. Hum. Behav. Artif. Hum. 2024, 2, 100081. [Google Scholar] [CrossRef]
- O’Brien, H.L.; Toms, E.G. What is user engagement? A conceptual framework for defining user engagement with technology. J. Am. Soc. Inf. Sci. Technol. 2008, 59, 938–955. [Google Scholar] [CrossRef]
- Hassenzahl, M.; Tractinsky, N. User experience-a research agenda. Behav. Inf. Technol. 2006, 25, 91–97. [Google Scholar] [CrossRef]
- Shackel, B. Usability—Context, framework, definition, design and evaluation. Interact. Comput. 2009, 21, 339–346. [Google Scholar] [CrossRef]
- Moilanen, J.; Visuri, A.; Suryanarayana, S.A.; Alorwu, A.; Yatani, K.; Hosio, S. Measuring the Effect of Mental Health Chatbot Personality on User Engagement. In Proceedings of the 21st International Conference on Mobile and Ubiquitous Multimedia, Lisbon, Portugal, 27–30 November 2022; ACM: New York, NY, USA, 2022; pp. 138–150. [Google Scholar]
- Gabrielli, S.; Rizzi, S.; Bassi, G.; Carbone, S.; Maimone, R.; Marchesoni, M.; Forti, S. Engagement and Effectiveness of a Healthy-Coping Intervention via Chatbot for University Students During the COVID-19 Pandemic: Mixed Methods Proof-of-Concept Study. JMIR Mhealth Uhealth 2021, 9, e27965. [Google Scholar] [CrossRef] [PubMed]
- O’Brien, H.L.; Toms, E.G. The development and evaluation of a survey to measure user engagement. J. Am. Soc. Inf. Sci. Technol. 2010, 61, 50–69. [Google Scholar] [CrossRef]
- Denecke, K.; Vaaheesan, S.; Arulnathan, A. A Mental Health Chatbot for Regulating Emotions (SERMO)—Concept and Usability Test. IEEE Trans. Emerg. Top. Comput. 2021, 9, 1170–1182. [Google Scholar] [CrossRef]
- Escobar-Viera, C.G.; Porta, G.; Coulter, R.W.S.; Martina, J.; Goldbach, J.; Rollman, B.L. A chatbot-delivered intervention for optimizing social media use and reducing perceived isolation among rural-living LGBTQ+ youth: Development, acceptability, usability, satisfaction, and utility. Internet Interv. 2023, 34, 100668. [Google Scholar] [CrossRef] [PubMed]
- Lima, M.R.; Wairagkar, M.; Natarajan, N.; Vaitheswaran, S.; Vaidyanathan, R. Robotic Telemedicine for Mental Health: A Multimodal Approach to Improve Human-Robot Engagement. Front. Robot. AI 2021, 8, 618866. [Google Scholar] [CrossRef]
- Laugwitz, B.; Held, T.; Schrepp, M. Construction and Evaluation of a User Experience Questionnaire. In HCI and Usability for Education and Work; Holzinger, A., Ed.; Springer: Berlin/Heidelberg, Germany, 2008; pp. 63–76. ISBN 978-3-540-89349-3. [Google Scholar] [CrossRef]
- Shah, J.; DePietro, B.; D’Adamo, L.; Firebaugh, M.L.; Laing, O.; Fowler, L.A.; Smolar, L.; Sadeh-Sharvit, S.; Taylor, C.B.; Wilfley, D.E.; et al. Development and usability testing of a chatbot to promote mental health services use among individuals with eating disorders following screening. Int. J. Eat. Disord. 2022, 55, 1229–1244. [Google Scholar] [CrossRef]
- Boyd, K.; Potts, C.; Bond, R.; Mulvenna, M.; Broderick, T.; Burns, C.; Bickerdike, A.; Mctear, M.; Kostenius, C.; Vakaloudis, A.; et al. Usability testing and trust analysis of a mental health and wellbeing chatbot. In Proceedings of the 33rd European Conference on Cognitive Ergonomics, Kaiserslautern, Germany, 4–7 October 2022; ACM: New York, NY, USA, 2022; pp. 1–8. [Google Scholar]
- Islam, M.N.; Khan, S.R.; Islam, N.N.; Rezwan-A-Rownok Md Zaman, S.R.; Zaman, S.R. A Mobile Application for Mental Health Care During COVID-19 Pandemic: Development and Usability Evaluation with System Usability Scale; Springer: Cham, Switzerland, 2021; pp. 33–42. [Google Scholar]
- Valtolina, S.; Zanotti, P.; Mandelli, S. Designing Conversational Agents to Empower Active Aging. In Proceedings of the ACM International Conference on Intelligent Virtual Agents, Glasgow, UK, 16–19 September 2024; ACM: New York, NY, USA, 2024; pp. 1–4. [Google Scholar]
- Brooke, J. SUS: A “Quick and Dirty” Usability Scale. In Usability Evaluation In Industry; CRC Press: Boca Raton, FL, USA, 1996; pp. 207–212. [Google Scholar]
- Holmes, S.; Moorhead, A.; Bond, R.; Zheng, H.; Coates, V.; Mctear, M. Usability testing of a healthcare chatbot: Can we use conventional methods to assess conversational user interfaces? In Proceedings of the 31st European Conference on Cognitive Ergonomics, Belfast, UK, 10–13 September 2019; ACM: New York, NY, USA, 2019; pp. 207–214. [Google Scholar]
- Borsci, S.; Malizia, A.; Schmettow, M.; van der Velde, F.; Tariverdiyeva, G.; Balaji, D.; Chamberlain, A. The Chatbot Usability Scale: The Design and Pilot of a Usability Scale for Interaction with AI-Based Conversational Agents. Pers. Ubiquitous Comput. 2022, 26, 95–119. [Google Scholar] [CrossRef]
- Henkel, T.; Linn, A.J.; van der Goot, M.J. Understanding the Intention to Use Mental Health Chatbots Among LGBTQIA+ Individuals: Testing and Extending the UTAUT. In Proceedings of the 6th International Workshop, CONVERSATIONS 2022, Amsterdam, The Netherlands, 22–23 November 2022; Springer: Cham, Switzerland, 2023; pp. 83–100. [Google Scholar]
- Kamita, T.; Ito, T.; Matsumoto, A.; Munakata, T.; Inoue, T. A Chatbot System for Mental Healthcare Based on SAT Counseling Method. Mob. Inf. Syst. 2019, 2019, 1–11. [Google Scholar] [CrossRef]
- Venkatesh, V.; Morris, M.G.; Davis, G.B.; Davis, F.D. User Acceptance of Information Technology: Toward a Unified View. MIS Qarterly 2003, 27, 425. [Google Scholar] [CrossRef]
- Davis, F.D. Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology. MIS Qarterly 1989, 13, 319. [Google Scholar] [CrossRef]
- Ahuja, K.; Lio, P. Measuring Empathy in Artificial Intelligence: Insights From Psychodermatology and Implications for General Practice. Prim. Care Companion CNS Disord. 2024, 26, 24lr03782. [Google Scholar] [CrossRef] [PubMed]
- Zhao, J.; Plaza-del-Arco, F.M.; Genchel, B.; Curry, A.C. Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks. arXiv 2024, arXiv:2406.08598. [Google Scholar]
- Schmidmaier, M.; Rupp, J.; Cvetanova, D.; Mayer, S. Perceived Empathy of Technology Scale (PETS): Measuring Empathy of Systems Toward the User. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; ACM: New York, NY, USA, 2024; pp. 1–18. [Google Scholar]
- Concannon, S.; Tomalin, M. Measuring perceived empathy in dialogue systems. AI Soc. 2024, 39, 2233–2247. [Google Scholar] [CrossRef]
- Miloff, A.; Carlbring, P.; Hamilton, W.; Andersson, G.; Reuterskiöld, L.; Lindner, P. Measuring Alliance Toward Embodied Virtual Therapists in the Era of Automated Treatments With the Virtual Therapist Alliance Scale (VTAS): Development and Psychometric Evaluation. J. Med. Internet Res. 2020, 22, e16660. [Google Scholar] [CrossRef]
- Wei, S.; Freeman, D.; Rovira, A. A randomised controlled test of emotional attributes of a virtual coach within a virtual reality (VR) mental health treatment. Sci. Rep. 2023, 13, 11517. [Google Scholar] [CrossRef]
- Yu, H.Q.; McGuinness, S. An experimental study of integrating fine-tuned large language models and prompts for enhancing mental health support chatbot system. J. Med. Artif. Intell. 2024, 7, 16. [Google Scholar] [CrossRef]
- Crasto, R.; Dias, L.; Miranda, D.; Kayande, D. CareBot: A Mental Health ChatBot. In Proceedings of the 2021 2nd International Conference for Emerging Technology (INCET), Belagavi, India, 21–23 May 2021; pp. 1–5. [Google Scholar]
- Srivastava, A.; Pandey, I.; Akhtar, M.S.; Chakraborty, T. Response-act Guided Reinforced Dialogue Generation for Mental Health Counseling. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; ACM: New York, NY, USA, 2023; pp. 1118–1129. [Google Scholar]
- Kaysar, M.N.; Shiramatsu, S. Mental State-Based Dialogue System for Mental Health Care by Using GPT-3. In Proceedings of Eighth International Congress on Information and Communication Technology; Springer: Singapore, 2024; pp. 891–901. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics—ACL ’02, Philadelphia, PA, USA, 7–12 July 2002; Association for Computational Linguistics: Morristown, NJ, USA, 2001; p. 311. [Google Scholar]
- Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
- Radziwill, N.M.; Benton, M.C. Evaluating Quality of Chatbots and Intelligent Conversational Agents. arXiv 2017, arXiv:1704.04579. [Google Scholar]
- Ding, H.; Simmich, J.; Vaezipour, A.; Andrews, N.; Russell, T. Evaluation framework for conversational agents with artificial intelligence in health interventions: A systematic scoping review. J. Am. Med. Inform. Assoc. 2024, 31, 746–761. [Google Scholar] [CrossRef]
- Donohoe, H.; Stellefson, M.; Tennant, B. Advantages and Limitations of the e-Delphi Technique. Am. J. Health Educ. 2012, 43, 38–46. [Google Scholar] [CrossRef]
- Belton, I.; MacDonald, A.; Wright, G.; Hamlin, I. Improving the practical application of the Delphi method in group-based judgment: A six-step prescription for a well-founded and defensible process. Technol. Forecast. Soc. Change 2019, 147, 72–82. [Google Scholar] [CrossRef]
- McMillan, S.S.; King, M.; Tully, M.P. How to use the nominal group and Delphi techniques. Int. J. Clin. Pharm. 2016, 38, 655–662. [Google Scholar] [CrossRef] [PubMed]
- Jünger, S.; Payne, S.A.; Brine, J.; Radbruch, L.; Brearley, S.G. Guidance on Conducting and REporting DElphi Studies (CREDES) in palliative care: Recommendations based on a methodological systematic review. Palliat. Med. 2017, 31, 684–706. [Google Scholar] [CrossRef]
- Denecke, K.; May, R.; Rivera Romero, O. Potential of Large Language Models in Health Care: Delphi Study. J. Med. Internet Res. 2024, 26, e52399. [Google Scholar] [CrossRef] [PubMed]
- Maroengsit, W.; Piyakulpinyo, T.; Phonyiam, K.; Pongnumkul, S.; Chaovalit, P.; Theeramunkong, T. A Survey on Evaluation Methods for Chatbots. In Proceedings of the 2019 7th International Conference on Information and Education Technology, Aizu-Wakamatsu, Japan, 29–31 March 2019; ACM: New York, NY, USA, 2019; pp. 111–119. [Google Scholar]
- Denecke, K.; Abd-Alrazaq, A.; Househ, M.; Warren, J. Evaluation Metrics for Health Chatbots: A Delphi Study. Methods Inf. Med. 2021, 60, 171–179. [Google Scholar] [CrossRef] [PubMed]
- Guo, Z.; Lai, A.; Thygesen, J.H.; Farrington, J.; Keen, T.; Li, K. Large Language Model for Mental Health: A Systematic Review. arXiv 2024, arXiv:2403.15401. [Google Scholar]
- Tam, T.Y.C.; Sivarajkumar, S.; Kapoor, S.; Stolyar, A.V.; Polanska, K.; McCarthy, K.R.; Osterhoudt, H.; Wu, X.; Visweswaran, S.; Fu, S.; et al. A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review. npj Digit. Med. 2024, 7, 1–20. [Google Scholar] [CrossRef]
- Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 2024, 15, 39. [Google Scholar] [CrossRef]
- Peng, J.-L.; Cheng, S.; Diau, E.; Shih, Y.-Y.; Chen, P.-H.; Lin, Y.-T.; Chen, Y.-N. A Survey of Useful LLM Evaluation. arXiv 2024, arXiv:2406.00936. [Google Scholar]
- Qualtrics. Qualtrics XM. Provo (UT): Qualtrics. Available online: https://www.qualtrics.com (accessed on 11 March 2025).
- Mistral AI. Mistral Large. Version 2407. Mistral AI: Paris, France. Available online: https://mistral.ai (accessed on 11 March 2025).
- McKinney, W. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, Austin, TX, USA, 28 June–3 July 2010; pp. 56–61. [Google Scholar]
- Saari, D.G. Selecting a voting method: The case for the Borda count. Const. Political Econ. 2023, 34, 357–366. [Google Scholar] [CrossRef]
- Clarke, V.; Braun, V. Thematic analysis. J. Posit. Psychol. 2017, 12, 297–298. [Google Scholar] [CrossRef]
- Sheinis, M.; Selk, A. Development of the Adult Vulvar Lichen Sclerosus Severity Scale—A Delphi Consensus Exercise for Item Generation. J. Low. Genit. Tract. Dis. 2018, 22, 66–73. [Google Scholar] [CrossRef]
- Bauer, S.M.; Fusté, A.; Andrés, A.; Saldaña, C. The Barcelona Orthorexia Scale (BOS): Development process using the Delphi method. Eat. Weight. Disord.—Stud. Anorex. Bulim. Obes. 2019, 24, 247–255. [Google Scholar] [CrossRef] [PubMed]
- Xin, T.; Ding, X.; Gao, H.; Li, C.; Jiang, Y.; Chen, X. Using Delphi method to develop Chinese women’s cervical cancer screening intention scale based on planned behavior theory. BMC Womens Health 2022, 22, 512. [Google Scholar] [CrossRef] [PubMed]
- Scott, V.C.; Temple, J.; Jillani, Z. Development of the Technical Assistance Engagement Scale: A modified Delphi study. Implement. Sci. Commun. 2024, 5, 84. [Google Scholar] [CrossRef] [PubMed]
- World Health Organization. Doing What Matters in Times of Stress: An Illustrated Guide; World Health Organization: Geneva, Switzerland, 2020. [Google Scholar]
- Cronbach, L.J. Coefficient Alpha and the Internal Structure of Tests. Psychometrika 1951, 16, 297–334. [Google Scholar] [CrossRef]
- Guilford, J.P. The Correlation of an Item With a Composite of the Remaining Items in a Test. Educ. Psychol. Meas. 1953, 13, 87–93. [Google Scholar] [CrossRef]
- Tavakol, M.; Dennick, R. Making sense of Cronbach’s alpha. Int. J. Med. Educ. 2011, 2, 53–55. [Google Scholar] [CrossRef]
- Röschel, A.; Wagner, C.; Dür, M. Examination of validity, reliability, and interpretability of a self-reported questionnaire on Occupational Balance in Informal Caregivers (OBI-Care)—A Rasch analysis. PLoS ONE 2021, 16, e0261815. [Google Scholar] [CrossRef]
- Zieve, G.G.; Sarfan, L.D.; Dong, L.; Tiab, S.S.; Tran, M.; Harvey, A.G. Cognitive Therapy-as-Usual versus Cognitive Therapy plus the Memory Support Intervention for adults with depression: 12-month outcomes and opportunities for improved efficacy in a secondary analysis of a randomized controlled trial. Behav. Res. Ther. 2023, 170, 104419. [Google Scholar] [CrossRef]
- Dong, L.; Zieve, G.; Gumport, N.B.; Armstrong, C.C.; Alvarado-Martinez, C.G.; Martinez, A.; Howlett, S.; Fine, E.; Tran, M.; McNamara, M.E.; et al. Can integrating the Memory Support Intervention into cognitive therapy improve depression outcome? A randomized controlled trial. Behav. Res. Ther. 2022, 157, 104167. [Google Scholar] [CrossRef]
- Jo, E.; Jeong, Y.; Park, S.; Epstein, D.A.; Kim, Y.H. Understanding the Impact of Long-Term Memory on Self-Disclosure with Large Language Model-Driven Chatbots for Public Health Intervention. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; ACM: New York, NY, USA, 2024; pp. 1–21. [Google Scholar]
- ISO/IEC TS 25010:2023(en); Systems and Software Engineering—Systems and Software Quality Requirements and Evaluation (SQuaRE): Product Quality Model. ISO: Geneva, Switzerland, 2023.
- Ouhbi, S.; Idri, A.; Fernández-Alemán, J.L.; Toval, A.; Benjelloun, H. Applying ISO/IEC 25010 on Mobile Personal Health Records. In Proceedings of the BIOSTEC 2015: Proceedings of the International Joint Conference on Biomedical Engineering Systems and Technologies, Lisbon, Portugal, 12–15 January 2015; SCITEPRESS—Science and and Technology Publications: Setubal, Portugal, 2015; pp. 405–412. [Google Scholar]
- Blut, M.; Wang, C.; Wünderlich, N.V.; Brock, C. Understanding anthropomorphism in service provision: A meta-analysis of physical robots, chatbots, and other AI. J. Acad. Mark. Sci. 2021, 49, 632–658. [Google Scholar] [CrossRef]
- Eyssel, F.; Reich, N. Loneliness makes the heart grow fonder (of robots)—On the effects of loneliness on psychological anthropomorphism. In Proceedings of the 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Tokyo, Japan, 3–6 March 2013; pp. 121–122. [Google Scholar]
- Jessup, S.; Schneider, T.; Alarcon, G.; Ryan, T.; Capiola, A. The Measurement of the Propensity to Trust Technology. Master’s Thesis, Wright University,, Dayton, OH, USA, 2018. [Google Scholar]
Characteristic | Value or % (n) | |
---|---|---|
Age | M = 34.50 (SD = 10.66) | |
Gender | Male | 56.25% (9) |
Female | 43.75% (7) | |
Education | Bachelor | 12.50% (2) |
Master | 50.00% (8) | |
Doctorate | 31.25% (5) | |
PsyD Specialization | 6.25% (1) | |
Area of expertise | Psychology | 31.25% (5) |
Artificial Intelligence | 31.25% (5) | |
Human–Computer Interaction | 18.75% (3) | |
Digital Therapeutics | 18.75% (3) | |
Occupation | Researcher | 50.00% (8) |
Developer (AI) | 37.50% (6) | |
Psychologist | 12.50% (2) | |
Job seniority | 3–5 years | 50.00% (8) |
6–10 years | 18.75% (3) | |
11–15 years | 12.50% (2) | |
16–20 years | 0.00% (0) | |
21+ years | 18.75% (3) | |
Country | Italy | 100% (16) |
Dimension | Item | Priority |
---|---|---|
Understanding requests [UR] | The chatbot consistently understands what I am saying and asking. Il chatbot capisce ciò che sto dicendo e chiedendo. | 1 |
The chatbot is able to make adequate inferences based on my messages. Il chatbot è in grado di fare deduzioni appropriate basandosi sui miei messaggi. | 2 | |
The chatbot asks specific questions to better understand my requests. Il chatbot fa domande specifiche per capire meglio le mie richieste. | 3 | |
Providing helpful information [PHI] | The chatbot provides accurate information. Il chatbot fornisce informazioni accurate. | 1 |
The chatbot provides helpful information. Il chatbot fornisce informazioni utili. | 2 | |
The chatbot provides information grounded in theory and scientific literature. Il chatbot fornisce informazioni supportate da teorie e letteratura scientifica. | 3 | |
Clarity and relevance of responses [CRR] | The chatbot’s responses are clear, and easy to understand. Le risposte del chatbot sono chiare e semplici da capire. | 1 |
The chatbot’s responses are adequately concise. Le risposte del chatbot sono sufficientemente concise. | 2 | |
The chatbot’s responses are irrelevant to my questions. Le risposte del chatbot non sono pertinenti alle mie domande. | 3 | |
Language quality [LQ] | The chatbot uses correct grammar and spelling in its responses. Il chatbot fornisce risposte grammaticalmente e ortograficamente corrette. | 1 |
The chatbot’s language is appropriate for the context. Il linguaggio del chatbot è appropriato al contesto. | 2 | |
The chatbot’s language style sounds natural Lo stile linguistico del chatbot suona naturale. | 3 | |
Trust [T] | I feel safe sharing my personal matters with the chatbot. Mi sento al sicuro nel condividere questioni personali con il chatbot. | 1 |
I believe that the feedback and the information provided by the chatbot are trustworthy. Credo che i feedback e le informazioni fornite dal chatbot siano affidabili. | 2 | |
I believe the chatbot is transparent about its limitations and capabilities. Credo che il chatbot sia trasparente riguardo ai suoi limiti e alle sue capacità | 3 | |
Emotional support [ES] | The chatbot makes me feel heard and understood. Il chatbot mi fa sentire ascoltato e capito. | 1 |
The chatbot’s responses feel empathetic and supportive. Le risposte del chatbot risultano empatiche e supportive. | 2 | |
The chatbot’s responses can make me feel reassured Le risposte del chatbot sono in grado di farmi sentire rassicurato. | 3 | |
Guidance and direction [GD] | The chatbot provides adjusted guidance in coping with my problems. Il chatbot fornisce indicazioni personalizzate per aiutarmi a gestire i miei problemi. | 1 |
The chatbot encourages me to take positive steps. Il chatbot mi incoraggia a compiere azioni costruttive. | 2 | |
The chatbot helps me set realistic and achievable goals. Il chatbot mi aiuta a stabilire obiettivi realistici e raggiungibili. | 3 | |
Memory [M] | The chatbot accurately recalls details from previous conversations. Il chatbot ricorda accuratamente i dettagli delle conversazioni precedenti. | 1 |
The chatbot maintains consistency by integrating past interactions into current responses. Il chatbot integra coerentemente le interazioni passate nelle risposte. | 2 | |
The chatbot adapts its advice based on information provided in earlier sessions. Il chatbot adatta i suoi consigli in base alle informazioni fornite nelle sessioni precedenti. | 3 | |
Overall satisfaction [OS] | I am overall satisfied with the usability of this chatbot. Nel complesso, sono soddisfatto dell’usabilità di questo chatbot. | 1 |
Overall, I feel that my interactions with the chatbot were worthwhile. Nel complesso, trovo che le mie interazioni con il chatbot siano state proficue. | 2 | |
I am overall satisfied with the support provided by this chatbot Nel complesso, sono soddisfatto del supporto offerto da questo chatbot. | 3 |
Dimension | Inter-Item Correlation | Item-Total Correlation | Cronbach’s α | ||
---|---|---|---|---|---|
Mean | Range | Mean | Range | ||
UR | 0.42 | 0.28–0.54 | 0.50 | 0.41–0.61 | 0.68 |
PHI | 0.58 | 0.44–0.73 | 0.65 | 0.55–0.76 | 0.79 |
CRR | 0.28 | 0.02–0.71 | 0.33 | 0.06–0.54 | 0.47 |
LQ | 0.40 | 0.23–0.61 | 0.47 | 0.31–0.63 | 0.63 |
T | 0.55 | 0.41–0.74 | 0.63 | 0.49–0.73 | 0.78 |
ES | 0.77 | 0.70–0.86 | 0.82 | 0.76–0.88 | 0.91 |
GD | 0.48 | 0.33–0.73 | 0.54 | 0.38–0.64 | 0.71 |
M | 0.55 | 0.45–0.65 | 0.63 | 0.55–0.71 | 0.78 |
OS | 0.75 | 0.72–0.77 | 0.80 | 0.78–0.82 | 0.90 |
Overall | N/A | N/A | N/A | N/A | 0.94 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bolpagni, M.; Gabrielli, S. Development of a Comprehensive Evaluation Scale for LLM-Powered Counseling Chatbots (CES-LCC) Using the eDelphi Method. Informatics 2025, 12, 33. https://doi.org/10.3390/informatics12010033
Bolpagni M, Gabrielli S. Development of a Comprehensive Evaluation Scale for LLM-Powered Counseling Chatbots (CES-LCC) Using the eDelphi Method. Informatics. 2025; 12(1):33. https://doi.org/10.3390/informatics12010033
Chicago/Turabian StyleBolpagni, Marco, and Silvia Gabrielli. 2025. "Development of a Comprehensive Evaluation Scale for LLM-Powered Counseling Chatbots (CES-LCC) Using the eDelphi Method" Informatics 12, no. 1: 33. https://doi.org/10.3390/informatics12010033
APA StyleBolpagni, M., & Gabrielli, S. (2025). Development of a Comprehensive Evaluation Scale for LLM-Powered Counseling Chatbots (CES-LCC) Using the eDelphi Method. Informatics, 12(1), 33. https://doi.org/10.3390/informatics12010033