A Systematic Review on Healthcare Artificial Intelligent Conversational Agents for Chronic Conditions

This paper reviews different types of conversational agents used in health care for chronic conditions, examining their underlying communication technology, evaluation measures, and AI methods. A systematic search was performed in February 2021 on PubMed Medline, EMBASE, PsycINFO, CINAHL, Web of Science, and ACM Digital Library. Studies were included if they focused on consumers, caregivers, or healthcare professionals in the prevention, treatment, or rehabilitation of chronic diseases, involved conversational agents, and tested the system with human users. The search retrieved 1087 articles. Twenty-six studies met the inclusion criteria. Out of 26 conversational agents (CAs), 16 were chatbots, seven were embodied conversational agents (ECA), one was a conversational agent in a robot, and another was a relational agent. One agent was not specified. Based on this review, the overall acceptance of CAs by users for the self-management of their chronic conditions is promising. Users’ feedback shows helpfulness, satisfaction, and ease of use in more than half of included studies. Although many users in the studies appear to feel more comfortable with CAs, there is still a lack of reliable and comparable evidence to determine the efficacy of AI-enabled CAs for chronic health conditions due to the insufficient reporting of technical implementation details.


Introduction
The availability and use of conversational agents have been increasing due to advances in technologies such as natural language processing (NLP), voice recognition, and artificial intelligence (AI). Conversational agents (CAs), also known as chatbots or dialogue systems, are computer systems that communicate with users through natural language user interfaces involving images, text, and voice [1,2]. Google Assistance, Apple Siri, Amazon Alexa, and Microsoft Cortana are common CAs with voice-activated interfaces. In the last decade, CAs' popularity has increased, particularly those that use unconstrained natural language [3][4][5]. For example, consumers can talk to CAs on their smartphones for daily tasks, such as managing their calendars and retrieving information [6,7].
Recently, AI-based CAs have demonstrated multiple benefits in many domains, especially in healthcare. It is used to deliver scalable, less costly medical support solutions that can help at any time via smartphone apps or online [8,9]. For example, support The criteria included primary research studies that focused on consumers, caregivers, or healthcare professionals in the prevention, treatment, or rehabilitation of chronic diseases using CAs, and tested the system with human users. Reviews, perspectives, opinion papers, or news articles were excluded based on exclusion criteria. In addition, studies that reported on evaluations based on human users interacting with the entire health system were excluded. The studies that evaluated only individual components of natural language understanding and CAs' automatic speech recognition, dialogue management, response generation, and text-to-speech synthesis were excluded. The last exclusion criteria were studies using "Wizard of Oz" methods, where dialogue generated by a human operator rather than the CAs, were excluded [1,6,9].

•
Screening, data extraction, and synthesis All references identified through the searches were downloaded. Then, duplicates were eliminated using reference managers (Endnote and Mendeley). Next, the titles and abstracts for each paper were exported from the reference manager into an Excel spreadsheet.
Before starting the screening process, the procedures of the screening were handled. After that, the first filter used was a screening filter based on the information contained in their titles and abstracts. Two independent reviewers conducted this screening. Two independent reviewers also conducted the full-text screening. The exclusion of an article was resolved with a Zoom meeting between two independent reviewers. Four reviewers extracted the following data for each study: first author, year of publication, study location, chronic condition, study aim, study types and methods, participants' characteristics, evaluation measures, and main findings (Table 1). Evaluation measures were extracted based on three types: technical performance, user experience, and health-related measures. The technical performance of CAs was considered an objective assessment of the technical properties of the whole system. The technical performance measure is not included in Table 1 because most papers had not reported it, so the information about this measure will be in the next sections. User experience evaluation included the subjective assessment, where users tested the system properties or components based on their perspectives, via quantitative or qualitative methods [27]. Health-related measures were considered, alongside any health outcomes present in the included studies, such as diagnostic accuracy or symptom reduction. Table 2 has retained the characteristics of the CAs (the categories defined in Box 1) that were evaluated in the included studies. In addition, it shows AI methods used, based on a list of keywords, defined from three systematic literature reviews for CAs in health care [1,6,27].  • Participants mentioned the app was useful and easy to use.

•
The avg. answer ratio of participants in the intervention group: 0.71.
• In relation to impairment and pain intensity, the intention to change behavior was positive (p = 0.01).

•
Compared with the control group, The intervention group did not show a significant change in pain-related impairment (p = 0.01). • System naturalness, information quality, and coherence scores were consistent among participants.

•
Participants need an initial time on the system to know how it worked.

•
The majority of the participants faced difficulties with the speech recognition of some keywords.
CARMIE has proven the capability of addressing the pharmacological and treatment information for heart failure daily care.

Neerincx et al., 2019
Netherlands and Italy Diabetes-Type 1 (T1DM) Support and manage children diabetes Iterative refinement process (6 months; this process went through three cycles that include knowledge base, interaction, and some functions to achieve an effective partner for diabetes management).
Children from diabetes camps and hospitals in Netherlands and Italy. Age: 7-14; duration: 6 months. Quasi-experimental (6 months; 36 dental patients used personal assistive devices (PDs) and had their oral health tracked by a dentist).
Thirty-six participants from a single dental practice.
No information about age and gender; 9 participants left study partway through; duration: 6 months.
More than half of participants reported PDAs not functioning correctly (mostly problems keeping the battery charged).
Ten participants (40%) achieved improvement in at least three areas of oral health.

Results
The six databases that were searched retrieved 1754 articles. Then, the duplicates were removed, which resulted in 1087 unique articles. After the abstract and title screening, 110 articles remained. After the full-text screening, 80 of these were excluded. Thirty articles were considered eligible for inclusion in the systematic literature review. Four more papers were excluded during extraction data based on the exclusion criteria. Twenty-six articles were considered eligible for inclusion in the systematic literature review (Figure 1).

Description of Conversational Agents and AI Methods
Different technologies have supported CAs, including independent platforms, apps delivered via web or mobile device, short message services (SMS), and telephone ( Table 2). Out of 26 conversational agents, 16 were chatbots (a computer program that simulates human conversation via voice or text communication). Seven were embodied conversational agents (ECA), a virtual agent that appeared on computer screens and was equipped with a virtual, human-like body that had real-time conversations with humans. One was a conversational agent in a robot, and another was a relational agent explicitly designed to remember history and manage future expectations in their interactions with users. One agent was not specified [43]. The characterisation of conversational agents are as shown in Table 3, and this summarization is adapted from Laranjo et al. 2018 [27]. Table 3. Characterisation of conversational agents (Laranjo et al. 2018 [27]).

Finite-state
The user is taken through a dialogue consisting of a sequence of pre-determined steps or states.

Frame-based
The user is asked questions that enable the system to fill slots in a template in order to perform a task.
The dialogue flow is not pre-determined, but it depends on the content of the user's input and the information that the system has to elicit.

Agent-based
These systems enable complex communication between the system, the user, and the application. There are many variants of agent-based systems, depending on what aspects of intelligent behavior are designed into the system. In agent-based systems, communication is viewed as the interaction between two agents, each of which is capable of reasoning its own actions and beliefs, and sometimes the actions and beliefs of the other agent. The dialogue model takes the preceding context into account, with the result that the dialogue evolves dynamically as a sequence of related steps that build on each other.

User
The user leads the conversation.

System
The system leads the conversation.

Mixed
Both the user and the system can lead the conversation.

Input modality Spoken
The user uses spoken language to interact with the system.

Written
The user uses written language to interact with the system.

Output modality
Spoken, Written, visual (e.g., non-verbal communication like facial expressions or body movements).

Yes
The system is designed for a particular task and is set up to have short conversations, in order to get the necessary information to achieve the goal (e.g., booking a consultation).

No
The system is not directed to the short-term achievement of a specific end-goal or task (e.g., purely conversational chatbots).
The CAs in the papers used various AI methods such as speech recognition, facial recognition, and NLP. However, most studies did not provide sufficient information on the implementation details. In order to identify the AI methods, a list of common words (Appendix B) used for building AI CAs [1,6,27] were employed. Several papers reported that AI methods could improve the user's interaction with the system [1,2,5,6,27]. For example, speech recognition can capture speech much faster than you can type. Half of the included papers utilized speech recognition in many CAs (e.g., chatbot, ECA, or relational agent). Although having speech recognition can capture speech much faster than typing, it could lead to difficulties with some keywords because of misinterpretation of words. Six studies did not report these technical methods.

Evaluation Measures
Evaluation measures were identified based on three types: technical performance (six studies), user experience (25 studies), and health-related measures (18 studies). The most common technical performance measures were accuracy (89-99.2% for five CAs) [31,37,40,43,48] and specificity (93-99.7% for three CAs) [37,40,46]. One study for hypertension identified that the rate of the achieved goal for the CAs was 96%. In addition, the authors clarified that the accuracy of the spoken dialogue system in cough and compliance were 81% and 41%, respectively [43]. Another study in glaucoma and diabetic conditions used Cohen's D to calculate the task completion (k = 0.848), and the accuracy of the CAs was 89% [40]. Two studies were on depression (symptoms and major depressive disorder) and used finite-state dialogue management. The study for symptoms of depression noted that the written chatbot showed an accuracy of 99.2% and a specificity of 99.7% [52]. The spoken system for a major depressive disorder used embodied CAs, and showed sensitivity (49%) and specificity (93%) [46]. Two studies were about treating urinary incontinence and various chronic conditions such as pain and anxiety. The urinary incontinence article mentioned accuracy, but without clarifying the percentage or rate of accuracy [31]. Another paper for various chronic conditions (pain, anxiety, and depression) showed almost 92% accuracy in the breathing rate for patients [48].
Almost all studies reported on user experience except one study [43]. Helpfulness, satisfaction, and ease of use were the common features in more than half of the included studies. Three studies mentioned that users were unsatisfied. In two studies, the participants found the CAs hard to use. Regarding diabetes-type 2 [28], a study reported the feedback from patients through various measures, such as competency (85%), helpfulness and friendliness (86%). On the other hand, some patients described the embodied CAs as annoying (39%) and boring (30%). Another study for diabetes [40] illustrated the user experience through attractiveness (0.74), perspicuity (0.67), and efficiency (0.77), by using the scale of Cronbach's Alpha Coefficient correlation. In mental health, a study for treatment and education reported that some users felt the chatbot was hard to engage with and had no availability to ask questions [33]. A study after cancer treatment clarified that the users found the chatbot nonjudgmental and helpful. Additionally, users supported recommending it to a friend (69%).
Regarding health-related measures, 18 out of the 26 studies included the health-related measures. The most common method that has been used is quasi-experimental, where it was used in 12 out of 26 studies. The second most common method used was RCT, used in six studies. One of the quasi-experimental studies evaluated the medication adherence system of interaction dialogue, finding decreased delirium (p < 0.001) and loneliness (p = 0.01) [45]. Another study showed a reduction in depression and anxiety by p = 0.053 and 0.029, respectively [32]. One RCT measured the outcomes using a 5-point Likert-type scale, finding improved self-management for older people with the chatbot (p = 0.001) [28]. One study reported quasi-experimental and RCT which evaluated a medication adherence intervention, finding that system use, medication adherence, physical activity, and satisfaction measures were high (84-89%) [44].

Discussion
The most commonly used method in the included studies was quasi-experimental, which was used in almost half of the included papers. This is aligned with the findings of the previous systematic reviews of CAs in healthcare [1,27]. Quasi-experimental demonstrates the involvement of real-world interventions, instead of artificial laboratory settings. It allows the research to move with higher internal validity than other non-experimental types of research. In addition, quasi-experimental design requires fewer resources and is less expensive compared with RCT. This systematic review introduced a list of AI CAs in healthcare for chronic disease. It reflects the efficiency, acceptability, and usability of the AI CAs in the daily education of, and support for, chronic disease patients. Our review reflected this as most of the included studies were published after 2016 (21 papers). Most included studies evaluated task-oriented AI CAs (23 studies out of 26) that are used to assist patients and clinicians through specific processes. The majority of the included studies were focused entirely on designing, developing, or evaluating AI CAs that are specific to one chronic condition. This finding implies that AI CAs evolve to provide tailored support for specific chronic conditions, rather than general interventions for a broad range of chronic conditions. The outcomes of the included studies were assessed on three measures: technical performance, user experience, and a health-related measure. There were only six studies that mentioned some technical details. Due to the lack of details reported on the technical implementation of AI methods, it was not possible to establish consistent relationships with the intervention used, disease areas, and measured outcomes. The evaluation measures of the identified AI-based CAs and their effects on the targeted chronic conditions were not unified and broad. This inconsistency shows the complexity of contrasting and comparing the current AI CAs. Regardless of some studies that showed the complexity in use and chatbot constraints (four studies), most studies reported satisfaction with agents and feeling more comfortable than continuous follow-ups with a doctor in the hospitals. User experience was the most commonly reported measure (25 studies). It reflects the positive effect and enhancement of the quality of life in most studies through AI CAs that help patients who suffer from a chronic condition. This systematic review found that most included studies focused on designing, developing, or evaluating AI CAs for a specific chronic condition. That resulted in more accuracy, a tailored interaction with patients, and enhanced interventions for a wide range of conditions. In dialogue management, nine studies used a mixed initiative, whereas most applied system initiatives. No study in the included studies targeted the dialogue management of agent-based interactions. Moreover, these studies do not contain CAs that can be used across other populations. No analysis is applied on a broad scale, especially for communities or countries that suffer a lot from managing chronic conditions due to the high demand on hospitals or the cost and effort of following up with doctors. Targeting this area will help many people deal with chronic conditions and live their lives, especially as the CAs' supporting preventive measures can prove very effective.
Compared to prior reviews focused on AI CAs for healthcare, we found only two review studies that targeted AI CAs for chronic conditions, where one of them focused on voicebased CAs only. Those reviews did not differentiate between the type of CAs used besides the AI methods used in each study, so this review focused on investigating the different types of dialogue management with the AI method used in each study. This review also focused on technical descriptions of the CAs used. Clarifying the technical features of the AI CAs will help to choose the appropriate type of AI CAs. Regarding limitations, most studies did not include technical performance details, which makes replicability of the studies reviewed problematic. Another limitation of the reviewed literature is the heterogeneity and the prevalence of quasi-experimental studies. This suggests that this is still a nascent field.

Conclusions
Many studies in this review showed some positive evidence for the usefulness and usability of AI CAs to support the management of different chronic diseases. The overall acceptance of CAs by users for the self-management of their chronic conditions is promising. Users' feedback shows helpfulness, satisfaction, and ease of use in more than half of the included studies. Although the users in many studies appear to feel more comfortable with CAs, there is still a lack of reliable and comparable evidence to determine the efficacy of AI-enabled CAs for chronic health conditions. This is mainly due to the insufficient reporting of technical implementation details. Future research studies should provide more detailed accounts of the technical aspects of the CAs used. This includes developing a comprehensive and clear taxonomy for the CAs in healthcare. More RCT studies are required to evaluate the efficacy of using AI CAs to manage chronic conditions. Safety aspects of CAs is still a neglected area, and needs to be included as part of core design considerations.

Conflicts of Interest:
The authors declare no conflict of interest.

Study Protocol
Adopted from PRISMA-P (Preferred Reporting Items for Systematic Review and Meta-Analysis Protocols) and PROSPERO.

Topic Content
Eligibility criteria

1.
We will include primary research studies that (i) focused on consumers, caregivers, or healthcare professionals in the prevention, treatment, or rehabilitation of chronic diseases; (ii) involved conversational agent and AI methods used; and (iii) tested the system with human users.

1.
Review, perspective, opinion papers, or news articles will be excluded.

2.
Studies must also have reported evaluations based on human users interacting with the full system. Studies evaluating only individual components of the conversational agent-automatic speech recognition, natural language understanding, dialogue management, response generation, text-to-speech synthesis -will be excluded. 3.
Studies will be excluded using "Wizard of Oz" methods, where the dialog is generated by a human operator rather than the conversational agent.

Information sources
A database search will be conducted by accessing PubMed Medline, EMBASE, PsycINFO, CINAHL, ACM Digital Library, and Web of Science databases. Search terms include synonyms, acronyms, and commonly known terms of the constructs "conversational agent" and "healthcare". Grey literature will be excluded, such as posters, reviews, and presentations.

Search strategy
The following search strategy will be used in the whole six databases. Filters: none Conduct started in February 2021 "Conversational agent" OR "conversational agents" OR "conversational system" OR "conversational systems" OR "dialog system" OR "dialog systems" OR "dialogue systems" OR "dialogue system" OR "assistance technology" OR "assistance technologies" OR "relational agent" OR "relational agents" OR "chatbot" OR "chatbots" OR "digital agent" OR "digital agents" OR "digital assistant" OR "digital assistants" OR "virtual assistant" OR "virtual assistants" AND "healthcare" OR "digital healthcare" OR "digital health" OR "health" OR "mobile health" OR "mHealth" OR "mobile healthcare". Data collection and selection process AS and AA will conduct the initial screening of the obtained studies based on titles and abstracts. Then, AS and IM will conduct full-text screening based on the eligibility/inclusion criteria. AS, AM, JS, and BY will extract data from eligible papers. Any disagreement will be discussed in the zoom meeting. Dr. Kocaballi and Dr. Prasad will supervise all these processes to ensure the measures are on the right path.
Data items for coding The following data items will be extracted from each included study: first author, year of publication, study location, study design/type, study aim, conversational agent evaluation measures, main reported outcomes and findings, type of chronic condition, type of study participants, type of the conversational agent, the goal of the conversational agent, communication channel, interaction modality, technique, system development. AS, AM, JS and BY will conduct the data extraction, and it will be discussed with Kocaballi and Dr Prasad.

Outcomes and prioritisation
Main outcomes: Any healthcare related intervention outcomes (e.g., type of chronic condition, health goal, intervention targets), any architecture related outcomes (e.g., technique type, system development). Additional outcomes: Any conversational agent related outcomes (e.g., feasibility, accuracy, acceptability, functionality) and design features.

Risk of bias in individual studies
AS and IM will review the included papers to appraise their quality. Disagreement will be discussed to reach a consensus. Any disagreement will be resolved with Dr Kocaballi and Dr. Prasad.

Data synthesis
The PRISMA guidelines will be used for data synthesis. A narrative synthesis of the included studies will be performed.

Topic Content
Country Australia Anticipated or actual start date February 2021 Anticipated or actual end date September 2021